Home Rule Discovery Systemâ„¢ Download Help and support About us
Username:    Password:     
  Create account 

Help and support
What is data mining used for?
RDS Quickstart
Forums
Data Mining Newsletter
Professional support
Installation instructions
Submit bug report

Re:analysis - Mushroom Data Mining Season

We have had a warm autumn, and here in Uppsala it is still possible to pick mushrooms in the forest. This has turned out to be one of the best mushroom seasons for many years. Therefore, it feels particularly well suited to illustrate how RDS is used by re-analysing one of the classic publicly available data sets - the UCI mushrooms data set.

The data set

This data set includes descriptions of 8124 hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. The records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf  (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.  This latter class has been combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be" for Poisonous Oak and Ivy.

Analysis

Import the data into RDS. Click the data Modeling data set hyperlink to view the data set in the spreadsheet view. Change the data type of the first column (called class) to class, and you are ready to go.

Where I would like to go is to the forest. And although I like using RDS, I do not wish to bring my computer this time. Instead I would like to find some useful rules of thumb that would be easy to remember when finding a specimen of the Agaricus or the Lepiota species. Therefore, I add the default rule set method and the default tree method and press the Start-button.

Results

Since we did not change the default settings for the validation method we will now get two models created from a random sample of 70 % of the data set, and tested on the remaining 30 % when the experiment has been run.

Let's first have a look at the rule set model. At first there are a number of simple rules involving the odor that identifies a number of cases of poisonous mushrooms (brown). It seems that if the smell of the mushroom is unpleasant, one better leave it where it is.

Tree model

How do we find the best rules for the edible mushrooms? Take a look to the right in the model browser window. There is a drop-down that allows us to sort all the rules of the model according to different criteria. Since I am interested in finding the best rules for edible mushrooms, I choose to order by probability for edible. This gives me a list of the best rules for identifying edible mushrooms.

Tree model

I click the first rule in the list saying that if a mushroom is grey below the ring it is edible, and get to see the details. Apparently, this rule applies to almost 500 of the cases in the data set, and the estimated probability that a mushroom that is grey below the ring is 99.3 %. Among the test cases, all are edible. In-fact, all of the cases in the training set are also edible, it is only RDS that uses an internal sampling scheme to compensate for the fact that there is a higher uncertainty in estimating the probability when it derives rules from fewer examples. Findin best rules

There are many more rules for finding edible mushrooms in this model simple enough to remember on the mushroom hunt. Please download the experiment and have a look for yourself if you feel curious. Instead, let's have a look at the tree model.

Tree model

Here we can see that the most important way for distinguishing between edible and poisonous mushrooms is to use your nose. The first splitting rule at the top says that if the mushroom does not smell anything, then it is most likely edible. Well, I better make sure that I do not have a cold when applying this rule in practice… Following this (the left) part of the decision-tree, we find two refinements that separate out some poisnous mushrooms based on the spore print colour and the stalk-surface below the ring. Eventually, we end up in rule number 4, which covers the vast majority of the edible mushrooms, and states that it is safe to eat the mushroom if it does not smell, does not have spore prints that are green, and does not have a stalk that is scaly.

Almost all other edible mushrooms can be found in rule number 8 in the right part of tree, saying that if the mushroom does indeed have a smell that is not bad (foul or pungent), has bruises, and does not live in an urban habitat, it is most likely edible too.

I would say that these are two rules that are easy enough to bring to the forest as well.

Lessons learned

Tree- and rule set models are straightforward and simple, yet powerful means when looking for simple rules of thumb either to tell you something new, pinpoint what you already suspected, or find some interesting sub-groups in your data.
Personally, I will still stick to the mushrooms I know next time on the hunt, please do the same!

Download the experiment including the data set as an RDS experiment file.

Further reading

Picking mushrooms? Chek out www.svampguiden.com (in Swedish) or www.morelmushroomhunting.com (in English).

More excercise data like this can be found at the UCI machine learning repository.