|
|

|
What data-mining method to use? Decision Trees, Rule Sets and Ensembles RDS can be used to create three basic types of model: trees, rule sets and ensembles. Each type of model has its own characteristics. In this article we will go through the characteristics of each model type and how they are used by RDS for making predictions. Decision Trees A tree can be regarded as a hierarchically organized set of rules. A prediction for a new example is made by following a path from the root of the tree to a leaf node (the tree is actually turned upside-down in RDS so the root is displayed at the top of the graph and the leafs at the bottom!). The actual path to follow from the root is decided by the outcome of tests associated with each node.
For example, when making a prediction by the tree in Fig. 1, which has been built from the dataset Cleveland heart disease in the UCI repository, the number of major vessels colored by flourosopy determines whether to go to the left or right child of the root. Assume that we have an example for which this number is 1. Hence, we go to the right and the next tests concern whether or not the chest pain type is asymptomatic. Assume that in our example, the chest pain type is indeed asymptomatic. We would then reach a leaf of the tree (leaf no. 5 when counting from the left) and no more tests are required to make a prediction for the example.
Since this particular tree concerns classification (i.e. prediction of a categorical value as opposed to a numerical value, which we will get back to later), estimated class probabilities at the leaf are used to form a prediction. The pie chart shown at each node in the tree represents these estimated probabilities which have been formed from training data (i.e., data that has been used to build the model) - in this tree the blue and brown segments correspond to the probability of an example belonging to the class 'healthy' and 'narrowed vessel' respectively.
Looking at leaf no. 5, one can see that the most likely class is 'narrowed vessel'. According to the tree this is considered to be the most likely class for the example. If the example instead would have reached, say leaf no. 4, the most likely class according to the tree would be 'healthy'.  Figure 1. A tree built with RDS from Cleveland heart disease data. Each leaf in a tree corresponds to an if-then rule, where the conditions or the rule (the if-part) is formed by making a conjunction (this means that they are combined with the boolean operator AND) of all conditions from the root to the leaf. The conclusion of the rule (the then-part) is, in the case of classification, to predict the most likely class. The rule obtained from leaf no. 5 of the above tree is shown in Fig. 2, together with some statistics on training and test data (data that has been left out to test the model).
Besides the estimated class probabilities expressed not only by a pie chart but also as real numbers, the number of training and test data that fall into the leaf (these are said to be covered by the rule) as well as the relative frequencies of the classes in the test data are displayed. It may appear strange that the number of covered examples can be anything that an integer - but fractions of examples are actually considered in case they have missing values for variables that are used in the rule, something we will ignore for now but return to in a later tutorial.  Figure 2. A rule built with RDS from Cleveland heart disease data. As a consequence of how the rules in a tree are organized, they all become non-overlapping, i.e., the conditions of all rules are exclusive, and the rules will cover all possible examples. This means that we do not have to bother about multiple, possibly conflicting, rules covering the same example and cases where none of the rules apply - there is always exactly one rule that apply. Rule Sets Rule sets are sets of rules with no specific order. This means that one may have a look at any rule in the set and interpret it independently of its position. This contrasts to so-called decision lists for which rules only can be interpreted in their contexts. Part of a rule set generated for the Cleveland heart disease dataset is shown in Fig. 3.
 Figure 3. Part of a rule set built with RDS from Cleveland heart disease data. Each rule in a rule set is supposed to be interpreted in exactly the same way as the rules in a tree. There is however one major difference between the rules in a rule set and in a tree when considered as a whole - each rule in a set has a unique root to which tests are associated independently of other rules. In contrast, rules in a tree will share a number of conditions corresponding to the path from the root that they have in common.
In particular, this means that all rules in a tree will contain a condition concerning the variable associated to the root (e.g., the number of major vessels colored by flourosopy in the tree in Fig. 1). In contrast, the rules in a rule set may consider completely different variables, as illustrated by Fig. 3. This characteristic of rule sets may be beneficial in particular when the target function is disjunctive - or in other words - when there are sub-groups of a class that can be defined using different sets of variables. This is of course something which is seldom known in advance, but is discovered only after comparing rule sets and trees. In contrast to rules in a tree, the rules in a rule set may however overlap and in some cases none of the generated rules apply - these problems are handled by RDS, but the details of how this is done is left for a later tutorial. Ensembles If you are less interested in finding relationships in the data, and more interested in making better and more accurate predictions, it is advisable to use an ensemble. An ensemble is simply a collection of models (trees in RDS), which is used to make a prediction by forming a collective vote from all contained models. The predictive performance of an ensemble is, among other things, affected by the number of contained models, how correct each individual model is and how much it differs from the others - but in general it outperforms each individual model by far. One drawback of ensembles is that they typically contain such a large number of models that it is not meaningful to display them right as they are. In RDS, one may however get a good overall picture of an ensemble by looking at the variable importance graph. In Fig. 4, this graph is shown for an ensemble generated from the Cleveland heart disease data. The two most important variables according to this, the number of major vessels colored by flourosopy and the chest pain type, were used also in the tree and rule set above, while the third most important variable, Thal, did not appear in the tree. Typically, ensembles are more robust regarding which variables are considered important compared to trees and rule sets, in the sense that they are less affected by small variations in the training data.  Figure 4. Variable importance for ensemble built with RDS from Cleveland heart disease data.
Further readingDownload the RDS User Guide here The data set used for the illustrations can be downloade here
|
|