Home Rule Discovery Systemâ„¢ Download Help and support About us
Username:    Password:     
  Create account 

Help and support
What is data mining used for?
RDS Quickstart
Forums
Data Mining Newsletter
Professional support
Installation instructions
Submit bug report

Message boards
Categories » Compumine Rule Discovery System » Why sorting out the same data set results in different trees?

Threads [ Previous | Next ]
Why sorting out the same data set results in different trees?
Christos Begleris
Rank:
Posts: 1
Joined: 7/2/07
Why sorting out the same data set results in... | 7/11/07 5:58 PM
I am using a data base with 3 categorical variables and 1 regression variable. I am trying to establish the variables importance on the regression value and using a tree model.

My data looks as follows:
Captain Port Vessel Amount
ALGIANNAKIS GEORGIOS ALGECIRAS, SPAIN Suezmax 142.23
KATSANTONIS NIKOS ASHKELON, ISRAEL Aframax 188.1
EMPENADO ORATIO AUGUSTA, GEORGIA Suezmax 1300
CADRON ROMAN BOSPORUS, TURKEY Suezmax 35.155

There are a total of 800 entries, with captains, ports and vessels being periodically repeated.

I am sorting the data as follows:
a. By captain (alphabetically - with captain data in column A)
b. By port (alphabetically - with port data in column A)
c. By type (alphabetically - with type data in column A)

In all three cases above I am getting a different tree and different values for variable importance.

Would you be able to explain why?

How should the data be sorted to get the 'truest' results possible?

Many thanks,

Christos Begleris
Compumine Support
Rank:
Posts: 19
Joined: 9/8/06
RE: Why sorting out the same data set results in... | 7/20/07 2:12 PM as a reply to Christos Begleris.
Dear Christos,

When RDS creates a split in the tree, there are a number of stochastic component involved. First, RDS randomly partitions the data into training and testing subsets, and then when searching for rules and creating splits, the system only looks at a randomly selected subset of the training examples available in each node.

I assume that your data set has many different values for the three background variables, of which none really stands out in information content. Therefore RDS will select different variables, and probably also different splitting criteria each time you sort the data set differently.

Since variable importance is calculated based on how much each variable contribute to reducing the predictive error of the model, you will see different results each time due to the same reasons.

If you want to see a more stable variable importance score, better reflecting the information in your data, I suggest that you try using an ensemble of multiple trees. You can e.g. start with 25 or 50 trees. Ensembles are known to be more robust.

Good luck!

/Per