Home Rule Discovery Systemâ„¢ Download Help and support About us
Username:    Password:     
  Create account 

Help and support
What is data mining used for?
RDS Quickstart
Forums
Data Mining Newsletter
Professional support
Installation instructions
Submit bug report

Re:analysis - Phenol toxicity classification

Many of our users are using RDS for predictive modelling of relationships between chemical structures and properties that are typically measured in the laboratory, so-called Quantitative Structure-Activity Relationship (QSAR) modelling. In this first issue we have chosen to perform a re-analysis of a QSAR problem that we published previously this year.

In doing the re-analysis we have chosen to use a basic set of compound descriptors calculated with the publicly available software JOElib. This means that this time, in addition to providing the structures, we can also provide the data set in exactly the same format as used here.

The data set

Phenols are commonly used substances both in industry and in consumer products around the world. A number of different mechanisms of toxic action have been related to the phenol class of compounds.

The dat set consists of 220 phenols for which toxicity data related to ciliate protozoan Tetrahymena pyriformis is available. The objective of the modelling is to identify the correct mechanistic class of toxic action.

The structures and assigned mechanisms of toxic action are taken from reference 2. The class distribution of the phenols is summarised in Table 1.

Table 1: Class distribution
MOA class12TotalFraction of total
Polar narcotics757715269.1 %
Precursors to soft electrophiles13142712.3 %
Soft electrophiles 11122310.5 %
Weak acid respiratory uncouplers99188.2 %
Total108112220100.0 %

As we can see, the examples of this data set belong to four different classes, of which one is much more abundant than the others (69 %). The minority class, "weak acid respiratory uncouplers", only constitutes about 8 % of the data.

In addition to using the descriptors calculated using JOElib, we also keep the descriptors used in the original publication.

For purposes of validating the models created, the data set was divided into two groups by the authors of the publication from where we have obtained the data set. Since we want to be able to compare our results with their results, we use the same groups as they did, and validate our model using group-wise cross validation with the group membership variable set to the data type 'group' in RDS.

Analysis

The data is imported into RDS. Once imported, the column called name can be used to identify the compounds by setting the data type to id. The column called group contains the validation group from the original publication, we would like to use this for partitioning for group-wise cross-validation, so change the data type to group, and validation method to Group-wise cross-validation. The columns number, MDLNUMBER, and log 1/IGC50 ([mmol/L]), were omitted from the analysis by setting the data type of these columns to ignore. All the other columns are kept as numeric, which is what RDS guesses during import. Data types

It is always interesting to see how much performance is improved when going from a single tree model to an ensemble. Therefore, first add a custom tree method, and then add a custom ensemble method. The custom methods are used because we would like to use class-weights since the class distribution is skewed in this data set.

Change the class-weights to 10 for the class 'Weak acid respiratory uncouplers', 5 for the classes 'Precursors to soft electrophiles' and 'Soft electrophiles', and to 2 for 'Polar narcotics'. These are the same class-weights as we used in the first analysis of this data set. How to best choose class-weights will be covered in a later article.

Do not change any other parameters of the methods. Now RDS will use the two methods twice (one time for each group), to create four models, two decision trees and two ensembles of trees. Each model created from one group will be assessed on the phenol compounds in the other group. Press the Start-button in order to run the experiment. It should take about 15 seconds to complete.

Results

Table2: Method performance
 
 Correct classPredicted class
MethodAccuracyTotal AUC PrecisionRecallAUCPolar narcoticsPrecursors to soft electrophilesSoft electrophilesWeak acid respiratory uncouplers
Tree86.3640.929Polar narcotics0.9170.9470.926

144

5

0

3

Precursors to soft electrophiles0.7370.5190.898

12

14

0

1

Soft electrophiles0.9410.6960.981

0

0

16

7

Weak acid respiratory uncouplers0.5930.8890.936

1

0

1

16

Ensemble90.9090.981Polar narcotics0.9660.9340.982

142

9

1

0

Precursors to soft electrophiles0.7350.9260.986

2

25

0

0

Soft electrophiles0.8330.870.97

1

0

20

2

Weak acid respiratory uncouplers0.8670.7220.985

2

0

3

13

These results are comparable to what obtained previously. The single-tree methods gives better results than what we obtained before, and the ensemble method is slightly worse, but it is still better than the tree method.

The two trees that are generated are slightly different, but the have comparable performance and partition the data set in similar groups, but using different variables.

The ensemble gives us a more complete insight into the most important variables, highlighting the same variables as previously identified HOMO and LUMO energies, pKa., and the number of hydrogen donors. Both ensemble models created for the two groups highlight the same variables as being the most important.


Lessons learned

By constructing an ensemble model consisting of multiple decision trees, the predictive performance can be increased.
RDS is able to gracefully handle skewed class distributions by using class-weights.
Multiple decision trees constructed from similar data can look very different when looking at the variables used, in particular when using many variables in relation to the number of examples, but may still correspond to the same groups in the data.
Ensembles are more robust than trees with regard to handling data with many variables.

Download the experiment including the data set as an RDS experiment file. Structures can be obtained here.

Further reading:

1. Norinder, U., Lidén, P., and Boström, H. Discrimination between modes of toxic action of phenols using rule based methods. Molecular Diversity 2006; 10:207-212
2. Aptula, A.O., Netzeva, T.I., Valkova, I.V., Cronin, M.T.D., Schultz, T.W., Kühne, R. and Schüürmann, G., Multivariate Discrimination between Modes of Toxic Action of Phenols, Quant. Struct.-Act. Relat., 21 (2002) 12-22.
3. Wegner, J.K.; JOELib, A Java based computational chemistry (cheminformatics) library. Version: 2004-01-16 http://www-ra.informatik.uni-tuebingen.de/software/joelib/index.html
2004.