NATreeClassifier
From CVRG Wiki
Contents |
NA Tree Classifier Doc
Document Information
Package base version 2.8.0
4/19/2009
Project 5: Statistical Learning with Multi-Scale Cardiovascular Data
Contact email: yqin@jhu.edu
CardioVascular Research Grid
Johns Hopkins University
NA Tree Classifier Summary
In the traditional classification tree modeling, it is assumed that there are no missing (NA) data, or is at least relatively rare. data. For the Reynolds cardiovascular datasets we have been studying, missing data are the rule rather than the exception. For example, imaging data is available only for a small subset of subjects. To address this situation we developed a classification method which we refer to as the NA-Tree Classifier. It makes use of traditional classification tree concepts. However, instead of creating a decision tree with the familiar binary structure, it creates a ternary tree, adding an extra NA branch and corresponding node associated with subjects missing data for a the variable defining a given split.
For instance, when we ask subjects survey questions there may be informative responses such as "Yes", "No", but there is also the possibility of noninformative responses such as "Not Applicable" or "Not available". For informative responses, we can follow the the traditional tree approach and create branches and child nodes for each possible response. For the latter two responses, we don't know what to do with such subjects in the traditional setting. However we solve this in the NA tree approach by categorize subjects intoto the NA node.
Conceptually, NA trees use the same ideas as traditional classification trees, but making use of a possible third branch (NA branch) at each node gives us much more flexibility when dealing with incomplete data. The method described here should prove useful in large medical studies when subsets of the subject population are chosen for particular analyses.
NA Tree Structure
As described above, each non-leaf node of the NA tree has at most three branches. Subjects traverse the tree starting at the root node. Each non-leaf node has an associated feature variable and a threshold. A subject is passed to one of the node's children as follows. If the subject's feature value is available and less than or equal to the threshold, it passes to the left child. If it is available and greater than the threshold it passes to the right child. If the feature value is unavailable for the subject, then it passes to the middle (NA )node..
A typical NA tree would have the following appearance:
As described in Section 2.2 entitled "Pruned Tree Classifier" of the Painted Tree Classifier section, our sample size is somewhat limited so in order to avoid over-fitting we restrict trees to have a maximum depth of two, meaning that classification of a subject involves decisions made in at most two nodes. In instances where the sample size is larger higher depth trees should certainly be considered.
Special Adjustment for Root-NA-NA Node
Most of the NA trees resemble the one shown above, with the only exception being that, for the split Root -> NA -> NA, although it is in depth of 2, the real depth is still 0, that is, it has not been split by any real feature at all. So if it has impurity at that node, we split it again to further classify that part of data. If node * (indicated in the previous graph) has both firing and not firings (i.e. impurity), the NA tree will have the following appearance:
Categorical Feature
Some of the features we use are categorical variables, for example, sex and blood type. To incorporate such features in the NA tree, we create binary variables associated with each possble variable value, and split the tree based on criterion feature equals value vs. feature does not equal value, and of course NA.
The discrete tree will look like:
Note: In the above figure "!=" is used to denote "not equal to".
Splitting Criterion
At each node, we need to find an optimal split, which amounts to picking a best feature and a threshold (or value to split on in the categorical case). For this, we use the entropy criterion in selection. We try all the possible features and a large collection of thresholds (or values), and we calculate the entropy decrease caused by each split. We then select the feature and threshold (value) pair giving rise to the largest entropy decrease. If none of the choice leads to a decrease, or if the decrease in entropy is not sufficiently large, the node is made into a leaf.
We define the entropy at each node as:
Where
Definitions of entropy and other relevant concepts are introduced in Document 2. "Feature Selection Instruction", 1.1 "Information Theory, Entropy, CMI".
And the entropy after splitting is defined as:
Where
From the definition, we know that
is weighted average of the three child nodes. It describes the average entropy these three nodes carry.
Naturally, the entropy difference is:
Finally, we pick the feature i and threshold t which maximize D at each node.
We illustrate the construction with an example. We reach a node (parent node) where there are 10 firings and 100 not firings. After we split the node with an arbitrary feature and threshold, we arrive at the following tree:
Then the entropy at parent node is:
Follow the same formula, we obtain entropies at left node, NA node and right node as respectively, 0.7219281, 0.6500224 and 0.2472944.
We get the entropy after split is
The difference (i.e. entropy decrease) D is =0.439497-0.3990997=0.0403973
We repeat the same procedure for every possible feature and threshold, then pick the pair with the largest D. That is our best split selection at that node.
Threshold List
For numeric features, it is inefficent to try every possible real number as a threshold. In addition, allowing for every possible thresold is likely to lead to over-fitting. Instead, we calculate the 20%, 50% and 80% percentiles of the feature in training data and use these as thresholds.
For discrete feature, we simply use every possible value (or level) as threshold. For each level in that feature, the splitting criterion becomes: 1. feature A = level i; 2. feature A != level i; 3 feature A is NA.
Depth, Real Depth and Label
As discussed in section 1.1.1, it is important to distinguish between the depth of a node in an NA-tree (tree depth) and the decision depth, i.e. the number of decisions that have actually been made based on available features to get to a given node.
An graph illustration will make this more sensible.
Since there are so many nodes and features, it is very hard to keep track where the each observation is classified. So we assign numerical labels to the nodes. The labels are shown in the graph. These labels are fixed and do not change when we change datasets. Thus, n the future, Node 5 is always referring to the node which has first split to left (<= for numeric feature and = for discrete feature) and then to right (> for numeric feature and != for discrete feature). And node 12 will always referr to the NA-NA node.
If node 12 has impurity, we will further split it and label the three new nodes as node 13(left, <=), 14(right, >), 15(middle, NA).
Pre-selected Features
Our CVRG dataset contains approximately 100 features. Given the available sample size, it is no surprising that using all of these features for our NA tree produces poor cross-validated classification rates. So, instead, we pre-select roughly 12 features among the 100 based on cardiovascular experts' judgement, and perform the analysis based on these 12 features.
These features are "meanHRbpm", "percPVC", "QTVI_log", "LFdivHFPow", "SDNN_RST_msec", "snp2", "snp5", "snp6", "Gender.x", "Inducible", "LVEF", "DEmass". We will update our pre-selected feature list according to the result. We will update our pre-selected feature list according to the result.
We note the full feature lists of all the important numeric features and discrete features here:
Discrete list contains: "snp1", "snp2", "snp3", "snp4", "snp5", "snp6", "Gender.x", "Race", "cardiom.type", "Inducible".
Continuous list contains: "meanHRbpm", "perckept", "percPVC", "QTVI_log", "QTVInumer.1000", "QTVIdenom.1000", "QTV_msecsqu", "HRV_bpmsqu", "QTintmean_sec", "HRmean_bpm", "QTRRslope", "QTRRintercept", "QTRR_r2", "meancoh", "VLFPow", "LFPow", "HFPow", "TotPow_clinical_use", "TotPow_mathematical", "LFdivHFPow", "LFdivHFplLF", "RMSSD_msec", "pNN50", "meanNN_msec", "SDNN_msec", "total_RRvariance_msec", "SDNN_RST_msec", "sdRLTESQTintmsec", "avgNssthoseelim", "avgNssthosekept", "LVEDV", "LVESV", "LVEF", "LVEDMASS", "DEmass".
Data
In our analysis, we have total number of observations of 937. After our data cleaning procedure, i.e. removing observations with NA implant dates, and removing observations with negative days of observations. These two steps removed 343 observations, and left us 594 observations. We use as a phenotype "fired within ndays of implant." Since the phenotype is based on observing individuals for at least ndays, in order to eliminate sampling bias, we must exclude all subjects who have not implanted at least ndays ago. We then build up our NA tree based on the response variable "fired" using the other features. The phenotype "fired" is define to mean d$fired<-d$Days.To.First.App.Firing<=ndays, where ndays is a pre-specified number. This means that if the days of firing is smaller than ndays, we code a new variable "fired" 1, and otherwise 0. The value of ndays value can be changed depending on the interest of the researcher's goal. We have focused on a vaue of ndays of 545, i.e. one and one half years. There are multiple reasons why we use the 545 day criterion. One point to make is that if we choose too small time of observation criterion we end up with too small a number of firings. On the other hand if we require that subjects need to be observed for too long an amount of time to be included, then we lose sample size.
For example, if we include all the patients who had fired before 545 days, then both patient X and Y should be included since they both have days of firing less than 545 days (300 and 500 respectively). However, only patient X is observable to us and it is impossible to observed patient Y. It is because that we only observed patient Y for 400 days and cannot know whether he fires after 400 days. So without being able to include both patient X and Y, we are introducing bias into our model.
Note:
Days of firing are independent of days of observations.
So, to sum up, in order to analyze firings happened before 545 days, we have to exclude all the observations with days of observations less than 545 days. So observations left will have either days of firing before or after 545 days, but all of them have days of observations larger than 545 days.
n-Fold Cross Validation
As we did for every analysis, we need to do cross validation for this method. The definition of cross validation and basic method is explained in document 5, "Painted Tree Classifier", 1.3. "Leave One Out Cross Validation (LOOCV)".
Cross validation is a method in machine learning to test the performance of a statistical analysis. It is mainly used when the statistical analysis's goal is prediction and one wants to test its accuracy. The routine procedure is that first partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set or learning set) to get the model and parameters, and validating the model on the rest dataset (called testing set).
The n-fold cross-validation is basically partitioning the original dataset into n subsets (approximately equal size). One of them is drawn as the testing set, and the remaining n−1 subsets are used as training set. The cross-validation process is then repeated n times (the folds), with each of the subset used exactly once as the testing set. Then the n results from each validation can be averaged to produce a single evaluation. The advantage of this method is that all observations are used for both training and validation, and each observation is used in testing set exactly once.
The leave one out cross validation mentioned in "Painted Tree Classifier" is the special case of n-fold cross validation with n being the sample size.
In our analysis, we used 8 fold cross validation.
Program Description
Since we changed the structure of the tree model, there is not an existing software package to do such analysis. We program our own NA tree from scratch, which involves many functions. I am going to introduce them one by one with basic input, output and functions.
Fucntion tree.init()
This function basically takes a dataset which contains features and response variable, and creates a tree frame. Please note that this function does not really generate the NA tree, instead, it initiates the tree model by building up the starting point of a NA tree root and the tree frame which tree.grow() will be base on.
We can see below that the output of tree.int() is just a table with one entry (root node).
Output of tree.int()
node.label depth real.depth split.var.name split.var.value dec.rule
1 0 0 0 <leaf> NA NA
parent.node call count.fired count.nonfired
1 NA FALSE 26 411
As we can see, the dataset has 411+26=437 patients (or observations), of which 26 patients have fired and 411 patients have not fired.
Function tree.grow()
This function takes a dataset and a tree frame which is created by tree.int(), and generates the whole NA tree and also the classification of the training data. This is the biggest function since it calls nearly all the rest of functions at some point during the program. It can be regard as an outline of the whole procedure, while each other function just does one single step of the procedure.
Output Description
We need to carefully explain the output of this function since it is most important information our model gives. It is crucial to fully understand the output. For example, we build up a tree and print the output.
> tree.d<-tree.init(d) > tree.struct.d <- tree.grow(d, tree.d, 1, features, 0.5) > tree.struct.d
Output:
$frame
node.label depth real.depth split.var.name split.var.value dec.rule
1 0 0 0 snp2 AA <NA>
2 1 1 1 LFdivHFPow 0.8012 ==
3 2 1 1 meanHRbpm 83.8006 !=
4 3 1 0 LVEF 34.62 <NA>
5 4 2 2 <leaf> <NA> <=
6 5 2 2 <leaf> <NA> >
7 6 2 1 <leaf> <NA> <NA>
8 7 2 2 <leaf> <NA> <=
9 8 2 2 <leaf> <NA> >
10 9 2 1 <leaf> <NA> <NA>
11 10 2 1 <leaf> <NA> <=
12 11 2 1 <leaf> <NA> >
13 12 2 0 <leaf> <NA> <NA>
parent.node call count.fired count.nonfired
1 NA FALSE 26 411
2 0 FALSE 2 109
3 0 FALSE 22 219
4 0 FALSE 2 83
5 1 FALSE 2 13
6 1 FALSE 0 57
7 1 FALSE 0 39
8 2 FALSE 13 153
9 2 FALSE 7 35
10 2 FALSE 2 31
11 3 FALSE 0 21
12 3 FALSE 2 3
13 3 FALSE 0 59
$where
[1] 7 7 9 9 6 6 6 6 7 6 7 12 9 6 12 7 12 9 9 12 6
[22] 9 6 6 6 10 6 9 6 9 6 6 6 9 9 9 6 12 9 9 9 7
[43] 7 6 9 6 9 9 7 9 7 7 9 12 9 7 7 6 7 9 9 7 6
[64] 9 7 7 7 7 8 9 12 12 7 7 8 6 7 6 9 7 9 7 9 6
[85] 6 7 6 7 6 6 7 5 7 5 7 7 7 5 8 7 8 10 9 8 6
[106] 8 9 7 7 6 7 6 7 7 7 7 8 7 5 7 4 7 8 4 8 7
[127] 7 7 5 6 7 6 8 8 6 6 8 5 4 7 5 8 9 7 7 7 7
[148] 4 8 5 6 5 7 7 7 8 9 7 7 7 8 7 7 7 4 7 8 7
[169] 4 7 7 5 7 5 5 4 7 7 8 7 7 10 5 8 12 7 7 12 7
[190] 7 7 7 5 7 7 7 7 5 7 7 7 7 7 5 10 7 12 12 7 12
[211] 7 7 7 7 7 7 7 8 5 7 7 8 7 7 7 5 5 5 10 7 7
[232] 12 8 5 7 8 9 6 8 12 7 8 8 8 5 5 5 4 7 7 7 8
[253] 7 7 4 8 7 7 5 7 12 5 7 7 8 7 4 7 8 5 8 7 7
[274] 8 5 8 7 7 7 5 7 7 12 7 5 7 7 5 7 5 5 7 7 12
[295] 5 5 8 7 8 7 7 7 7 7 8 7 5 5 5 7 8 12 5 7 8
[316] 4 5 7 4 7 10 4 5 7 7 6 9 8 7 7 5 8 7 8 7 4
[337] 5 7 7 7 5 7 5 7 7 12 7 5 12 5 7 12 10 12 12 12 12
[358] 12 12 12 12 12 12 11 10 12 12 12 12 12 10 12 10 12 12 11 12 12
[379] 11 12 10 12 12 12 12 12 10 11 11 12 10 10 12 12 10 10 12 10 12
[400] 10 10 12 7 7 6 7 7 5 7 12 5 7 5 5 5 7 5 10 7 7
[421] 7 5 6 7 7 7 5 7 4 7 5 5 12 10 12 12 12
This is the main output. The field $frame give the structure of the NA tree using a table. Since R cannot draw a tree with arrows and lines, we can only express the tree by using a table. For each row in the output table, it stands for node and its related information. For example, there are some basic information about the node label, its depth and real depth. It also shows whether it is a leaf or not. If it is not a leaf, it shows what feature it uses to split this node along with its corresponding threshold. If the node is not root, it shows how this node is split from its parent node (i.e. dec.rule) and parent node label. In the end, it shows how the firings and not firings are distributed at each node, in order to give people an idea how the NA tree is splitting.
Each table above can be translated to a graph below, which indicates an exactly same NA tree structure.
The second part of the output is where the training sample is classified. Since only node 4 to node 12 are leaves, the 437 patients are classified into this 9 leaves.
Function best.split()
This function is used at each node. It takes the dataset at that node and a list of features, and returns the best feature and threshold based on the criterion mentioned in the previous section. It is one of most important function in this NA tree model.
Functions split.cont() and split.discrete()
These two functions are used after best.split() has decided which feature and threshold to use. They are also used at each node. They take the feature and threshold and split the dataset at each node.
In the output table of tree.grow(), each row is actually generated by either split.cont() or split.discrete() depending on whether it is a numeric or discrete feature.
Function impurity.at.node()
This function is used to calculate the entropy at each node. It just does what the formula of
does. The return of this function is one number and it takes two numbers, which are numbers of firings and not firings.
Function predict.na.tree()
This function is used to predict classification of new observations or patients. It takes the existing NA tree model and a vector or matrix of new observations, and returns the classification of these observations. It basically goes through the NA tree from top to bottom for every single observation and record the node that observation ends up with.
Function draw.tree() and draw.node()
These two functions print the NA tree in a more fashion way. It is easier to distinguish between different depth and understand the tree structure. Function draw.tree() print the whole NA tree, while draw.node() only print one node. For example, some common output looks like the following (It is the same NA tree developed above):
> draw.tree(tree.struct.d$frame)
Node 0 : Root node Fired 26 Nonfired 411 Total 437
Node 1 : snp2 == AA Fired 2 Nonfired 109 Total 111
Node 4 : LFdivHFPow <= 0.8012 Fired 2 Nonfired 13 Total 15
Node 5 : LFdivHFPow > 0.8012 Fired 0 Nonfired 57 Total 57
Node 6 : LFdivHFPow is.na Fired 0 Nonfired 39 Total 39
Node 2 : snp2 != AA Fired 22 Nonfired 219 Total 241
Node 7 : meanHRbpm<=83.8006 Fired 13 Nonfired 153 Total 166
Node 8 : meanHRbpm > 83.8006 Fired 7 Nonfired 35 Total 42
Node 9 : meanHRbpm is.na Fired 2 Nonfired 31 Total 33
Node 3 : snp2 is.na Fired 2 Nonfired 83 Total 85
Node 10 : LVEF <= 34.62 Fired 0 Nonfired 21 Total 21
Node 11 : LVEF > 34.62 Fired 2 Nonfired 3 Total 5
Node 12 : LVEF is.na Fired 0 Nonfired 59 Total 59
> draw.node(tree.struct.d$frame,2)
Node 2 : snp2 != AA Fired 22 Nonfired 219 Total 241
Data Preparation
In order to prepare data to be used in the NA tree model, we need the following code to organize data. The following part of code clears the data and generates the key variable (¡°fired¡±) on which our NA tree model develops. The data.csv file is just the file we get after merging all the datasets.
d <- read.csv("./data.csv",as.is=TRUE)
# keep only the implanted subjects with positive days of observation
d <- d[!is.na(d$Implant.Date),]
d <- d[d$Days.Of.Observation>0,]
# keep only subjects with at least ndays of observation
ndays<-545
d <- d[d$Days.Of.Observation>=ndays,]
#
# Create a "fired" [logical] variable
d$fired<-d$Days.To.First.App.Firing<=ndays
R Code
Commented Code
To view the full commented R Code, Click Here








