QuadrantTreeClassifier

From CVRG Wiki

Jump to: navigation, search

Contents

Quadrant Tree Classifier Doc

Document Information
Package base version 2.8.0
1/5/2009
Project 5: Statistical Learning with Multi-Scale Cardiovascular Data
Contact email: yqin@jhu.edu
CardioVascular Research Grid
Johns Hopkins University

Quadrant Tree Classifier Summary

The quadrant tree classifier is based on the idea of selecting a pair of numerical or ordered features x, y, choosing thresholds c, d for each feature breaking up the feature space into four quadrants based on which inequalities x<c, x>c and y<d, y>d hold. Such a classifier amounts to a tree (stub) of depth two in which the same classification criterion for splitting is used for the two nodes at depth two. Constraining the available trees in this manner reduces the level of complexity of the space of allowable classifiers and can thus help to avoid over-fitting while providing a search space that is computationally simpler to exhaustively search.

In addition, our aim is to produce a classifier having a high degree of sensitivity with optimum specificity, so build these criteria into the search through the classifier space, identifying what is an optimum pair of features in the sense of maximizing apparent specificity subject to the constraint that the apparent sensitivity is above a prescribed level.

The search for feature pairs and corresponding thresholds requires calculations involving likelihoods as well as survival probabilities. We will break the discussion of these two aspects of the problem and introduce them one by one in the following sections.


Sensitivity and Specificity

The most important concepts in quadrant tree classifier are sensitivity and specificity. They are statistical measures of the performance of a classification test. The sensitivity measures the proportion of actual positives which are correctly identified as such (or, in our case, the percentage of "fired" patients who are identified as "fired"); and the specificity measures the proportion of negatives which are correctly identified (i.e. the percentage of "not-fired" patients who are identified as "not-fired"). They are closely related to the concepts of type I and type II errors.

Consider, as an example, a scenario we have been investigating, in which heart disease patients implanted with cardioverter-defibrillators. Firing of the device is used by us as a surrogate phenotype for sudden cardiac death. When this occurs, we say that the device has "fired". We are trying to build a model to identify these subjects who will "fire". The model outcomes for patients are "fired" or "not-fired". Since the model is not 100% accurate, the actual "fired" (or "not-fired") status of each person may be different. In that setting:

  • True "fired": Patients whose devices fail are correctly identified as "fired"
  • False "fired": Patients whose devices do not fail are wrongly identified as "fired"
  • True "not-fired": Patients whose devices do not fail are correctly identified as "not-fired"
  • False "not-fired": Patients whose devices fail are wrongly identified as "not-fired"

Then,

Sensitivity = number of true "fired" / (number of true "fired" + number of false "not-fired") i.e. patients who are correctly diagnosed as "fired" / "fired" patients in total

Specificity = number of true "not-fired" / (number of true "not-fired" + number of false "fired") i.e. patients who are correctly diagnosed as "not-fired" / "not-fired" patients in total

We can put them in a table.

"fired" patients

"not-fired" patients

Identified as "fired"

A = True "fired"

B = False "fired"
(Type I error,)

Identified as "not-fired"

C = False "not-fired"
(Type II error)

D = True "not-fired"

Sensitivity =
A / (A + C)

Specificity =
B / (B + D)


Conditional Probability

Conditional probability plays an important role in our quadrant tree classifier. Conditional probability is the probability of some event A, given the occurrence of some other event B, denoted as Image:qtclass_image002.gif.

Conditional probability can be calculated as: Image:qtclass_image004.gif

The numerator Image:qtclass_image006.gif is called joint probability. That is, it is the probability of both events occurring. It is also written as Image:qtclass_image008.gif.

Based on conditional probability, we can solve more complicated problems in our setting. For example, A is an event that can be caused by events Image:qtclass_image010.gif. If we are given that event A occurred, what is probability of Image:qtclass_image012.gif occurring, i.e. Image:qtclass_image014.gif? We can use conditional probability to solve this problem. Image:qtclass_image016.gif

Where Image:qtclass_image018.gif

Each term in the formula has a conventional name:

  • Image:qtclass_image020.gif is the conditional probability of Image:qtclass_image022.gif, given A. It is also called the posterior probability because it is derived from or depends upon the specified value of A.
  • Image:qtclass_image024.gif is the prior probability or marginal probability of Image:qtclass_image025.gif. It is "prior" in the sense that it does not take into account any information about A.
  • Image:qtclass_image027.gif is the prior or marginal probability of A.


Survival Analysis

The last piece of quadrant tree classifier is survival analysis. It is a part of statistics which specifically models death time and failure time, or more generally, the time to event data. In our case, fire or failure is considered an "event". In our case, survival analysis attempts to answer questions such as: what is the probability of patients not firing until a certain time? How do particular circumstances or individual characteristics increase or decrease the probability of firing?

The primary interest in survival analysis is survival function, conventionally denoted S, which is defined as

Image:qtclass_image029.gif

where t is a time, T is a random variable denoting the time of death. So the survival function is the probability that the time of fire (or failure) is later than some specified time.

There are many ways to estimate S(t). We will briefly discuss about them in the following sections:


Empirical Distribution Function

The first and simplest method is just using only those in study for at least some time t and taking the ratio between it and the total number of observations. (i.e. empirical distribution function)

Image:qtclass_image031.gif

This method is simplest, but not be the best. Subjects who have only been observed for s time units are removed from consideration if s<t. Thus we may be leaving out useful information. On the other hand, this method does lead to an unbiased estimate of the probability of firing before t time units since implant.


Kaplan-Meier Estimator, Censoring and Truncation

To solve the problem that empirical distribution function has, that is, being unbiased when patients are not observed for sufficient amount of time, we introduce another method --- Kaplan-Meier estimator (product limit estimator). It is developed to address the censored and truncated data, which is what we mostly deal with in the analysis.

Censoring (or right censoring) happens when patients are not observed for a sufficient amount of time, then we do not know whether and when they get "fired" or not. All we know is that they do not get "fired" when they are in the study. This is the most common type of censoring in our analysis since a lot of patients participate in the study only for a very short time, then decide to leave or lose contact with the hospital. Even so, their cases are still informative to us, thus we want to use their cases to give us more information about the firing. In order to put their cases in use, we only take them into consideration during the period when they are in the study. This is the basic idea of Kaplan-Meier estimator. And this is also how Kaplan-Meier is better than empirical distribution function.

Kaplan-Meier estimator apparently uses more information than empirical distribution function, which makes it more efficient. Kaplan-Meier estimator takes these patients who leave the study into consideration. For example, to estimate a survival probability at a particular time t, empirical distribution function only counts these who survive pass time t. And all the calculation is based on this group of patients. It ignores patients who drop off in the middle of the study (before time t). These patients may also survive pass time t, but they are assumed to have "fired" at the minute when they leave the study, which is an unreasonable assumption. On the other hand, to estimate the same probability, Kaplan-Meier estimator starts from the beginning time of the study and calculate the conditional survival probability at every possible times before t, and multiple them together to get a more accurate estimate, or in some sense, less biased estimate than empirical distribution function.


Kaplan-Meier estimator can be unbiased when under the assumption that censoring times are independent with "fired" times. When these two types of times are independent, then censoring do not give us any information about when these patients who get censored are going to get "fired". So we can treat censoring times as random times. But if they are dependent with each other, then censoring may mean something to us. For example, say patients tend to drop off the study when they feel bad and uncomfortable with their cardioverter-defibrillators, then there is a tendency that patients get "fired" soon after they drop off the study, or vice versa. If that is the case, then every conditional probability in the Kaplan-Meier estimator is biased, thus the Kaplan-Meier estimator is biased. In our analysis, there is no guarantee that this assumption is true, so we cannot say Kaplan-Meier estimator is unbiased either. But it at least uses all the information we have about firing, so Kaplan-Meier estimator is usually preferred over empirical distribution function.

Besides censoring, truncation (or left-truncation) is also a major problem in our analysis. It happens when some patients join the study a while after they are implanted with cardioverter-defibrillators, so we do not observe them before they join in, which means everything we observe from them are based on the fact that they join the study after time t. This gives us the problem that everything we get is conditional. This problem can also be solved by Kaplan-Meier estimator since it only takes patients into consideration when they are in the study. And everything in Kaplan-Meier estimator is conditional (conditional survival probability). So truncation can be easily solved by Kaplan-Meier estimator. But again, Kaplan-Meier estimator is only unbiased when truncation times are independent with "fired" times. The reason is similar to that of censoring which is mentioned in the previous paragraph. For example, if patients tend to go back to the hospital when they sense that their cardioverter-defibrillators are malfunctioning, then there is also a tendency that patients get fired soon after they join in the study. So Kaplan-Meier estimator is biased in that case.

In general, Kaplan-Meier estimator is a milestone for survival analysis. It is a very simple idea and yet improves the estimation quality by significant magnitude. The paper of Kaplan-Meier estimator is also one of the top 5 most referenced papers in this field.

To conduct Kaplan-Meier estimator, we first list all unique death times Image:qtclass_image033.gif, and get number at risk Image:qtclass_image035.gif and number of death Image:qtclass_image037.gif at each time. Number of risk Image:qtclass_image038.gif is the number of patients who are alive and in the study just before Image:qtclass_image040.gif. Number of death Image:qtclass_image041.gif is the number of patients who died at time Image:qtclass_image042.gif.

Then the Kaplan-Meier estimator is: Image:qtclass_image044.gif

If there is no censored or truncated data, then Image:qtclass_image046.gif. So Kaplan-Meier estimator becomes Image:qtclass_image048.gif. We can see that it is just the empirical estimate function.


Other Methods

Other than the methods mentioned above, there are still a great number of methods. For example, Proportional hazard rate model is also an important model. It is developed under the assumption that each individual's hazard rate function is proportional to the others' over the time. Under the proportional hazard rate assumption, we can assume there is a baseline hazard rate and everyone else's hazard rate functions are just proportional to the baseline. So we can just estimate the effect parameter(s) without any consideration of the hazard function. This approach is also called Cox proportional hazard rate model. Cox model is a semiparametric method since we only put the effect parameters in the model but the baseline has no parameters at all.


Quadrant Tree Classifier

With the introduction to all these fundamental knowledge, let's move to the quadrant tree classifier. Quadrant tree classifier is developed based on traditional tree model, with only two features and two thresholds. These thresholds break up the feature space into four quadrants based on which inequalities x<tx, x>ty, y<ty and y>ty holds, at mean time, the thresholds also break the dataset into four pieces which will be used for estimating survival probabilities. See the graph below.

x > tx
y > ty
x<tx, y>ty
x>tx, y>ty
x<tx, y<ty
x>tx, y<ty

Within each quadrant i (i=1,2,3,4), we conduct a survival analysis based on the dataset of that quadrant. So we can estimate the probability of fire before/after a certain point (a threshold we pre-specified), i.e. P(fired before threshold | quadrant i) and P(fired after threshold | quadrant i), or simply P(fired | quadrant i) and P(not-fired | quadrant i).

Since we divide the original dataset into four parts. We can estimate the probability of any individual falls into each of the quadrant by: Image:qtclass_image050.gif

Now we have P(fired | quadrant i) (from survival analysis) and P(quadrant i), we can apply the conditional probability to get P(quadrant i | fired) and P(quadrant i | not-fired), denoted as Pi and Qi respectively.

Image:qtclass_image052.gif

Image:qtclass_image054.gif

Now we have eight probabilities Pi and Qi i=1,2,3,4. We calculate the likelihood ratio (=Pi/Qi) for every quadrant i, and put them in descending order. For example, we have a set of Pi and Qi like this:

Quadrant

Pi

Qi

Likelihood ratio

1

0.5

0.25

0.2

2

0.2

0.1

2

3

0.35

0.35

1

4

0.4

0.3

1.33

After put them in descending order based on likelihood ratio, they become the following table:

Quadrant
Pi
Qi
Likelihood ratio
Indicator for specificity
(1-indicator for specificity)*Qi

2

0.2

0.1

2

1

0

4

0.4

0.3

1.33

1

0

3

0.35

0.35

1

0.86 = (0.9-0.6 )/0.35

0.05= (1-0.86)*0.35

1

0.05

0.25

0.2

0

0.25 = (1-0)*0.25

Sum (rows 1-2)

0.6

Sum (rows 1-4)

1

1

0.3 (specificity)

Starting with the quadrant which has the largest likelihood ratio, we take summation of their Pi's by including them in one by one based on the likelihood ratio order. We stop doing that when the summation reaches the largest number below the sensitivity (including sensitivity). In our case, we set our sensitivity to be 90%, so after we include quadrant 2 and 4, we stop, otherwise, the summation will go over 0.9.

Then we assign 1s to each quadrant whose Pi's are included in the summation, call it "indicator for specificity". For the quadrant who just missed to be included, we calculate the difference between the sensitivity and summation, divide it by Pi of that quadrant, and assign the ratio to the indicator. In our case, the summation is 0.6, the sensitivity is 0.9, the quadrant's Pi is 0.35. So, for quadrant 3, the indicator is 0.86 = (0.9 - 0.6 )/ 0.35. For the rest of quadrants, assign 0s for the indicator. In our case, it is quadrant 1.

The indicators mean that: if it is 1, then persons falling in this quadrant will be identified as "fired". If it is 0, then persons falling in this quadrant will be identified as "not-fired". If it is a fraction p, then persons falling in this quadrant will be randomly identified as "fired" with a probability p, and identified as "not-fired" with a probability 1-p.

After this setting, we can lock the sensitivity at 90%. It is because sensitivity is the probability of being identified as "fired" given "fired" patients, i.e. P(identified as "fired" I "fired" patients). So, in our case, it is can be written as:

Image:qtclass_image056.gif

With the fixed sensitivity, we can get the maximized specificity. Since specificity is the probability of being identified as "not-fired" given "not-fired" patients, i.e. P(identified as "not-fired" I "not-fired" patients). So it is can be written as:

Image:qtclass_image058.gif

Now, with a threshold and a sensitivity given, we can repeat the procedure above with different criterions (tx and ty) and find an optimal pair of tx and ty with the largest specificity. This is basically how the quadrant tree classifier work.


Quadrant Tree Classifier R Code

Click Here To view R Code for this project.

Personal tools
Project Infrastructures