From CVRG Wiki

Jump to: navigation, search


Statistical Learning with Multi-Scale Cardiovascular Data

Document Information
Package base version 2.8.0
Project 5: Statistical Learning with Multi-Scale Cardiovascular Data
Contact email:
CardioVascular Research Grid
Johns Hopkins University


In large studies of cardiovascular disease, there are usually multiple sources of data for subjects, including ECG measurements, genotypes and various marker loci, proteomic analyses, MRI and CT images, age, gender, ethnicity, and results of additional clinical evaluation. The goal of this part of the CVRG project is to develop statistical tools for integrating available patient data from several such modalities and scales in order to classify patients according to their risk of sudden cardiac death (SCD), and to make these tools available to researchers in the cardiovascular community for analyzing their own data. We envision this grid becoming a widely-used resource for collaboration among cardiovascular researchers, allowing them to share data, statistical methods and software for implementing the methods developed in this project.

New methodology has been developed by our team using data from subjects in the Reynolds project. These subjects have been implanted with defibrillators when they are considered at high risk for SCD. Since these devices are implanted at considerable financial cost as well is inconvenience to patients, and can have a significant effect on quality of life, it is evidently desirable to determine for which patients, if any, implantation can be safely avoided. In particular, we have focused on developing predictors with high sensitivity, in the sense of identifying patients who should be implanted due to high risk of SCD, at the possible expense of specificity, in the sense of implanting patients who perhaps needn't be implanted due to low risk of SCD. Various other aspects of this study, and the data collected herein, have driven our design of methodology:

  • It is necessary to merge data on subjects from several different modalities, including SNP, ECG, and IMAGE measurements.
  • Whereas data from at least one such modality is available for most of the subjects, there are only a small number of subjects with data for all modalities; consequently, missing data is more the rule than exception.
  • The numbers of sudden cardiac deaths among the subjects is relatively small, resulting in the need for us to choose an appropriate surrogate phenotype to replace SCD.
  • After consideration of inducibility and other possible surrogate phenotypes, we settled on appropriate firing within a certain period since implantation. Whereas not as rare as SCD itself in our population, the number of occurrences of such firings is still quite small, and any proposed method must address this extreme imbalance among sample sizes.

The algorithms we have developed and proposed for dissemination are all written in the R programming language and build on various existing R packages. The scripts we present are extensively annotated, allowing researchers with limited experience with R to make use of them.


Look below for information on specific research topics.

Classification Tree Model

Click here for detailed information on classification tree models.

Data Merging

Click here for information on Data Merging.

Feature Selection

Click here for code associated with Feature Selection.

Painted Tree Model

Click here for information on the Painted Tree Model.

Quadrant Tree Classifier

Click here for information on Quadrant Tree Classifiers.

NA Tree Classifier

Click here for information on NA Tree Classifiers.

Click here for information on implementation of NA Tree Classifiers using Google Web Toolkit.

Feature/Phenotype Association

click here for information on Feature/Phenotype Association (Under construction)

Model Fitting

click here for information on Model Fitting (Under construction)

Summaries (data)

click here for Summaries (Under construction)

Personal tools
Project Infrastructures