Decision Tree Gadget

From CVRG Wiki

Jump to: navigation, search

Contents

A Google Web Toolkit (GWT) interface for NA Tree Analysis

Introduction to GWT interface for NA Tree Analysis

Classification Tree analysis is one of the main techniques used in Data Mining. Classification trees predict or explain responses on a categorical dependent variable.
The NA Tree Classifier follows traditional binary classification trees, but makes use of a possible third branch (NA branch) at each node. The additional branch provides much more flexibility when dealing with incomplete (Not Available) data.
The core of the NA Tree Classifier is R script code developed by Yichen Qin at Johns Hopkins University.
The original R script requires that R be installed on the user's machine and runs as a command line program, the GWT version only requires the user to have internet access and a web browser.
The R script takes as inputs:

  • A rectangular array of data in the form of a Comma Separated Value (CSV) file.
  • A comma separated single dimensional array of integers, specifying which columns of the data array to analyze.
  • An integer indicating how deep a result tree is desired.
  • The boolean goal column is currently not a passed in parameter.
    The current R script defaults to a boolean goal column (fired) from the data array's columns "Implant.Date", "Days.Of.Observation", "Days.To.First.App.Firing", effectively making these columns required.

The output is a CSV file with 10 columns (see: Result columns) listing the details of the nodes in the result tree. A tree with a depth of 2 will have 13 rows in the result file.
The defaults in the current R script cause a bias towards certain outputs. Broad use of this algorithm will require some structural design changes in the future.
The data in this file can be used to draw a diagram of the result tree, as below:
Conceptual Drawing of an NA tree with a depth of 3Conceptual Drawing of an NA tree with a depth of 3, see above link for full details.

Implementation as Google Web Toolkit (GWT)

In an attempt to make the classifier more accessible, it has been converted into a web service using the rJava/jri library, running on an Apache Server. This web service is not designed to be called directly, but as a part of the GWT gadget below.

A GWT gadget web page has been created to combine the web service with data upload, storage, graphical rendering and retrieval of the analysis results.
This gadget allows the user to upload a large dataset, select a subset of its fields for analysis and generate an NA Tree that can be used to predict the outcome of future data of the same type.

This GWT gadget is derived from the CVRG ECG Gadget and maintains much of its user interface. The user interface consists of two sections below the title bar:

  • The stack of tinted panels on the left with icons and a brief label, which select the next action.
  • The white content panel on the right, which displays selection lists, graphics and results of the actions.

Many of the content panels also have an expandable Instructions section.

Below is the full widget with the Connect panel selected and therefore the white content panel is displaying the UserID and Password form.
After filling in the blanks, click the "Connect to Grid" checkbox in the left panel. The Instructions section is expanded.

Screen capture of log in panel.

Usage summary

Enter User ID of "demo" with a Password of "demo" to try Gadget. Any data uploaded will be visible to anyone and be periodically purged to reset the demo.
To obtain a real user ID on a working copy of the the Decision Tree Gadget, please contact us.
After logging in, running an NA Tree analysis consists of the following steps:

  1. Upload Data, in the Store panel, selected a CSV file from the local machine which contains a rectangular dataset (try this sample dataset: DemoDataSet64.zip).
  2. Select Dataset, in the Analyze panel, choose one of the datasets previously uploaded.
  3. Select Features, still in the Analyze panel, specify a name for the result file, tree depth, and which columns to use as analyzed features.
    • Multiple analyses may be done on a given dataset.
    • Each must have a different result file name.
    • Each should have different features or depth selections.
  4. View result files, in the Visualize panel.
    • Lists all result files, sorted by dataset.
    • View a interactive graphical tree of one of the analysis.
      1. Clicking the features line at the top of the tree, displays a table of the features.
      2. Clicking the text of any tree node displays a table of the details of that node.
      3. Clicking the -/+ icon at the left of a tree branch collapses/expands that branch.
      4. The tree diagram and the tables are in independently scrollable sub-panels.
  5. Download result file, in the Review panel, as a CSV file.
    • Lists all result files, sorted by dataset.

Figures and details

The following figures only show the right hand panel, as stack of blue panels on the left doesn't change much.

1. Upload Data, in the Store panel:

  1. Enter a single word dataset name, will be used to identify the dataset later. Spaces and underscores will be removed if present. (e.g. "Sample Data_set" becomes "SampleDataset".)
  2. Browse for a CSV file on the local hard drive. First line of the file must contain header names.
  3. Clicking the Store Data Set button uploads the file to the analysis server. It will be retained there so that multiple analysis can be preformed with different features selected without needing to upload the data file each time.
  4. Data must be de-identified before uploading to avoid HIPAA violations. There is no automated checking of this.

Screen capture of data upload panel

2. Select Dataset, in the Analyze panel:

  1. Lists all datasets uploaded under the current userID.
  2. Future versions will allow marking datasets to be shared with other users, in which case shared dataset will also be listed, this part of the specification has not be finalized.
  3. The size and dimensions of the dataset is displayed in bytes, rows and columns.
  4. Clicking on the icon in the list starts the analysis by displaying the feature selection controls in the same panel.

Screen capture of dataset list

3. Select features to analyze:

  1. Checkbox in header checks or clears all feature checkboxes.
  2. Allows entering a title for the result tree, it must be unique within this input dataset.
  3. Allow setting of the tree depth.
  4. Displays a collection of feature selection checkboxes derived from the headers of the dataset file.
  5. The parameters can be identical to a previous analysis, as long as the name is different, but that would be of questionable utility.
  6. Analyze button executes the analysis on the selected dataset, with the selected parameters.
  7. The result will be added to the Visualization and Review lists.

Screen capture of checkboxes for selecting features to analyze.

4. View result files, in the Visualize panel:

  1. Lists all result trees generated under the current userID, sorted by dataset.
  2. Future versions will allow marking result trees to be shared with other users, in which case result trees will also be listed, this part of the specification has not be finalized.
  3. Each line displays the parent dataset in blue, followed by the result tree's name in parenthesis (also in blue).
  4. This is followed, in black: result type (NATree is currently the only type).
  5. in black and square brackets: actual name the result file will have when downloaded, which includes the dataset name and the tree name.
  6. in black: language specific description property "analysisResults" from the constants file, "analysis results" in English or "Analyse resultiert" in German.
  7. Clicking on the icon in the list displays the graphical representation of the result tree.

Screen capture of result file list.

Graphical tree, showing list of features:

  1. Contains 2 sub-panels; the graphical tree on the left, and a table with details on the right.
  2. Selecting a node in the tree graphic causes the table to display the data for that node.
  3. The first (pseudo)node lists, by column number, the features that where used in this analysis.
  4. As shown below, selecting the feature list displays the feature's column number and name in the details table.

Screen capture of graphical tree, showing list of features for displaying graphically.

Graphic tree display:

  1. Each node in the graphic tree displays:
    • The numerical label of the node.
    • The "Rule" that this node's parent uses to decide to assign data to this branch(this node and it's children).
    • The percentage of the total samples represented by branch.
  2. Displays all of the data from the result file for this node, for details see: Result columns
  3. Also displays derived meta-data:
    • The "Rule" that this node's parent uses to decide to assign data to this branch.
    • Total: Count of data represented by this branch, essentially (Fired + Not Fired).
    • Portion of this Node fired: Percentage of data in this branch which are fired.
    • Portion of all firings: Percent of total fired data in the dataset which are in this branch.
    • Portion of all samples: Percent of all data in the dataset which are in this branch.
  4. The Node details table is longer than the space allotted for the sub-panel, so the 2nd image below shows the sub-panel scrolled down.

Screen capture of graphical tree, showing first part of details of selected node. Screen capture of graphical tree, showing last part of details of selected node(frame was scrolled).

5. Download result file, in the Review panel:

  1. Displays the same list as Visualize except:
    • The icon is different.
    • Clicking on the icon in the list downloads the result tree as a CSV file.
  2. Lists all result trees generated under the current userID.
  3. Future versions will allow marking result trees to be shared with other users, in which case result trees will also be listed, this part of the specification has not be finalized.
  4. Each line displays the parent dataset in blue, followed by the result tree's name in parenthesis (also in blue).
  5. This is followed, in black: result type (NATree is currently the only type).
  6. in black and square brackets: actual name the result file will have when downloaded, which includes the dataset name and the tree name.
  7. in black: language specific description property "analysisResults" from the constants file, "analysis results" in English or "Analyse resultiert" in German.

Screen capture of result file list for downloading.

Downloaded file to spreadsheet:

  1. The downloaded CSV file can be read directly by a spreadsheet (e.g. MS-Excel or Open Office Calc).
  2. Contains one row for each tree node, as many as needed for the tree's depth, with three nodes below each parent node.
  3. Contains 10 columns:
    Name Description
    node.label Since there are so many nodes and features, so we assign numerical labels to the nodes.
    depth Indicates which tier of the tree the node is on.
    real.depth Indicates the number of decisions that have actually been made to reach this node, due to NA nodes in it's parent branch. e.g. for the split Root -> NA -> NA, although it is in depth of 2, the real depth is still 0, that is, it has not been split by any real feature at all.
    split.var.name Name of the feature this node's parent used for split decisions. (see: Categorical Feature)
    split.var.value The threshold value of the feature this node's parent used for split decisions. (see: Categorical Feature)
    dec.rule The algebraic rule this node's parent used for split decisions, e.g. "feature equals value", "feature does not equal value", and "NA". (see: Categorical Feature)
    parent.node The node.label of the this node's parent.
    call ?don't know what this is, it is always ' false ' for the sample data I have. M.S.?
    count.fired Count of data in this node with "Fired" = true.
    Fired is a derived column, and is not part of the original dataset. This make the analysis difficult to use on other dataset types, and should be re-evaluated.
    count.nonfired Count of data in this node with "Fired" = false.
    Fired is a derived column, and is not part of the original dataset. This make the analysis difficult to use on other dataset types, and should be re-evaluated.

    Screen capture of a downloaded result opened in Excel.


    Document Information

    version 1.0
    4/9/2010
    Project 5: Statistical Learning with Multi-Scale Cardiovascular Data
    Contact email: [1]
    CardioVascular Research Grid
    Johns Hopkins University

    References

    Introduction to classification tree model
    Introduction to Feature Selection
    Quadrant Tree Classifier Summary
    Painted Tree Classifier Summary
    NA Tree Classifier Summary
    Yichen Qin's Homepage with details of several tree classifiers
    Node-Cluster Grid Communication / Services Invocation Software Architecture Document (best practices on software development for the CVRG at JHU)
    Setting up GWT for NATree
    Setting up rJava/jri for NATree
    Google Web Toolkit overview

    GWT tutorial from Roughian.com
Personal tools
Project Infrastructures