FeatureSelection
From CVRG Wiki
Contents |
Feature Selection Instruction
Document Information
Package base version 2.7.1
10/21/2008
Project 5: Statistical Learning with Multi-Scale Cardiovascular Data
Contact email: yqin@jhu.edu
CardioVascular Research Grid
Johns Hopkins University
Introduction to Feature Selection
Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique, commonly used in machine learning, of selecting a subset of relevant features for building robust learning models. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by:
- Reducing overfitting of learning methods.
- Increasing the computation speed of prediction and learning process.
- Alleviating the effect of the curse of dimensionality.
- Enhancing generalization capability.
- Improving model interpretability.
Feature selection also helps people to acquire better understanding about their data by telling them that which are the important features and how they are related with each other.
Information Theory, Entropy and Conditional Mutual Information
Now let us introduce several fundamental concepts in information theory. Information theory plays a very important role in feature selection. It gives us criterion to compare variables and quantify how much information each variable carries out for predicting the dependent variable. In this article, we only consider finite random variables.
Now, suppose U, V and W are three random variables. First, let’s introduce a very fundamental concept --- entropy, denoted by H(U). A entropy is a number to quantify the uncertainty of a random variable. Since U|V is also a random variable, we can as well define H(U|V) as the conditional entropy which quantify the uncertainty of U given V, that is V is known. H(U|V) can be 0 when U is determined by V (eg. U=f(V)). Or H(U|V) equals to H(U) when V carries no information about U (independent), that is to say, knowing V doesn’t help us at all to understand U.
Another concept is conditional mutual information, which tells us how much information two variables share given some other random variables. It is denoted by I(U ; V |W). Note that:
I(U ; V |W) = H(U |W) - H(U |W, V)
An intuitive explanation of this formula is that the conditional mutual information is the difference between the entropy of U when W is known and the entropy of U when V are W are both known. So this formula tells how much information V carries about U which is not carried by W.
Conditional Mutual Information Maximization Method
The Conditional Mutual Information Maximization Method is an algorithm that selects a small subset of features that carries as much information as possible based on conditional mutual information mentioned above. To be specific, the ultimate goal of CMIM would be to choose v(1), . . . , v(K) which minimize
. v(1), . . . , v(K) are the number of the variables we select from the whole set of variables. However,
cannot be estimated easily. Furthermore, even if there were ways to have an estimation, its minimization would be computationally intractable. So we have alternative method to solve this problem.
We propose an intermediate solution --- Conditional Mutual Information Maximization Method. To avoid the overlap information between variables and also make sure new variables we bring in carry new information about the dependent variable, we tradeoff between individual power and independence by comparing each new variable with the ones already picked. We bring in a new feature
only if
is large for every X we already picked. This means that
carries new information that about Y, and this information has not been caught by any of the X already picked. To be more formally, the formulas for selecting features are:
So
is small either because
does not bring new information about Y or because this information was already carried by
which we picked already. So if
is small, we don’t pick this
. To be more efficient, we create a score vector s(n, k) which represents
, where k means the kth variable that is going to be picked.
So, by picking the feature Xn with the maximum score in s(n, k), we ensure that, in term of predicting Y, the new feature carries the most new information that has not been caught by the preceding features.
For the computation of the score vectors, there are two common ways to do that. We will detail the cost of these algorithms in the later sections. And the implementations are also showed in the code section.
Other Feature Selection Methods
This section lists the various other feature selection methods that are commonly used. Users can also use these methods for comparison with CMIM.
Random Sampling
The most trivial form of feature selection consists of a uniform random subsampling without repetition. Such an approach leads to features as independent as the original but does not pick the informative ones. This leads to poor results when only a small fraction of the features actually provide information about the class to predict.
Mutual Information Maximization
To avoid the main weakness of the random sampling described above, we have also implemented a method which picks the K features n(1), . . . , n(K) maximizing individually the mutual information
with the class to predict. Selection based on such a ranking does not ensure weak dependency among features, and can lead to redundant and poorly informative families of features. We call this method MIM for Mutual Information Maximization.
C4.5 Binary Trees
As proposed by Ratanamahatana and Gunopulos (2003), binary decision trees can be used for feature selection. The idea is to grow several binary trees and to rank features according to the number of times they appear in the top nodes. This technique is proposed in the literature as a good filter for naive Bayesian classifiers, and is a good example of a scheme able to spot statistical dependencies between more than two features, since the choice of a feature in a binary tree depends on the statistical behavior conditionally on the values of the ones picked above.
Fast Correlation-Based Filter
The FCBF method addresses explicitly the correlation between features. It first ranks the features according to their mutual information with the class to predict, and remove those which mutual information is lesser than a threshold d.
In a second step, it iteratively removes any feature Xi if there exist a feature Xj such that
and
, i.e. Xj is better as a predictor of Y and Xi is more similar to Xj than to Y. The threshold d can be adapted to get the expected number of features.
CMIM Implementations
Now we introduce two ways to compute efficiently entropy and mutual information for CMIM. The first is more straightforward and but time consuming. The second one is more efficiently and subtle. The implementation codes are given in next section.
First, let us compute the estimation of the conditional entropy, mutual information, and conditional mutual information, which can be done by summing and subtracting estimation of entropies of these variables. Let x, y, and z be three boolean vectors and u, v, and w, three boolean values. We denote by ||.|| the cardinal of a set and define three counting functions
From this, if we define
, with the usual convention
, we have
And by definition, we have
Now we have the conditional mutual information. Then the implementation can be optimized by using a look-up table to choose the useful features from the pool. In the pseudo-code showed later, the function mut_inf(n) standards for
and cond_mut_inf(n, m) standards for
.
Standard Implementation
The most straight-forward implementation of CMIM keeps a score vector s which contains the score for every feature Xn. Each time after bringing in a new feature, the score vector is updated by take the minimum of
, where n is the new feature we select and v(1)....v(k) are the feature already chosen. In the beginning, the score vector is initialized with the values
.
The algorithm picks at each iteration the feature v(k) with the highest score, and then refreshes every score s[n] by taking the minimum of s[n] and
. This implementation is given in pseudo-code on Algorithm 1 and has a cost of O(K×N ×T).
Algorithm 1 Simple version of CMIM
for n = 1. . .N do
s[n]=mut_inf(n)
end
for k = 1. . .K do
nu[k] = argmax s[n]
for n = 1. . .N do
s[n]=min(s[n], cond_mut_inf(n, nu[k]))
end
end
Fast Implementation
The most expensive part in the algorithm described above are the K ×N calls to cond_mut_inf, each costing O(T) operations. The fast implementation of CMIM relies on the fact that because the score vector can only decrease when the process goes on, bad scores may not need to be refreshed. This implementation does not rely on any approximation and produces the exact same results as the naive implementation described above.
Intuitively, consider a set of features containing several ones almost identical. Picking one of them makes all the other ones of this group useless during the rest of the computation. This can be spotted early because their scores are low, and will remain so because scores can only decrease.
The fast version of CMIM stores for every feature Xn a partial score ps[n], which is the minimum over a few of the conditional mutual informations appearing in the min in equation (2) in section 2.3. Another vector m[n] contains the index of the last picked feature taken into account in the computation of ps[n]. Thus, we have at any moment
At every iteration, the algorithm goes through all candidates and update its score only if the best one found so far in that iteration is not better, since scores can only go down when updated. For instance, if the up-to-date score of the first feature was 0.02 and the non-already updated score of the second feature was 0.005, it is not necessary to update the later, since it can only go down. The pseudo-code on Algorithm 2 is an implementation of that idea. It goes through all the candidate features, but does not compute the conditional mutual information between a candidate and the class to predict, given the most recently picked features, if the score of that candidate is below the best up-to-date score s* found so far in that iteration.
Algorithm 2 Fast version of CMIM
for n = 1. . .N do
ps[n]=mut_inf(n)
m[n]=0
end
for k = 1. . .K do
s*=0
for n = 1. . .N do
while ps[n] > s* and m[n] < k-1 do
m[n]=m[n]+1
ps[n]=min(ps[n], cond mut inf(n, nu[m[n]]))
if ps[n] > s* then
s*=ps[n]
nu[k]=n
end
end
R Code
Commented R Code
#
# Define a function of f(x)=x*log2(x) log2() is based on 2
#
myxlogx<-function(x)
{
if (x>0)
{
return(x*log2(x))
}
else
{
# is x<0, then log2(x) is error, so we return 0.
return(0)
}
}
#
# Estimated entropy of categorical variable x i.e. H(px) where
# px is the empirical probability mass function of x.
#
entropy<-function(x)
{
#
# Generate a indicator vector SUBSET, whose individual element SUBSET[i]
# is a TRUE if x[i] is a valid data, or a FALSE if x[i] is a NA.
#
SUBSET<-!is.na(x)
#
# Create a table Tx using all the valid data.
#
Tx<-table(x[SUBSET])
#
# Calculate the probabilities for all the data points.
#
px<-Tx/sum(Tx)
#
# Calculate the entropy for x.
#
-sum(sapply(px,myxlogx))
}
#
# Estimated entropy of a pair of categorical variables x,y
# i.e. for the empirical probability mass function pxy
#
entropy2<-function(x,y)
{
#
# Generate a indicator vector SUBSET, whose individual element SUBSET[i]
# is a TRUE if both of x[i] and y[i] are valid data, or a FALSE if any
# of x[i] and y[i] are NAs.
#
SUBSET<-(!is.na(x))&(!is.na(y))
#
# Create a table Txy using all the valid data.
#
Txy<-table(x[SUBSET],y[SUBSET])
#
# Calculate the probabilities for all the data points.
#
pxy<-Txy/sum(Txy)
#
# Calculate the entropy for x and y.
#
-sum(sapply(pxy,myxlogx))
}
#
# Estimated entropy of a 3-tuple of categorical variables x,y,z
# i.e. for the empirical probability mass function pxyz
#
entropy3<-function(x,y,z)
{
#
# Generate a indicator vector SUBSET, whose individual element SUBSET[i]
# is a TRUE if all of x[i], y[i] and z[i] are valid data, or a FALSE
# if any of x[i], y[i] and z[i] are NAs.
#
SUBSET<-(!is.na(x))&(!is.na(y))&(!is.na(z))
#
# Create a table Txyz using all the valid data.
#
Txyz<-table(x[SUBSET],y[SUBSET],z[SUBSET])
#
# Calculate the probabilities for all the data points.
#
pxyz<-Txyz/sum(Txyz)
#
# Calculate the entropy for x, y and z.
#
-sum(sapply(pxyz,myxlogx))
}
#
#
# Given variables x and y, compute the mutual information.
#
#
MI<-function(x,y)
{
#
# Based on formula
#
entropy(x)+entropy(y)-entropy2(x,y)
}
#
# Given variables X1,X2,Y, compute the mutual information H({X1,X2},Y}
#
MI2<-function(x1,x2,y)
{
#
# Based on formula
#
entropy2(x1,x2)+entropy(y)-entropy3(x1,x2,y)
}
#
# Given variables X1,X2,Y, compute the conditional mutual information given Y
#
CMI2<-function(x1,x2,y )
{
#
# Based on formula
#
-entropy3(x1,x2,y)-entropy(y)+entropy2(x1,y)+entropy2(x2,y)
}
Standard Implementation Commented R Code
#
# Standard implementation of CMIM
#
# Inputs:
#
# d = data frame of features
# y = column of response variable
#
CMIM.standard<-function(d,y)
{
#
# Assign N the number of variables in d.
#
N<-dim(d)[2]
#
# Assign s (score) and nu (selection result) all 0s.
#
s<-rep(0,N)
nu<-rep(0,N)
#
# Assign s with mutual information between y and every variables in d.
#
for (n in seq(1,N))
{
s[n]<-MI(d[,n],y)
}
#
# Assign k from 1 to N, and select the kth feature each time.
#
for (k in seq(1,N))
{
#
# Assign nu[k] the variable number (from 1 to N) who maximize the
# score vectro s.
#
nu[k]<-which(s==max(s))
#
# Update score vector s by assigning n from 1 to N
#
for (n in seq(1,N))
{
#
# update score vector with the minimum of current s and conditional
# mutual information given the vector we just picked.
#
s[n]<-min(s[n],CMI2(d[,n],y,d[,nu[k]]))
}
}
list(nu=nu,s=s)
}
Fast Implementation Commented R Code
#
# Fast implementation of CMIM
#
# Inputs:
#
# d = data frame of features
# y = column of response variable
#
CMIM.fast<-function(d,y)
{
#
# Assign N the number of variables in d.
#
N<-dim(d)[2]
#
# Assign ps (partial score), m (counting indicator), and
# nu (selection result) all 0s.
#
ps<-rep(0,N)
m<-rep(0,N)
nu<-rep(0,N)
#
# Assign ps with mutual information between y and every variables in d.
#
for (n in seq(1,N))
{
ps[n]<-MI(d[,n],y)
}
#
# Assign k from 1 to N, and select the kth feature each time.
#
for (k in seq(1,N))
{
#
# Assign sstar a initial value of 0. It is used in the future to
# identify the maxium of sorce vector s
#
sstar<-0
#
# Look for the variable for the kth feature with the maxium
# conditional mutual information by assigning n from 1 to N
#
for (n in seq(1,N))
{
#
# If s (partial score) is larger than sstar, and m[n] is smaller
# than k-1, then execute the following program.
#
# This while program will execute only once for each k in 1:N
#
# Since the ps is decreacing, as long as ps[n] is smaller than sstar,
# then we don't need to update ps any more. That is the reason of
# (ps[n]>sstar)
#
# If we find a ps[n] which is larger than sstar, then we update ps[n]
# with the all the conditional mutual information of given the
# variables we already picked ( from nu[1] to nu[k-1] ). The reason why
# (m[n]<k-1) is needed is that the loop need to update all the
# variables we picked. from nu[1] to nu [k-1].
#
# Note that for k=1, k-1=0, so this part of program is never executed
# when k is 1.
#
while((ps[n]>sstar)&&(m[n]<k-1))
{
#
# Increace m[n] by 1.
#
m[n]<-m[n]+1
#
# Update the ps with the minimum of current ps[n] and the
# conditional mutual information given variable nu[m[n]], where
# m[n] is between 1 and k-1.
#
ps[n]<-min(ps[n],CMI2(d[,n],y,d[,nu[m[n]]]))
}
#
# If, after updating, the ps[n] is still larger than sstar, then
# we replace the old variable nu[k] with this new variable n, whose ps is
# larger than the former. And we set sstar to the current ps[n], so
# we can compare it with the ps of the rest of variables.
#
if (ps[n]>sstar)
{
#
# We set sstar to the current ps[n], so that we can compare it
# with the ps of the rest of variables later.
#
sstar<-ps[n]
#
# Record the current variable n to the nu vector, which stores
# all the feature selection result variables.
#
nu[k]<-n
}
}
}
#
# List the feature selection result and ps.
#
list(nu=nu,ps=ps)
}
#
# A example
#
> d=read.csv("data.csv",as.is=TRUE)
> f<-d[,52:56]
> y<-d[,51]
> CMIM.fast(f,y)
$nu
[1] 3 1 5 2 4
$ps
[1] 0.000000e+00 -2.220446e-16 0.000000e+00 9.714629e-03 0.000000e+00
> CMIM.standard(f,y)
$nu
[1] 3 1 5 2 4
$s
[1] 0.000000e+00 -2.220446e-16 0.000000e+00 4.440892e-16 0.000000e+00
Uncommented R Code
myxlogx<-function(x)
{
if (x>0)
{
return(x*log2(x))
}
else
{
return(0)
}
}
entropy<-function(x)
{
SUBSET<-!is.na(x)
Tx<-table(x[SUBSET])
px<-Tx/sum(Tx)
-sum(sapply(px,myxlogx))
}
entropy2<-function(x,y)
{
SUBSET<-(!is.na(x))&(!is.na(y))
Txy<-table(x[SUBSET],y[SUBSET])
pxy<-Txy/sum(Txy)
-sum(sapply(pxy,myxlogx))
}
entropy3<-function(x,y,z)
{
SUBSET<-(!is.na(x))&(!is.na(y))&(!is.na(z))
Txyz<-table(x[SUBSET],y[SUBSET],z[SUBSET])
pxyz<-Txyz/sum(Txyz)
-sum(sapply(pxyz,myxlogx))
}
MI<-function(x,y)
{
entropy(x)+entropy(y)-entropy2(x,y)
}
MI2<-function(x1,x2,y)
{
entropy2(x1,x2)+entropy(y)-entropy3(x1,x2,y)
}
CMI2<-function(x1,x2,y)
{
-entropy3(x1,x2,y)-entropy(y)+entropy2(x1,y)+entropy2(x2,y)
}
Standard Implementation Uncommented R Code
#
# Standard implementation of CMIM
#
# Inputs:
#
# d = data frame of features
# y = column of response variable
#
CMIM.standard<-function(d,y)
{
N<-dim(d)[2]
s<-rep(0,N)
nu<-rep(0,N)
for (n in seq(1,N))
{
s[n]<-MI(d[,n],y)
}
for (k in seq(1,N))
{
nu[k]<-which(s==max(s))
for (n in seq(1,N))
{
s[n]<-min(s[n],CMI2(d[,n],y,d[,nu[k]]))
}
}
list(nu=nu,s=s)
}
Fast implementation Uncommented R Code
#
# Fast implementation of CMIM
#
# Inputs:
#
# d = data frame of features
# y = column of response variable
#
CMIM.fast<-function(d,y)
{
N<-dim(d)[2]
ps<-rep(0,N)
m<-rep(0,N)
nu<-rep(0,N)
for (n in seq(1,N))
{
ps[n]<-MI(d[,n],y)
}
for (k in seq(1,N))
{
sstar<-0
for (n in seq(1,N))
{
while((ps[n]>sstar)&&(m[n]<k-1))
{
m[n]<-m[n]+1
ps[n]<-min(ps[n],CMI2(d[,n],y,d[,nu[m[n]]]))
}
if (ps[n]>sstar)
{
sstar<-ps[n]
nu[k]<-n
}
}
}
list(nu=nu,ps=ps)
}

