Monte-Carlo testing of Classifier-based AnalysesΒΆ
It is often desirable to be able to make statements like “Performance is
significantly above chance-level” and to help with that PyMVPA supports Null
hypothesis (aka H0) testing for any Measure
.
Measures take an optional constructor argument null_dist
that can be used
to provide an instance of some NullDist
estimator.
If the properties of the expected Null distribution are known a-priori, it is
possible to use any distribution specified in SciPy’s stats
module for this
purpose (see e.g. FixedNullDist
).
However, as with other applications of statistics in classifier-based analyses there is the problem that we typically do not know the distribution of a variable like error or performance under the Null hypothesis (i.e. the probability of a result given that there is no signal), hence we cannot easily assign the adored p-values. Even worse, the chance-level or guess probability of a classifier depends on the content of a validation dataset, e.g. balanced or unbalanced number of samples per label and total number of labels.
One approach to deal with this situation is to estimate the Null distribution using permutation testing. The Null distribution is then estimated by computing the measure of interest multiple times using original data samples but with permuted targets. Since quite often the exploration of all permutations is unfeasible, Monte-Carlo testing (see Nichols et al. (2006)) allows to obtain stable estimate with only a limited number of random permutations.
Given the results computed using permuted targets one can now determine the probability of the empirical result (i.e. the one computed from the original training dataset) under the no signal condition. This is simply the fraction of results from the permutation runs that is larger or smaller than the empirical (depending on whether one is looking at performances or errors).
Here is how this looks for a simple cross-validated classification in PyMVPA. We start by generated a dataset with 200 samples and 3 features of which 2 carry some relevant signal.
# lazy import
from mvpa2.suite import *
# enable progress output for MC estimation
if __debug__:
debug.active += ["STATMC"]
# some example data with signal
ds = normal_feature_dataset(perlabel=100, nlabels=2, nfeatures=3,
nonbogus_features=[0,1], snr=0.3, nchunks=2)
Now we can start collecting the pieces that play a role in this analysis. We need a classifier.
clf = LinearCSVMC()
We need a generator than will produce partitioned datasets, one for each
fold of the cross-validation. A partitioned dataset is basically the same as the
original dataset, but has an additional samples attribute that indicates whether
particular samples will be the part of the data that is used for training the
classifier, or for testing it. By default, the
NFoldPartitioner
will create a sample
attribute partitions
that will label one chunk in each fold
differently from all others (hence mark it as taken-out for testing).
partitioner = NFoldPartitioner()
We need two pieces for the Monte Carlo shuffling. The first of them is
an instance of an
AttributePermutator
that will
permute the target attribute of the dataset for each iteration. We
will instruct it to perform 200 permutations. In a real analysis the
number of permutations should be larger to get stable estimates.
permutator = AttributePermutator('targets', count=200)
The second mandatory piece for a Monte-Carlo-style estimation of
the Null distribution is the actual “estimator”.
MCNullDist
will use the
constructed permutator
to shuffle the targets and later on report
p-value from the left tail of the Null distribution, because we are
going to compute errors and are interested in them being lower than
chance. Finally we also ask for all results from Monte-Carlo shuffling
to be stored for subsequent visualization of the distribution.
distr_est = MCNullDist(permutator, tail='left', enable_ca=['dist_samples'])
Now we have all pieces and can conduct the actual cross-validation. We assign
a post-processing node mean_sample
that will take care of averaging
error values across all cross-validation fold. Consequently, the Null
distribution of average cross-validated classification error will be estimated
and used for statistical evaluation.
cv = CrossValidation(clf, partitioner,
errorfx=mean_mismatch_error,
postproc=mean_sample(),
null_dist=distr_est,
enable_ca=['stats'])
# run
err = cv(ds)
Now we have a usual cross-validation error and cv
stores
conditional attribute`s such as confusion matrices:
print 'CV-error:', 1 - cv.ca.stats.stats['ACC']
However, in addition it also provides the results of the statistical
evaluation. The conditional attribute null_prob
has a
dataset that contains the p-values representing the likelihood of an
error equal or lower to the output one under the Null hypothesis,
i.e. no actual relevant signal in the data. For a reason that will
appear sensible later on, the p-value is contained in a dataset.
p = cv.ca.null_prob
# should be exactly one p-value
assert(p.shape == (1,1))
print 'Corresponding p-value:', np.asscalar(p)
We can now look at the distribution of the errors under H0 and plot the expected chance level as well as the empirical error.
# make new figure
pl.figure()
# histogram of all computed errors from permuted data
pl.hist(np.ravel(cv.null_dist.ca.dist_samples), bins=20)
# empirical error
pl.axvline(np.asscalar(err), color='red')
# chance-level for a binary classification with balanced samples
pl.axvline(0.5, color='black', ls='--')
# scale x-axis to full range of possible error values
pl.xlim(0,1)
pl.xlabel('Average cross-validated classification error')
We can see that the Null or chance distribution is centered around the expected chance-level and the empirical error value is in the far left tail, thus unlikely to belong to Null distribution, and hence the low p-value.
This could be the end, but sometimes one needs to have a closer look. Let’s say your
data is not that homogeneous. Let’s say that some chunk may be very
different from others. You might want to look at the error value probability for
specific cross-validation folds. Sounds complicated? Luckily it is very simple.
It only needs a tiny change in the cross-validation setup – the removal of the
mean_sample
post-processing node.
cv = CrossValidation(clf, partitioner,
errorfx=mean_mismatch_error,
null_dist=distr_est,
enable_ca=['stats'])
# run
err = cv(ds)
assert (err.shape == (2,1))
print 'CV-errors:', np.ravel(err)
Now we get two errors – one for each cross-validation fold and most importantly, we also get the two associated p-values.
p = cv.ca.null_prob
assert(p.shape == (2,1))
print 'Corresponding p-values:', np.ravel(p)
What happened is that a dedicated Null distribution has been estimated for
each element in the measure results. Without mean_sample
an error is
reported for each CV-fold, hence a separate distributions are estimated for
each CV-fold too. And because we have also asked for all distribution samples
to be reported, we can now plot both distribution and both empirical errors.
But how do we figure out with value is which?
As mentioned earlier all results are returned in Datasets. All datasets have compatible sample and feature axes, hence corresponding elements.
assert(err.shape == p.shape == cv.null_dist.ca.dist_samples.shape[:2])
# let's make a function this time
def plot_cv_results(cv, err, title):
# make new figure
pl.figure()
colors = ['green', 'blue']
# null distribution samples
dist_samples = np.asarray(cv.null_dist.ca.dist_samples)
for i, e in enumerate(err):
# histogram of all computed errors from permuted data per CV-fold
pl.hist(np.ravel(dist_samples[i]), bins=20, color=colors[i],
label='CV-fold %i' %i, alpha=0.5,
range=(dist_samples.min(), dist_samples.max()))
# empirical error
pl.axvline(np.asscalar(e), color=colors[i])
# chance-level for a binary classification with balanced samples
pl.axvline(0.5, color='black', ls='--')
# scale x-axis to full range of possible error values
pl.xlim(0,1)
pl.xlabel(title)
plot_cv_results(cv, err, 'Per CV-fold classification error')
We have already seen that the statistical evaluation is pretty flexible. However, we haven’t yet seen whether it is flexible enough. To illustrate that think about what was done in the above Monte Carlo analyses.
A dataset was shuffled repeatedly, and for each iteration a full cross-validation of classification error was performed. However, the shuffling was done on the full dataset, hence target were permuted in both training and testing dataset portions in each CV-fold. This basically means that for each Monte Carlo iteration the classifier was tested on a new data/signal. However, we may be more interested in what the classifier has to say on the actual data, but when it was trained on randomly permuted data.
As you can guess this is possible too and goes like this. The most important difference is that we are going to use now a dedicate measure to estimate the Null distribution. That measure is very similar to the cross-validation we have used before, but differs in an important bit. Let’s look at the pieces.
# how often do we want to shuffle the data
repeater = Repeater(count=200)
# permute the training part of a dataset exactly ONCE
permutator = AttributePermutator('targets', limit={'partitions': 1}, count=1)
# CV with null-distribution estimation that permutes the training data for
# each fold independently
null_cv = CrossValidation(
clf,
ChainNode([partitioner, permutator], space=partitioner.get_space()),
errorfx=mean_mismatch_error)
# Monte Carlo distribution estimator
distr_est = MCNullDist(repeater, tail='left', measure=null_cv,
enable_ca=['dist_samples'])
# actual CV with H0 distribution estimation
cv = CrossValidation(clf, partitioner, errorfx=mean_mismatch_error,
null_dist=distr_est, enable_ca=['stats'])
The repeater
is a simple node that returns any given dataset a
configurable number of times. We use the helper to configure the number of
Monte Carlo iterations. The new permutator
is again configured to shuffle
the targets
attribute, but only once and only for samples that were
labeled as being part of the training set in a particular CV-fold (the
partitions
sample attribute will be created by the NFoldPartitioner that we
have configured earlier).
The most important difference is a new dedicated measure that will be used to perform a cross-validation analysis under the H0 hypotheses. To this end we set up a standard CV procedure with a twist: we use a chained generator (comprising of the typical partitioner and the new one-time permutator). This will cause the CV to permute the training set for each CV-fold internally (and that is what we wanted).
Now we assign the H0 cross-validation procedure to the distribution
estimator and use the repeater
to set the number of iterations. Lastly, we
plug everything into a standard CV analysis with, again, a non-permuting
partitioner
and the pimped Null distribution estimator.
Now we just need to run it, and plot the results the same way we did before.
err = cv(ds)
print 'CV-errors:', np.ravel(err)
p = cv.ca.null_prob
print 'Corresponding p-values:', np.ravel(p)
# plot
plot_cv_results(cv, err,
'Per CV-fold classification error (only training permutation)')
There a many ways to futher tweak the statistical evaluation. For example, if
the family of the distribution is known (e.g. Gaussian/Normal) and provided
with the dist_class
parameter of MCNullDist
, then permutation tests
done by MCNullDist
allow determining the distribution parameters. Under the
(strong) assumption of Gaussian distribution, 20-30 permutations should be
sufficient to get sensible estimates of the distribution parameters.
But that would be another story...
See also
The full source code of this example is included in the PyMVPA source distribution (doc/examples/permutation_test.py
).