Frequently Asked Questions¶
General¶
I’m a Matlab user. How hard is learning Python and PyMVPA for me?¶
If you are coming from Matlab, you will soon notice a lot of similarities between Matlab and Python (besides the huge advantages of Python over Matlab). For an easy transition you might want to have a look at a basic comparison of Matlab and NumPy.
It would be nice to have some guidelines on how to use PyMVPA for users who are already familiar with the Matlab MVPA toolbox. If you are using both packages and could compile a few tips, your contribution would be most welcome.
A recent paper by Jurica and van Leeuwen (2009) describes an open-source MATLAB®-to-Python compiler which might be a very useful tool to migrate a substantial amount of Matlab-based source code to Python and therefore also aids the migration of developers from Matlab to the new “general open-source lingua franca for scientific computation”.
It is sloooooow. What can I do?¶
Have you tried running the Python interpreter with -O
? PyMVPA provides
lots of debug messages with information that is computed in addition to the
work that really has to be done. However, if Python is running in
optimized mode, PyMVPA will not waste time on this and really tries to be
fast.
If you are already running it optimized, then maybe you are doing something really demanding...
I am tired of writing these endless import blocks. Any alternative?¶
Sure. Instead of individually importing all pieces that are required by a script, you can import them all at once. A simple:
>>> import mvpa2.suite as mvpa2
makes everything directly accessible through the mvpa namespace, e.g.
mvpa2.datasets.base.Dataset
becomes mvpa2.Dataset
. Really lazy people
can even do:
>>> from mvpa2.suite import *
However, as always there is a price to pay for this convenience. In contrast to the individual imports there is some initial performance and memory cost. In the worst case you’ll get all external dependencies loaded (e.g. a full R session), just because you have them installed. Therefore, it might be better to limit this use to case where individual key presses matter and use individual imports for production scripts.
I feel like I want to contribute something, do you mind?¶
Not at all! If you think there is something that is not well explained in the documentation, send us an improvement. If you implemented a new algorithm using PyMVPA that you want to share, please share. If you have an idea for some other improvement (e.g. speed, functionality), but you have no time/cannot/do not want to implement it yourself, please post your idea to the PyMVPA mailing list.
I want to develop a new feature for PyMVPA. How can I do it efficiently?¶
The best way is to use Git for both, getting the latest code from the repository and preparing the patch. Here is a quick sketch of the workflow.
First get the latest code:
git clone git://github.com/PyMVPA/PyMVPA.git
This will create a new PyMVPA
subdirectory, that contains the complete
repository. Enter this directory and run gitk --all
to browse the full
history and all branches that have ever been published.
You can run:
git fetch origin
in this directory at any time to get the latest changes from the main repository.
Next, you have to decide what you want to base your new feature on. In the
simplest case this is the master
branch (the one that contains the code that
will become the next release). Creating a local branch based on the (remote)
master
branch is:
git checkout -b my_hack origin/master
Now you are ready to start hacking. You are free to use all powers of Git (and yours, of course). You can do multiple commits, fetch new stuff from the repository, and merge it into your local branch, ... To get a feeling what can be done, take a look very short description of Git or a more comprehensive Git tutorial.
When you are done with the new feature, you can prepare the patch for
inclusion into PyMVPA. If you have done multiple commits you might want to
squash them into a single patch containing the new feature. You can do this
with git rebase
. Any recent version of git rebase
has an option
--interactive
, which allows you to easily pick, squash or even further edit
any of the previous commits you have made. Rebase your local branch against
the remote branch you started hacking on (origin/master
in this example):
git rebase --interactive origin/master
When you are done, you can generate the final patch file:
git format-patch origin/master
Above command will generate a file for each commit in you local branch that is
not yet part of origin/master
. The patch files can then be easily emailed.
The manual is quite insufficient. When will you improve it?¶
Writing a manual can be a tricky task if you already know the details and have to imagine what might be the most interesting information for someone who is just starting. If you feel that something is missing which has cost you some time to figure out, please drop us a note and we will add it as soon as possible. If you have developed some code snippets to demonstrate some feature or non-trivial behavior (maybe even trivial ones, which are not as obvious as they should be), please consider sharing this snippet with us and we will put it into the example collection or the manual. Thanks!
Data import, export and storage¶
What file formats are understood by PyMVPA?¶
Please see the data_formats section.
What if there is no special file format for some particular datatype?¶
With the h5save()
function, PyMVPA supports storing
any kind of serializable data into a (compressed) HDF5 file. The facility is
particularly useful for storing any number of intermediate analysis results,
e.g. for post-processing.
Data preprocessing¶
Is there an easy way to remove invariant features from a dataset?¶
You might have to deal with invariant features in case like an fMRI dataset, where the brain mask is slightly larger than the thresholded fMRI timeseries image. Such invariant features (i.e. features with zero variance) are sometime a problem, e.g. they will lead to numerical difficulties when z-scoring the features of a dataset (i.e. division by zero).
The mvpa2.datasets.miscfx
module provides a convenience function
remove_invariant_features()
that strips such features from a dataset.
How can I do block-averaging of my block-design fMRI dataset?¶
The easiest way is to use a mapper to transform/average the respective samples. Suppose you have a dataset:
>>> dataset = normal_feature_dataset()
>>> print dataset
<Dataset: 100x4@float64, <sa: chunks,targets>>
Averaging all samples with the same label in each chunk individually is done by applying a mapper to the dataset.
>>> from mvpa2.mappers.fx import mean_group_sample
>>>
>>> m = mean_group_sample(['targets', 'chunks'])
>>> mapped_dataset = dataset.get_mapped(m)
>>> print mapped_dataset
<Dataset: 10x4@float64, <sa: chunks,targets>, <a: mapper>>
mean_group_sample
creates an FxMapper
that applies a function to
every group of samples in each chunk individually and therefore yields
one sample of each label per chunk.
Data analysis¶
How do I know which features were finally selected by a classifier doing feature selection?¶
All feature selection classifier use a built-in mapper to slice datasets. This mapper can be queried for selected features, or simply used to apply the same feature selection to other datasets.
>>> clf = FeatureSelectionClassifier(
... kNN(k=5),
... SensitivityBasedFeatureSelection(
... SMLRWeights(SMLR(lm=1.0), postproc=maxofabs_sample()),
... FixedNElementTailSelector(1, tail='upper', mode='select')))
>>> clf.train(dataset)
>>> len(clf.mapper.slicearg)
1
>>> final_dataset = clf.mapper.forward(dataset)
>>> print final_dataset
<Dataset: 100x1@float64, <sa: chunks,targets>>
In the above code snippet a kNN classifier is defined, that performs a feature
selection step prior training. Features are selected according to the maximum
absolute magnitude of the weights of a SMLR classifier trained on the data
(same training data that will also go into kNN). Absolute SMLR weights are used
for feature selection as large negative values also indicate important
information. Finally, the classifier is configured to select the single most
important feature (given the SMLR weights). After enabling the feature_ids
state, the classifier provides the desired information, that can e.g. be
applied to generate a stripped dataset for an analysis of the similarity
structure.
How do I extract sensitivities from a classifier used within a cross-validation?¶
In various parts of PyMVPA it is possible to extract information from inside loops via callbacks. To extract sensitivities from inside a cross-validation analysis, without unnecessary retraining of the classifier one only needs to write a corresponding callback function. here is a sketch:
>>> sensitivities = []
>>> def store_me(data, node, result):
... sens = node.measure.get_sensitivity_analyzer(force_train=False)(data)
... sensitivities.append(sens)
>>>
>>> cv = CrossValidation(SMLR(), OddEvenPartitioner(), callback=store_me)
>>> merror = cv(dataset)
>>> len(sensitivities)
2
>>> sensitivities[0].shape == (len(dataset.uniquetargets), dataset.nfeatures)
True
First we set up a container (a list) to store the sensitivies for a
cross-validation folds. next is the callback: It takes three arguments, as
described in the documentation of RepeatedMeasure
.
The second argument is the node that is evaluated inside the loop. For a
cross-validation this is a TransferMeasure
that
exposes its internal classifier via the measure
property. The rest is
straightforward. We contruct a sensitivity analyzer and pass the input dataset.
Finally, we store the returned sensitivities.
Can PyMVPA deal with literal class labels?¶
Yes. For all external machine learning libraries that do not support literal labels, PyMVPA will transparently convert them to numerical ones, and also revert this transformation for all output values.