Dataset basics and concepts¶
Note
This tutorial part is also available for download as an IPython notebook: [ipynb]
A Dataset
is the basic data container in PyMVPA. It
serves as the primary form of data storage, but also as a common container for
results returned by most algorithms. In this tutorial part we will take a look
at what a dataset consists of, and how it works.
Most datasets in PyMVPA are represented as a two-dimensional array, where the first axis is the samples axis, and the second axis represents the features of the samples. In the simplest case, a dataset only contains data that is a matrix of numerical values.
>>> from mvpa2.tutorial_suite import *
>>> data = [[ 1, 1, -1],
... [ 2, 0, 0],
... [ 3, 1, 1],
... [ 4, 0, -1]]
>>> ds = Dataset(data)
>>> ds.shape
(4, 3)
>>> len(ds)
4
>>> ds.nfeatures
3
>>> ds.samples
array([[ 1, 1, -1],
[ 2, 0, 0],
[ 3, 1, 1],
[ 4, 0, -1]])
In the above example, every row vector in the data
matrix becomes an
observation, a sample, in the dataset, and every column vector
represents an individual variable, a feature. The concepts of samples
and features are essential for a dataset, hence we take a closer look.
The dataset assumes that the first axis of the data is to be used to define individual samples. If the dataset is created using a one-dimensional vector it will therefore have as many samples as elements in the vector, and only one feature.
>>> one_d = [ 0, 1, 2, 3 ]
>>> one_ds = Dataset(one_d)
>>> one_ds.shape
(4, 1)
On the other hand, if a dataset is created from multi-dimensional data, only its second axis represents the features
>>> import numpy as np
>>> m_ds = Dataset(np.random.random((3, 4, 2, 3)))
>>> m_ds.shape
(3, 4, 2, 3)
>>> m_ds.nfeatures
4
In this case we have a dataset with three samples and four features, where each feature is a 2x3 matrix. In case somebody is wondering now why not simply treat each value in the data array as its own feature (yielding 24 features) – stay tuned, as this is going to be of importance later on.
Attributes¶
What we have seen so far does not really warrant the use of a dataset over a plain array or a matrix with samples. However, in the MVPA context we often need to know more about each sample than just the value of its features. For example, in order to train a supervised-learning algorithm to discriminate two classes of samples we need per-sample target values to label each sample with its respective class. Such information can then be used in order to, for example, split a dataset into specific groups of samples. For this type of auxiliary information a dataset can also contain collections of three types of attributes: a sample attribute, a feature attribute, and a dataset attribute.
For samples¶
Each sample in a dataset can have an arbitrary number of additional
attributes. They are stored as vectors of the same length as the number of
samples in a collection, and are accessible via the sa
attribute. A
collection is similar to a standard Python dict
, and hence adding sample
attributes works just like adding elements to a dictionary:
>>> ds.sa['some_attr'] = [ 0., 1, 1, 3 ]
>>> ds.sa.keys()
['some_attr']
However, sample attributes are not directly stored as plain data, but for
various reasons as a so-called Collectable
that in
turn embeds a NumPy array with the actual attribute:
>>> type(ds.sa['some_attr'])
<class 'mvpa2.base.collections.ArrayCollectable'>
>>> ds.sa['some_attr'].value
array([ 0., 1., 1., 3.])
This “complication” is done to be able to extend attributes with additional functionality that is often needed and can offer a significant speed-up of processing. For example, sample attributes carry a list of their unique values. This list is only computed once (upon first request) and can subsequently be accessed directly without repeated and expensive searches:
>>> ds.sa['some_attr'].unique
array([ 0., 1., 3.])
However, for most interactive uses of PyMVPA this type of access to attributes’
.value
is relatively cumbersome (too much typing), therefore collections
support direct access by name:
>>> ds.sa.some_attr
array([ 0., 1., 1., 3.])
Another purpose of the sample attribute collection is to preserve data integrity, by disallowing improper attributes:
>>> ds.sa['invalid'] = 4
Traceback (most recent call last):
File "/usr/lib/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
File "<doctest tutorial_datasets.rst[20]>", line 1, in <module>
ds.sa['invalid'] = 4
File "/home/test/pymvpa/mvpa2/base/collections.py", line 459, in __setitem__
value = ArrayCollectable(value)
File "/home/test/pymvpa/mvpa2/base/collections.py", line 171, in __init__
% self.__class__.__name__)
ValueError: ArrayCollectable only takes sequences as value.
>>> ds.sa['invalid'] = [ 1, 2, 3, 4, 5, 6 ]
Traceback (most recent call last):
File "/usr/lib/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
File "<doctest tutorial_datasets.rst[21]>", line 1, in <module>
ds.sa['invalid'] = [ 1, 2, 3, 4, 5, 6 ]
File "/home/test/pymvpa/mvpa2/base/collections.py", line 468, in __setitem__
str(self)))
ValueError: Collectable 'invalid' with length [6] does not match the required length [4] of collection '<SampleAttributesCollection: some_attr>'.
But other than basic plausibility checks, no further constraints on values of samples attributes exist. As long as the length of the attribute vector matches the number of samples in the dataset, and the attributes values can be stored in a NumPy array, any value is allowed. Consequently, it is even possible to have n-dimensional arrays, not just vectors, as attributes – as long as their first axis matched the number of samples in a dataset. Moreover, it is perfectly possible and supported to store literal (non-numerical) attributes. It should also be noted that each attribute may have its own individual data type, hence it is possible to have literal and numeric attributes in the same dataset.
>>> ds.sa['literal'] = ['one', 'two', 'three', 'four']
>>> sorted(ds.sa.keys())
['literal', 'some_attr']
>>> for attr in ds.sa:
... print "%s: %s" % (attr, ds.sa[attr].value.dtype.name)
literal: string40
some_attr: float64
For features¶
Feature attributes are almost identical to sample attributes, the only difference is that instead of having one attribute value per
sample, feature attributes have one value per (guess what? ...) feature.
Moreover, they are stored in a separate collection in the dataset that is
called fa
:
>>> ds.nfeatures
3
>>> ds.fa['my_fav'] = [0, 1, 0]
>>> ds.fa['responsible'] = ['me', 'you', 'nobody']
>>> sorted(ds.fa.keys())
['my_fav', 'responsible']
For the entire dataset¶
Lastly, there can be also attributes, not per-sample, or per-feature, but for
the dataset as a whole: so called dataset attributes. Both assigning
such attributes and accessing them later on work in exactly the same way as for
the other two types of attributes, except that dataset attributes are stored in
their own collection which is accessible via the a
property of the dataset.
However, in contrast to sample and feature attribute, no constraints on the
type or size are imposed – anything can be stored. Let’s store a list with the
names of all files in the current directory, just because we can:
>>> from glob import glob
>>> ds.a['pointless'] = glob("*")
>>> 'setup.py' in ds.a.pointless
True
Slicing, resampling, feature selection¶
At this point we can already construct a dataset from simple arrays and enrich it with an arbitrary number of additional attributes. But just having a dataset isn’t enough. We often need to be able to select subsets of a dataset for further processing.
Slicing a dataset (i.e. selecting specific subsets) is very similar to
slicing a NumPy array. It actually works almost identically. A dataset
supports Python’s slice
syntax, but also selection by boolean masks and
indices. The following three slicing operations result in equivalent output
datasets, by always selecting every other samples in the dataset:
>>> # original
>>> ds.samples
array([[ 1, 1, -1],
[ 2, 0, 0],
[ 3, 1, 1],
[ 4, 0, -1]])
>>>
>>> # Python-style slicing
>>> ds[::2].samples
array([[ 1, 1, -1],
[ 3, 1, 1]])
>>>
>>> # Boolean mask array
>>> mask = np.array([True, False, True, False])
>>> ds[mask].samples
array([[ 1, 1, -1],
[ 3, 1, 1]])
>>>
>>> # Slicing by index -- Python indexing start with 0 !!
>>> ds[[0, 2]].samples
array([[ 1, 1, -1],
[ 3, 1, 1]])
Exercise
Search the NumPy documentation for the difference between “basic slicing” and “advanced indexing”. The aspect of memory consumption, especially, applies to dataset slicing as well, and being aware of this fact might help to write more efficient analysis scripts. Which of the three slicing approaches above is the most memory-efficient? Which of the three slicing approaches above might lead to unexpected side-effects if the output dataset gets modified?
All three slicing-styles are equally applicable to the selection of feature subsets within a dataset. Remember, features are represented on the second axis of a dataset.
>>> ds[:, [1,2]].samples
array([[ 1, -1],
[ 0, 0],
[ 1, 1],
[ 0, -1]])
By applying a selection by indices to the second axis, we can easily get
the last two features of our example dataset. Please note that the :
is supplied
for the first axis slicing. This is the Python way to indicate take everything
along this axis, thus including all samples.
As you can guess, it is also possible to select subsets of samples and features at the same time.
>>> subds = ds[[0,1], [0,2]]
>>> subds.samples
array([[ 1, -1],
[ 2, 0]])
If you have prior experience with NumPy you might be confused now. What you might have expected is this:
>>> ds.samples[[0,1], [0,2]]
array([1, 0])
The above code applies the same slicing directly to the NumPy array of
.samples
, and the result is fundamentally different. For NumPy arrays
this style of slicing allows selection of specific elements by their indices on
each axis of an array. For PyMVPA’s datasets this mode is not very useful,
instead we typically want to select rows and columns, i.e. samples and
features given by their indices.
Exercise
Try to select samples [0,1] and features [0,2] simultaneously using
dataset slicing. Now apply the same slicing to the samples array itself
(ds.samples
) – make sure that the result doesn’t surprise you and find
a pure NumPy way to achieve similar selection.
One last interesting thing to look at, in the context of dataset slicing, are the attributes. What happens to them when a subset of samples and/or features is chosen? Our original dataset had both samples and feature attributes:
>>> print ds.sa.some_attr
[ 0. 1. 1. 3.]
>>> print ds.fa.responsible
['me' 'you' 'nobody']
Now let’s look at what they became in the subset-dataset we previously created:
>>> print subds.sa.some_attr
[ 0. 1.]
>>> print subds.fa.responsible
['me' 'nobody']
We see that both attributes are still there and, moreover, also the corresponding subsets have been selected. It makes it convenient to select subsets of the dataset matching specific values of sample or feature attributes, or both:
>>> subds = ds[ds.sa.some_attr == 1., ds.fa.responsible == 'me']
>>> print subds.shape
(2, 1)
To simplify such selections based on the values of attributes, it is possible to specify the desired selection as a dictionary for either samples of features dimensions, where each key corresponds to an attribute name, and each value specifies a list of desired attribute values. Specifying multiple keys for either dimension can be used to obtain the intersection of matching elements:
>>> subds = ds[{'some_attr': [1., 0.], 'literal': ['two']}, {'responsible': ['me', 'you']}]
>>> print subds.sa.some_attr, subds.sa.literal, subds.fa.responsible
[ 1.] ['two'] ['me' 'you']
Exercise
Check the documentation of the select()
method
that can also be used to implement such a selection, but provides an
additional argument strict
. Modify the example above to select
non-existing elements via []
, and compare to the result to the output
of select()
with strict=False
.
Load fMRI data¶
Enough theoretical foreplay – let’s look at a concrete example of loading an
fMRI dataset. PyMVPA has several helper functions to load data from specialized
formats, and the one for fMRI data is fmri_dataset()
. The
example dataset we are going to look at is a single subject from Haxby et al.
(2001). For more convenience and less typing, we have a short cut for the
path of the directory with the fMRI data: tutorial_data_path`
.
In the simplest case, we now let fmri_dataset
do its job,
by just pointing it to the fMRI data file. The data is stored as a NIfTI file
that has all volumes of one experiment concatenated into a single file.
>>> bold_fname = os.path.join(tutorial_data_path, 'haxby2001', 'sub001',
... 'BOLD', 'task001_run001', 'bold.nii.gz')
>>> ds = fmri_dataset(bold_fname)
>>> len(ds)
121
>>> ds.nfeatures
163840
>>> ds.shape
(121, 163840)
We can notice two things. First – it worked! Second, we obtained a two-dimensional dataset with 121 samples (these are volumes in the NIfTI file), and over 160k features (these are voxels in the volume). The voxels are represented as a one-dimensional vector, and it seems that they have lost their association with the 3D-voxel-space. However, this is not the case, as we will see later. PyMVPA represents data in this simple format to make it compatible with a vast range of generic algorithms that expect data to be a simple matrix.
We loaded all data from that NIfTI file, but usually we would be interested in
a subset only, i.e. “brain voxels”. fmri_dataset
is
capable of performing data masking. We just need to specify a mask image. Such
a mask image is generated in pretty much any fMRI analysis pipeline – may it
be a full-brain mask computed during skull-stripping, or an activation map from
a functional localizer. We are going to use the original GLM-based localizer
mask of ventral temporal cortex from Haxby et al. (2001). Let’s reload the
dataset:
>>> mask_fname = os.path.join(tutorial_data_path, 'haxby2001', 'sub001',
... 'masks', 'orig', 'vt.nii.gz')
>>> ds = fmri_dataset(bold_fname, mask=mask_fname)
>>> len(ds)
121
>>> ds.nfeatures
577
As expected, we get the same number of samples, but now only 577 features – voxels corresponding to non-zero elements in the mask image. Now, let’s explore this dataset a little further.
Exercise
Explore the dataset attribute collections. What kind of information do they contain?
Besides samples, the dataset offers a number of attributes that enhance the
data with information that is present in the NIfTI image file header.
Each sample has information about its volume index in the time series and the
actual acquisition time (relative to the beginning of the file). Moreover, the
original voxel index (sometimes referred to as ijk
) for each feature is
available too. Finally, the dataset also contains information about the
dimensionality of the input volumes, voxel size, and any other NIfTI-specific
information since it also includes a dump of the full NIfTI image header.
>>> ds.sa.time_indices[:5]
array([0, 1, 2, 3, 4])
>>> ds.sa.time_coords[:5]
array([ 0. , 2.5, 5. , 7.5, 10. ])
>>> ds.fa.voxel_indices[:5]
array([[ 6, 23, 24],
[ 7, 18, 25],
[ 7, 18, 26],
[ 7, 18, 27],
[ 7, 19, 25]])
>>> ds.a.voxel_eldim
(3.5, 3.75, 3.75)
>>> ds.a.voxel_dim
(40, 64, 64)
>>> 'imghdr' in ds.a
True
In addition to all this information, the dataset also carries a key additional attribute: the mapper. A mapper is an important concept in PyMVPA, and hence has its own tutorial chapter.
>>> print ds.a.mapper
<Chain: <Flatten>-<StaticFeatureSelection>>
Having all these attributes being part of a dataset is often a useful thing to
have, but in some cases (e.g. when it comes to efficiency, and/or very large
datasets) one might want to have a leaner dataset with just the information
that is really necessary. One way to achieve this, is to strip all unwanted
attributes. The Dataset class’ copy()
method can help with that.
>>> stripped = ds.copy(deep=False, sa=['time_coords'], fa=[], a=[])
>>> print stripped
<Dataset: 121x577@int16, <sa: time_coords>>
We can see that all attributes besides time_coords
have been filtered out.
Setting the deep
arguments to False
causes the copy function to reuse
the data from the source dataset to generate the new stripped one, without
duplicating all data in memory – meaning both datasets now share the sample
data and any change done to ds
will also affect stripped
.
Intermediate storage¶
Some data preprocessing can take a long time. One would rather prevent having to do it over and over again, and instead just store the preprocessed data into a file for subsequent analyses. PyMVPA offers functionality to store a large variety of objects, including datasets, into HDF5 files. A variant of this format is also used by recent versions of Matlab to store data.
For HDF5 support, PyMVPA depends on the h5py package. If it is available, any
dataset can be saved to a file by simply calling
save()
with the desired filename.
>>> import tempfile, shutil
>>> # create a temporary directory
>>> tempdir = tempfile.mkdtemp()
>>> ds.save(os.path.join(tempdir, 'mydataset.hdf5'))
HDF5 is a flexible format that also supports, for example, data compression. To
enable it, you can pass additional arguments to
save()
that are supported by h5py’s
Group.create_dataset()
. Instead of using
save()
one can also use the
h5save()
function in a similar way. Saving the same dataset
with maximum gzip-compression looks like this:
>>> ds.save(os.path.join(tempdir, 'mydataset.gzipped.hdf5'), compression=9)
>>> h5save(os.path.join(tempdir, 'mydataset.gzipped.hdf5'), ds, compression=9)
Loading datasets from a file is easy too. h5load()
takes a
filename as an argument and returns the stored dataset. Compressed data will be
handled transparently.
>>> loaded = h5load(os.path.join(tempdir, 'mydataset.hdf5'))
>>> np.all(ds.samples == loaded.samples)
True
>>> # cleanup the temporary directory, and everything it includes
>>> shutil.rmtree(tempdir, ignore_errors=True)
Note that this type of dataset storage is not appropriate from long-term archival of data, as it relies on a stable software environment. For long-term storage, use other formats.