mvpa2.misc.data_generators.Dataset

Inheritance diagram of Dataset
class mvpa2.misc.data_generators.Dataset(samples, sa=None, fa=None, a=None)

Generic storage class for datasets with multiple attributes.

A dataset consists of four pieces. The core is a two-dimensional array that has variables (so-called features) in its columns and the associated observations (so-called samples) in the rows. In addition a dataset may have any number of attributes for features and samples. Unsurprisingly, these are called ‘feature attributes’ and ‘sample attributes’. Each attribute is a vector of any datatype that contains a value per each item (feature or sample). Both types of attributes are organized in their respective collections – accessible via the sa (sample attribute) and fa (feature attribute) attributes. Finally, a dataset itself may have any number of additional attributes (i.e. a mapper) that are stored in their own collection that is accessible via the a attribute (see examples below).

Notes

Any dataset might have a mapper attached that is stored as a dataset attribute called mapper.

Examples

The simplest way to create a dataset is from a 2D array.

>>> import numpy as np
>>> from mvpa2.datasets import *
>>> samples = np.arange(12).reshape((4,3))
>>> ds = AttrDataset(samples)
>>> ds.nsamples
4
>>> ds.nfeatures
3
>>> ds.samples
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

The above dataset can only be used for unsupervised machine-learning algorithms, since it doesn’t have any targets associated with its samples. However, creating a labeled dataset is equally simple.

>>> ds_labeled = dataset_wizard(samples, targets=range(4))

Both the labeled and the unlabeled dataset share the same samples array. No copying is performed.

>>> ds.samples is ds_labeled.samples
True

If the data should not be shared the samples array has to be copied beforehand.

The targets are available from the samples attributes collection, but also via the convenience property targets.

>>> ds_labeled.sa.targets is ds_labeled.targets
True

If desired, it is possible to add an arbitrary amount of additional attributes. Regardless if their original sequence type they will be converted into an array.

>>> ds_labeled.sa['lovesme'] = [0,0,1,0]
>>> ds_labeled.sa.lovesme
array([0, 0, 1, 0])

An alternative method to create datasets with arbitrary attributes is to provide the attribute collections to the constructor itself – which would also test for an appropriate size of the given attributes:

>>> fancyds = AttrDataset(samples, sa={'targets': range(4),
...                                'lovesme': [0,0,1,0]})
>>> fancyds.sa.lovesme
array([0, 0, 1, 0])

Exactly the same logic applies to feature attributes as well.

Datasets can be sliced (selecting a subset of samples and/or features) similar to arrays. Selection is possible using boolean selection masks, index sequences or slicing arguments. The following calls for samples selection all result in the same dataset:

>>> sel1 = ds[np.array([False, True, True])]
>>> sel2 = ds[[1,2]]
>>> sel3 = ds[1:3]
>>> np.all(sel1.samples == sel2.samples)
True
>>> np.all(sel2.samples == sel3.samples)
True

During selection data is only copied if necessary. If the slicing syntax is used the resulting dataset will share the samples with the original dataset (here and below we compare .base against both ds.samples and its .base for compatibility with NumPy < 1.7)

>>> sel1.samples.base in (ds.samples.base, ds.samples)
False
>>> sel2.samples.base in (ds.samples.base, ds.samples)
False
>>> sel3.samples.base in (ds.samples.base, ds.samples)
True

For feature selection the syntax is very similar they are just represented on the second axis of the samples array. Plain feature selection is achieved be keeping all samples and select a subset of features (all syntax variants for samples selection are also supported for feature selection).

>>> fsel = ds[:, 1:3]
>>> fsel.samples
array([[ 1,  2],
       [ 4,  5],
       [ 7,  8],
       [10, 11]])

It is also possible to simultaneously selection a subset of samples and features. Using the slicing syntax now copying will be performed.

>>> fsel = ds[:3, 1:3]
>>> fsel.samples
array([[1, 2],
       [4, 5],
       [7, 8]])
>>> fsel.samples.base in (ds.samples.base, ds.samples)
True

Please note that simultaneous selection of samples and features is not always congruent to array slicing.

>>> ds[[0,1,2], [1,2]].samples
array([[1, 2],
       [4, 5],
       [7, 8]])

Whereas the call: ‘ds.samples[[0,1,2], [1,2]]’ would not be possible. In AttrDatasets selection of samples and features is always applied individually and independently to each axis.

Attributes

sa (Collection) Access to all sample attributes, where each attribute is a named vector (1d-array) of an arbitrary datatype, with as many elements as rows in the samples array of the dataset.
fa (Collection) Access to all feature attributes, where each attribute is a named vector (1d-array) of an arbitrary datatype, with as many elements as columns in the samples array of the dataset.
a (Collection) Access to all dataset attributes, where each attribute is a named element of an arbitrary datatype.

Methods

aggregate_features(dataset[, fx]) Apply a function to each row of the samples matrix of a dataset.
append(other) This method should not be used and will be removed in the future
coarsen_chunks(source[, nchunks]) Change chunking of the dataset
copy([deep, sa, fa, a, memo]) Create a copy of a dataset.
find_collection(attr) Lookup collection that contains an attribute of a given name.
from_channeltimeseries(samples[, targets, ...]) Create a dataset from segmented, per-channel timeseries.
from_hdf5(source[, name]) Load a Dataset from HDF5 file
from_npz(filename) Load dataset from NumPy’s .npz file, as e.g.
from_wizard(samples[, targets, chunks, ...]) Convenience method to create dataset.
get_attr(name) Return an attribute from a collection.
get_mapped(mapper) Feed this dataset through a trained mapper (forward).
get_nsamples_per_attr(dataset, attr) Returns the number of samples per unique value of a sample attribute.
get_samples_by_attr(dataset, attr, values[, ...]) Return indices of samples given a list of attributes
get_samples_per_chunk_target(dataset[, ...]) Returns an array with the number of samples per target in each chunk.
init_origids(which[, attr, mode]) Initialize the dataset’s ‘origids’ attribute.
item() Provide the first element of samples array.
random_samples(dataset, npertarget[, ...]) Create a dataset with a random subset of samples.
remove_invariant_features(dataset) Returns a new dataset with all invariant features removed.
remove_nonfinite_features(dataset) Returns a new dataset with all non-finite (NaN,Inf) features removed
save(dataset, destination[, name, compression]) Save Dataset into HDF5 file
select([sadict, fadict, strict]) Helper to select samples/features given dictionaries describing selection
set_attr(name, value) Set an attribute in a collection.
summary(dataset[, stats, lstats, sstats, ...]) String summary over the object
summary_targets(dataset[, targets_attr, ...]) Provide summary statistics over the targets and chunks
to_npz(filename[, compress]) Save dataset to a .npz file storing all fa/sa/a which are ndarrays

A Dataset might have an arbitrary number of attributes for samples, features, or the dataset as a whole. However, only the data samples themselves are required.

Parameters:

samples : ndarray

Data samples. This has to be a two-dimensional (samples x features) array. If the samples are not in that format, please consider one of the AttrDataset.from_* classmethods.

sa : SampleAttributesCollection

Samples attributes collection.

fa : FeatureAttributesCollection

Features attributes collection.

a : DatasetAttributesCollection

Dataset attributes collection.

Attributes

C
O
S
T
UC
UT
chunks
idhash To verify if dataset is in the same state as when smth else was done
mapper
nfeatures
nsamples len(object) -> integer
shape
targets
uniquechunks
uniquetargets

Methods

aggregate_features(dataset[, fx]) Apply a function to each row of the samples matrix of a dataset.
append(other) This method should not be used and will be removed in the future
coarsen_chunks(source[, nchunks]) Change chunking of the dataset
copy([deep, sa, fa, a, memo]) Create a copy of a dataset.
find_collection(attr) Lookup collection that contains an attribute of a given name.
from_channeltimeseries(samples[, targets, ...]) Create a dataset from segmented, per-channel timeseries.
from_hdf5(source[, name]) Load a Dataset from HDF5 file
from_npz(filename) Load dataset from NumPy’s .npz file, as e.g.
from_wizard(samples[, targets, chunks, ...]) Convenience method to create dataset.
get_attr(name) Return an attribute from a collection.
get_mapped(mapper) Feed this dataset through a trained mapper (forward).
get_nsamples_per_attr(dataset, attr) Returns the number of samples per unique value of a sample attribute.
get_samples_by_attr(dataset, attr, values[, ...]) Return indices of samples given a list of attributes
get_samples_per_chunk_target(dataset[, ...]) Returns an array with the number of samples per target in each chunk.
init_origids(which[, attr, mode]) Initialize the dataset’s ‘origids’ attribute.
item() Provide the first element of samples array.
random_samples(dataset, npertarget[, ...]) Create a dataset with a random subset of samples.
remove_invariant_features(dataset) Returns a new dataset with all invariant features removed.
remove_nonfinite_features(dataset) Returns a new dataset with all non-finite (NaN,Inf) features removed
save(dataset, destination[, name, compression]) Save Dataset into HDF5 file
select([sadict, fadict, strict]) Helper to select samples/features given dictionaries describing selection
set_attr(name, value) Set an attribute in a collection.
summary(dataset[, stats, lstats, sstats, ...]) String summary over the object
summary_targets(dataset[, targets_attr, ...]) Provide summary statistics over the targets and chunks
to_npz(filename[, compress]) Save dataset to a .npz file storing all fa/sa/a which are ndarrays
C
O
S
T
UC
UT
chunks
find_collection(attr)

Lookup collection that contains an attribute of a given name.

Collections are searched in the following order: sample attributes, feature attributes, dataset attributes. The first collection containing a matching attribute is returned.

Parameters:

attr : str

Attribute name to be looked up.

Returns:

Collection

If not matching collection is found a LookupError exception is raised.

classmethod from_channeltimeseries(samples, targets=None, chunks=None, t0=None, dt=None, channelids=None)

Create a dataset from segmented, per-channel timeseries.

Channels are assumes to contain multiple, equally spaced acquisition timepoints. The dataset will contain additional feature attributes associating each feature with a specific channel and timepoint.

Parameters:

samples : ndarray

Three-dimensional array: (samples x channels x timepoints).

t0 : float

Reference time of the first timepoint. Can be used to preserve information about the onset of some stimulation. Preferably in seconds.

dt : float

Temporal distance between two timepoints. Preferably in seconds.

channelids : list

List of channel names.

targets, chunks

See Dataset.from_wizard for documentation about these arguments.

classmethod from_wizard(samples, targets=None, chunks=None, mask=None, mapper=None, flatten=None, space=None)

Convenience method to create dataset.

Datasets can be created from N-dimensional samples. Data arrays with more than two dimensions are going to be flattened, while preserving the first axis (separating the samples) and concatenating all other as the second axis. Optionally, it is possible to specify targets and chunk attributes for all samples, and masking of the input data (only selecting elements corresponding to non-zero mask elements

Parameters:

samples : ndarray

N-dimensional samples array. The first axis separates individual samples.

targets : scalar or ndarray, optional

Labels for all samples. If a scalar is provided its values is assigned as label to all samples.

chunks : scalar or ndarray, optional

Chunks definition for all samples. If a scalar is provided its values is assigned as chunk of all samples.

mask : ndarray, optional

The shape of the array has to correspond to the shape of a single sample (shape(samples)[1:] == shape(mask)). Its non-zero elements are used to mask the input data.

mapper : Mapper instance, optional

A trained mapper instance that is used to forward-map possibly already flattened (see flatten) and masked samples upon construction of the dataset. The mapper must have a simple feature space (samples x features) as output. Use a ChainMapper to achieve that, if necessary.

flatten : None or bool, optional

If None (default) and no mapper provided, data would get flattened. Bool value would instruct explicitly either to flatten before possibly passing into the mapper if no mask is given.

space : str, optional

If provided it is assigned to the mapper instance that performs the initial flattening of the data.

Returns:

instance : Dataset

get_attr(name)

Return an attribute from a collection.

A collection can be specified, but can also be auto-detected.

Parameters:

name : str

Attribute name. The attribute name can also be prefixed with any valid collection name (‘sa’, ‘fa’, or ‘a’) separated with a ‘.’, e.g. ‘sa.targets’. If no collection prefix is found auto-detection of the collection is attempted.

Returns:

(attr, collection)

2-tuple: First element is the requested attribute and the second element is the collection that contains the attribute. If no matching attribute can be found a LookupError exception is raised.

get_mapped(mapper)

Feed this dataset through a trained mapper (forward).

Parameters:

mapper : Mapper

This mapper instance has to be trained.

Returns:

Dataset

The forward-mapped dataset.

idhash

To verify if dataset is in the same state as when smth else was done

Like if classifier was trained on the same dataset as in question

item()

Provide the first element of samples array.

Notes

Introduced to provide compatibility with numpy.asscalar. See numpy.ndarray.item for more information.

mapper
select(sadict=None, fadict=None, strict=True)

Helper to select samples/features given dictionaries describing selection

Generally __getitem__ (i.e. []) should be used, but this function might be useful whenever non-strict selection (strict=False) is required.

See match() for more information about specification of selection dictionaries.

Parameters:

sa, fa : dict, optional

Dictionaries describing selection for samples/features correspondingly.

strict : bool, optional

If True, absent matching to any specified selection key/value pair would result in ValueError exception. If False, it would allow to not have matches, but if only a single value for a key is given or none of the values match – you will end up with empty selection.

set_attr(name, value)

Set an attribute in a collection.

Parameters:

name : str

Collection and attribute name. This has to be in the same format as for get_attr().

value : array

Value of the attribute.

targets
uniquechunks
uniquetargets