mvpa2.misc.surfing.queryengine.AttrDataset

Inheritance diagram of AttrDataset
class mvpa2.misc.surfing.queryengine.AttrDataset(samples, sa=None, fa=None, a=None)

Generic storage class for datasets with multiple attributes.

A dataset consists of four pieces. The core is a two-dimensional array that has variables (so-called features) in its columns and the associated observations (so-called samples) in the rows. In addition a dataset may have any number of attributes for features and samples. Unsurprisingly, these are called ‘feature attributes’ and ‘sample attributes’. Each attribute is a vector of any datatype that contains a value per each item (feature or sample). Both types of attributes are organized in their respective collections – accessible via the sa (sample attribute) and fa (feature attribute) attributes. Finally, a dataset itself may have any number of additional attributes (i.e. a mapper) that are stored in their own collection that is accessible via the a attribute (see examples below).

Notes

Any dataset might have a mapper attached that is stored as a dataset attribute called mapper.

Examples

The simplest way to create a dataset is from a 2D array.

>>> import numpy as np
>>> from mvpa2.datasets import *
>>> samples = np.arange(12).reshape((4,3))
>>> ds = AttrDataset(samples)
>>> ds.nsamples
4
>>> ds.nfeatures
3
>>> ds.samples
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

The above dataset can only be used for unsupervised machine-learning algorithms, since it doesn’t have any targets associated with its samples. However, creating a labeled dataset is equally simple.

>>> ds_labeled = dataset_wizard(samples, targets=range(4))

Both the labeled and the unlabeled dataset share the same samples array. No copying is performed.

>>> ds.samples is ds_labeled.samples
True

If the data should not be shared the samples array has to be copied beforehand.

The targets are available from the samples attributes collection, but also via the convenience property targets.

>>> ds_labeled.sa.targets is ds_labeled.targets
True

If desired, it is possible to add an arbitrary amount of additional attributes. Regardless if their original sequence type they will be converted into an array.

>>> ds_labeled.sa['lovesme'] = [0,0,1,0]
>>> ds_labeled.sa.lovesme
array([0, 0, 1, 0])

An alternative method to create datasets with arbitrary attributes is to provide the attribute collections to the constructor itself – which would also test for an appropriate size of the given attributes:

>>> fancyds = AttrDataset(samples, sa={'targets': range(4),
...                                'lovesme': [0,0,1,0]})
>>> fancyds.sa.lovesme
array([0, 0, 1, 0])

Exactly the same logic applies to feature attributes as well.

Datasets can be sliced (selecting a subset of samples and/or features) similar to arrays. Selection is possible using boolean selection masks, index sequences or slicing arguments. The following calls for samples selection all result in the same dataset:

>>> sel1 = ds[np.array([False, True, True])]
>>> sel2 = ds[[1,2]]
>>> sel3 = ds[1:3]
>>> np.all(sel1.samples == sel2.samples)
True
>>> np.all(sel2.samples == sel3.samples)
True

During selection data is only copied if necessary. If the slicing syntax is used the resulting dataset will share the samples with the original dataset (here and below we compare .base against both ds.samples and its .base for compatibility with NumPy < 1.7)

>>> sel1.samples.base in (ds.samples.base, ds.samples)
False
>>> sel2.samples.base in (ds.samples.base, ds.samples)
False
>>> sel3.samples.base in (ds.samples.base, ds.samples)
True

For feature selection the syntax is very similar they are just represented on the second axis of the samples array. Plain feature selection is achieved be keeping all samples and select a subset of features (all syntax variants for samples selection are also supported for feature selection).

>>> fsel = ds[:, 1:3]
>>> fsel.samples
array([[ 1,  2],
       [ 4,  5],
       [ 7,  8],
       [10, 11]])

It is also possible to simultaneously selection a subset of samples and features. Using the slicing syntax now copying will be performed.

>>> fsel = ds[:3, 1:3]
>>> fsel.samples
array([[1, 2],
       [4, 5],
       [7, 8]])
>>> fsel.samples.base in (ds.samples.base, ds.samples)
True

Please note that simultaneous selection of samples and features is not always congruent to array slicing.

>>> ds[[0,1,2], [1,2]].samples
array([[1, 2],
       [4, 5],
       [7, 8]])

Whereas the call: ‘ds.samples[[0,1,2], [1,2]]’ would not be possible. In AttrDatasets selection of samples and features is always applied individually and independently to each axis.

Attributes

sa (Collection) Access to all sample attributes, where each attribute is a named vector (1d-array) of an arbitrary datatype, with as many elements as rows in the samples array of the dataset.
fa (Collection) Access to all feature attributes, where each attribute is a named vector (1d-array) of an arbitrary datatype, with as many elements as columns in the samples array of the dataset.
a (Collection) Access to all dataset attributes, where each attribute is a named element of an arbitrary datatype.

Methods

aggregate_features(dataset[, fx]) Apply a function to each row of the samples matrix of a dataset.
append(other) This method should not be used and will be removed in the future
coarsen_chunks(source[, nchunks]) Change chunking of the dataset
copy([deep, sa, fa, a, memo]) Create a copy of a dataset.
from_hdf5(source[, name]) Load a Dataset from HDF5 file
from_npz(filename) Load dataset from NumPy’s .npz file, as e.g.
get_nsamples_per_attr(dataset, attr) Returns the number of samples per unique value of a sample attribute.
get_samples_by_attr(dataset, attr, values[, ...]) Return indices of samples given a list of attributes
get_samples_per_chunk_target(dataset[, ...]) Returns an array with the number of samples per target in each chunk.
init_origids(which[, attr, mode]) Initialize the dataset’s ‘origids’ attribute.
random_samples(dataset, npertarget[, ...]) Create a dataset with a random subset of samples.
remove_invariant_features(dataset) Returns a new dataset with all invariant features removed.
remove_nonfinite_features(dataset) Returns a new dataset with all non-finite (NaN,Inf) features removed
save(dataset, destination[, name, compression]) Save Dataset into HDF5 file
summary(dataset[, stats, lstats, sstats, ...]) String summary over the object
summary_targets(dataset[, targets_attr, ...]) Provide summary statistics over the targets and chunks
to_npz(filename[, compress]) Save dataset to a .npz file storing all fa/sa/a which are ndarrays

A Dataset might have an arbitrary number of attributes for samples, features, or the dataset as a whole. However, only the data samples themselves are required.

Parameters:

samples : ndarray

Data samples. This has to be a two-dimensional (samples x features) array. If the samples are not in that format, please consider one of the AttrDataset.from_* classmethods.

sa : SampleAttributesCollection

Samples attributes collection.

fa : FeatureAttributesCollection

Features attributes collection.

a : DatasetAttributesCollection

Dataset attributes collection.

Attributes

nfeatures
nsamples len(object) -> integer
shape

Methods

aggregate_features(dataset[, fx]) Apply a function to each row of the samples matrix of a dataset.
append(other) This method should not be used and will be removed in the future
coarsen_chunks(source[, nchunks]) Change chunking of the dataset
copy([deep, sa, fa, a, memo]) Create a copy of a dataset.
from_hdf5(source[, name]) Load a Dataset from HDF5 file
from_npz(filename) Load dataset from NumPy’s .npz file, as e.g.
get_nsamples_per_attr(dataset, attr) Returns the number of samples per unique value of a sample attribute.
get_samples_by_attr(dataset, attr, values[, ...]) Return indices of samples given a list of attributes
get_samples_per_chunk_target(dataset[, ...]) Returns an array with the number of samples per target in each chunk.
init_origids(which[, attr, mode]) Initialize the dataset’s ‘origids’ attribute.
random_samples(dataset, npertarget[, ...]) Create a dataset with a random subset of samples.
remove_invariant_features(dataset) Returns a new dataset with all invariant features removed.
remove_nonfinite_features(dataset) Returns a new dataset with all non-finite (NaN,Inf) features removed
save(dataset, destination[, name, compression]) Save Dataset into HDF5 file
summary(dataset[, stats, lstats, sstats, ...]) String summary over the object
summary_targets(dataset[, targets_attr, ...]) Provide summary statistics over the targets and chunks
to_npz(filename[, compress]) Save dataset to a .npz file storing all fa/sa/a which are ndarrays
aggregate_features(dataset, fx=<function mean>)

Apply a function to each row of the samples matrix of a dataset.

The functor given as fx has to honour an axis keyword argument in the way that NumPy used it (e.g. NumPy.mean, var).

Returns:a new Dataset object with the aggregated feature(s).
append(other)

This method should not be used and will be removed in the future

coarsen_chunks(source, nchunks=4)

Change chunking of the dataset

Group chunks into groups to match desired number of chunks. Makes sense if originally there were no strong groupping into chunks or each sample was independent, thus belonged to its own chunk

Parameters:

source : Dataset or list of chunk ids

dataset or list of chunk ids to operate on. If Dataset, then its chunks get modified

nchunks : int

desired number of chunks

copy(deep=True, sa=None, fa=None, a=None, memo=None)

Create a copy of a dataset.

By default this is going to return a deep copy of the dataset, hence no data would be shared between the original dataset and its copy.

Parameters:

deep : boolean, optional

If False, a shallow copy of the dataset is return instead. The copy contains only views of the samples, sample attributes and feature attributes, as well as shallow copies of all dataset attributes.

sa : list or None

List of attributes in the sample attributes collection to include in the copy of the dataset. If None all attributes are considered. If an empty list is given, all attributes are stripped from the copy.

fa : list or None

List of attributes in the feature attributes collection to include in the copy of the dataset. If None all attributes are considered If an empty list is given, all attributes are stripped from the copy.

a : list or None

List of attributes in the dataset attributes collection to include in the copy of the dataset. If None all attributes are considered If an empty list is given, all attributes are stripped from the copy.

memo : dict

Developers only: This argument is only useful if copy() is called inside the __deepcopy__() method and refers to the dict-argument memo in the Python documentation.

classmethod from_hdf5(source, name=None)

Load a Dataset from HDF5 file

Parameters:

source : string or h5py.highlevel.File

Filename or HDF5’s File to load dataset from

name : string, optional

If file contains multiple entries at the 1st level, if provided, name specifies the group to be loaded as the AttrDataset.

Returns:

AttrDataset

Raises:

ValueError

classmethod from_npz(filename)

Load dataset from NumPy’s .npz file, as e.g. stored by to_npz

File expected to have ‘samples’ item, which serves as samples, and other items prefixed with the corresponding collection (e.g. ‘sa.’ or ‘fa.’). All other entries are skipped

Parameters:

filename: str

Filename for the .npz file. Can be specified without .npz suffix

get_nsamples_per_attr(dataset, attr)

Returns the number of samples per unique value of a sample attribute.

Parameters:

attr : str

Name of the sample attribute

Returns:

dict with the number of samples (value) per unique attribute (key).

get_samples_by_attr(dataset, attr, values, sort=True)

Return indices of samples given a list of attributes

get_samples_per_chunk_target(dataset, targets_attr='targets', chunks_attr='chunks')

Returns an array with the number of samples per target in each chunk.

Array shape is (chunks x targets).

Parameters:

dataset : Dataset

Source dataset.

init_origids(which, attr='origids', mode='new')

Initialize the dataset’s ‘origids’ attribute.

The purpose of origids is that they allow to track the identity of a feature or a sample through the lifetime of a dataset (i.e. subsequent feature selections).

Calling this method will overwrite any potentially existing IDs (of the XXX)

Parameters:

which : {‘features’, ‘samples’, ‘both’}

An attribute is generated for each feature, sample, or both that represents a unique ID. This ID incorporates the dataset instance ID and should allow merging multiple datasets without causing multiple identical ID and the resulting dataset.

attr : str

Name of the attribute to store the generated IDs in. By convention this should be ‘origids’ (the default), but might be changed for specific purposes.

mode : {‘existing’, ‘new’, ‘raise’}, optional

Action if attr is already present in the collection. Default behavior is ‘new’ whenever new ids are generated and replace existing values if such are present. With ‘existing’ it would not alter existing content. With ‘raise’ it would raise RuntimeError.

Raises:

`RuntimeError`

If mode == ‘raise’ and attr is already defined

nfeatures
nsamples

len(object) -> integer

Return the number of items of a sequence or collection.

random_samples(dataset, npertarget, targets_attr='targets')

Create a dataset with a random subset of samples.

Parameters:

dataset : Dataset

npertarget : int or list

If an int is given, the specified number of samples is randomly chosen from the group of samples sharing a unique target value. Total number of selected samples: npertarget x len(uniquetargets). If a list is given of length matching the unique target values, it specifies the number of samples chosen for each particular unique target.

targets_attr : str, optional

Returns:

Dataset

A dataset instance for the chosen samples. All feature attributes and dataset attribute share there data with the source dataset.

remove_invariant_features(dataset)

Returns a new dataset with all invariant features removed.

remove_nonfinite_features(dataset)

Returns a new dataset with all non-finite (NaN,Inf) features removed

Removes all feature for which not all values are finite

Parameters:

dataset : Dataset

Input dataset

Returns:

finite_dataset: Dataset

Dataset based on data form the input, but only the features for which all samples are finite are kept.

save(dataset, destination, name=None, compression=None)

Save Dataset into HDF5 file

Parameters:

dataset : Dataset

destination : h5py.highlevel.File or str

name : str, optional

compression : None or int or {‘gzip’, ‘szip’, ‘lzf’}, optional

Level of compression for gzip, or another compression strategy.

shape
summary(dataset, stats=True, lstats='auto', sstats='auto', idhash=False, targets_attr='targets', chunks_attr='chunks', maxc=30, maxt=20)

String summary over the object

Parameters:

stats : bool

Include some basic statistics (mean, std, var) over dataset samples

lstats : ‘auto’ or bool

Include statistics on chunks/targets. If ‘auto’, includes only if both targets_attr and chunks_attr are present.

sstats : ‘auto’ or bool

Sequence (order) statistics. If ‘auto’, includes only if targets_attr is present.

idhash : bool

Include idhash value for dataset and samples

targets_attr : str, optional

Name of sample attributes of targets

chunks_attr : str, optional

Name of sample attributes of chunks – independent groups of samples

maxt : int

Maximal number of targets when provide details on targets/chunks

maxc : int

Maximal number of chunks when provide details on targets/chunks

summary_targets(dataset, targets_attr='targets', chunks_attr='chunks', maxc=30, maxt=20)

Provide summary statistics over the targets and chunks

Parameters:

dataset : Dataset

Dataset to operate on

targets_attr : str, optional

Name of sample attributes of targets

chunks_attr : str, optional

Name of sample attributes of chunks – independent groups of samples

maxc : int

Maximal number of chunks when provide details

maxt : int

Maximal number of targets when provide details

to_npz(filename, compress=True)

Save dataset to a .npz file storing all fa/sa/a which are ndarrays

Parameters:

filename : str

compress : bool, optional

If True, savez_compressed is used