multiml.StoreGate

class multiml.StoreGate(backend='numpy', backend_args=None, data_id=None)

Data management class for multiml execution.

StoreGate provides common interfaces to manage data between multiml agents and tasks with features of:

  • Different backends are supported (numpy or zarr, and hybrid of them),

  • Data are split into train, valid and test phases for ML,

  • Data are retrieved by var_names, phase and index options.

Each dataset in the storegate is keyed by unique data_id. All data in the dataset are identified by var_names (column names). The number of samples in a phase is assumed to be the same for all variables in multiml agents and tasks. The compile() method ensures the validity of the dataset.

Examples

>>> from multiml.storegate import StoreGate
>>>
>>> # User defined parameters
>>> var_names = ['var0', 'var1', 'var2']
>>> data = [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]
>>> phase = (0.5, 0.25, 0.25) # fraction of train, valid, test
>>>
>>> # Add data to storegate
>>> storegate = StoreGate(backend = 'numpy', data_id='test_id')
>>> storegate.add_data(var_names=var_names, data=data, phase=phase)
>>>
>>> # Get data from storegate
>>> storegate.get_data(var_names=var_names, phase='train')
>>> storegate['train'][var_names][0]
__init__(backend='numpy', backend_args=None, data_id=None)

Initialize the storegate and the backend architecture.

Initialize storegate and the backend architecture with its options. numpy backend manages data in memory, zarr backend reads and writes data to storage of given path. hybrid backend is combination of numpy and zarr backends, which allows to move data between memory and storage.

Parameters:
  • backend (str) – numpy (on memory), zarr (on storage), hybrid.

  • backend_args (dict) – backend options, e.g. path to zarr database. Please see ZarrDatabase and HybridDatabase classes for details.

  • data_id (str) – set default data_id if given.

Methods

__init__([backend, backend_args, data_id])

Initialize the storegate and the backend architecture.

add_data(var_names, data[, phase, shuffle, ...])

Add data to the storegate with given options.

argmax(var_names, axis[, phase])

Convert data to argmax (operation is limited by memory)

astype(var_names, dtype[, phase])

Convert data type to given dtype (operation is limited by memory)

clear_data()

Delete all data in the current data_id and backend

compile([reset, show_info])

Check if registered samples are valid.

create_empty(var_names, shape[, phase, dtype])

Create empty data in the current data_id and backend.

delete_data(var_names[, phase, do_compile])

Delete data associated with var_names.

get_data(var_names[, phase, index])

Retrieve data from storegate with given options.

get_data_ids()

Returns registered data_ids in the backend.

get_metadata()

Returns a dict of metadata.

get_var_names([phase])

Returns registered var_names for given phase.

get_var_shapes(var_names[, phase])

Returns shapes of variables for given phase.

onehot(var_names, num_classes[, phase])

Convert data to onehot vectors (operation is limited by memory)

set_data_id(data_id)

Set the default data_id and initialize the backend.

set_mode(mode)

Set backend mode of hybrid architecture.

show_info()

Show information currently registered in storegate.

shuffle([phase, seed])

Shuffle data in given phase.

to_memory(var_names[, phase, ...])

Move data from storage to memory.

to_storage(var_names[, phase, ...])

Move data from storage to memory.

update_data(var_names, data[, phase, index, ...])

Update data in storegate with given options.

Attributes

backend

Return the current backend of storegate.

data_id

Returns the current data_id.

__init__(backend='numpy', backend_args=None, data_id=None)

Initialize the storegate and the backend architecture.

Initialize storegate and the backend architecture with its options. numpy backend manages data in memory, zarr backend reads and writes data to storage of given path. hybrid backend is combination of numpy and zarr backends, which allows to move data between memory and storage.

Parameters:
  • backend (str) – numpy (on memory), zarr (on storage), hybrid.

  • backend_args (dict) – backend options, e.g. path to zarr database. Please see ZarrDatabase and HybridDatabase classes for details.

  • data_id (str) – set default data_id if given.

__getitem__(item)

Retrieve data by python getitem syntax.

Retrieve data by python getitem syntax, i.e. storegate[phase][var_names][index]. data_id, phase, var_names and index need to be given to return selected data. If all parameters are set, selected data are returned. Otherwise, self instance class with given parameters is returned.

Parameters:

item (str or list or int or slice) – If item is str of train or valid or test, phase is set. If item is the other str or list of strs, var_names is set. If item is int or slice, data with index (slice) are returned.

Returns:

please see description above.

Return type:

self or ndarray

Example

>>> # get all train data
>>> storegate['train']['var0'][:]
>>> # slice train data by index
>>> storegate['train']['var0'][0:2]
>>> # loop by index
>>> for data in storegate['train']['var0']:
>>>     print(data)
__setitem__(item, data)

Update data by python setitem syntax.

Update data by python setitem syntax, i.e. storegate[phase][var_names][index] = data. data_id, phase, var_names and index need to be given to update data.

Parameters:
  • item (int or slice) – Index of data to be updated.

  • data (list or ndarray) – new data.

Example

>>> # update all train data
>>> storegate['train']['var0'][:] = data
>>> # update train data by index
>>> storegate['train']['var0'][0:2] = data[0:2]
__delitem__(item)

Delete data by python delitem syntax.

Delete data by python setitem syntax, i.e. del storegate[phase][var_names]. data_id, phase, var_names need to be given to delete data.

Parameters:

item (str or list) – var_names to be deleted.

Example

>>> # delete var0 from train phase
>>> del storegate['train']['var0']
__len__()

Returns number of samples for given phase and data_id.

Returns:

the number of samples in given conditions.

Return type:

int

Examples

>>> len(storegate['train'])
>>> len(storegate['test'])
__contains__(item)

Check if given var_name is available in storegate.

Parameters:

item (str) – name of variables.

Returns:

If item exists in given condisons or not.

Return type:

bool

Examples

>>> 'var0' in storegate['train']
>>> 'var1' in storegate['test']