multiml.storegate module
StoreGate module.
- class multiml.storegate.StoreGate(backend='numpy', backend_args=None, data_id=None)
Bases:
objectData management class for multiml execution.
StoreGate provides common interfaces to manage data between multiml agents and tasks with features of:
Different backends are supported (numpy or zarr, and hybrid of them),
Data are split into train, valid and test phases for ML,
Data are retrieved by
var_names,phaseandindexoptions.
Each dataset in the storegate is keyed by unique
data_id. All data in the dataset are identified byvar_names(column names). The number of samples in a phase is assumed to be the same for all variables in multiml agents and tasks. Thecompile()method ensures the validity of the dataset.Examples
>>> from multiml.storegate import StoreGate >>> >>> # User defined parameters >>> var_names = ['var0', 'var1', 'var2'] >>> data = [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]] >>> phase = (0.5, 0.25, 0.25) # fraction of train, valid, test >>> >>> # Add data to storegate >>> storegate = StoreGate(backend = 'numpy', data_id='test_id') >>> storegate.add_data(var_names=var_names, data=data, phase=phase) >>> >>> # Get data from storegate >>> storegate.get_data(var_names=var_names, phase='train') >>> storegate['train'][var_names][0]
- __init__(backend='numpy', backend_args=None, data_id=None)
Initialize the storegate and the backend architecture.
Initialize storegate and the backend architecture with its options.
numpybackend manages data in memory,zarrbackend reads and writes data to storage of given path.hybridbackend is combination ofnumpyandzarrbackends, which allows to move data between memory and storage.- Parameters:
backend (str) – numpy (on memory), zarr (on storage), hybrid.
backend_args (dict) – backend options, e.g. path to zarr database. Please see
ZarrDatabaseandHybridDatabaseclasses for details.data_id (str) – set default
data_idif given.
- __getitem__(item)
Retrieve data by python getitem syntax.
Retrieve data by python getitem syntax, i.e.
storegate[phase][var_names][index].data_id,phase,var_namesandindexneed to be given to return selected data. If all parameters are set, selected data are returned. Otherwise, self instance class with given parameters is returned.- Parameters:
item (str or list or int or slice) – If item is str of train or valid or test,
phaseis set. If item is the other str or list of strs,var_namesis set. If item is int or slice, data with index (slice) are returned.- Returns:
please see description above.
- Return type:
self or ndarray
Example
>>> # get all train data >>> storegate['train']['var0'][:] >>> # slice train data by index >>> storegate['train']['var0'][0:2] >>> # loop by index >>> for data in storegate['train']['var0']: >>> print(data)
- __setitem__(item, data)
Update data by python setitem syntax.
Update data by python setitem syntax, i.e.
storegate[phase][var_names][index] = data.data_id,phase,var_namesandindexneed to be given to update data.- Parameters:
item (int or slice) – Index of data to be updated.
data (list or ndarray) – new data.
Example
>>> # update all train data >>> storegate['train']['var0'][:] = data >>> # update train data by index >>> storegate['train']['var0'][0:2] = data[0:2]
- __delitem__(item)
Delete data by python delitem syntax.
Delete data by python setitem syntax, i.e.
del storegate[phase][var_names].data_id,phase,var_namesneed to be given to delete data.- Parameters:
item (str or list) –
var_namesto be deleted.
Example
>>> # delete var0 from train phase >>> del storegate['train']['var0']
- __len__()
Returns number of samples for given
phaseanddata_id.- Returns:
the number of samples in given conditions.
- Return type:
int
Examples
>>> len(storegate['train']) >>> len(storegate['test'])
- __contains__(item)
Check if given
var_nameis available in storegate.- Parameters:
item (str) – name of variables.
- Returns:
If
itemexists in given condisons or not.- Return type:
bool
Examples
>>> 'var0' in storegate['train'] >>> 'var1' in storegate['test']
- property data_id
Returns the current
data_id.- Returns:
the current
data_id.- Return type:
str
- set_data_id(data_id)
Set the default
data_idand initialize the backend.If the default
data_idis set, all methods defined in storegate, e.g.add_data()use the defaultdata_idto manage data.- Parameters:
data_id (str) – the default
data_id.
- property backend
Return the current backend of storegate.
- Returns:
numpy or zarr or hybrid.
- Return type:
str
- add_data(var_names, data, phase='train', shuffle=False, do_compile=False)
Add data to the storegate with given options.
If
var_namesalready exists in givendata_idandphase, the data are appended, otherwisevar_namesare newly registered and the data are stored.- Parameters:
var_names (str or list) – list of variable names, e.g. [‘var0’, ‘var1’, ‘var2’]. Single string, e.g. ‘var0’, is also allowed to add only one variable.
data (list or ndarray) – If
var_namesis single string, data shape must be (N, k) where N is the number of samples and k is an arbitrary shape of each data. Ifvar_namesis a tuple, data shape must be (N, M, k), where M is the number of variables. Ifvar_namesis a list, data mustbe a list of [(N, k), (N, k), (N, k)…], where diffeernt shapes of k are allowed.phase (str or tuple or list) – all (auto), train, valid, test or tuple. all divides the data to train, valid and test automatically, but only after the
compile. If tuple (x, y, z) is given, the data are divided to train, valid and test. If contents of tuple is float and sum of the tuple is 1.0, the data are split to phases with fractions of (x, y, z) respectively. If contents of tuple is int, the data are split by given indexes.shuffle (bool or int) – data are shuffled if True or int. If int is given, it is used as random seed of
np.random.do_compile (bool) – do compile if True after adding data.
Examples
>>> # add data to train phase >>> storegate.add_data(var_names='var0', data=np.array([0, 1, 2]), phase='train')
- update_data(var_names, data, phase='train', index=-1, do_compile=True)
Update data in storegate with given options.
Update (replace) data in the storegate. If
var_namesdoes not exist in givendata_idandphase, data are newly added. Otherwise, selected data are replaced with given data.- Parameters:
var_names (str or list(srt)) – see
add_data()method.data (list or ndarray) – see
add_data()method.phase (str or tuple) – see
add_data()method.index (int or tuple) – If
indexis -1 (default), all data are updated for given options. Ifindexis int, only the data withindexis updated. If index is (x, y), data in the range (x, y) are updated.do_compile (bool) – do compile if True after updating data.
Examples
>>> # update data of train phase >>> storegate.update_data(var_names='var0', data=[1], phase='train', index=1)
- get_data(var_names, phase='train', index=-1)
Retrieve data from storegate with given options.
Get data from the storegate. Python getitem sytax is also supported, please see
__getitem__method.- Parameters:
var_names (tuple or list or str) – If a tuple of variable names is given, e.g. (‘var0’, ‘var1’, ‘var2’), data with ndarray format are returned. Single string, e.g. ‘var0’, is also allowed. Please see the matrix below for shape of data. If list of variable names is given, e.g. [‘var0’, ‘var1’, ‘var2’], list of ndarray data for each variable are returned.
phase (str or None) – all, train, valid, test or None. If
phaseis all or None, data in all phases are returned, but it is allowed only after thecompile.index (int or tuple) – see update_data method.
- Returns:
selected data by given options.
- Return type:
ndarray or list
- Shape of returns:
>>> # index var_names | single var | tuple vars >>> # ------------------------------------------------------------ >>> # single index (>=0) | k | (M, k) >>> # otherwise | (N, k) | (N, M, k) >>> # ------------------------------------------------------------ >>> # k = arbitrary shape of data >>> # M = number of var_names >>> # N = number of samples
Examples
>>> # get data by var_names, phase and index >>> storegate.get_data(var_names='var0', phase='train', index=1)
- delete_data(var_names, phase='train', do_compile=True)
Delete data associated with var_names.
All data associated with
var_namesare deleted. Partial deletions with index is not supported for now.- Parameters:
var_names (str or list) – see
add_data()method.phase (str) – see
update_data()method.do_compile (bool) – do compile if True after deletion.
Examples
>>> # delete data associated with var_names >>> storegate.get_data(var_names='var0', phase='train')
- clear_data()
Delete all data in the current data_id and backend
- create_empty(var_names, shape, phase='train', dtype='f4')
Create empty data in the current data_id and backend.
- Parameters:
var_names (str or list) – see
add_data()method.shape (tuple) – shape of empty data.
phase (str) – see
update_data()method.dtype (str) – dtype of empty data. Default float32.
- get_data_ids()
Returns registered data_ids in the backend.
- Returns:
list of registered
data_id.- Return type:
list
- get_var_names(phase='train')
Returns registered var_names for given phase.
- Parameters:
phase (str) – train or valid or test.
- Returns:
list of variable names.
- Return type:
list
- get_var_shapes(var_names, phase='train')
Returns shapes of variables for given phase.
- Parameters:
var_names (str or list) – variable names.
phase (str) – train or valid or test.
- Returns:
shape of a variable, or list of shapes.
- Return type:
ndarray.shape or list
- get_metadata()
Returns a dict of metadata.
The metadata is available only after compile.
- Returns:
dict of metadata. Please see below for contents.
- Return type:
dict
- Metadata contents:
>>> { >>> 'compiled': 'compiled or not', >>> 'total_events': 'total events, sum of each phase', >>> 'sizes': { >>> 'train': 'total events of train phase', >>> 'valid': 'total events of valid phase', >>> 'test': 'total events of test phase', >>> 'all': 'total events', >>> } >>> 'valid_phases': 'phases containing events' >>> }
- astype(var_names, dtype, phase='train')
Convert data type to given dtype (operation is limited by memory)
- Parameters:
var_names (str or list) – see
add_data()method.dtype (numpy.dtype) – dtypes of numpy. Please see numpy documents.
phase (str) – all, train, valid, test.
- onehot(var_names, num_classes, phase='train')
Convert data to onehot vectors (operation is limited by memory)
- Parameters:
var_names (str or list) – see
add_data()method.num_classes (int) – the number of classes.
phase (str) – all, train, valid, test.
- argmax(var_names, axis, phase='train')
Convert data to argmax (operation is limited by memory)
- Parameters:
var_names (str or list) – see
add_data()method.axis (int) – specifies axis.
phase (str) – all, train, valid, test.
- shuffle(phase='all', seed=0)
Shuffle data in given phase.
- Parameters:
phase (str) – all, train, valid, test.
seed (int) – seed of numpy.random
- set_mode(mode)
Set backend mode of hybrid architecture.
This method is valid for only hybrid database. If
modeis numpy, basically data will be written in memory, andmodeis zarr, data will be written to storage.- Parameters:
mode (str) – numpy or zarr.
- to_memory(var_names, phase='train', output_var_names=None, callback=None)
Move data from storage to memory.
This method is valid for only hybrid backend. This should be effective to reduce data I/O impacts.
- Parameters:
var_names (str or list) – see
add_data()method.phase (str) – all, train, valid, test.
output_var_names (str or list) – new var_names in numpy mode.
callback (obj) – callback function, which receives
var_namesanddataand returns newvar_namesanddata.
- to_storage(var_names, phase='train', output_var_names=None, callback=None)
Move data from storage to memory.
This method is valid for only hybrid backend. This is useful if data are large, then data need to be escaped to storage.
- Parameters:
var_names (str or list) – see
add_data()method.phase (str) – all, train, valid, test.
output_var_names (str or list) – new var_names in zarr mode.
callback (obj) – callback function, which receives
var_namesanddataand returns newvar_namesanddata.
- compile(reset=False, show_info=False)
Check if registered samples are valid.
It is assumed that the
compileis always called afteradd_data()orupdate_data()methods to validate registered data.- Parameters:
reset (bool) – special variable
activeis (re)set if True,activevariable is used to indicate that samples should be used or not. e.g. in the metric calculation.show_info (bool) – show information after compile.
- show_info()
Show information currently registered in storegate.