Welcome to multiml’s documentation!

multiml is a prototype framework for developing multi-step machine learnings.

_static/classes.png

Quick start

This section runs through the APIs to demonstrate Grid Search optimization.

Installation

Requirements:
  • CentosOS 7.6+

  • Python 3.8+

$ git clone https://github.com/UTokyo-ICEPP/multiml.git
$ cd multiml
$ pip install -e .

Preparing data (StoreGate)

import numpy as np
from multiml import StoreGate

storegate = StoreGate(data_id='dataset0')
phase = (0.8, 0.1, 0.1) # fraction of train, valid, test
storegate.add_data(var_names='data', data=np.arange(0, 10), phase=phase)
storegate.add_data(var_names='true', data=np.arange(0, 10), phase=phase)
storegate.compile()
storegate.show_info()

Out:

================================================================================
data_id : dataset0, compiled : True
--------------------------------------------------------------------------------
phase  backend  var_names       var_types       total_events    var_shape
================================================================================
train  numpy    data            int64           8               (8,)
train  numpy    true            int64           8               (8,)
--------------------------------------------------------------------------------
phase  backend  var_names       var_types       total_events    var_shape
================================================================================
valid  numpy    data            int64           1               (1,)
valid  numpy    true            int64           1               (1,)
--------------------------------------------------------------------------------
phase  backend  var_names       var_types       total_events    var_shape
================================================================================
test   numpy    data            int64           1               (1,)
test   numpy    true            int64           1               (1,)
================================================================================

Please see StoreGate tutorial for more details.

Impementing algorithms (Task)

from multiml import logger
from multiml.task import BaseTask

class MyTask(BaseTask):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._weight = None

    def execute(self):
        if self._weight is None:
            return # no hyperparameter

        data = self.storegate['test']['data'][:]
        pred = data * self._weight

        logger.info(f'pred value = {pred}')
        self.storegate['test']['pred'][:] = pred

task = MyTask(storegate=storegate)
task.set_hps(dict(weight=0.5)) # set hyperparameter
task.execute()

Out:

pred value = [4.5]

Please see machine learning task examples.

Registering tasks (TaskScheduler)

from multiml import TaskScheduler

task0 = MyTask()
task1 = MyTask()
hps0 = dict(weight=[0.5, 1.0, 1.5])

steps = [[(task0, hps0)], [(task1, None)]]
task_scheduler = TaskScheduler(steps)
task_scheduler.show_info()

Out:

--------------------------------------------------------------------------------
task_id: step0, DAG: True (parents: [], children: ['step1']):
subtask_id: MyTask, hps: ['weight']
--------------------------------------------------------------------------------
task_id: step1, DAG: True (parents: ['step0'], children: []):
subtask_id: MyTask, hps: []
--------------------------------------------------------------------------------

Optimization (Agent)

from multiml.agent import GridSearchAgent

# minimize Mean Squared Error
agent = GridSearchAgent(storegate=storegate,
                        task_scheduler=task_scheduler,
                        metric='MSE')
agent.execute_finalize()

Out:

(1/3) events processed (metric=20.25)
(2/3) events processed (metric=0.0)
(3/3) events processed (metric=20.25)
------------------------------------ Result ------------------------------------
task_id step0 and subtask_id MyTask with:
  weight = 1.0
  job_id = 1
task_id step1 and subtask_id MyTask with:
  job_id = 1
Metric (mse) is 0.0

weight = 1.0 shows the best performance as expected.

API references

Indices and tables