Practical MLOps — autoscaled and platform agnostic

MLOps using mlFlow and docker containers

7 min readNov 24, 2023

Introduction

Similar to my previous article Practical MLOps using Azure, I am writing this article to implement MLFow using docker. A brief working knowledge of mlFlow is expected. A quick and informative article on MLOps to get upto speed.

In this article, we are going to build end-end MLOps using mlFlow and docker. The training is done using docker containers whose backend can be changed very easily — Amazon Fargate to Azure EKS. This makes the MLOps process more portable and faster to reproduce based on your specific cloud.

Hoping this article is useful for you.

This article will help you simulate this scenario:

There are multiple data engineers, data scientists and ML engineers working together. There are different models for different brands for a product. Depending on business requirements a multi cloud setup is required. Hence backends for running different workloads is different.

This is achieved via scalable containerised training and mlFlow.

Structure of article

Versioning
Training
Dockerization for auto-scaling
Experiment analysis
Code run-down
Execution options overview

This article does not include deployment, for deployment in Azure you can read Practical MLOps using Azure.

1. Versioning

1.1 Data versioning

In this article I have not touched upon data versioning. This is under the assumption that data engineers do that in the respective cloud for different models.

1.2 Environment versioning

Training environments are not versioned in this tutorial. It is maintained in the requirements.txt file. As per my experience, training environments are not hugely altered and hence versioning it is not critical. requirements.txt is automatically versioned as part of the repository.

1.3 Code versioning

Code versioning is done using Github. All code is available here.

For Azure implementation where I handle data and environment versioning automatically, you can read Practical MLOps using Azure.

2. Training

2.1 Data feed

Data is assumed to be available via two files X.csv and y.csv in /data. In your case you would be feeding data through the preferred cloud platform.

2.2 Hyperparameter set

The code spins up required no. of containers depending on the no. of hyperparameter sets. If you provide ten hyperparameter sets in PARALLEL_RUNS.py, ten containers would automatically start, train the model, transmit results back to the mlFlow server and terminate.

2.3 Best model registration

From all the available ML runs, filter on the no. of look-back days( the no. of days since the team has started working on the recent model’s training) and the metric to optimise on. The system automatically registers the model(if model name is new)(or updates the model version) of the model among all the filtered runs.

2.4 Serving the best model

The best model can be tested locally before it is deployed to the cloud. There is provision to promote the model to higher environments if the entire solution is executed on the cloud.

3. Dockerization

The training environment has been dockerized to scale up basis the no. of hyperparameter sets given. There could have been an alternate approach to use tools like Optuna which would pre-prune runs if the loss reduction is not satisfactory. However, this would require a large VM to be provisioned which would be inflexible and unportable. Using dockers, which are lighter, data scientists would be able to iterate through different model architectures and other neural network components quicker.

Here four sets of hyperparameters were given. Accordingly four containers started. Once their training tasks were completed, they terminated automatically.

4. Experiment analysis

mlFlow offers a visual way to analyze different model runs across different metrics.

5. Code

No concept was ever understood only by reading. Let’s break down the code. The code can be found here.

Repository structure:

Brief information about the folders and files:

data/

Here the X.csv and y.csv are stored after generation.

mlruns/

All mlFlow runs’ information are stored here. The registered models are also stored here.

Here 1fbc9…. refers to the mlFlow run ID.

artifacts — Model artifact, with environment details in which it was trained. Model summary has information about model architecture.
metrics — Metrics generated and tracked.
params—Parameters of the model during the run.
tags — Metadata information about the run.

mlruns/models/

regression_NN_model, version 1 in Staging

aliases: []
creation_timestamp: 1693397371746
current_stage: Staging
description: null
last_updated_timestamp: 1693398101422
name: regression_NN_model
run_id: 437b480f1f954c6e8fba4611a8e1d646
run_link: null
source: file:///Users/anupam/Documents/Codebase/Studies/MLFlow/mlruns/0/437b480f1f954c6e8fba4611a8e1d646/artifacts/model
status: READY
status_message: null
user_id: null
version: 1

src/

data_generator.py
Generates random data using sklearn.datasets.make_regression

data_work.py
Splits data intro train, validation and test and prepares pytorch data loaders.

neural_network.py
Contains the neural network.

train.py

This is the main file for training the neural network.

Loggic metrics across epochs

class log_losses(Callback):

    def on_train_epoch_end(self, trainer, pl_module):
        mlflow.log_metric('train_loss_epochs', trainer.logged_metrics['train_loss'])
    def on_validation_epoch_end(self, trainer, pl_module):
        mlflow.log_metric('val_loss_epochs', trainer.logged_metrics['val_loss'])

Setting seed received from the config file

def set_seed(seed: int = RANDOM_STATE) -> None:
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ["PYTHONHASHSEED"] = str(seed)
    print(f"Random seed set as {seed}")

Train the models

def train(epochs,lr):
    early_stop = EarlyStopping(monitor="val_loss", mode="min", patience=5)
    checkpoint = ModelCheckpoint(monitor="val_loss")
    callbacks=[early_stop,checkpoint, log_losses()]
    set_seed()
    trainer=pl.Trainer(max_epochs=epochs, enable_progress_bar=True, callbacks=callbacks)
    data=data_module()
    model=regresion_network(lr)

    with mlflow.start_run() as run:
        # mlflow.log_input(mlflow.data.from_numpy(features=X_train,targets=y_train,source="sklearn/data/trainset.csv"), context='train')
        # mlflow.log_input(mlflow.data.from_numpy(features=X_test,targets=y_test,source="sklearn/data/testset.csv"), context='test')

        mlflow.pytorch.autolog()
        trainer.fit(model=model,datamodule=data)
        metrics=trainer.logged_metrics
        data_to_log = {"date":str(datetime.today().date()),"runID": [run.info.run_id],"train_loss":metrics["train_loss"].numpy(),"val_loss":metrics["val_loss"].numpy()}
        trainer.test(model=model,datamodule=data)
        metrics=trainer.logged_metrics
        data_to_log.update({"test_loss":metrics["test_loss"].numpy()})
        print(data_to_log)
        mlflow.log_table(data=data_to_log, artifact_file="comparison_table.json")

Not logging inputs here
Enabling automatic pytorch logging
Using Pytorch lightning for training the models
In the variable data_to_log the specific metrics and details are mentioned which need to be logged as a table. Later, when the best model is selected, the table is referred from here.
data_to_log is logged into the comparison_table.json

register.py
Filters cut-off start date for model training. Models beyond this are analysed basis the selected metric. The best model is registered.

deploy.py
Updates model status — name, version and environment.

Dockerfile

FROM python:3.10-slim
WORKDIR /app
COPY /data /app/data/
COPY /src /app/src/
COPY requirements.txt /app
COPY config.json /app
RUN pip install -r requirements.txt

The docker file is very simple. It just copies the contents inside. Later this container is executed by mlFlow. This execution is done in PARALLEL_RUNS.py by changing the different hyperparameters.

MLProject

name: Project

# python_env: python_env.yaml
# or
docker_env:
   image: anupam/dummyproj

entry_points:
  train:
    parameters:
      epochs: {type: int, default: 3}
      lr: {type: float, default: 0.05}
    command: "python src/train.py {epochs} {lr}"
  # register:
  #   parameters:
  #     metric: {type: str, default: "test_loss"}
  #     model_name: {type: str, default: "regression_NN_on"} # Date is added here in python
  #     lookback_duration: {type: str, default: "7"}
  #   command: "python src/register.py {metric} {model_name} {lookback_duration}"
  # deploy:
  #   parameters:
  #     model_name: {type: str, default: "regression_NN_model"}
  #     model_version: {type: str, default: "latest"}
  #     model_stage: {type: str, default: "Staging"}
  #     # endpoint_name: {type: str, default: }
  #   command: "python src/deploy.py {model_name} {model_version} {model_stage}"

First the environment is defined.
Initially I used Python environment and local execution. Later I shifted to executing the training workload on docker. More on execution later.
Entry points
These define the different MLFlow workflows. I have defined three workflows for model training, registration and deployment.
train is the name of the file. The file accepts three CLI arguments which are defined in parameters.

6. Executing the training jobs

There are two ways to execute the training jobs.

Running individual hyperparameter sets. This is useful when trying to test the end-to-end code flow and ensuring the overall architecture is working.

mlflow run . -e train -P epochs=2 -P lr=0.1 --env-manager="local"

. implies the MLProject file is in the same directory
-e is followed by the entrypoint in the MLProject file
-P is followed by the training hyparameters

2. Parallel execution of all hyperparameter sets. This is useful when trying to find the best model.

PARALLEL_RUNS.py

mlflow.projects.run(
    uri=".",
    run_name="e_30_lr_0.003",
    entry_point="train",
    backend='local', 
    synchronous=False,
    parameters={
        'epochs': 30,
        'lr':0.003
    },)


. . .

uri – project directory in the container containing the MLProject.
run_name – the name by which the run appears.
entry_point – the entry point in MLProject.
backend – this defines where the containers run. This is where a cloud backend would be utilised.
synchronous – asynchronous parallel operation.
parameters – the specific hyperparameters for the run.

Different hyperparameter combinations are programmatically generated and mentioned in the above format in PARALLEL_RUNS.py . Then this file is normally executed as a python file. It spawns up multiple containers parallely and completes the training job.

By automatic management of these containers, training of 100s of hyperparameter combinations can be done in parallel. All of it is automated and tracked by mlFlow.

I hope you understood how to spawn infinitely many containers with parallel training.