Deployment of Machine Learning Models

In one of the previous modules you learned about how to use FastAPI to create an API to interact with your machine learning models. FastAPI is a great framework, but it is a general framework meaning that it was not developed with machine learning applications in mind. This means that there are features which you may consider to be missing when considering running large scale machine learning models:

Dynamic-batching: if you have a large number of requests coming in, you may want to process them in batches to reduce the overhead of loading the model and running the inference. This is especially true if you are running your model on a GPU, where the overhead of loading the model is significant.
Async inference: FastAPi does support async requests but not no way to call the model asynchronously. This means that if you have a large number of requests coming in, you will have to wait for the model to finish processing (because the model is not async) before you can start processing the next request.
Native GPU support: you can definitely run part of your application in FastAPI if you want to. But again it was not build with machine learning in mind, so you will have to do some extra work to get it to work.

It should come as no surprise that multiple frameworks have therefore sprung up that better supports deployment of machine learning algorithms:

The first 6 frameworks are backend agnostic, meaning that they are intended to work with whatever computational backend you model is implemented in (Tensorflow vs PyTorch vs Jax), whereas the last two are backend specific to respective Pytorch and Tensorflow. In this module we are going to look at two of the frameworks, namely Torchserve because we have developed Pytorch applications i nthis course and Triton Inference Server because it is a very popular framework for deploying models on Nvidia GPUs (but we can still use it on CPU).

But before we dive into these frameworks, we are going to look at a general way to package our machine learning models that should work with any of the above frameworks.

Model Packaging

Whenever we want to serve an machine learning model, we in general need 3 things:

The computational graph of the model, e.g. how to pass data through the model to get a prediction.
The weights of the model, e.g. the parameters that the model has learned during training.
A computational backend that can run the model

In the previous module on Docker we learned how to package all of these things into a container. This is a great way to package a model, but it is not the only way. The core assumption we currently have made is that the computational backend is the same as the one we trained the model on. However, this does not need to be the case. As long as we can export our model and weights to a common format, we can run the model on any backend that supports this format.

This is exactly what the Open Neural Network Exchange (ONNX) is designed to do. ONNX is a standardized format for creating and sharing machine learning models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. The idea behind ONNX is that a model trained with a specific framework on a specific device, lets say Pytorch on your local computer, can be exported and run with an entirely different framework and hardware easily. Learning how to export your models to ONNX is therefore a great way to increase the longivity of your models and not being locked into a specific framework for serving your models.

The ONNX format is designed to bridge the gap between development and deployment of machine learning models, by making it easy to export models between different frameworks and hardware. For example Pytorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework. Image credit

❔ Exercises

Start by installing ONNX, ONNX runtime and ONNX script. This can be done by running the following command
```
pip install onnx onnxruntime onnxscript
```
the first package contains the core ONNX framework, the second package contains the runtime for running ONNX models and the third package contains a new experimental package that is designed to make it easier to export models to ONNX.

Let's start out with converting a model to ONNX. The following code snippets shows how to export a Pytorch model to ONNX.

Pytorch => 2.0Pytorch < 2.0 or WindowsPytorch-lightning

import torch
import torchvision

model = torchvision.models.resnet18(weights=None)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
onnx_model = torch.onnx.dynamo_export(
    model=model,
    model_args=(dummy_input,),
    export_options=torch.onnx.ExportOptions(dynamic_shapes=True),
)
onnx_model.save("resnet18.onnx")

import torch
import torchvision

model = torchvision.models.resnet18(weights=None)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model=model,
    args=(dummy_input,),
    f="resnet18.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

import torch
import torchvision
import pytorch_lightning as pl
import onnx
import onnxruntime

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18(pretrained=True)
        self.model.eval()

    def forward(self, x):
        return self.model(x)

model = LitModel()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
model.to_onnx(
    file_path="resnet18.onnx",
    input_sample=dummy_input,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

Export a model of your own choice to ONNX or just try to export the resnet18 model as shown in the examples above, and confirm that the model was exported by checking that the file exists. Can you figure out what is meant by dynamic_axes?

Solution

The dynamic_axes argument is used to specify which axes of the input tensor that should be considered dynamic. This is useful when the model can accept inputs of different sizes, e.g. when the model is used in a dynamic batching scenario. In the example above we have specified that the first axis of the input tensor should be considered dynamic, meaning that the model can accept inputs of different batch sizes. While it may be tempting to specify all axes as dynamic, however this can lead to slower inference times, because the ONNX runtime will not be able to optimize the computational graph as well.

Check that the model was correctly exported by loading it using the onnx package and afterwards check the graph of model using the following code:
```
import onnx
model = onnx.load("resnet18.onnx")
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))
```
To get a better understanding of what is actually exported, lets try to visualize the computational graph of the model. This can be done using the open-source tool netron. You can either try it out directly in webbrowser or you can install it locally using pip install netron and then run it using netron resnet18.onnx. Can you figure out what method of the model is exported to ONNX?

Solution

When a Pytorch model is exported to ONNX, it is only the forward method of the model that is exported. This means that it is the only method we have access to when we load the model later. Therefore, make sure that the forward method of your model is implemented in a way that it can be used for inference.

After converting a model to ONNX format we can use the ONNX Runtime to run it. The benefit of this is that ONNX Runtime is able to optimize the computational graph of the model, which can lead to faster inference times. Lets try to look into that.

Figure out how to run a model using the ONNX Runtime. Relevant documentation.

Solution

To use the ONNX runtime to run a model, we first need to start a inference session, then extract input output names of our model and finally run the model. The following code snippet shows how to do this.

import onnxruntime as rt
ort_session = rt.InferenceSession("<path-to-model>")
input_names = [i.name for i in ort_session.get_inputs()]
output_names = [i.name for i in ort_session.get_outputs()]
batch = {input_names[0]: np.random.randn(1, 3, 224, 224).astype(np.float32)}
out = ort_session.run(output_names, batch)

Let's experiment with performance of ONNX vs. Pytorch. Implement a benchmark that measures the time it takes to run a model using Pytorch and ONNX. Bonus points if you test for multiple input sizes. To get you started we have implemented a timing decorator that you can use to measure the time it takes to run a function.

from statistics import mean, stdev
import time
def timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):
    """ Decorator that times the execution of a function. """
    def wrapper(*args, **kwargs):
        timing_results = []
        for _ in range(timing_repeat):
            start_time = time.time()
            for _ in range(function_repeat):
                result = func(*args, **kwargs)
            end_time = time.time()
            elapsed_time = end_time - start_time
            timing_results.append(elapsed_time)
        print(f"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds")
        return result
    return wrapper

Solution

onnx_benchmark.py
import sys
import time
from statistics import mean, stdev

import onnxruntime as ort
import torch
import torchvision


def timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):
    """Decorator that times the execution of a function."""

    def wrapper(*args, **kwargs):
        timing_results = []
        for _ in range(timing_repeat):
            start_time = time.time()
            for _ in range(function_repeat):
                result = func(*args, **kwargs)
            end_time = time.time()
            elapsed_time = end_time - start_time
            timing_results.append(elapsed_time)
        print(f"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds")
        return result

    return wrapper


model = torchvision.models.resnet18()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
if sys.platform == "win32":
    # Windows doesn't support the new TorchDynamo-based ONNX Exporter
    torch.onnx.export(
        model,
        dummy_input,
        "resnet18.onnx",
        input_names=["input.1"],
        dynamic_axes={"input.1": {0: "batch_size", 2: "height", 3: "width"}},
    )
else:
    torch.onnx.dynamo_export(model, dummy_input).save("resnet18.onnx")

ort_session = ort.InferenceSession("resnet18.onnx")


@timing_decorator
def torch_predict(image):
    """Predict using PyTorch model."""
    model(image)


@timing_decorator
def onnx_predict(image):
    """Predict using ONNX model."""
    ort_session.run(None, {"input.1": image.numpy()})


if __name__ == "__main__":
    for size in [224, 448, 896]:
        dummy_input = torch.randn(1, 3, size, size)
        print(f"Image size: {size}")
        torch_predict(dummy_input)
        onnx_predict(dummy_input)

To get a better understanding of why running the model using the ONNX runtime is usually faster lets try to see what happens to the computational graph. By default the ONNX Runtime will apply these optimization in online mode, meaning that the optimizations are applied when the model is loaded. However, it is also possible to apply the optimizations in offline mode, such that the optimized model is saved to disk. Below is an example of how to do this.
```
import onnxruntime as rt
sess_options = rt.SessionOptions()

# Set graph optimization level
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# To enable model serialization after graph optimization set this
sess_options.optimized_model_filepath = "optimized_model.onnx>"

session = rt.InferenceSession("<model_path>", sess_options)
```
Try to apply the optimizations in offline mode and use netron to visualize both the original and optimized model side by side. Can you see any differences?

Solution

You should hopefully see that the optimized model consist of fewer nodes and edges than the original model. These nodes are often called fused nodes, because they are the result of multiple nodes being fused together. In the image below we have visualized the first part of the computational graph of a resnet18 model, before and after optimization.

As mentioned in the introduction, ONNX is able to run on many different types of hardware and execution engine. You can check all providers and all the available providers by running the following code
```
import onnxruntime
print(onnxruntime.get_all_providers())
print(onnxruntime.get_available_providers())
```
Can you figure out how to set which provide the ONNX runtime should use?
Solution

The provider that the ONNX runtime should use can be set by passing the providers argument to the InferenceSession class. A list should be provided, which prioritizes the providers in the order they are listed.
```
import onnxruntime as rt
provider_list = ['CUDAExecutionProvider', 'CPUExecutionProvider']
ort_session = rt.InferenceSession("<path-to-model>", providers=provider_list)
```
In this case we will prefer CUDA Execution Provider over CPU Execution Provider if both are available.

As you have probably realised in the exercises on docker, it can take a long time to build the kind of containers we are working with and they can be quite large. There is a reason for this and that is that Pytorch is a very large framework with a lot of dependencies. ONNX on the other hand is a much smaller framework. This kind of makes sense, because Pytorch is a framework that primarily was designed for developing e.g. training models, while ONNX is a framework that is designed for serving models. Let's try to quantify this.

Construct a dockerfile that builds a docker image with Pytorch as a dependency. The dockerfile does actually not need to run anything. Repeat the same process for the ONNX runtime. Bonus point for developing a docker image that takes a build arg at build time that specifies if the image should be built with CUDA support or not.

Solution

The dockerfile for the Pytorch image could look something like this

inference_pytorch.dockerfile
FROM python:3.11-slim

RUN apt update && \
    apt install --no-install-recommends -y build-essential gcc && \
    apt clean && rm -rf /var/lib/apt/lists/*

ARG CUDA
ENV CUDA=${CUDA}
RUN echo "CUDA is set to: ${CUDA}"

RUN echo "CUDA is set to: ${CUDA}" && \
    if [ -n "$CUDA" ]; then \
        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu121; \
    else \
        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu; \
    fi

and the dockerfile for the ONNX image could look something like this

inference_onnx.dockerfile
FROM python:3.11-slim

RUN apt update && \
    apt install --no-install-recommends -y build-essential gcc && \
    apt clean && rm -rf /var/lib/apt/lists/*

RUN echo "CUDA is set to: ${CUDA}" && \
    if [ -n "$CUDA" ]; then \
        pip install onnxruntime-gpu; \
    else \
        pip install onnxruntime; \
    fi

Build both containers and measure the time it takes to build them. How much faster is it to build the ONNX container compared to the Pytorch container?
Solution

On unix/linux you can use the time command to measure the time it takes to build the containers. Building both images, with and without CUDA support, can be done with the following commands
```
time docker build . -t pytorch_inference_cuda:latest -f inference_pytorch.dockerfile \
    --no-cache --build-arg CUDA=true
time docker build . -t pytorch_inference:latest -f inference_pytorch.dockerfile \
    --no-cache --build-arg CUDA=
time docker build . -t onnx_inference_cuda:latest -f inference_onnx.dockerfile \
    --no-cache --build-arg CUDA=true
time docker build . -t onnx_inference:latest -f inference_onnx.dockerfile \
    --no-cache --build-arg CUDA=
```
the --no-cache flag is used to ensure that the build process is not cached and ensure a fair comparison. On my laptop this respectively took 5m1s, 1m4s, 0m4s, 0m50s meaning that the ONNX container was respectively 7x (with CUDA) and 1.28x (no CUDA) faster to build than the Pytorch container.
Find out the size of the two docker images. It can be done in the terminal by running the docker images command. How much smaller is the ONNX model compared to the Pytorch model?

Solution

As of writing the docker image containing the Pytorch framework was 5.54GB (with CUDA) and 1.25GB (no CUDA). In comparison the ONNX image was 647MB (with CUDA) and 647MB (no CUDA). This means that the ONNX image is respectively 8.5x (with CUDA) and 1.94x (no CUDA) smaller than the Pytorch image.

(Optional) Assuming you have completed the module on FastAPI try creating a small FastAPI application that serves a model using the ONNX runtime.

Solution

Here is a simple example of how to create a FastAPI application that serves a model using the ONNX runtime.

onnx_fastapi.py
import numpy as np
import onnxruntime
from fastapi import FastAPI

app = FastAPI()


@app.get("/predict")
def predict():
    """Predict using ONNX model."""
    # Load the ONNX model
    model = onnxruntime.InferenceSession("model.onnx")

    # Prepare the input data
    input_data = {"input": np.random.rand(1, 3).astype(np.float32)}

    # Run the model
    output = model.run(None, input_data)

    return {"output": output[0].tolist()}

This completes the exercises on the ONNX format. Do note that one limitation of the ONNX format is that is is based on ProtoBuf, which is a binary format. A protobuf file can have a maximum size of 2GB, which means that the .onnx format is not enough for very large models. However, through the use of external data it is possible to circumvent this limitation.

Torchserve

Text to come...

Triton Inference Server

Text to come...

🧠 Knowledge check

How would you export a scikit-learn model to ONNX? What method is exported when you export scikit-learn model to ONNX?
Solution

It is possible to export a scikit-learn model to ONNX using the sklearn-onnx package. The following code snippet shows how to export a scikit-learn model to ONNX.
```
from sklearn.ensemble import RandomForestClassifier
from skl2onnx import to_onnx
model = RandomForestClassifier(n_estimators=2)
dummy_input = np.random.randn(1, 4)
onx = to_onnx(model, dummy_input)
with open("model.onnx", "wb") as f:
    f.write(onx.SerializeToString())
```
The method that is exported when you export a scikit-learn model to ONNX is the predict method.
In your own words, describe what the concept of computational graph means?

Solution

A computational graph is a way to represent the mathematical operations that are performed in a model. It is essentially a graph where the nodes are the operations and the edges are the data that is passed between them. The computational graph normally represents the forward pass of the model and is the reason that we can easily backpropagate through the model to train it, because the graph contains all the necessary information to calculate the gradients of the model.
In your own words, explain why fusing operations together in the computational graph often leads to better performance?

Solution

Each time we want to do a computation, the data needs to be loaded from memory into the CPU/GPU. This is a slow process and the more operations we have, the more times we need to load the data. By fusing operations together, we can reduce the number of times we need to load the data, because we can do multiple operations on the same data before we need to load new data.

This ends the module on tools specifically designed for serving machine learning models.