Deployment of Machine Learning Models

In one of the previous modules you learned about how to use FastAPI to create an API to interact with your machine learning models. FastAPI is a great framework, but it is a general framework meaning that it was not developed with machine learning applications in mind. This means that there are features which you may consider to be missing when considering running large scale machine learning models:

Dynamic-batching: if you have a large number of requests coming in, you may want to process them in batches to reduce the overhead of loading the model and running the inference. This is especially true if you are running your model on a GPU, where the overhead of loading the model is significant.
Async inference: FastAPI does support async requests but not a way to call the model asynchronously. This means that if you have a large number of requests coming in, you will have to wait for the model to finish processing (because the model is not async) before you can start processing the next request.
Native GPU support: you can definitely run part of your application in FastAPI if you want to. But again it was not built with machine learning in mind, so you will have to do some extra work to get it to work.

It should come as no surprise that multiple frameworks have therefore sprung up that better support deployment of machine learning algorithms (just listing a few here):

🌟 Framework	🧩 Backend Agnostic	🧠 Model Agnostic	📂 Repository	⭐ GitHub Stars
Cortex	✅	✅	🔗 Link	8.0k
BentoML	✅	✅	🔗 Link	7.8k
Ray Serve	✅	✅	🔗 Link	37.8k
Triton Inference Server	✅	✅	🔗 Link	9.4k
OpenVINO	✅	✅	🔗 Link	8.5k
Seldon-core	✅	✅	🔗 Link	4.6k
Litserve	✅	✅	🔗 Link	3.3k
Torchserve	❌	✅	🔗 Link	4.3k
TensorFlow serve	❌	✅	🔗 Link	6.3k
vLLM	❌	❌	🔗 Link	51.1k

The first 7 frameworks are backend agnostic, meaning that they are intended to work with whatever computational backend your model is implemented in (TensorFlow, PyTorch, Jax, Sklearn, etc.), whereas the last 3 are backend specific (PyTorch, TensorFlow and a custom framework). The first 9 frameworks are model agnostic, meaning that they are intended to work with whatever model you have implemented, whereas the last one is model specific in this case to LLM's. When choosing a framework to deploy your model, you should consider the following:

Ease of use. Some frameworks are easier to use and get started with than others, but may have fewer features. As an example from the list above, Litserve is very easy to get started with but is a relatively new framework and may not have all the features you need.
Performance. Some frameworks are optimized for performance, but may be harder to use. As an example from the list above, vLLM is a very high performance framework for serving large language models but it cannot be used for other types of models.
Community. Some frameworks have a large community, which can be helpful if you run into problems. As an example from the list above, Triton Inference Server was developed by Nvidia and has a large community of users. As a good rule of thumb, the more stars a repository has on GitHub, the larger the community.

In this module we are going to be looking at the BentoML framework because it strikes a good balance between ease of use and having a lot of features that can improve the performance of serving your models. However, before we dive into this serving framework, we are going to look at a general way to package our machine learning models that should work with most of the above frameworks.

Model Packaging

Whenever we want to serve a machine learning model, we in general need three things:

The computational graph of the model, e.g. how to pass data through the model to get a prediction
The weights of the model, e.g. the parameters that the model has learned during training
A computational backend that can run the model

In the past module on Docker we learned how to package all of these things into a container. This is a great way to package a model, but it is not the only way. The core assumption we have currently made is that the computational backend is the same as the one we trained the model on. However, this does not need to be the case. As long as we can export our model and weights to a common format, we can run the model on any backend that supports this format.

This is exactly what the Open Neural Network Exchange (ONNX) is designed to do. ONNX is a standardized format for creating and sharing machine learning models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. The idea behind ONNX is that a model trained with a specific framework on a specific device, let's say PyTorch on your local computer, can be exported and run with an entirely different framework and hardware easily. Learning how to export your models to ONNX is therefore a great way to increase the longevity of your models and not be locked into a specific framework for serving your models.

The ONNX format is designed to bridge the gap between development and deployment of machine learning models by making it easy to export models between different frameworks and hardware. For example, PyTorch is in general considered to be a developer-friendly framework, though it has historically been slow to run inference with. Image credit

❔ Exercises

Start by installing ONNX, ONNX runtime and ONNX script. This can be done by running the following command:
```
pip install onnx onnxruntime onnxscript
```
The first package contains the core ONNX framework, the second package contains the runtime for running ONNX models and the third package contains a new experimental package that is designed to make it easier to export models to ONNX.

Let's start out by converting a model to ONNX. The following code snippets show how to export a PyTorch model to ONNX.

PyTorch => 2.0PyTorch < 2.0 or WindowsPyTorch-lightning

import torch
import torchvision

model = torchvision.models.resnet18(weights=None)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
onnx_model = torch.onnx.dynamo_export(
    model=model,
    model_args=(dummy_input,),
    export_options=torch.onnx.ExportOptions(dynamic_shapes=True),
)
onnx_model.save("resnet18.onnx")

import torch
import torchvision

model = torchvision.models.resnet18(weights=None)
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model=model,
    args=(dummy_input,),
    f="resnet18.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

import torch
import torchvision
import pytorch_lightning as pl
import onnx
import onnxruntime

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18(pretrained=True)
        self.model.eval()

    def forward(self, x):
        return self.model(x)

model = LitModel()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
model.to_onnx(
    file_path="resnet18.onnx",
    input_sample=dummy_input,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

Export a model of your own choice to ONNX or just try to export the resnet18 model as shown in the examples above, and confirm that the model was exported by checking that the file exists. Can you figure out what is meant by dynamic_axes?

Solution

The dynamic_axes argument is used to specify which axes of the input tensor that should be considered dynamic. This is useful when the model can accept inputs of different sizes, e.g. when the model is used in a dynamic batching scenario. In the example above we have specified that the first axis of the input tensor should be considered dynamic, meaning that the model can accept inputs of different batch sizes. While it may be tempting to specify all axes as dynamic, this can lead to slower inference times because the ONNX runtime will not be able to optimize the computational graph as well.

Check that the model was correctly exported by loading it using the onnx package and afterwards check the graph of the model using the following code:
```
import onnx
model = onnx.load("resnet18.onnx")
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))
```
To get a better understanding of what is actually exported, let's try to visualize the computational graph of the model. This can be done using the open-source tool netron. You can either try it out directly in webbrowser or you can install it locally using pip install netron and then run it using netron resnet18.onnx. Can you figure out what method of the model is exported to ONNX?

Solution

When a PyTorch model is exported to ONNX, it is only the forward method of the model that is exported. This means that that is the only method we have access to when we load the model later. Therefore, make sure that the forward method of your model is implemented in a way that it can be used for inference.

After converting a model to ONNX format we can use ONNX Runtime to run it. The benefit of this is that ONNX Runtime is able to optimize the computational graph of the model, which can lead to faster inference times. Let's try to look into that.

Figure out how to run a model using the ONNX Runtime. Relevant documentation.

Solution

To use the ONNX runtime to run a model, we first need to start an inference session, then extract the input and output names of our model and finally run the model. The following code snippet shows how to do this.

import onnxruntime as rt
ort_session = rt.InferenceSession("<path-to-model>")
input_names = [i.name for i in ort_session.get_inputs()]
output_names = [i.name for i in ort_session.get_outputs()]
batch = {input_names[0]: np.random.randn(1, 3, 224, 224).astype(np.float32)}
out = ort_session.run(output_names, batch)

Let's experiment with the performance of ONNX vs. PyTorch. Implement a benchmark that measures the time it takes to run a model using PyTorch and ONNX. Bonus points if you test for multiple input sizes. To get you started we have implemented a timing decorator that you can use to measure the time it takes to run a function.

from statistics import mean, stdev
import time
def timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):
    """ Decorator that times the execution of a function. """
    def wrapper(*args, **kwargs):
        timing_results = []
        for _ in range(timing_repeat):
            start_time = time.time()
            for _ in range(function_repeat):
                result = func(*args, **kwargs)
            end_time = time.time()
            elapsed_time = end_time - start_time
            timing_results.append(elapsed_time)
        print(f"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds")
        return result
    return wrapper

Solution

onnx_benchmark.py
import sys
import time
from statistics import mean, stdev

import onnxruntime as ort
import torch
import torchvision


def timing_decorator(func, function_repeat: int = 10, timing_repeat: int = 5):
    """Decorator that times the execution of a function."""

    def wrapper(*args, **kwargs):
        timing_results = []
        for _ in range(timing_repeat):
            start_time = time.time()
            for _ in range(function_repeat):
                result = func(*args, **kwargs)
            end_time = time.time()
            elapsed_time = end_time - start_time
            timing_results.append(elapsed_time)
        print(f"Avg +- Stddev: {mean(timing_results):0.3f} +- {stdev(timing_results):0.3f} seconds")
        return result

    return wrapper


model = torchvision.models.resnet18()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)
if sys.platform == "win32":
    # Windows doesn't support the new TorchDynamo-based ONNX Exporter
    torch.onnx.export(
        model,
        dummy_input,
        "resnet18.onnx",
        input_names=["input.1"],
        dynamic_axes={"input.1": {0: "batch_size", 2: "height", 3: "width"}},
    )
else:
    torch.onnx.dynamo_export(model, dummy_input).save("resnet18.onnx")

ort_session = ort.InferenceSession("resnet18.onnx")


@timing_decorator
def torch_predict(image) -> None:
    """Predict using PyTorch model."""
    model(image)


@timing_decorator
def onnx_predict(image) -> None:
    """Predict using ONNX model."""
    ort_session.run(None, {"input.1": image.numpy()})


if __name__ == "__main__":
    for size in [224, 448, 896]:
        dummy_input = torch.randn(1, 3, size, size)
        print(f"Image size: {size}")
        torch_predict(dummy_input)
        onnx_predict(dummy_input)

To get a better understanding of why running the model using the ONNX runtime is usually faster let's try to see what happens to the computational graph. By default the ONNX Runtime will apply this optimization in online mode, meaning that the optimizations are applied when the model is loaded. However, it is also possible to apply the optimizations in offline mode, such that the optimized model is saved to disk. Below is an example of how to do this.
```
import onnxruntime as rt
sess_options = rt.SessionOptions()

# Set graph optimization level
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# To enable model serialization after graph optimization set this
sess_options.optimized_model_filepath = "optimized_model.onnx>"

session = rt.InferenceSession("<model_path>", sess_options)
```
Try to apply the optimizations in offline mode and use netron to visualize both the original and optimized models side by side. Can you see any differences?

Solution

You should hopefully see that the optimized model consists of fewer nodes and edges than the original model. These nodes are often called fused nodes, because they are the result of multiple nodes being fused together. In the image below we have visualized the first part of the computational graph of a resnet18 model, before and after optimization.

Exporting a model to ONNX is not always perfect out of the box. To the conversion of your PyTorch models there need to be a one-to-one correspondence between PyTorch and ONNX operators. Especially, the opset number is important to set correctly in ONNX to get the correct operators. If this is not the case, the exported model can lead to a difference in results. To check the model, it is therefore also a good idea to check if the difference between the PyTorch and ONNX model is within a certain threshold. Implement a simple function that loads the model using PyTorch and ONNX and checks if the difference between the two models is within a certain threshold.

Solution

The function below should work for a neural network which takes in a single input tensor and returns a single output tensor. If this is not the case, you will need to modify the function to fit your model.

onnx_check.py
import torch

def check_onnx_model(
    onnx_model_file: str,
    pytorch_model: torch.nn.Module,
    random_input: torch.Tensor,
    rtol: float = 1e-03,
    atol: float = 1e-05,
) -> None:
    import onnxruntime as rt
    import numpy as np

    ort_session = rt.InferenceSession(onnx_model_file)
    ort_inputs = {ort_session.get_inputs()[0].name: random_input.numpy()}
    ort_outs = ort_session.run(None, ort_inputs)
    pytorch_outs = pytorch_model(random_input).detach().numpy()

    assert np.allclose(ort_outs[0], pytorch_outs, rtol=rtol, atol=atol)

As mentioned in the introduction, ONNX is able to run on many different types of hardware and execution engines. You can check all the providers and all the available providers by running the following code:
```
import onnxruntime
print(onnxruntime.get_all_providers())
print(onnxruntime.get_available_providers())
```
Can you figure out how to set which provider the ONNX runtime should use?
Solution

The provider that the ONNX runtime should use can be set by passing the providers argument to the InferenceSession class. A list should be provided, which prioritizes the providers in the order they are listed.
```
import onnxruntime as rt
provider_list = ['CUDAExecutionProvider', 'CPUExecutionProvider']
ort_session = rt.InferenceSession("<path-to-model>", providers=provider_list)
```
In this case we will prefer CUDA Execution Provider over CPU Execution Provider if both are available.

As you have probably realized in the exercises on docker, it can take a long time to build the kind of containers we are working with and they can be quite large. There is a reason for this and that is that PyTorch is a very large framework with a lot of dependencies. ONNX on the other hand is a much smaller framework. This kind of makes sense, because PyTorch is a framework that primarily was designed for developing, e.g. training models, while ONNX is a framework that is designed for serving models. Let's try to quantify this.

Construct a dockerfile that builds a docker image with PyTorch as a dependency. The dockerfile does not actually need to run anything. Repeat the same process for the ONNX runtime. Bonus point for developing a docker image that takes a build arg at build time that specifies if the image should be built with CUDA support or not.

Solution

The dockerfile for the PyTorch image could look something like this

inference_pytorch.dockerfile
FROM python:3.11-slim

RUN apt update && \
    apt install --no-install-recommends -y build-essential gcc && \
    apt clean && rm -rf /var/lib/apt/lists/*

ARG CUDA
ENV CUDA=${CUDA}
RUN echo "CUDA is set to: ${CUDA}"

RUN echo "CUDA is set to: ${CUDA}" && \
    if [ -n "$CUDA" ]; then \
        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu121; \
    else \
        pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu; \
    fi

and the dockerfile for the ONNX image could look something like this

inference_onnx.dockerfile
FROM python:3.11-slim

RUN apt update && \
    apt install --no-install-recommends -y build-essential gcc && \
    apt clean && rm -rf /var/lib/apt/lists/*

RUN echo "CUDA is set to: ${CUDA}" && \
    if [ -n "$CUDA" ]; then \
        pip install onnxruntime-gpu; \
    else \
        pip install onnxruntime; \
    fi

Build both containers and measure the time it takes to build them. How much faster is it to build the ONNX container compared to the PyTorch container?
Solution

On unix/linux you can use the time command to measure the time it takes to build the containers. Building both images, with and without CUDA support, can be done with the following commands:
```
time docker build . -t pytorch_inference_cuda:latest -f inference_pytorch.dockerfile \
    --no-cache --build-arg CUDA=true
time docker build . -t pytorch_inference:latest -f inference_pytorch.dockerfile \
    --no-cache --build-arg CUDA=
time docker build . -t onnx_inference_cuda:latest -f inference_onnx.dockerfile \
    --no-cache --build-arg CUDA=true
time docker build . -t onnx_inference:latest -f inference_onnx.dockerfile \
    --no-cache --build-arg CUDA=
```
The --no-cache flag is used to ensure that the build process is not cached and ensures a fair comparison. On my laptop this respectively took 5m1s, 1m4s, 0m4s, 0m50s meaning that the ONNX container was respectively 7x (with CUDA) and 1.28x (no CUDA) faster to build than the PyTorch container.
Figure out the sizes of the two docker images. This can be done in the terminal by running the docker images command. How much smaller is the ONNX model compared to the PyTorch model?

Solution

As of writing, the docker image containing the PyTorch framework was 5.54GB (with CUDA) and 1.25GB (no CUDA). In comparison the ONNX image was 647MB (with CUDA) and 647MB (no CUDA). This means that the ONNX image is respectively 8.5x (with CUDA) and 1.94x (no CUDA) smaller than the PyTorch image.

(Optional) Assuming you have completed the module on FastAPI, try creating a small FastAPI application that serves a model using the ONNX runtime.

Solution

Here is a simple example of how to create a FastAPI application that serves a model using the ONNX runtime.

onnx_fastapi.py
import numpy as np
import onnxruntime
from fastapi import FastAPI

app = FastAPI()


@app.get("/predict")
def predict():
    """Predict using ONNX model."""
    # Load the ONNX model
    model = onnxruntime.InferenceSession("model.onnx")

    # Prepare the input data
    input_data = {"input": np.random.rand(1, 3).astype(np.float32)}

    # Run the model
    output = model.run(None, input_data)

    return {"output": output[0].tolist()}

This completes the exercises on the ONNX format. Do note that one limitation of the ONNX format is that it is based on ProtoBuf, which is a binary format. A protobuf file can have a maximum size of 2GB, which means that the .onnx format is not enough for very large models. However, through the use of external data it is possible to circumvent this limitation.

BentoML

BentoML cloud vs BentoML OSS

We are only going to be looking at the open-source version of BentoML in this module. However, BentoML also has a cloud version that makes it very easy to deploy models that are coded in BentoML to the cloud. If you are interested in this, you can check out the BentoML cloud documentation. This business strategy of having an open-source product and a cloud product is very common in the machine learning space (HuggingFace, LightningAI, Weights and Biases, etc.), because it allows companies to make money from the cloud product while still providing a free product to the community.

BentoML is a framework that is designed to make it easy to serve machine learning models. It is designed to be backend agnostic, meaning that it can be used with any computational backend. It is also model agnostic, meaning that it can be used with any machine learning model.

Let's consider a simple example of how to serve a model using BentoML. The following code snippet shows how to serve a model that uses the transformers library to summarize text.

import bentoml
from transformers import pipeline

EXAMPLE_INPUT = (
    "Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as "
    "local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-"
    "defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking "
    "20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated "
    "by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to "
    "celebrate what is being hailed as 'The Leap of the Century.'"
)

@bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 10})
class Summarization:
    def __init__(self) -> None:
        self.pipeline = pipeline('summarization')

    @bentoml.api
    def summarize(self, text: str = EXAMPLE_INPUT) -> str:
        result = self.pipeline(text)
        return result[0]['summary_text']

In BentoML we organize our services in classes, where each class is a service that we want to serve. The two important parts of the code snippet are the @bentoml.service and @bentoml.api decorators.

The @bentoml.service decorator is used to specify the resources that the service should use and in general how the service should be run. In this case we are specifying that the service should use 2 CPU cores and that the timeout for the service should be 10 seconds.
The @bentoml.api decorator is used to specify the API that the service should expose. In this case we are specifying that the service should have an API called summarize that takes a string as input and returns a string as output.

To serve the model using BentoML we can execute the following command, which is very similar to the command we used to serve the model using FastAPI.

bentoml serve service:Summarization

❔ Exercises

In general, we recommend looking through the docs for Bento ML if you need help with any of the exercises. We are going to assume that you have done the exercises on ONNX and we are therefore going to be using BentoML to serve ONNX models. If you have not done that part, you can still follow along but you will need to use a PyTorch model instead of an ONNX model.

Install BentoML.
```
pip install bentoml
```
Remember to add the dependency to your requirements.txt file.

You are in principal free to serve any model you like, but we recommend just using a torchvision model as in the ONNX exercises. Write your first service in BentoML that serves a model of your choice. We recommend experimenting with providing input/output as tensors because bentoml supports this natively. Secondly, write a client that can send a request to the service and print the result. Here we recommend using the built-in bentoml.SyncHTTPClient.

Solution

The following implements a simple BentoML service that serves an ONNX resnet18 model. The service expects both the input and output to be numpy arrays.

bentoml_service.py
from __future__ import annotations

import bentoml
import numpy as np
from onnxruntime import InferenceSession


@bentoml.service
class ImageClassifierService:
    """Image classifier service using ONNX model."""

    def __init__(self) -> None:
        self.model = InferenceSession("model.onnx")

    @bentoml.api
    def predict(self, image: np.ndarray) -> np.ndarray:
        """Predict the class of the input image."""
        output = self.model.run(None, {"input": image.astype(np.float32)})
        return output[0]

The service can be served using the following command:

bentoml serve bentoml_service:ImageClassifierService

To test that the service works the following client can be used:

bentoml_client.py
import bentoml
import numpy as np
from PIL import Image

if __name__ == "__main__":
    image = Image.open("my_cat.jpg")
    image = image.resize((224, 224))  # Resize to match the minimum input size of the model
    image = np.array(image)
    image = np.transpose(image, (2, 0, 1))  # Change to CHW format
    image = np.expand_dims(image, axis=0)  # Add batch dimension

    with bentoml.SyncHTTPClient("http://localhost:4040") as client:
        resp = client.predict(image=image)
        print(resp)

We are now going to look at features where BentoML really sets itself apart from FastAPI. The first is adaptive batching. As you are hopefully aware, modern machine learning models can process multiple samples at the same time and in doing so increase the throughput of the model. When we train a model we often set a fixed batch size, however we cannot do that when serving the model because that would mean that we would have to wait for the batch to be full before we can process it. Adaptive batching simply refers to the process where we specify a maximum batch size and also a timeout. When either the batch is full or the timeout is reached, however many samples we have collected are sent to the model for processing. This can be a very powerful feature because it allows us to process samples as soon as they arrive, while still taking advantage of the increased throughput of batching.

The overall architecture of the adaptive batching feature in BentoML. The feature is implemented on the server side and mainly consists of a dispatcher that is in charge of collecting requests and sending them to the model server when either the batch is full or a timeout is reached. Image credit

Look through the documentation on adaptive batching and add adaptive batching to your service from the previous exercise. Make sure your service works as expected by testing it with the client from the previous exercise.

Solution

bentoml_service_adaptive_batching.py
from __future__ import annotations

import bentoml
import numpy as np
from onnxruntime import InferenceSession


@bentoml.service
class ImageClassifierService:
    """Image classifier service using ONNX model."""

    def __init__(self) -> None:
        self.model = InferenceSession("model.onnx")

    @bentoml.api(
        batchable=True,
        batch_dim=(0, 0),
        max_batch_size=128,
        max_latency_ms=1000,
    )
    def predict(self, image: np.ndarray) -> np.ndarray:
        """Predict the class of the input image."""
        output = self.model.run(None, {"input": image.astype(np.float32)})
        return output[0]

Try to measure the throughput of your model with and without adaptive batching. Assuming that you have completed the module on testing APIs and therefore are familiar with the locust framework, we recommend that you write a simple locustfile and use the locust command to measure the throughput of your model.

Solution

The following locust file can be used to measure the throughput of the model with and without adaptive batching

locustfile.py
import numpy as np
from locust import HttpUser, between, task
from PIL import Image


def prepare_image():
    """Load and preprocess the image as required."""
    image = Image.open("my_cat.jpg")
    image = image.resize((224, 224))
    image = np.array(image)
    image = np.transpose(image, (2, 0, 1))  # Convert to CHW format
    image = np.expand_dims(image, axis=0)  # Add batch dimension
    # Convert to list format for JSON serialization
    return image.tolist()


image = prepare_image()


class BentoMLUser(HttpUser):
    """Locust user class for sending prediction requests to the server."""

    wait_time = between(1, 2)

    @task
    def send_prediction_request(self):
        """Send a prediction request to the server."""
        payload = {"image": image}  # Package the image as JSON
        self.client.post("/predict", json=payload, headers={"Content-Type": "application/json"})

and then the following command can be used to measure the throughput of the model

locust -f locustfile_bentoml.py --host http://localhost:4040 --headless -u 50 -t 60s

You should hopefully see that the throughput of the model is higher when adaptive batching is enabled, but the speedup is largely dependent on the model you are running, the configuration of the adaptive batching and the hardware you are running on.

On my laptop I saw about a 1.5 - 2x speedup when adaptive batching was enabled.

(Optional, requires GPU) Look through the documentation for inference on GPU and add this to your service. Check that your service works as expected by testing it with the client from the previous exercise and make sure you are seeing a speedup when running on the GPU.

Solution

A simple change to the bento.service decorator is all that is needed to run the model on the GPU.

```python @bentoml.service(resources={"gpu": 1}) class MyService: def init(self): self.model = torch.load('model.pth').to('cuda:0')
Another way to speed up the inference is to just use multiple workers. This duplicates the server over multiple processes taking advantage of modern multi-core CPUs. This is similar to running the uvicorn command with the --workers flag for FastAPI applications. Implement multiple workers in your service and test that it works as expected by testing it with the client from the previous exercise. Also test that you are seeing a speedup when running with multiple workers.
Solution

Multiple workers can be added to the bento.service decorator as shown below.
```
@bentoml.service(workers=4)
class MyService:
    # Service implementation
```
Alternatively, you can set workers="cpu_count" to use all available CPU cores. The speedup depends on the model you are serving, the hardware you are running on and the number of workers you are using, but it should be higher than using a single worker.

In addition to increasing the throughput of your deployments BentoML can also help with ML applications that require some kind of composition of multiple models. It is very normal in production setups to have multiple models that either

Run in a sequence, e.g., the output of one model is the input of another model. You may have a preprocessing service that preprocesses the data before it is sent to a model that makes a prediction.
Run concurrently, e.g., you have multiple models that are run at the same time and the outputs of all the models are combined to make a prediction. Ensemble models are a good example of this.

BentoML makes it easy to compose multiple models together.

Implement two services that run in a sequence, e.g., the output of one service is used as the input to another service. As an example you can implement either some pre- or post-processing service that is used in conjunction with the model you have implemented in the previous exercises.

Solution

The following code snippet shows how to implement two services that run in sequence.

bentoml_service_composition.py
from __future__ import annotations

from pathlib import Path

import bentoml
import numpy as np
from onnxruntime import InferenceSession
from PIL import Image


@bentoml.service
class ImagePreprocessorService:
    """Image preprocessor service."""

    @bentoml.api
    def preprocess(self, image_file: Path) -> np.ndarray:
        """Preprocess the input image."""
        image = Image.open(image_file)
        image = image.resize((224, 224))
        image = np.array(image)
        image = np.transpose(image, (2, 0, 1))
        return np.expand_dims(image, axis=0)


@bentoml.service
class ImageClassifierService:
    """Image classifier service using ONNX model."""

    preprocessing_service = bentoml.depends(ImagePreprocessorService)

    def __init__(self) -> None:
        self.model = InferenceSession("model.onnx")

    @bentoml.api
    async def predict(self, image_file: Path) -> np.ndarray:
        """Predict the class of the input image."""
        image = await self.preprocessing_service.to_async.preprocess(image_file)
        output = self.model.run(None, {"input": image.astype(np.float32)})
        return output[0]

Implement three services, where two of them run concurrently and the outputs of both services are combined in the third service to make a prediction. As an example you can expand your previous service to serve two different models and then implement a third service that combines the outputs of both models to make a prediction.

Solution

The following code snippet shows how to implement a service that consists of two concurrent services. The example assumes that two models called model_a.onnx and model_b.onnx are available.

bentoml_service_composition.py
from __future__ import annotations

import asyncio

import bentoml
import numpy as np
from onnxruntime import InferenceSession


@bentoml.service
class ImageClassifierServiceModelA:
    """Image classifier service using ONNX model."""

    def __init__(self) -> None:
        self.model = InferenceSession("model_a.onnx")

    @bentoml.api
    def predict(self, image: np.ndarray) -> np.ndarray:
        """Predict the class of the input image."""
        output = self.model.run(None, {"input": image.astype(np.float32)})
        return output[0]


@bentoml.service
class ImageClassifierServiceModelB:
    """Image classifier service using ONNX model."""

    def __init__(self) -> None:
        self.model = InferenceSession("model_b.onnx")

    @bentoml.api
    def predict(self, image: np.ndarray) -> np.ndarray:
        """Predict the class of the input image."""
        output = self.model.run(None, {"input": image.astype(np.float32)})
        return output[0]


@bentoml.service
class ImageClassifierService:
    """Image classifier service using ONNX model."""

    model_a = bentoml.depends(ImageClassifierServiceModelA)
    model_b = bentoml.depends(ImageClassifierServiceModelB)

    @bentoml.api
    async def predict(self, image: np.ndarray) -> np.ndarray:
        """Predict the class of the input image."""
        result_a, result_b = await asyncio.gather(
            self.model_a.to_async.predict(image), self.model_b.to_async.predict(image)
        )
        return (result_a + result_b) / 2

(Optional) Implement a server that consists of both sequential and concurrent services.

Similar to deploying a FastAPI application to the cloud, deploying a BentoML framework to the cloud often requires you to first containerize the application. Because BentoML is designed to be easy to use for even users not that familiar with Docker, it introduces the concept of a bentofile. A bentofile is a file that specifies how the container should be built. Below is an example of how a bentofile could look.
```
service: 'service:Summarization'
labels:
  owner: bentoml-team
  project: gallery
include:
  - '*.py'
python:
  packages:
    - torch
    - transformers
```
This can then be used to build a bento using the following command:
```
bentoml build
```
A bento is not a docker image, but it can be used to build a docker image with the following command:
```
bentoml containerize summarization:latest
```
1. Can you figure out how the different parts of the bentofile are used to build the docker image? Additionally, can you figure out from the source repository how the bentofile is used to build the docker image?
  
  Solution
  
  The service part specifies both what the container should be called and also what service it should serve, e.g., the last statement in the corresponding dockerfile is CMD ["bentoml", "serve", "service:Summarization"]. The labels part is used to specify labels about the container, see this link for more info. The include part corresponds to COPY statements in the dockerfile and finally the python part is used to specify what python packages should be installed in the container which corresponds to RUN pip install ... in the dockerfile.
  
  Regarding how the bentofile is used to build the docker image, the bentoml package contains a number of templates (written using the jinja2 templating language) that are used to generate the dockerfiles. The templates can be found here.
2. Take any service from the previous exercises and try to containerize it. You are free to either write a bentofile or a dockerfile to do this.
  Solution
  
  The following bentofile can be used to containerize the very first service we implemented in this set of exercises.
```
service: 'bentoml_service:ImageClassifierService'
labels:
  owner: bentoml-team
  project: gallery
include:
- 'bentoml_service.py'
- 'model.onnx'
python:
  packages:
    - onnxruntime
    - numpy
```
  The corresponding dockerfile would look something like this:
```
FROM python:3.11-slim
WORKDIR /bento
COPY bentoml_service.py .
COPY model.onnx .
RUN pip install onnxruntime numpy bentoml
CMD ["bentoml", "serve", "bentoml_service:ImageClassifierService"]
```
3. Deploy the container to GCP Run and test that it works.
  Solution
  
  The following command can be used to deploy the container to GCP Run. We assume that you have already built the container and called it bentoml_service:latest
```
docker tag bentoml_service:latest \
    <region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest
docker push <region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest
gcloud run deploy bentoml-service \
    --image=<region>-docker.pkg.dev/<project-id>/<repository-name>/bentoml_service:latest \
    --platform managed \
    --port 3000  # default used by BentoML
```
  where <project-id> should be replaced with the id of the project you are deploying to. The service should now be available at the URL that is printed in the terminal.

This completes the exercises on the BentoML framework. If you want to deep dive more into this we recommend looking into their tasks feature for use cases that have a very long running time and built-in model management feature to unify the way models are loaded, managed and served.

🧠 Knowledge check

How would you export a scikit-learn model to ONNX? What method is exported when you export a scikit-learn model to ONNX?
Solution

It is possible to export a scikit-learn model to ONNX using the sklearn-onnx package. The following code snippet shows how to export a scikit-learn model to ONNX.
```
from sklearn.ensemble import RandomForestClassifier
from skl2onnx import to_onnx
model = RandomForestClassifier(n_estimators=2)
dummy_input = np.random.randn(1, 4)
onx = to_onnx(model, dummy_input)
with open("model.onnx", "wb") as f:
    f.write(onx.SerializeToString())
```
The method that is exported when you export a scikit-learn model to ONNX is the predict method.
In your own words, describe what the concept of computational graph means.

Solution

A computational graph is a way to represent the mathematical operations that are performed in a model. It is essentially a graph where the nodes are the operations and the edges are the data that is passed between them. The computational graph normally represents the forward pass of the model and is the reason that we can easily backpropagate through the model to train it, because the graph contains all the necessary information to calculate the gradients of the model.
In your own words, explain why fusing operations together in the computational graph often leads to better performance.

Solution

Each time we want to do a computation, the data needs to be loaded from memory into the CPU/GPU. This is a slow process and the more operations we have, the more times we need to load the data. By fusing operations together, we can reduce the number of times we need to load the data, because we can do multiple operations on the same data before we need to load new data.

This ends the module on tools specifically designed for serving machine learning models. As stated in the beginning of the module, there are a lot of different tools that can be used to serve machine learning models and the choice of tool often depends on the specific use case. In general, we recommend that whenever you want to serve a machine learning model, you try out a few different frameworks and see which one fits your use case best.