Skip to content

Logo

Profilers


Core Module

Profilers

In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what "performance" is: runtime, meaning the time it takes to execute your program.

At the bare minimum, the two questions a proper profiling of your program should be able to answer is:

  • “ How many times is each method in my code called?”
  • “ How long do each of these methods take?”

The first question can help us priorities what to optimize. If two methods A and B have approximately the same runtime, but A is called 1000 more times than B we should probably spend time optimizing A over B if we want to speed up our code. The second question is gives itself, directly telling us which methods are the expensive to call.

Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.

❔ Exercises

  1. Run the cProfile on the vae_mnist_working.py script. Hint: you can directly call the profiler on a script using the -m arg

    python -m cProfile -s <sort_order> myscript.py
    

    to write the output to a file you can use the -o argument

    python -m cProfile -s <sort_order> -o profile.txt myscript.py
    
    Script to debug
    vae_mnist_working.py
    """Adapted from https://github.com/Jackson-Kang/Pytorch-VAE-tutorial/blob/master/01_Variational_AutoEncoder.ipynb.
    
    A simple implementation of Gaussian MLP Encoder and Decoder trained on MNIST
    """
    
    import torch
    import torch.nn as nn
    import torchvision.transforms as transforms
    from torch.optim import Adam
    from torch.utils.data import DataLoader
    from torchvision.datasets import MNIST
    from torchvision.utils import save_image
    
    # Model Hyperparameters
    dataset_path = "datasets"
    device_name = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
    DEVICE = torch.device(device_name)
    batch_size = 100
    x_dim = 784
    hidden_dim = 400
    latent_dim = 20
    lr = 1e-3
    epochs = 5
    
    
    # Data loading
    mnist_transform = transforms.Compose([transforms.ToTensor()])
    
    train_dataset = MNIST(dataset_path, transform=mnist_transform, train=True, download=True)
    test_dataset = MNIST(dataset_path, transform=mnist_transform, train=False, download=True)
    
    train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
    
    
    class Encoder(nn.Module):
        """Gaussian MLP Encoder."""
    
        def __init__(self, input_dim, hidden_dim, latent_dim) -> None:
            super().__init__()
    
            self.FC_input = nn.Linear(input_dim, hidden_dim)
            self.FC_mean = nn.Linear(hidden_dim, latent_dim)
            self.FC_var = nn.Linear(hidden_dim, latent_dim)
            self.training = True
    
        def forward(self, x):
            """Forward pass."""
            h_ = torch.relu(self.FC_input(x))
            mean = self.FC_mean(h_)
            log_var = self.FC_var(h_)
    
            std = torch.exp(0.5 * log_var)
            z = self.reparameterization(mean, std)
    
            return z, mean, log_var
    
        def reparameterization(self, mean, std):
            """Reparameterization trick."""
            epsilon = torch.randn_like(std)
            return mean + std * epsilon
    
    
    class Decoder(nn.Module):
        """Bernoulli MLP Decoder."""
    
        def __init__(self, latent_dim, hidden_dim, output_dim) -> None:
            super().__init__()
            self.FC_hidden = nn.Linear(latent_dim, hidden_dim)
            self.FC_output = nn.Linear(hidden_dim, output_dim)
    
        def forward(self, x):
            """Forward pass."""
            h = torch.relu(self.FC_hidden(x))
            return torch.sigmoid(self.FC_output(h))
    
    
    class Model(nn.Module):
        """VAE Model."""
    
        def __init__(self, encoder, decoder) -> None:
            super().__init__()
            self.encoder = encoder
            self.decoder = decoder
    
        def forward(self, x):
            """Forward pass."""
            z, mean, log_var = self.encoder(x)
            x_hat = self.decoder(z)
    
            return x_hat, mean, log_var
    
    
    encoder = Encoder(input_dim=x_dim, hidden_dim=hidden_dim, latent_dim=latent_dim)
    decoder = Decoder(latent_dim=latent_dim, hidden_dim=hidden_dim, output_dim=x_dim)
    
    model = Model(encoder=encoder, decoder=decoder).to(DEVICE)
    
    
    BCE_loss = nn.BCELoss()
    
    
    def loss_function(x, x_hat, mean, log_var):
        """Reconstruction + KL divergence losses summed over all elements and batch."""
        reproduction_loss = nn.functional.binary_cross_entropy(x_hat, x, reduction="sum")
        kld = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
        return reproduction_loss + kld
    
    
    optimizer = Adam(model.parameters(), lr=lr)
    
    
    print("Start training VAE...")
    model.train()
    for epoch in range(epochs):
        overall_loss = 0
        for batch_idx, (x, _) in enumerate(train_loader):
            if batch_idx % 100 == 0:
                print(batch_idx)
            x = x.view(batch_size, x_dim)
            x = x.to(DEVICE)
    
            optimizer.zero_grad()
    
            x_hat, mean, log_var = model(x)
            loss = loss_function(x, x_hat, mean, log_var)
    
            overall_loss += loss.item()
    
            loss.backward()
            optimizer.step()
        print(
            "\tEpoch",
            epoch + 1,
            "complete!",
            "\tAverage Loss: ",
            overall_loss / (batch_idx * batch_size),
        )
    print("Finish!!")
    
    # Generate reconstructions
    model.eval()
    with torch.no_grad():
        for batch_idx, (x, _) in enumerate(test_loader):
            if batch_idx % 100 == 0:
                print(batch_idx)
            x = x.view(batch_size, x_dim)
            x = x.to(DEVICE)
            x_hat, _, _ = model(x)
            break
    
    save_image(x.view(batch_size, 1, 28, 28), "orig_data.png")
    save_image(x_hat.view(batch_size, 1, 28, 28), "reconstructions.png")
    
    # Generate samples
    with torch.no_grad():
        noise = torch.randn(batch_size, latent_dim).to(DEVICE)
        generated_images = decoder(noise)
    
    save_image(generated_images.view(batch_size, 1, 28, 28), "generated_sample.png")
    
  2. Try looking at the output of the profiling. Can you figure out which function took the longest to run? How do you show the content of the profile.txt file?

    Solution

    If you try to open profile.txt in a text editor you will see that it is not very human readable. To get a better overview of the profiling you can use the pstats module to read the file and print the results in a more readable format. For example, to print the 10 functions that took the longest time to run you can use the following code:

    import pstats
    p = pstats.Stats('profile.txt')
    p.sort_stats('cumulative').print_stats(10)
    
  3. Can you explain the difference between tottime and cumtime? Under what circumstances does these differ and when are they equal.

    Solution

    tottime is the total time spent in the function excluding time spent in subfunctions. cumtime is the total time spent in the function including time spent in subfunctions. Therefore, cumtime is always greater than tottime.

  4. To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof).

  5. Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile and snakeviz) that the code actually runs faster.

    Solution

    For consistency reasons, even though the data in the MNIST dataset class from torchvision is stored as tensors, they are converted to PIL images before returned. This is the reason the solution is initialize the dataclass with the transform

    mnist_transform = transforms.Compose([transforms.ToTensor()])
    

    such that the data is returned as tensors. However, since data is already stored as tensors, calling this transform every time you want to access the data is redundant and can be removed. The easiest way to do this is to create a TensorDataset from the internal data and labels (which already are tensors).

    from torchvision.datasets import MNIST
    from torch.utils.data import TensorDataset
    # the class also internally normalize to [0,1] domain so we need to divide by 255
    train_dataset = MNIST(dataset_path, train=True, download=True)
    train_dataset = TensorDataset(train_dataset.data.float() / 255.0, train_dataset.targets)
    test_dataset = MNIST(dataset_path, train=False, download=True)
    test_dataset = TensorDataset(test_dataset.data.float() / 255.0, test_dataset.targets)
    

PyTorch profiling

Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.

The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel time (this is the time spend doing actual computations) and also transfer times such as memcpy (where we are copying data between devices). It can even analyze your code and give recommendations.

Image

Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile decorator

with torch.profiler.profile(...) as prof:
    # code that I want to profile
    output = model(data)

❔ Exercises

Exercise files

In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):

import torch
print(torch.__version__)

But we always recommend to update to the latest PyTorch version for the best experience. Additionally, to display the result nicely (like snakeviz for cProfile) we are also going to use the tensorboard profiler extension

pip install torch_tb_profiler
  1. A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile class.

  2. Lets try out an simple example (taken from here):

    1. Try to run the following code

      import torch
      import torchvision.models as models
      from torch.profiler import profile, ProfilerActivity
      
      model = models.resnet18()
      inputs = torch.randn(5, 3, 224, 224)
      
      with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
          model(inputs)
      

      this will profile the forward pass of Resnet 18 model.

    2. Running this code will produce an prof object that contains all the relevant information about the profiling. Try writing the following code:

      print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
      

      what operation is taking most of the cpu?

    3. Try running

      print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total", row_limit=30))
      

      can you see any correlation between the shape of the input and the cost of the operation?

    4. (Optional) If you have a GPU you can also profile the operations on that device:

      with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
          model(inputs)
      
    5. (Optional) As an alternative to using profile as an context-manager we can also use its .start and .stop methods:

      prof = profile(...)
      prof.start()
      ...  # code I want to profile
      prof.stop()
      

      Try doing this on the above example.

  3. The torch.profiler.profile function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage.

  4. As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:

    prof.export_chrome_trace("trace.json")
    

    you should be able to visualize the file by going to chrome://tracing in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?

  5. Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step() to tell the profiler when we are doing a new iteration

    with profile(...) as prof:
        for i in range(10):
            model(inputs)
            prof.step()
    

    Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?

  6. Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.

    1. Start by initializing the profile class with an additional argument:

      from torch.profiler import profile, tensorboard_trace_handler
      with profile(..., on_trace_ready=tensorboard_trace_handler("./log/resnet18")) as prof:
          ...
      

      Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json is produced in the log/resnet18 folder.

    2. Now try launching tensorboard

      tensorboard --logdir=./log
      

      and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:

      Image
      Image credit

      Try poking around in the interface.

    3. Tensorboard have a nice feature for comparing runs under the diff tab. Try redoing a profiling run but use model = models.resnet34() instead. Load up both runs and try to look at the diff between them.

  7. As an final exercise, try to use the profiler on the vae_mnist_working.py file from the previous module on debugging, where you profile a whole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.

This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile. An example would be an simple index operations such as a[idx] = b, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for Python programs.