Code organization

Core Module

With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain

Big ball of Mud

A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated.
The overall structure of the system may never have been well defined.
If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems.

Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997

We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.

Cookiecutter

We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just an overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.

We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.

Shown below is the default code structure of cookiecutter for data science projects.

What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is a guide to make something. Therefore, not all parts of a template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.

Python projects

While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.

The first file you may or may not know is the __init__.py file. In Python the __init__.py file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:

├── src/
│   ├── __init__.py
│   ├── file1.py
│   ├── file2.py
├── pyproject.toml

The second file to focus on is the pyproject.toml. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install, pip is in charge of both downloading the package you want but also in charge of installing it. For pip to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml file.

Below we have both added a description of the structure of the pyproject.toml file but also setup.py + setup.cfg which is the "old" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg so it is good to at least know about them.

pyproject.tomlsetup.py + setup.cfg

pyproject.toml is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written in toml format which is easy to read. At the very least your pyproject.toml file should include the [build-system] and [project] sections:

[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."
requires-python = ">=3.8"
dynamic = ["dependencies"]

[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

the [build-system] informs pip/python that to build this Python project it needs the two packages setuptools and wheels and that it should call the setuptools.build_meta function to actually build the project. The [project] section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.

For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt file and it as a dynamic field in pyproject.toml as shown above. Alternatively, you can add a dependencies field under the [project] header like this:

[project]
dependencies = [
    'torch==2.1.0',
    'matplotlib>=3.8.1'
]

The improvement over setup.py + setup.cfg is that pyproject.toml also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff and how it can help format your code. If we want to configure ruff for our project we can do that directly in pyproject.toml by adding additional headers:

[ruff]
ruff_option = ...

To read more about how to specify pyproject.toml this page is a good place to start.

setup.py is the original way to describing how a Python package should be build. The most basic setup.py file will look like this:

from setuptools import setup
from pip.req import parse_requirements
requirements = [str(ir.req) for ir in parse_requirements("requirements.txt")]
setup(
    name="my-package-name",
    version="0.1.0",
    author="EM",
    description="Something cool here."
    install_requires=requirements,
)

Essentially, the it is the exact same meta information as in pyproject.toml, just written directly in Python syntax instead of toml. Because there was a wish to deperate this meta information into a separate file, the setup.cfg file was created which can contain the exact same information as setup.py just in a declarative config.

[metadata]
name = my-package-name
version = 0.1.0
author = EM
description = "Something cool here."
# ...

This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml.

Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same

pip install .
# or in developer mode
pip install -e . # (1)!

The -e is short for --editable mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.

after running this your code should be available to import as from project_name import ... like any other Python package you use. This is the most essential you need to know about creating Python packages.

❔ Exercises

After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.

python <project_name>/data/make_dataset.py data/raw data/processed
python <project_name>/models/train_model.py <arguments>
etc...

in this way paths (for saving and loading files) are always relative to the root.

Install cookiecutter framework
```
pip install cookiecutter
```
Start a new project using this template, that is specialized for this course (1).
1. If you feel like the template can be improve in some way, feel free to either open a issue with the proposed improvement or directly send a pull request to the repository 😄.
You do this by running the cookiecutter command using the template url:
```
cookiecutter <url-to-template>
```
Valid project names

When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project is a valid name, while MyProject is not. Additionally, the packaage name cannot start with a number.

Flat-layout vs src-layout

There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name> folder and the second is called flat-layout where the source code is place is just placed in a <project_name> folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.
After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment
```
pip install -e .
```

Start by filling out the <project_name>/data/make_dataset.py file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist) which now should be located in a data/raw folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.

Solution

make_dataset.py
import click
import torch


def normalize(images: torch.Tensor) -> torch.Tensor:
    """Normalize images."""
    return (images - images.mean()) / images.std()


@click.command()
@click.option("raw_dir", default="data/raw", help="Path to raw data directory")
@click.option("processed_dir", default="data/processed", help="Path to processed data directory")
def make_data(raw_dir: str, processed_dir: str):
    """Process raw data and save it to processed directory."""
    train_images, train_target = [], []
    for i in range(5):
        train_images.append(torch.load(f"{raw_dir}/train_images_{i}.pt"))
        train_target.append(torch.load(f"{raw_dir}/train_target_{i}.pt"))
    train_images = torch.cat(train_images)
    train_target = torch.cat(train_target)

    test_images: torch.Tensor = torch.load(f"{raw_dir}/test_images.pt")
    test_target: torch.Tensor = torch.load(f"{raw_dir}/test_target.pt")

    train_images = train_images.unsqueeze(1).float()
    test_images = test_images.unsqueeze(1).float()
    train_target = train_target.long()
    test_target = test_target.long()

    train_images = normalize(train_images)
    test_images = normalize(test_images)

    torch.save(train_images, f"{processed_dir}/train_images.pt")
    torch.save(train_target, f"{processed_dir}/train_target.pt")
    torch.save(test_images, f"{processed_dir}/test_images.pt")
    torch.save(test_target, f"{processed_dir}/test_target.pt")


if __name__ == "__main__":
    make_data()

This template comes with a Makefile that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy
```
make data  # runs the make_dataset.py file, try it!
make clean  # clean __pycache__ files
make requirements  # install everything in the requirements.txt file
```
Windows users

make is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.

In general we recommend that you add commands to the Makefile as you move along in the course. If you want to know more about how to write Makefiles then this is an excellent video.
Put your model file (model.py) into <project_name>/models folder together and insert the relevant code from the main.py file into the train_model.py file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models folder (preferably in sub-folders).
When you run train_model.py, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/ folder. This could be a simple .png of the training curve.
(Optional) Can you figure out a way to add a train command to the Makefile such that training can be started using
```
make train
```
Solution
```
train:
    python <project_name>/models/train_model.py
```
Fill out the newly created <project_name>/models/predict_model.py file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy or pickle file with already loaded images e.g. something like this
```
python <project_name>/models/predict_model.py \
    models/my_trained_model.pt \  # file containing a pretrained model
    data/example_images.npy  # file containing just 10 images for prediction
```

Fill out the file <project_name>/visualization/visualize.py with this (as minimum, feel free to add more visualizations)

Loads a pre-trained network
Extracts some intermediate representation of the data (your training set) from your cnn. This could be the features just before the final classification layer
Visualize features in a 2D space using t-SNE to do the dimensionality reduction.
Save the visualization to a file in the reports/figures/ folder.

Solution

The solution here depends a bit on the choice of model. However, in most cases your last layer in the model will be a fully connected layer, which we assume is named fc. The easiest way to get the features before this layer is to replace the layer with torch.nn.Identity which essentially does nothing (see highlighted line below). Alternatively, if you implemented everything in a torch.nn.Sequential you can just remove the last layer from the Sequential object: model = model[:-1].

visualize.py
import click
import matplotlib.pyplot as plt
import torch
from my_project_name.model import MyAwesomeModel
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE


@click.command()
@click.option("model_checkpoint", default="model.pth", help="Path to model checkpoint")
@click.option("processed_dir", default="data/processed", help="Path to processed data directory")
@click.option("figure_dir", default="reports/figures", help="Path to save figures")
@click.option("figure_name", default="embeddings.png", help="Name of the figure")
def visualize(model_checkpoint: str, processed_dir: str, figure_dir: str, figure_name: str) -> None:
    """Visualize model predictions."""
    model = MyAwesomeModel().load_state_dict(torch.load(model_checkpoint))
    model.eval()
    model.fc = torch.nn.Identity()

    test_images = torch.load(f"{processed_dir}/test_images.pt")
    test_target = torch.load(f"{processed_dir}/test_target.pt")
    test_dataset = torch.utils.data.TensorDataset(test_images, test_target)

    embeddings, targets = [], []
    with torch.inference_mode():
        for batch in torch.utils.data.DataLoader(test_dataset, batch_size=32):
            images, target = batch
            predictions = model(images)
            embeddings.append(predictions)
            targets.append(target)
        embeddings = torch.cat(embeddings).numpy()
        targets = torch.cat(targets).numpy()

    if embeddings.shape[1] > 500:  # Reduce dimensionality for large embeddings
        pca = PCA(n_components=100)
        embeddings = pca.fit_transform(embeddings)
    tsne = TSNE(n_components=2)
    embeddings = tsne.fit_transform(embeddings)

    plt.figure(figsize=(10, 10))
    for i in range(10):
        mask = targets == i
        plt.scatter(embeddings[mask, 0], embeddings[mask, 1], label=str(i))
    plt.legend()
    plt.savefig(f"{figure_dir}/{figure_name}")

(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
Make sure to update the README.md file with a short description on how your scripts should be run
Finally make sure to update the requirements.txt file with any packages that are necessary for running your code (see this set of exercises for help)
(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.
1. Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.
2. After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json file. For the mlops template it looks like this:
```
{
    "project_name": "project_name",
    "repo_name": "{{ cookiecutter.project_name.lower().replace(' ', '_') }}",
    "author_name": "Your name (or your organization/company/team)",
    "description": "A short description of the project.",
    "python_version_number": "3.10",
    "open_source_license": ["No license file", "MIT", "BSD-3-Clause"]
}
```
  simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.
3. The actual template is located in the {{ cookiecutter.project_name }} folder. cookiecutter works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }} with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }} folder and make sure to add the {{ cookiecutter.<variable_name> }} where you want the variable to be replaced.
4. After you have made the changes you want to the template, you should test it locally. Just run
```
cookiecutter . -f --no-input
```
  and it should create a new folder using the default values of the cookiecutter.json file.
5. Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running
```
cookiecutter https://github.com/<username>/<my_template_repo>
```

🧠 Knowledge check

Starting from complete scratch, what is the steps needed to create a new GitHub repository and push a specific template to it as the very first commit.
Solution
1. Create a completely barebone repository, either using the GitHub UI or if you have the GitHub cli installed (not git) you can run
```
gh repo create <repo_name> --public --confirm
```
2. Run cookiecutter with the template you want to use
```
cookiecutter <template>
```
  The name of the folder created by cookiecutter should be the same as you just used.
3. Run the following sequence of commands
```
cd <project_name>
git init
git add .
git commit -m "Initial commit"
git remote add origin https://github.com/<username>/<repo_name>
git push origin master
```
That's it. The template should now have been pushed to the repository as the first commit.

That ends the module on code structure and cookiecutter. We again want to stress the point of using cookiecutter is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.