AE - Hub

Reproducibility in MLOps through versioning

Written by Wouter Durnez | 7/4/23 10:00 PM

Have you ever yearned for a way to journey back in time? A method to undo or alter certain pivotal moments in your life? Perhaps there were decisions you made that didn't yield the desired outcomes, relationships you unintentionally damaged, instances where you could have treated someone better, or even that super awkward moment in high school when an involuntary bodily function brought you to shame in front of the whole class. You already have some events in mind, don’t you? Don’t lie. Unfortunately, life is a one-way slippy slide, all the way up to your eventual demise. Tough.

Thankfully, machine learning beats life. When it comes to cleaning up the mess we make along our model training and development journey, we have not one, but two saviors:

  • version control, and
  • dependency management.

These tools serve as safety nets, allowing us to track our progress, experiment fearlessly, and navigate through the complexities of our projects. As a result, we can focus on core functionality without worrying about inconsistent outcomes, conflicts, or unexpected behavior.

💁‍♂️ You speak fancy words, data man. But what do they mean?

Want to dive in a little deeper? Glad you asked, my little impromptu conversational facilitator!

Version all the things!

Version control acts as a magical reset button, enabling us to rewind time, retrieve earlier iterations of our models (pipelines), and rectify any missteps or deviations. It ensures that no mistake is irreversible, giving us the freedom to explore and improve.

💁‍♂️ Makes sense, but what exactly are *all the things*?

Version the source code!

If you’re a professional coder, you should already be familiar with version control (that means you too, data scientists 👀), the practice of tracking changes made to code over time. Git is the most popular tool for this purpose, although other systems like Subversion and Mercurial exist.

In 2005, Linus Torvalds flexed his brain muscles for a few moments to give birth to Git, just as a way to support his primary project, Linux. This is what peak male performance looks like, ladies.

Git is a distributed version control system. Each developer has a complete copy of the code repository on their local machine, making it easy to work offline or without access to the central repository. Although Git has a command-line interface, GUI applications like SourceTree and Fork exist to visualize the repository, branches, and committed milestones. In addition, we can use several platforms built on top of Git technology, including GitHub and BitBucket.

💁‍♂️ Okay, but what do we want to keep track of, exactly?

When developing a machine learning model, a crucial step involves training it on a dataset using Python scripts or better yet a well-designed, bells-and-whistles grand pipeline. This source code comprises essential modules and (sub)pipelines responsible for tasks such as data preprocessing, model training, and model evaluation. These code components embody the initial stages of our machine learning development lifecycle.

By embracing version control for this code, we acquire a vital piece of the puzzle when it comes to reproducing our workflow. Versioning our machine learning code ensures that we can precisely track and manage the evolution of our models over time. It allows us to revisit previous versions of the codebase, compare different iterations, and understand the changes that have shaped our models' development. Moreover, version control facilitates collaboration among team members by providing a unified platform for sharing, reviewing, and integrating code changes.

💁‍♂️ Great, we can reproduce our models by documenting the code that births them. But aren’t we forgetting something?

(smooth segue, Wouter, nice 👍)

Version the data!

Indeed, machine learning models aren’t simply the result of running a bit of code. Well, ok, they kind of are (damn!). However, in productized machine learning models, other moving parts need to be tracked. The data itself, for instance, is likely variable, and should therefore be versioned. Consider a climate forecasting model trained on data from the 1970s—can we truly expect it to perform on par with a model trained on the most up-to-date data?

So how can we implement data versioning? Here are a few approaches, each with its own pros and cons:

  • Timestamp-based versioning This involves adding a timestamp to the filename or file metadata each time a dataset is updated. For example, if you have a file called mydata.csv, you could save each version as mydata_20220503.csv for May 3rd, 2022. While this method is simple, it can quickly become unwieldy if you have many versions of the data.
    As an example, Kedro will use this type of data versioning by default.
  • Checksum-based versioning We can generate a checksum or hash of the data and use it as a unique identifier for the version. This can be done using tools like SHA-1 or MD5. The checksum can be included in the filename or metadata to identify the version uniquely. Aside from the identification aspect, this approach also offers a way to verify the data integrity, as we can recalculate the checksum and verify it later.
  • Data version control systems There are specialized tools for versioning data, such as DVC (Data Version Control) and Quilt. These tools allow data scientists to version data files directly, track changes, and facilitate sharing and collaboration across team members. DVC is also built on Git (all hail Linus), which was discussed in the earlier section. It bridges the gap between data versioning and code versioning, by allowing you to track associations between versions of both. Quilt is a general data hub, built specifically for AWS, in which data can be versioned along with other assets, such as Jupyter notebooks.

Version the model!

In MLOps, machine learning models themselves should be versioned along with the code used to train them. This includes the training process’s hyperparameters parameters that can’t be learnt, but control the learning process itself and the weights of the model. This allows data scientists to reproduce the same model at a later time and easily track changes made to the model.

💁‍♂️ But didn’t you just say the source code and data lead to the model? Then why track the model separately?

Although the source code is indeed used to train the model, the model can be affected by changes that occur outside of the codebase. For example, changing the random seed used to initialize the model weights can result in a different model even if the code remains the same. And even if the seed were set in the source code, changes in the hardware environment or software dependencies can still result in subtle differences in the model training process that can impact the final model.

By versioning the trained model itself, including the hyperparameters and weights, we can easily reproduce the exact model that was trained and track changes made to the model over time. Doing so allows you to go back in time (oooh): if your latest model acts suspicious, you can grab an older version while you figure things out. This is particularly useful when the model is used in production or needs to be retrained on updated data. In addition to the model itself, there are various other model artifacts that we can store, including performance metrics and model summaries. There are plenty of tools at your disposal to track these kinds of things (such as Neptune or Weights & Biases), but the one we love the most is *drum roll*...

MLflow

MLflow is one of the more popular open-source tools to track your model development journey, from experimentation to deployment. It provides a set of tools and APIs for managing and tracking machine learning experiments, including:

  • Tracking: MLflow can track and log all aspects of an experiment, including the code, data, parameters, and metrics. This makes it easy to reproduce experiments, compare results, and share insights with collaborators.
  • Model packaging: MLflow can package machine learning models in a standard format, making it easy to deploy and serve them in production.
  • Model versioning: MLflow can version machine learning models, allowing users to compare and track changes over time.
  • Model registry: MLflow provides a central repository for managing models, allowing users to organize and share models with others.

MLflow supports a wide range of machine learning libraries and frameworks, including TensorFlow, PyTorch, scikit-learn, and XGBoost. It also supports multiple deployment options, including batch inference, real-time inference, and serverless functions.

MLflow is designed to be flexible and customizable and can be used with a variety of development environments, such as Jupyter notebooks, standalone Python scripts, and cloud-based workflows. It is also highly scalable and can be used in large-scale production environments. MLflow has a built-in UI to visualize the results of experimentation, can be set up in a variety of ways (with custom backend, locally or in the cloud), and has the functionality to quickly deploy models to an endpoint. It also plays well with another one of our preferred frameworks: Kedro. Did we mention it’s our favorite experiment tracker?

💁‍♂️ Ok, so version all the things got it. Did we miss anything worth versioning?

Version the infrastructure!

Also: infrastructure-as-code

For completeness’ sake, we should mention that any infrastructure in MLOps should also be versioned. This includes configuration files for cloud infrastructure, Docker images, and Kubernetes manifests. Infrastructure versioning ensures that the infrastructure used to train and deploy machine learning models is consistent and reproducible. Foregoing this practice could result in the creation of snowflakes. Outside of the political scope, the term ‘snowflake’ refers to a configuration that as a result of gradual, small changes over time has become ad-hoc and “unique” to the environment at large. This, in turn, can lead to inconsistency issues when going from one phase (e.g., training) to another (e.g., deployment), as the environments are no longer the same. In truth, this part warrants a completely separate blog post (which I’ll gladly outsource to one of AE’s capable DevOps aficionados).

💁‍♂️ So you’ve talked about code versioning, data versioning, and model versioning, great. But how can I ensure everything runs the same, every single time?

Dependency management

A final guardian angel, from a reproducibility perspective, is dependency management. In MLOps, we want to ensure that the various components and libraries we rely on in our projects harmoniously coexist (Kumbaya). It takes care of all the intricate relationships between different pieces of code, ensuring that everything works together seamlessly. It also ensures that our code remains idempotent: running it will give consistent results if we remain mindful of the environment our code is run from.

Dependency management in MLOps refers to the process of managing the dependencies required for a machine learning project, including libraries, frameworks, and other software packages. To develop and run machine learning code, we often rely on a variety of external resources, and ensuring that all dependencies are installed and compatible with each other can be a complex and time-consuming process. Take this to heart: In machine learning, dependency hell is a real place. Libraries and frameworks coexist in a fragile and borderline mystical symbiosis of versions. Playing fast and loose with a project's requirements is a recipe for disaster, as it may eventually lead you into a world of errors and deprecation. Hypothetically, of course.

💁‍♂️ So how can I get this part right?

Have no fear! Several off-the-shelf tools are at your disposal to safely manage your dependency stack. First, it’s usually always a good idea to first set up a virtual environment: a pristine place for your code to live, void of the stain of earlier package installs. You can then specify and sync the requirements for your project, installing and removing dependencies as you go.

Some popular tools to achieve this include pip/pip-tools/pipenv (vanilla but solid), conda (not a fan, but useful in a pinch), poetry (for the cool kids), and the more recent rye (for the ultra-cool kids with a hint of existential crisis). We’re cool kids, so we like poetry quite a bit. If we’re being really honest, though, there’s nothing wrong with good ol’ pip either. Or maybe our age is showing.

Note that dependencies don’t always seamlessly translate across different operating systems. If you are building a Docker image for your training pipelines (which you can then version as an artifact), it is important to be aware that these images will run on the Linux kernel, which may differ from your local working environment. Double-check whether your image can be run without unforeseen issues!

💁‍♂️ Roger that, data man.

Alright, time to wrap this one up!

git commit -am “📚 Added some documentation on reproducibility”

(If you get this title, you’re doing something right.)

MLOps covers a variety of moving bits and pieces. Keeping a record of the state of these bits and pieces in other words, applying version control helps us to stay on top of how certain models came to be, and retrace our steps if need be. Version what? Everything! Code, data, models, infrastructure... including their dependencies! This should allow you to link all of these bits and pieces together, so you have full control over what you were doing, whenever you were doing it.

Next time, we’ll discuss another (big) concept in MLOps, that is arguably at the very core of the topic: automation!