Machine Learning model deployment is a complex process that requires the completion of a long queue of steps. In the case of the classical software development cycle, the majority of processes are already automated, while machine learning still needs a lot of manual effort. To resolve this issue MLOps approach was developed. In short, it is a set of activities aimed to automate the creation of machine learning models starting from the data processing and finishing with the deployment of the trained model. In this paper we will focus on the general principles of MLOps solutions and understand what steps they usually contain.
Hi there! Each Data Science engineer working on his projects sooner or later comes to the idea to automate the process. Usually, it takes place only in a separate block like model selection, hyperparameters tunning, or model deployment. And it is a native approach because full automation requires a lot of effort and is not always applicable in the context of the ongoing project. For example, it can be a simple model that will be retrained once per 10 years. You just implemented it and forgot at all. But what if we deal with a composite production system that includes several different models that should be updated and redeployed at least once per day? In this case, the right way to solve this challenge is to think around the MLOps solution.
So here we are going to understand, what MLOps is, why we need it, what principles it follows, and a lot of other insightful information. As usual, let's define a short plan.
What is MLOps.
Building blocks of every MLOps solution.
Tools to build MLOps solution
What is MLOps
If we start googling MLOps (Machine Learning Operations) we will find that it is a set of approaches to move ML models to production and support them. MLOps encompasses different aspects of machine learning model management, including data preprocessing and preparation, model training and evaluation, deployment, and monitoring.
MLOps enables organizations to streamline their ML workflows, reduce time to deployment, and improve the performance, scalability, and reliability of their ML models. As a result, it helps to reduce the amount of human effort in developing any ML-based solution. This also includes reducing the involvement of specialists like software developers, data scientists, and DevOps engineers, limiting only by hiring MLOps engineers.
So the motives for building MLOps solutions are pretty clear, let's go deeper into the structure of such systems.
Building blocks of every MLOps solution
As we already understand what MLOps is and why we need it, it is the correct time to start an overview of the structural elements of the MLOps-based solution. According to my experience, the list of blocks really varies from one project to another, but, despite this, we can highlight ones that you can face during the work:
1. Training pipeline
Certainly one of the core elements of any MLOps solution. Basically, it is a block responsible for the model training process. The training pipeline is not a black box - it always has a predefined structure. It is a good practice when your pipeline consists of the following steps: 1) Data preparation step. It is a set of actions that should extract data from the data source (usually from the data registry), apply processing to prepare data for training, and save the processed version of the data. 2) Model training step. The part of the pipeline that is responsible for the training procedure. As a result of this step, we have model artifacts that include model weights and code for execution. This list can be extended if needed. In short words, the output should contain all elements that will be used in inference serving. 3) Model evaluation step. In some cases, this step can be concatenated with the model training step. The goal of the evaluation step is to calculate quality metrics from the trained model. So the output here is quite simple - it is some file with metrics required by the machine learning task. 4) Model registration step. The goal of this step is to prepare model artifacts for deployment. The registration process is specific from one tool to another, and the list of actions performed inside this step is totally different. The only output should be the same from one solution to another - we should have a new version of the model stored in the model registry. 5) Deployment step (Optional). It is not a mandatory element for the training pipeline. Often the model deployment process is a separate step. But in case it is a part of the training pipeline, it also contains a Condition step that should take a decision about the necessity to deploy the model. How does it work - the simplest way is to check the output from the evaluation step. If the current value is greater than the expected one - green light and let's deploy our model. The most common approach to deploy a model is to place it on some server with a real-time response that is usually called the endpoint.
2. Dataset registry
Of course, data is a core thing of any Data Science solution. Without a good dataset, the training procedure can't guarantee a high quality of the trained model. That's why selecting the right strategy for storing datasets is crucial for the MLOps solution.
Not every MLOps system includes a dataset registry, but it is always a good practice to integrate it into your solution. Especially if you are working with a high-performing production system with different models retrained with a high frequency.
So what is a dataset registry? Physically it can be a database, repository, or any other object storage (depending on the task and preferences of the architect) that should satisfy at least several (not mandatory) points:
Provide API for dataset uploading and downloading. You don't need a tool named dataset registry if you can't operate the data with it.
Provide dataset versioning. Different runs of the training pipeline can use different versions of the dataset. This information should be stored somehow, and what exactly should be covered by dataset versioning.
Provide API for getting metadata of the dataset. There are a lot of reasons to store metadata of the dataset. The simplest is just for comparing different versions of the same dataset.
The functionality of the dataset registry varies from one task to another. So you probably will need to create your own features depending on the task.
3. Model registry
The goal of the model registry is similar to the mission of the dataset registry - store trained models and provide all required information about each of them. What should contain the model registry:
Provide API for model uploading and downloading. Like in the case of datasets, the user should be able to extend the list of models or get the required model.
Provide model versioning mechanism. Even for the same task, we can conduct a set of experiments on the same model architecture but with different conditions (like changes in hyperparameters, amount of data, version of data, etc.). Of course, all these conditions will affect the creation of new model artifacts for each separate run. Each artifact and related resources should be saved into a model registry as an independent model version.
Metrics storage. When we have a lot of versions of a current model, it is critical to be able to understand which one model is better. For this purpose, the model registry should provide a metrics dashboard for each version. Basically, it is something that can provide information about model quality. It even can be a simple text file that contains metrics values.
Hyperparameters storage. As well as for model metrics, similar information should be available for model hyperparameters.
Meta information about model artifacts. To have the possibility to restore or deploy the model you should have information on where all artifacts are placed and be familiar with the list of required artifacts.
4. Triggering mechanism
In the ideal case, our MLOps solution should work in automotive mode. So publishing it once, the developer shouldn't do anything with his pipeline - just make requests to the deployed model. To achieve this, we need to deal with triggering mechanisms. The idea is quite simple. The mission of any trigger is waiting for the event described in its requirements. After the event has occurred, the trigger should start pipeline execution. The simplest example is the dataset update trigger. It is waiting until a new version of the dataset is created and starting to retrain the model. Basically, you can create a lot of different triggers like dataset update triggers, code update triggers, model quality degradation triggers, time-dependent triggers, and so on. The core idea here is to clearly understand situations when you need to re-run your pipeline.
5. Production pipeline
In some specific scenarios/projects, a deployment mechanism can be separated from the training pipeline into a block called the production pipeline. When can it be useful? Let's observe a simple case - you don't need to deploy a new model as soon as you train the model with higher accuracy than you have now. For example, you are waiting for a data structure update, the start of a new month, or any other mysterious triggering event. Usually, the production pipeline has the following steps inside:
Get the best model from the data registry.
Deploy new model into endpoint or any other tool.
6. Model monitoring mechanism
It is always cool when your system can re-train itself and deploy a new model automatically. But I think one more cool thing here is the ability to identify that something is going wrong with your model. To achieve it on the MLOps solution level, the system should be integrated with a model monitoring mechanism.
Basically, it is a feedback loop that should analyze the quality performance of the model and monitor changes in the data (like shifts in distributions).
7. Experiments tracking
Quite important item, that present not in all MLOps solutions. Basically it some kind of logging because usually experiment trackers store information from model and data registries for the current run. In other words for each pipeline execution we will have information about model architecture, score for the training, version of the dataset, elapsed time, metadata that can be useful in the future and any other information up to the user. I wouldn't say that it is a frequent feature in large and fully automated MLOps solutions when you shouldn't (ideally) do any analytics. But in case of projects where you have semi-automated MLOps and a large team of developers - it is must have feature.
Tools to build MLOps solution
Today MLOps is a popular direction, and a lot of companies, including large corporations invest huge money and effort into the development of MLOps solutions. That's why the number of available tools for building MLOps systems is extremely growing. Non-officially we can highlight 2 categories of MLOps tools - cloud-based and cloud-independent solutions.
1. Cloud-based tools. As we can understand from its name, it is a group of tools hosted only inside a specific cloud. Normally, you can't deploy these MLOps tools locally, and all computations are always performed on the virtual machines inside the hosting cloud. Let's observe the pros and cons of this group.
Pros. 1) All resources are placed in the same location. All building blocks for your MLOps solution are placed inside a cloud. That's why you minimize risks linked with authorization and infrastructure setup.
2) Solution management. You don't need to implement manually such things as a high-level logging mechanism, error handling, and others, because all of them are already covered by the cloud.
3) Wide range of items inside a tool. Cloud-based solutions try to cover as many as possible elements for solving MLOps sub-tasks to allow users to focus only on the cloud and not use any other tools.
4) Support and development of the tools. Usually, cloud-based solutions actively developing and has powerful user support mechanisms.
5) Relative reliability. All bugs and issues that users faced during the usage of such tools are fixed really fast. Cons. 1) High cost for usage of cloud-based resources. Cloud-based solutions cost much more expensive than cloud-neutral analogies.
2) Difficult to integrate parts that are required but absent in cloud solution. In case you find a missed functionality, but you really need it - it would be relatively difficult to integrate your own (or open-source tool) with a cloud.
The most popular (at least for me) cloud-based MLOps tools are AWS Sagemaker, Google VertexAI, and AzureML Studio. Of course, to start working with these tools, you should have a cloud account.
2. Cloud-independent tools. The group of cloud-free MLOps tools is much larger than the previous one. It includes different tools that basically support different parts of the whole MLOps solution. For example, some of them provide functionality only for data management or pipeline training. Others allow only to deploy models and so on. Only one common thing for them - they are cloud independent and can be deployed everywhere you need.
Pros.
1) The majority of such tools are free or cost much less than cloud-based solutions.
2) High flexibility. Orchestration and extension of these systems are simpler than in the cloud-based case. As well as integration with external tools.
3) Deployment independence. You can decide where and how to deploy your tools.
Cons.
1) Usually, cloud-neutral solutions are focused on one-two MLOps subtasks and don't provide the full set of required tools.
2) Cloud-free tools require more custom management, which is usually handled by cloud providers.
3) Support and development of these tools is usually slower than in the case of cloud-native tools.
I think we already have a clear understanding of what is MLOps and why we need it. And in case you need to build a such system, it will not be a big problem for you, to propose the structure of the solution and recommend tools that would be more suitable for your current case.
Hope it was really interesting and at least a little bit informative for you. What interesting tools and approaches do you follow to build MLOps solutions? Write down your answers in the comments and, of course, thanks for reading!
Comments