The DataHour Synopsis: Writing Reusable and Reproducible Pipeline
Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour.”
DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 25th June 2022, we were joined by Mr.Andrey Lukyanenko for a DataHour session on “Writing Reusable and Reproducible Pipelines for Training Neural Networks”
Andrey Lukyanenko is a Senior Data Scientist at Careem, which provides IT solutions and consulting. He has over ten years of experience in Analytics and Data Science. He aspires to create Deep Learning applications that positively impact people, bringing value to the business while also improving the lives of users/clients.
He is the Grandmaster of Kaggle Notebooks (ranked first in the kernel ranking), Discussions, and Kaggle Competitions.
Are you excited to dive deeper into the world of Data Engineering? We got you covered. Let’s start with this session’s major highlights: Writing Reusable and Reproducible Pipelines for Training Neural Networks.
Data science is concerned with reproducibility. It is critical to have a dependable code if we want to ensure that our experiments are not unduly influenced by randomness. At the same time, we should be able to change the code as we iterate over ideas easily.
What is a Training Pipeline?
A training pipeline is a code for training neural networks and producing checkpoints with model weights, logs, etc. For example, images for interpreting the model. Everyone needs to write some training pipeline to train neural networks. But here, we specifically learn about reusable pipelines, which can be used for different data sets and tasks without changing many things.
Styles of Writing Training Code
It’s one of the most important things to consider when writing code for business, implementing projects, etc. When it comes to business, it is critical to have consistent results and a well-functioning production system. And in this case, machine learning could be only a small part even though it’s crucial. Usually, the code could have some good quality; maybe it could be tested, it should be well written, or maybe it’s optimized for something.
For example, optimize models for speed if you need to have a high load. If you want to be able to deploy the model on small devices, you optimize the size model. If you need interpretability, you use simple models, and so on.
As a result, the emphasis is on whether or not written code is stable metrics and software engineering. You use a different coding style when you need to make some prospects at the types or simply make something work.
For example, suppose you have a new task and only a little experience with it. You have a limited amount of time and must test numerous approaches. So you go to Stack Overflow, Kaggle, and other similar sites, take various pieces of code and hope they work. When they appear functional, you can decide whether to try something else or rewrite this code to do something better.
So, in this case, we simply iterate and don’t write much code, and Kaggle is somewhere in the middle. Because it is about iteration, and you frequently modify the code. To begin, you add some features or drop some parts support quickly. At the same time, you must be able to track and reproduce the code because if you want to optimize the metrics, you must be certain that your changes to the code improve the metrics and are not the result of randomness or a large number of changes in the code. As you can see, there are many different approaches to writing code. And the pipelines for writing training aren’t yet ready for all cases.
For example, we don’t need to write a large pipeline while trying new things and discarding them quickly. However, when writing training pipelines, you will have less stable code. You will be able to run many models and compare them easily. You can reduce your old functions and transfer them to different projects.
Another important distinction is whether you work alone or as a team. If you’re working alone, you can do whatever you want; write whatever code you want, however well or poorly; the important thing is that it works for you. You can use it, reuse it, and so on. However, if you work in a team, you must compromise and make the code understandable by writing documentation, tests, or comments. It appears small to you because not only do you read this code but so do others.
How do we get started with writing code and training neural networks? Taking a popular framework is the default approach. TensorFlow, PyTorch, and so on. They are the most well-known, but there are others. Keras, for example, appears to be overtaking Tensorflow. But, so far, PyTorch is the most popular in TensorFlow. So you open the notebook, then the script, and finally write the code from scratch before training the model. The advantages are that you understand exactly how everything works. You can distribute it to anyone, and they will understand the codes if they are familiar with the framework. However, writing this code takes a significant amount of time. Everything, such as training loops, features, and so on, must be written here. It’s a lot of code, especially if you need to handle complex cases like distributed training, advanced machine learning, etc. You must also write it yourself. Experienced people can do it faster, but many potential issues remain. You could try tests, but they have their own bugs that take longer.
Another approach is to use a high-level framework, such as Keras. It is the most widely used framework for tensorflow, Lightning PyTorch, Catalyst, and other high-level frameworks. And the nice thing is that you can pick the framework you want, which is usually quite different because they focus on different things. Some of them attempt to concentrate on cutting-edge approaches. Some of them are concerned with high-quality software engineering. Some have a strict API, while others are simple to use. It’s also convenient to be able to select them. The main advantage is that you have a lot of related code and a friendly community of people who will answer your questions and assist you in using the framework.
However, there is a disadvantage in that switching between them is much more difficult. If you could write the code in one framework, it would take a long time to change it to another. The logical progression of the first approach is to write your frame when writing trainers, classes, methods, and abstractions. It’s great to have because it will help you understand how everything works, but the main issue is that it will be difficult to share your code because no one will understand it.
So you have good eyesight to write or write code for yourself. And some kind of hybrid approach – she’s writing your wrapper on top of standard frameworks. For example, Keras or Fast.ai multiply by the other fractions. The reason for this is that, while frameworks are fantastic, they are frequently insufficient for certain use cases. On the one hand, you must make some changes. It’s great that you can use all of the framework’s features. At the same time, you can include whatever you want.
The main issue is that when other people try to use your code, they must be familiar with both the high-level framework and your code.
So, it’s more difficult for others, but I see that more and more people use this approach because it helps you. After all, you don’t need to write everything from scratch; you can add whatever you want.
Reasons for Writing Pipelines
- Writing everything from scratch takes time and can have errors.
If you have to write everything from scratch, it will take a long time, and there may be many errors due to many issues. Many popular high-level frameworks have dozens, if not hundreds, of features, and if you think it’s possible to write your framework without any bugs or errors, think again. It takes a lot of self-assurance to say that you can write your framework better than hundreds of people who contribute to popular high-level frameworks.
- You have repeatable pieces of code anyway.
Then, suppose you participate in multiple projects or develop multiple projects. In that case, you will have some repeatable pieces of code regardless of whether you calculate the metrics or use which optimizers. Perhaps you have optimized some code or have optimization for a specific metric. You have some code that you copied from your first project to the second, third, and so on. Converting them into classes of functions may be the next logical step in this case.
- Standardization among the team
Suppose you have a team and work on specific projects using the same pipeline. In that case, it is much easier to share code because if different people use different frameworks, comparing their solutions to metrics is much more difficult. And having the same pipeline is far superior.So, where do you begin when creating a training pipeline? People usually open Twitter, Notebook, Google, Collab, Pockel, or whatever and type in a call code. It’s magnificent at first because it works, but you should realize that it’s not the best approach because it makes changing things in the code more difficult.
- A better understanding of how things work
Another intriguing aspect is that it greatly aids in understanding how these things work. For example, most people can change the top layer of the model’s framework. They can’t change the inputs and don’t understand how the layers work, which is necessary when writing your pipeline. It will be extremely beneficial to your job or future projects.It’s difficult to virtualize your code, and it’s even more difficult to commit it, so after some time, one could try to split the code into multiple parts, such as a separate script for the data set, a separate script for the model, a separate script for optimization. Then you could add obstructions configurations and so on. After a while, your project grows larger and more modern, and you imagine that you need to work on a different project.
For example, if you train a model for image classification and then need to train a model for image segmentation if your pipeline is quite good, you won’t need to change many things. Of course, you’ll need to change your model, which will order, but most other things, such as the training loop, shouldn’t have many changes. The operation is quite good if you can switch to a different task without many changes. By the way, an interesting approach to determining whether your program is okay is to ask a friend to try and run your pipeline; if this person can run your complaint well, for example, within a couple of minutes, then your program is fine.
Speakers’ Pipeline – The approach he follows
His strategy evolved. He first wrote his code in Tensorflow and then in PyTorch. He has experimented with various frameworks, and his current approach is based on PyTorch lightning and hydra. PyTorch lightning is a high-level PyTorch framework. It abstracts a lot of code, allows you to write less code, and provides a lot of flexibility and nice features. Hydra manages configuration files by combining multiple configuration files and making changing heater parameters easier. He used the same pipeline in several projects.
For example, he trained it on multiple GPUs and developed multiple nodes for time series, tableau data, named entity recognition, and image specification. He didn’t have to change many things. So he is currently working on his approach and wishes to share it.
Speakers’ Pipeline: Core Ideas
His pipeline contains several key concepts. To begin with, he has replaceable models, making it simple to change the model, the data loader, and some optimizers. For example, as he previously stated, his configuration files are managed by hydra, and he will go into more detail later. The most important thing is that the command line can change values and configuration files. Any value in any configuration file can be easily changed, and of course, as with any pipeline, it has some log-in and is replicated.
To know more about his pipeline strategy, follow the session properly and make one on your own to embed the learnings more efficiently.
It’s based on two frameworks; if any of these frameworks change API, he has to change his pipeline and spend some time here. If he wants to add some feature, he has to wait for the next version of the library, and he can’t do anything before that.
- Training Loops
The first step in every pipeline is a training loop, in which we iterate over the data, gather losses, compute the matrix, and possibly have some events. For instance, in this example, we have some batches, calculate losses, block them, and then lock the losses and the metrics. Yes, sometimes having multiple functions is preferable. Numerous methods for training, such as distinct methods for events on the training period or the batch, are acceptable if they can be abstracted. Still, there are occasions when this is impossible, and you must use multiple ways to carry out the training.
Because some layers aren’t deterministic, and even if you set up a torchback codeine and deterministic true sentence, you might still have some varied values in training. However, in most circumstances, such a function should be sufficient to fix multiple decimal values when you conduct multiple tests.
- Experiment Tracking
It’s crucial to conduct some experiments. Tensorboard is the standard method and option, and if you haven’t tried it yet, I encourage you to do so because it’s quite good and has many features and might be sufficient for you. However, many companies are currently developing their solutions, such as wasting biases, which allows you to have a web interface, some nice fish, and frequently a community of people who can assist you.
It is not very flexible, and it is difficult to find the parameters here, so I recommend using configuration files. It isn’t essential to have many configuration files, like in my code, as many libraries prefer to have a single huge configuration file. One of the most popular approaches is using ARC Parse. When in your mind training script, you have many lines with your parameters and change them here.
The training pipeline should have some more features in the functionality.
- Easy to modify for similar problems – It should be simple to adapt. For instance, if you trained a model for binary image classification and now have a different set, you shouldn’t need to make many changes to the code. Likewise, switching from binary to multi-class classification, you shouldn’t need to make many changes. If you only have to modify the design and possibly the losses, the pipeline may not be optimum. However, if you also have to change a lot of other things.
- Make predictions – Although it may seem strange, we have seen many pipelines only intended for training, wherein the model was trained on the data. Still, no methods for making predictions were provided. As a result, our plan should have some capacity for making predictions, and it should be possible to do so both with and without a pipeline.
- Make predictions without a pipeline – Why should we be able to make predictions without a pipeline? Because for instance, using tensorflow or PyTorch makes it simpler to make predictions when using the model in production. Additionally, it should be possible to convert the model to other formats to make it simpler to use them in the future.
- Changing isn’t very complicated – Additionally, it should be simple to change the model. I’ve given some examples of how to change the code, but there are other examples. For instance, some frameworks have many layers of obstruction, making it difficult to understand what exactly needs to be changed. I won’t criticize this approach, but I’d rather things be simpler, more modular, and easier.
- Configs, configs, everything – Of course, there are some niches to fill, but as we’ve already mentioned, setups are the best.
- Templates of everything – Some high-level frameworks already offer templates for image segmentation, image classification, text classification, etc. This is incredibly convenient because it reduces the thought required when switching from one task to another.
- Training on folds and hyperparameter optimization – Unfortunately, there aren’t many high-level frameworks available because they often concentrate on building powerful single models. If you want to undertake parameter optimization, you sometimes have to work via loops, which may be challenging, so it would be nice to have it in the pipeline.
- Training with Stages – It would also be nice to train errors, by which I mean, for instance, changing the size of an image after several epochs or perhaps changing some of those optimizers by training. It’s interesting that some high-level frameworks already have this feature, and it’s beneficial when trying to push your metric to the limit.
- Using pipeline for a variety of tasks without rewriting all the codes
- Shareable code and documentation – It’s important to write so you can also remember what you wrote in the future.
- Various cool tricks -There are many possible methods, and it would be wonderful to have an in-pipeline, so we don’t have to write them repeatedly. Gradient accumulation increases dropout and so on.
The speaker provided links to resources that might be used while writing the pipeline. The speaker also uses these resources. These are:
This article has covered the roadmap for creating reusable and reproducible pipelines for neural network training, along with a great example using speaker data.
You can connect with the speaker on:
- ods.ai @artgor