This article was published as a part of the Data Science Blogathon.
We may encounter many issues when working on a machine learning project. It is challenging to train and monitor multiple models. It’s possible that each model has unique characteristics or parameters. Assessing and exploiting these models without suitable performance monitoring and model version control tools becomes complicated. Sharing these models with the rest of the team for testing is also challenging. If we have a tool that we can use to keep track of our models, it becomes more convenient. A platform that makes it simple for teams to collaborate to develop effective automated machine learning pipelines.
In this article, we will learn about collaborative machine learning and how to train, track and share our machine learning models using a platform called “Layer.”
The layer is a platform for building production-level machine learning pipelines. After uploading our data and model to this platform, we can easily train and retrain our models. It seamlessly supports model version control and performance tracking. We can share data and models, making it a simple collaborative machine learning platform. Team members can review and evaluate their peers’ model development cycles using model versioning.
Due to a lack of coordination, teams often spend time doing redundant work. Layer functions like a central repository for data and models, letting team members access data used in the process without having to preprocess it again, reducing repetitive efforts. The automated version control allows you to quickly switch back to the earlier versions of the model and recreate previously acquired results.
The wonderful thing about Layer is that we don’t have to modify our present programming methods or platforms. We can use Layer’s capabilities with just a few lines of code.
We will employ the water quality dataset to train a classification model to probe water potability using factors like pH, hardness, and other chemical properties in our machine learning project. During the retraining of our model, we will change some parameters. In most cases, older versions of the model are lost during this procedure. However, in this case, we will employ the Layer platform to aid in model version control and compare the performance of different model versions.
!pip install -U layer -q
from layer.decorators import dataset, model,resources from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import pandas as pd import numpy as np import layer from layer import Dataset
The layer requires you to first register and log in. When you run the following code, a prompt will appear to paste a key, and the link to the key will also appear. Copy and paste the URL into your browser, then log in to your Layer account to find the key; copy this key and enter it into the prompt.
It’s time to start working on your first Layer Project. Your entire project can be found at https://app.layer.ai.
layer.init("new_project")
To load the data into the Layer project, we will use the decorator @dataset and specify the dataset’s name and the path to it using the decorator @resources.
@dataset("water_dataset") @resources(path="./") def create_dataset(): data = pd.read_csv('water_potability.csv') return data
Run this to build the dataset in your layer project
layer.run([create_dataset])
You can navigate inside your project to access the dataset.
To our training function train(), we will add the decorator @model to register it with Layer. To do so, the function must return the model object. The layer.log() function logs all the parameters defined to the Layer dashboard.
@model(name='classification_model',dependencies=[Dataset('water_dataset')]) def train(): import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay from sklearn.metrics import average_precision_score, roc_auc_score, roc_curve,precision_recall_curve parameters = { "test_size": 0.20, "random_state": 20, "n_estimators": 150 } layer.log(parameters) # load the dataset from layer df = layer.get_dataset("water_dataset").to_pandas() df.dropna(inplace=True) features_x = df.drop(["Potability"], axis=1) target_y = df["Potability"] X_train, X_test, y_train, y_test = train_test_split(features_x, target_y, test_size=parameters["test_size"], random_state=parameters["random_state"]) random_forest = RandomForestClassifier(n_estimators=parameters["n_estimators"]) random_forest.fit(X_train, y_train) y_pred = random_forest.predict(X_test) layer.log({"accuracy":accuracy_score(y_test, y_pred)}) cm = confusion_matrix(y_test, y_pred, labels=random_forest.classes_) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=random_forest.classes_) disp.plot() layer.log({"Confusion metrics" : plt.gcf()}) probs = random_forest.predict(X_test) # Calculate ROC AUC auc = roc_auc_score(y_test, probs) layer.log({"AUC":f'{auc:.4f}'}) sample_preds = X_test sample_preds["predicted"] = y_pred layer.log({"Sample predictions":sample_preds.head(100)}) return random_forest
To log the parameter and upload the trained model, pass the training function to Layer.
layer.run([train])
Open the Layer project there; you will see the uploaded models and the datasets. All the parameters and graphs you have logged will be there, along with the model version. Every time you run the training function, a new version of the model gets uploaded along with all the logged parameters. This makes it easy to compare the performances of all models and use the earlier version.
We can compare the logged parameters and results such as test data size, hyperparameters, accuracy, and ROC-AUC score.
The following shows the sample predictions.
Different model versions’ logged graphs can also be visualized and compared.
After we have trained and uploaded the model to the Layer platform, we may load the desired version of the model to make predictions. With the help of this code, we can get the necessary model version from the layer app.
import layer model = layer.get_model("san22/new_project/models/classification_model:2.1").get_train()
This model object can be a regular model to perform predictions based on the input data.
In this article, we learned about the many issues that teams may encounter in the machine learning industry when collaborating and managing model versions. Later, we saw some of the best practices for working on ML projects as a team. In this post, we designed a Layer project that considers all the challenges that teams face in an ML project. Important takeaways from this article:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.