Data Science Pipelines And Workflow

Sharvari Santosh 02 May, 2021 • 6 min read

This article was published as a part of the Data Science Blogathon.

Pipelines for data science and workflow include many complex, varied, and similar steps. Let’s take the workflow of developing a typical machine model as an example. We start with data preparation and then move on to model training. Finally, we use our model (or application) in the production area. Each of these steps contains a few subtasks.

If we use AWS, our raw data may already be available on Amazon Simple Storage Service (Amazon S3) and stored as CSV, Apache Parquet, or its equivalent. We can start instant training models using Amazon AI or automated automation services (AutoML) to establish basic functionality by directly pointing to our database and by clicking one “train” button.

In customized machine learning models – the main focus of this article is that we can begin the stages of importing and evaluating personal data, including data analysis, data quality checks, summaries, missing values, quantile statistics, data analysis, integration analysis, etc.

After that, we have to define the type of machine learning problem regression, classification, clustering, etc. Once we have identified the type of problem, we can select the machine learning algorithm that is best suited to solve the given problem. Depending on the algorithm we choose, we need to select our data set to train, validate, and test our model. Our raw data often needs to be converted into mathematical vectors to enable numerical optimization and model training. For example, we may decide to convert categorical columns into binary encoded vectors or convert text-based columns into word-embedding vectors. After we have converted a subset of raw data into features, we must split the features into train, validation, and test feature sets for model training, adjustment, and testing.

In the model training phase, we select an algorithm and then train our model with our training element set to ensure that our model code and algorithm are ready to solve the given problem.
In the model tuning phase, we adjust the hyper-parameter algorithm and evaluate the performance of the model in comparison with the validation feature set. We repeat these steps – add more data or change hyperparameters as needed – until the model meets the expected results in the test feature set. These results should be in line with the purpose of our business before pressing the model into production.

The final phase from continuation to production often poses a great challenge to data scientists and machine learning experts.
Once we have built all the individual steps in the flow of our machine learning activity, we can begin to change the steps into a single machine learning duplication. When new data arrives at S3, our pipeline restarts with the latest data and presses the latest production model to run our applications. There are many workflow orchestration tools and AWS services available to help us build automatic machine learning pipelines.

Amazon SageMaker Pipelines

Amazon SageMaker Pipelines is the most common, and most complete way to use AI pipelines and machine learning pipelines in Amazon SageMaker. Amazon SageMaker Pipelines is the first organization designed for the purpose, ease of use, and continuous delivery (CI / CD) of machine learning (ML). With SageMaker’s pipes, you can create, use, and manage the end-to-end performance workflows.

Scheduling the workflow for each step of the machine learning process (e.g. Testing and editing data, experimenting with different algorithms and parameters, training and configuration models, and submitting models to production) can take months to encode.

AWS Step Functions Data Science SDK

Step Functions, a service run by AWS, is a great way to build complex workflows without building and maintaining our infrastructure. The AWS Step Functions Data Science Software Development Kit (SDK) is an open-source library that allows you to easily create data processing and training and publish machine learning models using Amazon SageMaker and AWS Step Functions. You can create a flow of machine learning functionality in Python that scales infrastructure, without providing and integrating AWS services separately.

Kubeflow pipelines

Kubeflow is a relatively new ecological system developed for Kubernetes which includes an orchestration system called Kubeflow Pipelines. With Kubeflow, we can restart failed pipelines, adjust pipeline performance, analyze training metrics, and track pipeline lineage.

Apache Airflow performance management in AWS
Apache Airflow is a highly mature and popular option designed specifically for data engineering pipelines and load-shifting (ETL) pipelines. We can use airflow to record workflow as a guide to acyclic graphs of activities. The Airflow Editor performs its functions for a large number of employees while following a specific dependency. We can see through pipelines in production, monitor progress, and troubleshooting problems when needed with the Airflow user interface.

MLflow

Source: devclass.com

MLflow is an open-source project originally focused on test tracks but now supports pipelines called MLflow Workflows. We can use MLflow to track tests with Kubeflow and Apache Airflow workflows as well. MLflow requires us to build and maintain our Amazon EC2 or Amazon EKS collections. MLflow is designed to work with any ML library, algorithm, download tool, or language.

It is built on REST APIs and simple data formats (e.g., the model can be viewed as a lambda function) that can be used from a variety of tools, instead of providing only a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code for the immediate benefit and share the code using any ML library others may use in your organization.

TensorFlow Extended

Source: Tensorflow.org

TensorFlow Extended (TFX) is a collection of open-source Python libraries used within a pipeline orchestrator such as AWS Step Functions, Beef Flow Pipelines, Apache Airflow, or MLflow. TFX is specified in TensorFlow and relies on another open-source project, Apache Beam, to measure more than one processing process.

Human in the loop Workflows

While AI and machine learning services make our lives easier, people are far from timeless. In fact, the concept of “human-in-the-loop” has emerged as a key cornerstone in most AI / ML travels. People provide the essential quality assurance of sensitive and controlled models in production.

Amazon Augmented AI (Amazon A2I) is a fully managed service to improve the flow of human interiors that include a clean user interface, component-based access control via AWS Identity and Access Management (IAM), and low-cost data storage via S3. Amazon A2I is integrated with many Amazon services including Amazon Rekognition content rating and Amazon Textract data output form. We can also use Amazon A2I with Amazon SageMaker and any of our ML models.
AWS Pipeline and Amazon SageMaker support a complete MLOps strategy, including automated pipeline re-training that can result in both GitOps decryption and statistical calculations such as data explosion, model selection, and definition variation.

The media shown in this article on Data Science Pipelines and Workflow are not owned by Analytics Vidhya and is used at the Author’s discretion.

blogathon Data Science pipelines

Sharvari Santosh 02 May 2021

I am Sharvari Raut. I love to write. I am a final year student in Computer Science and Engineering from NCER Pune. I have worked as a freelance technical writer for few startups and companies. Having 2 yrs of experience in Technical Writing I have written over 100+ technical articles which are published till now. Writing for Analytics Vidhya is one of my favourite things to do.

Advanced Machine Learning