Walkthrough of Kedro Framework Using News Classification Task

Dheeraj Bhat 18 Apr, 2023 • 7 min read

Introduction

Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It uses best practices of software engineering to build production-ready data science pipelines. This article will give you a glimpse of Kedro framework using news classification tasks.

The advantages of using Kedro are:

  • Machine Learning Engineering: It borrows concepts from software engineering and applies them to machine-learning code. It is the foundation for clean, data science code.
  • Handles Complexity: Provides the scaffolding to build more complex data and machine-learning pipelines.
  • Standardisation: Standardises team workflows; the modular structure of Kedro facilitates a higher level of collaboration when teams solve problems together.
  • Production-Ready: Makes a seamless transition from development to production, as you can write quick, throw-away exploratory code and transition to maintainable, easy-to-share, code experiments quickly.

kedro framework | News Classification

Learning Objectives

In this article, you will learn the following:

  • Introduction to kedro
  • Core concepts of kedro
  • Step-by-step tutorial on how to install kedro
  • Step-by-step tutorial on AG News Classification task using kedro

This article was published as a part of the Data Science Blogathon.

Table of Contents

Installation

Kedro can be installed from PyPi repository using the following command:

pip install kedro # core package
pip install kedro-viz # a plugin for visualization

It can also be installed using conda with the following command:

conda install -c conda-forge kedro

To confirm whether kedro is installed or not, type the following command in command line and you can verify the installation by seeing an ASCII art graphic with kedro version number:

kedro info
kedro framework information | Classification | news

What is Node?

In Kedro, a node is a wrapper for a pure Python function that names the inputs and outputs of that function. Nodes are the building block of a pipeline, and the output of one node can be the input of another.

What is Pipeline?

A pipeline organizes the dependencies and execution order of a collection of nodes and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed.

Data Catalog

The Kedro Data Catalog is the registry of all data sources that the project can use to manage loading and saving data. It maps the names of node inputs and outputs as keys in a DataCatalog (a Kedro class that can be specialized for different types of data storage).

Project Data Structure

The default template followed by kedro to store datasets, notebooks, configurations, and source code is shown below. This project structure makes it easier to maintain and collaborate on the project easily. It can also be customized based on our needs.

project-dir         # Parent directory of the template
├── .gitignore      # Hidden file that prevents staging of unnecessary files to `git`
├── conf            # Project configuration files
├── data            # Local project data (not committed to version control)
├── docs            # Project documentation
├── logs            # Project output logs (not committed to version control)
├── notebooks       # Project-related Jupyter notebooks 
├── pyproject.toml  # Identifies the project root and
├── README.md       # Project README
├── setup.cfg       # Configuration options for `pytest` when doing `kedro test`
└── src             # Project source code

Kedro Project using AG News Classification Dataset

Let’s understand how to set up and use it by going through step by step tutorial for creating a simple text classification task 🙂

Project Setup For News Classification

It is always better to create a virtual environment to prevent any conflicts in the environment package. Create a new virtual environment and install kedro from the above commands. To create a new kedro classification project enter the following command in the command line and enter a name for the project:

kedro new

Fill in the name of the project as “kedro-agnews-tf” in the interactive shell. Then, go to the project and install the initial project dependencies using the command:

cd kedro-agnews-tf
pip install tensorflow
pip install scikit-learn
pip install mlxtend
pip freeze > requirements.txt # update requirements file

We can setup logging, credentials, and sensitive information in conf ‘ folder of the project. Currently, we do not have any in our development project, but this becomes crucial in production environments.

Data Setup for News Classification

Now, we set up the data for our development workflow. The ‘data’ folder in the project directory hosts multiple sub-folders to store the project data. This structure is based on the layered data-engineering convention as a model of managing data (For in-depth information, check out this blogpost). We store the AG News Subset data (downloaded from here) in the ‘raw’ sub-folder. The processed data goes into other sub-folders like ‘intermediate’, and ‘feature’; the trained model goes into the ‘model’ sub-folder; model outputs and metrics go into ‘model_output’ and ‘reporting’ sub-folders respectively.

Then, we need to register the dataset with kedro Data Catalog i.e. we need to reference this dataset in the ‘conf/base/catalog.yml’ file which makes our project reproducible by sharing the data for the complete project pipeline. Add this code to the ‘conf/base/catalog.yml’ file (Note: we can also add to the ‘conf/local/catalog.yml’ file)

# in conf/base/catalog.yml

ag_news_train:
  type: pandas.CSVDataSet
  filepath: data/01_raw/ag_news_csv/train.csv
  load_args:
    names: ['ClassIndex', 'Title', 'Description']

ag_news_test:
  type: pandas.CSVDataSet
  filepath: data/01_raw/ag_news_csv/test.csv
  load_args:
    names: ['ClassIndex', 'Title', 'Description']

Testing Registered Dataset

To test whether kedro can load the data, type following command in command line:

kedro ipython

Type the following in the IPython session:

# train data
ag_news_train_data = catalog.load("ag_news_train")
ag_news_train_data.head()

# test data
ag_news_test_data = catalog.load("ag_news_test")
ag_news_test_data.head()

After validating the output, close the IPython session using the command: exit(). This shows that data has been registered with kedro successfully. Now, we move on to the pipeline creation stage where we create Data processing and Data Science pipelines.

Pipeline Creation

Now, we create python functions as nodes to construct the pipeline and run these nodes sequentially.

Data Processing Pipeline

In the terminal from project root directory, run the following command to generate a new pipeline for data processing:

kedro pipeline create data_processing

This generates following files:

  • src/kedro_agnews_tf/pipelines/data_processing/nodes.py
  • src/kedro_agnews_tf/pipelines/data_processing/pipeline.py
  • conf/base/parameters/data_processing.yml
  • src/tests/pipelines/data_processing

The steps to be followed are:

  • Add data preprocessing nodes (python functions) to nodes.py
  • Assemble the nodes in the pipeline.py
  • Add configurations in data_processing.yml file
  • Register the preprocessed data into conf/base/catalog.yml

To keep this blog succinct, I have not added the code that needs to be added to each of the files here. You can checkout the code that needs to be added for each file in my GitHub repository here.

Run the following command to validate if you are able to execute the data processing pipeline without any errors:

kedro run --pipeline=data_processing

The above code generates data in ‘data/02_intermediate’ and ‘data/03_primary’ folders.

Data Science Pipeline

In the terminal from project root directory, run the following command to generate a new pipeline for data science:

kedro pipeline create data_science

This command generates similar files as to when the data processing pipeline command had been run, BUT now files will be generated for the data science pipeline.

The steps to be followed are:

  • Add model training and evaluation nodes (python functions) to nodes.py
  • Assemble the nodes in the pipeline.py
  • Add configurations in data_science.yml file
  • Register the model and results into conf/base/catalog.yml

You can check out the code that needs to be added for each file in my GitHub repository here.

Run the following command to validate if you are able to execute the data science pipeline without any errors:

kedro run --pipeline=data_science

The above code generates model and results in ‘data/06_models’ and ‘data/08_reporting’ folders respectively

This completes the data science pipeline. If you are interested in further building project documentation, use Sphinx to build the documentation of your kedro project.

The data folder contains different datasets starting from raw data, intermediate data, features, models, etc. It is highly advised to use DVC (Data Version Control) to track this folder which offers lots of benefits.

Kedro Visualization

We can visualize our complete kedro project pipeline using Kedro-Viz, a plugin built by Kedro developers. We have already installed this package during initial installation (pip install kedro-viz). To visualize our kedro project, run the following command in the terminal in the project root directory:

kedro viz

This command opens a browser tab to serve the visualization (http://127.0.0.1:4141/). The below image shows the visualization of our kedro-agnews project:

"

You can click on each of the nodes and datasets in the visualization to get more details about them. This visualization can also be refreshed dynamically when the the Python or YAML file changes in the project, by using the option –autoreload in the command

Packaging Project

To package project, run the following in the project root directory:

kedro package

It builds the package into the ‘dist’ folder of your project and creates one .egg file and one .whl file, which are Python packaging formats for binary distribution.

Deploying Kedro Project

To deploy it’s pipelines, we can use kedro plugins to deploy to various deployment targets:

  • Kedro-Docker: For packaging and shipping kedro projects within docker contains
  • Kedro-Airflow: Converting kedro projects into Airflow project
  • Third-party plugins: Community-developed plugins for various deployment targets like AWS Batch and Prefect, AW SageMaker, Azure ML Pipelines, etc

Conclusion

To summarize briefly, it has many features that help you, from the development stage to the production of your ML workflow. To run the project directly, you can check out my GitHub repository here, and run the following commands:

git clone https://github.com/dheerajnbhat/kedro-agnews-tf.git
cd kedro-agnews-tf
tar -xzvf data/01_raw/ag_news_csv.tar.gz --directory data/01_raw/
pip install -r src/requirements.txt
kedro run
# for visualization
kedro viz

The key takeaways from this article are:

  • Understanding the capabilities kedro can offer for ML production
  • Understanding core concepts of kedro
  • Steps to install and use kedro
  • Walk-through tutorial using kedro on AG News Classification task

I hope this will help you get started with Kedro 🙂

References:
[1] https://github.com/kedro-org/kedro
[2] https://kedro.readthedocs.io/en/stable/index.html
[3] https://kedro.org/

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Dheeraj Bhat 18 Apr 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers