Walkthrough of Kedro Framework Using News Classification Task
Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It uses best practices of software engineering to build production-ready data science pipelines. This article will give you a glimpse of Kedro framework using news classification tasks.
The advantages of using Kedro are:
- Machine Learning Engineering: It borrows concepts from software engineering and applies them to machine-learning code. It is the foundation for clean, data science code.
- Handles Complexity: Provides the scaffolding to build more complex data and machine-learning pipelines.
- Standardisation: Standardises team workflows; the modular structure of Kedro facilitates a higher level of collaboration when teams solve problems together.
- Production-Ready: Makes a seamless transition from development to production, as you can write quick, throw-away exploratory code and transition to maintainable, easy-to-share, code experiments quickly.
In this article, you will learn the following:
- Introduction to kedro
- Core concepts of kedro
- Step-by-step tutorial on how to install kedro
- Step-by-step tutorial on AG News Classification task using kedro
This article was published as a part of the Data Science Blogathon.
Table of Contents
Kedro can be installed from PyPi repository using the following command:
pip install kedro # core package pip install kedro-viz # a plugin for visualization
It can also be installed using conda with the following command:
conda install -c conda-forge kedro
To confirm whether kedro is installed or not, type the following command in command line and you can verify the installation by seeing an ASCII art graphic with kedro version number:
What is Node?
In Kedro, a node is a wrapper for a pure Python function that names the inputs and outputs of that function. Nodes are the building block of a pipeline, and the output of one node can be the input of another.
What is Pipeline?
A pipeline organizes the dependencies and execution order of a collection of nodes and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed.
The Kedro Data Catalog is the registry of all data sources that the project can use to manage loading and saving data. It maps the names of node inputs and outputs as keys in a DataCatalog (a Kedro class that can be specialized for different types of data storage).
Project Data Structure
The default template followed by kedro to store datasets, notebooks, configurations, and source code is shown below. This project structure makes it easier to maintain and collaborate on the project easily. It can also be customized based on our needs.
project-dir # Parent directory of the template ├── .gitignore # Hidden file that prevents staging of unnecessary files to `git` ├── conf # Project configuration files ├── data # Local project data (not committed to version control) ├── docs # Project documentation ├── logs # Project output logs (not committed to version control) ├── notebooks # Project-related Jupyter notebooks ├── pyproject.toml # Identifies the project root and ├── README.md # Project README ├── setup.cfg # Configuration options for `pytest` when doing `kedro test` └── src # Project source code
Kedro Project using AG News Classification Dataset
Let’s understand how to set up and use it by going through step by step tutorial for creating a simple text classification task 🙂
Project Setup For News Classification
It is always better to create a virtual environment to prevent any conflicts in the environment package. Create a new virtual environment and install kedro from the above commands. To create a new kedro classification project enter the following command in the command line and enter a name for the project:
Fill in the name of the project as “kedro-agnews-tf” in the interactive shell. Then, go to the project and install the initial project dependencies using the command:
cd kedro-agnews-tf pip install tensorflow pip install scikit-learn pip install mlxtend pip freeze > requirements.txt # update requirements file
We can setup logging, credentials, and sensitive information in ‘conf ‘ folder of the project. Currently, we do not have any in our development project, but this becomes crucial in production environments.
Data Setup for News Classification
Now, we set up the data for our development workflow. The ‘data’ folder in the project directory hosts multiple sub-folders to store the project data. This structure is based on the layered data-engineering convention as a model of managing data (For in-depth information, check out this blogpost). We store the AG News Subset data (downloaded from here) in the ‘raw’ sub-folder. The processed data goes into other sub-folders like ‘intermediate’, and ‘feature’; the trained model goes into the ‘model’ sub-folder; model outputs and metrics go into ‘model_output’ and ‘reporting’ sub-folders respectively.
Then, we need to register the dataset with kedro Data Catalog i.e. we need to reference this dataset in the ‘conf/base/catalog.yml’ file which makes our project reproducible by sharing the data for the complete project pipeline. Add this code to the ‘conf/base/catalog.yml’ file (Note: we can also add to the ‘conf/local/catalog.yml’ file)
# in conf/base/catalog.yml ag_news_train: type: pandas.CSVDataSet filepath: data/01_raw/ag_news_csv/train.csv load_args: names: ['ClassIndex', 'Title', 'Description'] ag_news_test: type: pandas.CSVDataSet filepath: data/01_raw/ag_news_csv/test.csv load_args: names: ['ClassIndex', 'Title', 'Description']
Testing Registered Dataset
To test whether kedro can load the data, type following command in command line:
Type the following in the IPython session:
# train data ag_news_train_data = catalog.load("ag_news_train") ag_news_train_data.head() # test data ag_news_test_data = catalog.load("ag_news_test") ag_news_test_data.head()
After validating the output, close the IPython session using the command: exit(). This shows that data has been registered with kedro successfully. Now, we move on to the pipeline creation stage where we create Data processing and Data Science pipelines.
Now, we create python functions as nodes to construct the pipeline and run these nodes sequentially.
Data Processing Pipeline
In the terminal from project root directory, run the following command to generate a new pipeline for data processing:
kedro pipeline create data_processing
This generates following files:
The steps to be followed are:
- Add data preprocessing nodes (python functions) to nodes.py
- Assemble the nodes in the pipeline.py
- Add configurations in data_processing.yml file
- Register the preprocessed data into conf/base/catalog.yml
To keep this blog succinct, I have not added the code that needs to be added to each of the files here. You can checkout the code that needs to be added for each file in my GitHub repository here.
Run the following command to validate if you are able to execute the data processing pipeline without any errors:
kedro run --pipeline=data_processing
The above code generates data in ‘data/02_intermediate’ and ‘data/03_primary’ folders.
Data Science Pipeline
In the terminal from project root directory, run the following command to generate a new pipeline for data science:
kedro pipeline create data_science
This command generates similar files as to when the data processing pipeline command had been run, BUT now files will be generated for the data science pipeline.
The steps to be followed are:
- Add model training and evaluation nodes (python functions) to nodes.py
- Assemble the nodes in the pipeline.py
- Add configurations in data_science.yml file
- Register the model and results into conf/base/catalog.yml
You can check out the code that needs to be added for each file in my GitHub repository here.
Run the following command to validate if you are able to execute the data science pipeline without any errors:
kedro run --pipeline=data_science
The above code generates model and results in ‘data/06_models’ and ‘data/08_reporting’ folders respectively
This completes the data science pipeline. If you are interested in further building project documentation, use Sphinx to build the documentation of your kedro project.
The data folder contains different datasets starting from raw data, intermediate data, features, models, etc. It is highly advised to use DVC (Data Version Control) to track this folder which offers lots of benefits.
We can visualize our complete kedro project pipeline using Kedro-Viz, a plugin built by Kedro developers. We have already installed this package during initial installation (pip install kedro-viz). To visualize our kedro project, run the following command in the terminal in the project root directory:
This command opens a browser tab to serve the visualization (http://127.0.0.1:4141/). The below image shows the visualization of our kedro-agnews project:
You can click on each of the nodes and datasets in the visualization to get more details about them. This visualization can also be refreshed dynamically when the the Python or YAML file changes in the project, by using the option –autoreload in the command
To package project, run the following in the project root directory:
It builds the package into the ‘dist’ folder of your project and creates one .egg file and one .whl file, which are Python packaging formats for binary distribution.
Deploying Kedro Project
To deploy it’s pipelines, we can use kedro plugins to deploy to various deployment targets:
- Kedro-Docker: For packaging and shipping kedro projects within docker contains
- Kedro-Airflow: Converting kedro projects into Airflow project
- Third-party plugins: Community-developed plugins for various deployment targets like AWS Batch and Prefect, AW SageMaker, Azure ML Pipelines, etc
To summarize briefly, it has many features that help you, from the development stage to the production of your ML workflow. To run the project directly, you can check out my GitHub repository here, and run the following commands:
git clone https://github.com/dheerajnbhat/kedro-agnews-tf.git cd kedro-agnews-tf tar -xzvf data/01_raw/ag_news_csv.tar.gz --directory data/01_raw/ pip install -r src/requirements.txt kedro run # for visualization kedro viz
The key takeaways from this article are:
- Understanding the capabilities kedro can offer for ML production
- Understanding core concepts of kedro
- Steps to install and use kedro
- Walk-through tutorial using kedro on AG News Classification task
I hope this will help you get started with Kedro 🙂
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.