Crop Yield Prediction Using Machine Learning And Flask Deployment

Avikumar talaviya Last Updated : 22 Jan, 2024

12 min read

Introduction

Crop yield prediction is an essential predictive analytics technique in the agriculture industry. It is an agricultural practice that can help farmers and farming businesses predict crop yield in a particular season when to plant a crop, and when to harvest for better crop yield. Predictive analytics is a powerful tool that can help to improve decision-making in the agriculture industry. It can be used for crop yield prediction, risk mitigation, reducing the cost of fertilizers, etc. The crop yield prediction using ML and flask deployment will find analysis on weather conditions, soil quality, fruit set, fruit mass, etc.

flask deployment | crop yield prediction | ML — Unsplash

Learning Objectives

We will briefly go through the end-to-end project to predict crop yield using pollination simulation modeling.
We will follow each step of the data science project lifecycle including data exploration, pre-processing, modeling, evaluation, and deployment.
Finally, we will deploy the model using Flask API on a cloud service platform called render.

So let’s get started with this exciting real-world problem statement.

This article was published as a part of the Data Science Blogathon.

Introduction
Project Description of Crop Yield Prediction
What is the Pollination Simulation Model?
Problem Statement
Pre-requisites
Data Description
Loading Dataset
Exploratory Data Analysis
Data Pre-processing and Data Preparation
Modeling and Evaluation
Deployment of the Model Using FlaskAPI
Conclusion
Frequently Asked Questions

Project Description of Crop Yield Prediction

The dataset used for this project was generated using a spacial-explicit simulation computing model to analyze and study various factors that affect the wild-blue berry prediction including:

Plant spatial arrangement
Outcrossing and self-pollination
Bee species compositions
Weather conditions (in isolation and in combination) affect pollination efficiency and yield of the wild blueberry in the agricultural ecosystem.

The simulation model has been validated by the field observation and experimental data collected in Maine, USA, and Canadian Maritimes during the last 30 years and now is a useful tool for hypothesis testing and estimation of wild blueberry yield prediction. This simulated data provides researchers with actual data collected from the field for various experiments on crop yield prediction as well as provides data for developers and data scientists to build real-world machine learning models for crop yield prediction.

A simulated wild blueberry field | flask deployment | crop yield prediction | ML — A simulated wild blueberry field

What is the Pollination Simulation Model?

Pollination simulation modeling is the process of using computer models to simulate the process of pollination. There are various use cases of pollination simulation such as:

Studying the effects of different factors on pollination, such as climate change, habitat loss, and pesticides
Designing pollination-friendly landscapes
Predicting the impact of pollination on crop yields

Pollination simulation models can be used to study the movement of pollen grains between flowers, the timing of pollination events, and the effectiveness of different pollination strategies. This information can be used to improve pollination rates and crop yields which can further help farmers to produce crops effectively with optimal yield.

Pollination simulation models are still under development, but they have the potential to play an important role in the future of agriculture. By understanding how pollination works, we can better protect and manage this essential process.

In our project, we will use a dataset with various features like ‘clonesize’, ‘honeybee’, ‘RainingDays’, ‘AverageRainingDays’, etc., which were created using a pollination simulation process to estimate crop yield.

Problem Statement

In this project, our task is to classify yield variable (target feature) based on the other 17 features step-by-step by going through each day’s task. The evaluation metrics will be RMSE scored. We will deploy the model using Python’s Flask framework on a cloud-based platform.

Pre-requisites

This project is well-suited for intermediate learners of data science and machine learning to build their portfolio projects. begineers in the field can take up this project if they are familiar with below skills:

Knowledge of Python programming language, and machine learning algorithms using the scikit-learn library
Basic understanding of website development using Python’s Flask framework
Understanding of Regression evaluation metrics

Flask Python Tutorial for Data Science Professionals

Data Description

In this section, we will look the each and every variable of the dataset for our project.

Clonesize — m2 — The average blueberry clone size in the field
Honeybee — bees/m2/min — Honeybee density in the field
Bumbles — bees/m2/min — Bumblebee density in the field
Andrena — bees/m2/min — Andrena bee density in the field
Osmia — bees/m2/min — Osmia bee density in the field
MaxOfUpperTRange — ℃ —The highest record of the upper band daily air temperature during the bloom season
MinOfUpperTRange — ℃ — The lowest record of the upper band daily air temperature
AverageOfUpperTRange — ℃ — The average of the upper band daily air temperature
MaxOfLowerTRange — ℃ — The highest record of the lower band daily air temperature
MinOfLowerTRange — ℃ — The lowest record of the lower band daily air temperature
AverageOfLowerTRange — ℃ — The average of the lower band daily air temperature
RainingDays — Day — The total number of days during the bloom season, each of which has precipitation larger than zero
AverageRainingDays — Day — The average of rainy days in the entire bloom season
Fruitset — Transitioning time of fruit set
Fruitmass — Mass of the fruit set
Seeds — Number of seeds in fruitset
Yield — Crop yield (A target variable)

What is the value of this data for crop prediction use-case?

This dataset provides practical information on wild blueberry plant spatial traits, bee species, and weather situations. Therefore, it enables researchers and developers to build machine learning models for early prediction of blueberry yield.
This dataset can be essential for other researchers who have field observation data but wants to test and evaluate the performance of different machine learning algorithms by comparing the use of real data against computer simulation generated data as input in crop yield prediction.
Educationalists at different levels can use the dataset for training machine learning classification or regression problems in the agricultural industry.

Loading Dataset

In this section, we will load the dataset in whichever environment you are working on. Load the dataset in the kaggle environment. Use the kaggle dataset or download it to your local machine and run it on the local environment.

Dataset source: Click Here

Let’s look at the code to load the dataset and load the libraries for the project.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, KFold 
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import statsmodels.api as sm
from xgboost import XGBRegressor
import shap

# setting up os env in kaggle 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# read the csv file and load first 5 rows in the platform 
df = pd.read_csv("/kaggle/input/wildblueberrydatasetpollinationsimulation/
WildBlueberryPollinationSimulationData.csv", 
                 index_col='Row#')
df.head()

The output of the above code | flask deployment | crop yield prediction | ML — The output of the above code

# print the metadata of the dataset
df.info()

# data description
df.describe()

Above codes like ‘df.info()’ provides a summary of the dataframe with the number of rows, number of null values, datatypes of each variable, etc while ‘df.describe()’ provide descriptive statistics of the dataset like mean, median, count and percentiles of each variable in the dataset.

Exploratory Data Analysis

In this section, we will look at the exploratory data analysis of the crops dataset and derive insights from the dataset.

Heatmap of the Dataset

# create featureset and target variable from the dataset
features_df = df.drop('yield', axis=1)
tar = df['yield']

# plot the heatmap from the dataset
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1)
plt.show()

The output of the above code | heat map of dataset | flask deployment | crop yield prediction | ML — Code’s Output

The above plot shows a visualization of the correlation coefficients of the dataset. Using a seaborn library of Python we can visualize it in just 3 lines of code.

Distribution of the Target Variable

# plot the boxplot using seaborn library of the target variable 'yield'
plt.figure(figsize=(5,5))
sns.boxplot(x='yield', data=df)
plt.show()

Above code displays the distribution of the target variable using a box plot. we can see that the median of the distribution is at about 6000 with a couple of outliers with the lowest yield.

Distribution by the Categorical Features of the Dataset

# matplotlib subplot for the categorical feature 
nominal_df = df[['MaxOfUpperTRange','MinOfUpperTRange','AverageOfUpperTRange','MaxOfLowerTRange',
               'MinOfLowerTRange','AverageOfLowerTRange','RainingDays','AverageRainingDays']]

fig, ax = plt.subplots(2,4, figsize=(20,13))
for e, col in enumerate(nominal_df.columns):
    if e<=3:
        sns.boxplot(data=df, x=col, y='yield', ax=ax[0,e])
    else:
        sns.boxplot(data=df, x=col, y='yield', ax=ax[1,e-4])       
plt.show()

Distribution of Types of Bees in our Dataset

# matplotlib subplot technique to plot distribution of bees in our dataset
plt.figure(figsize=(15,10))
plt.subplot(2,3,1)
plt.hist(df['bumbles'])
plt.title("Histogram of bumbles column")
plt.subplot(2,3,2)
plt.hist(df['andrena'])
plt.title("Histogram of andrena column")
plt.subplot(2,3,3)
plt.hist(df['osmia'])
plt.title("Histogram of osmia column")
plt.subplot(2,3,4)
plt.hist(df['clonesize'])
plt.title("Histogram of clonesize column")
plt.subplot(2,3,5)
plt.hist(df['honeybee'])
plt.title("Histogram of honeybee column")
plt.show()

Let’s note down some of the observations from about analysis:

Upper and lower T-range columns correlate with each other
Rainy days and average rainy days correlate with each other
‘Fruitmass’, ‘fruitset’, and ‘seeds’ are correlated
The ‘bumbles’ column is highly imbalance while the ‘andrena’ and ‘osmia’ columns are not
‘Honeybee’ is also an imbalanced column compared to ‘clonesize’

Data Pre-processing and Data Preparation

In this section, we will pre-process the dataset for modeling. we will perform ‘mutual info regression’ to select the best features from the dataset, we will perform clustering on types of bees in our dataset and standardize the dataset for efficient machine learning modeling.

Mutual Info Regression

# run the MI scores of the dataset
mi_score = mutual_info_regression(features_df, tar, n_neighbors=3,random_state=42)
mi_score_df = pd.DataFrame({'columns':features_df.columns, 'MI_score':mi_score})
mi_score_df.sort_values(by='MI_score', ascending=False)

The above code calculates mutual regression using Pearson’s coefficient to find the most correlated features with the target variable. we can see the most correlated features in descending order and which are most correlated with the target feature. now we will cluster the types of bees to create a new feature.

Clustering Using K-means

# clustering using kmeans algorithm
X_clus = features_df[['honeybee','osmia','bumbles','andrena']]

# standardize the dataset using standard scaler
scaler = StandardScaler()
scaler.fit(X_clus)
X_new_clus = scaler.transform(X_clus)

# K means clustering 
clustering = KMeans(n_clusters=3, random_state=42)
clustering.fit(X_new_clus)
n_cluster = clustering.labels_

# add new feature to feature_Df 
features_df['n_cluster'] = n_cluster
df['n_cluster'] = n_cluster
features_df['n_cluster'].value_counts()

---------------------------------[Output]----------------------------------
1    368
0    213
2    196
Name: n_cluster, dtype: int64

The above code standardizes the dataset and then applies the clustering algorithm to group the rows into 3 different groups.

Data Normalization Using Min-Max Scaler

features_set = ['AverageRainingDays','clonesize','AverageOfLowerTRange',
               'AverageOfUpperTRange','honeybee','osmia','bumbles','andrena','n_cluster']

# final dataframe  
X = features_df[features_set]
y = tar.round(1)

# train and test dataset to build baseline model using GBT and RFs by scaling the dataset
mx_scaler = MinMaxScaler()
X_scaled = pd.DataFrame(mx_scaler.fit_transform(X))
X_scaled.columns = X.columns

The above code represents the normalized feature set ‘X_scaled’ and target variable ‘y’ which will be used for modeling.

Modeling and Evaluation

In this section, we will take a look at Machine learning modeling using gradient boosting modeling and hyperparameter tuning to get the desired accuracy and performance of the model. Also, look at the Ordinary Least Square regression modeling using the statsmodels library and shape model explainer to visualize which features are most important for our target crop yield prediction.

Machine Learning Modeling Baseline

# let's fit the data to the models lie adaboost, gradientboost and random forest
model_dict = {"abr": AdaBoostRegressor(), 
              "gbr": GradientBoostingRegressor(), 
              "rfr": RandomForestRegressor()
             }

# Cross value scores of the models
for key, val in model_dict.items():
    print(f"cross validation for {key}")
    score = cross_val_score(val, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    mean_score = -np.sum(score)/5
    sqrt_score = np.sqrt(mean_score) 
    print(sqrt_score)

-----------------------------------[Output]------------------------------------
cross validation for abr
730.974385377955
cross validation for gbr
528.1673164806733
cross validation for rfr
608.0681265123212

In the above machine learning modeling, we have got the lowest mean squared error on the gradient boosting regressor while the highest error on the Adaboost regressor. Now, we will train the gradient boosting model and evaluate the error using the scikit-learn train and test the split method.

# split the train and test data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# gradient boosting regressor modeling
bgt = GradientBoostingRegressor(random_state=42)
bgt.fit(X_train,y_train)
preds = bgt.predict(X_test)
score = bgt.score(X_train,y_train)
rmse_score = np.sqrt(mean_squared_error(y_test, preds))
r2_score = r2_score(y_test, preds)
print("RMSE score gradient boosting machine:", rmse_score)      
print("R2 score for the model: ", r2_score)

-----------------------------[Output]-------------------------------------------
RMSE score gradient boosting machine: 363.18286194620714
R2 score for the model:  0.9321362721127562

Here, we can see the RMSE score of gradient boosting modeling without hyperparameters tuning of the model is about 363. While R2 of the model is around 93% which is better model accuracy than baseline accuracy. Further, tune the hyperparameters to optimize the accuracy of the machine-learning model.

Hyperparameters Tuning

# K-fold split the dataset
kf = KFold(n_splits = 5, shuffle=True, random_state=0)

# params grid for tuning the hyperparameters
param_grid = {'n_estimators': [100,200,400,500,800],
             'learning_rate': [0.1,0.05,0.3,0.7],
             'min_samples_split': [2,4],
             'min_samples_leaf': [0.1,0.4],
             'max_depth': [3,4,7]
             }

# GBR estimator object 
estimator = GradientBoostingRegressor(random_state=42)

# Grid search CV object 
clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=kf, 
                   scoring='neg_mean_squared_error', n_jobs=-1)
clf.fit(X_scaled,y)

# print the best the estimator and params
best_estim = clf.best_estimator_
best_score = clf.best_score_
best_param = clf.best_params_
print("Best Estimator:", best_estim)
print("Best score:", np.sqrt(-best_score))

-----------------------------------[Output]----------------------------------
Best Estimator: GradientBoostingRegressor(max_depth=7, min_samples_leaf=0.1, 
                                          n_estimators=500, random_state=42)
Best score: 306.57274619213206

We can see that error of the tuned gradient-boosting model has further reduced from previous ones and we have optimized parameters for our ML model.

Shap Model Explainer

Machine learning Explainability is a very important aspect of ML modeling in today’s time. while ML models have given promising results in many domains but their inherent complexity makes it challenging to comprehend how they arrived at certain predictions or decisions. Shap library uses ‘shaply’ values to measure which features are influencers in predicting the target values. now let’s look at the ‘shap’ model explainer plots for our gradient boosting model.

# shaply tree explainer
shap_tree = shap.TreeExplainer(bgt)
shap_values = shap_tree.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

In the above output plot, It is clear that AverageRainingDays is the most influential variable to explain the predicted values of the target variable. while the andrena feature least affects the outcome of the prediction variable.

Deployment of the Model Using FlaskAPI

In this section, we will deploy the machine learning model using FlaskAPI on a cloud service platform called render.com. Prior to deployment, it is necessary to save the model file with the joblib extension in order to create an API that can be deployed on the cloud.

Saving the Model File

# remove the 'n_cluster' feature from the dataset
X_train_n = X_train.drop('n_cluster', axis=1)
X_test_n = X_test.drop('n_cluster', axis=1)

# train a model for flask API creation =
xgb_model = XGBRegressor(max_depth=9, min_child_weight=7, subsample=1.0)
xgb_model.fit(X_train_n, y_train)
pr = xgb_model.predict(X_test_n)
err = mean_absolute_error(y_test, pr)
rmse_n = np.sqrt(mean_squared_error(y_test, pr))

# after training, save the model using joblib library
joblib.dump(xgb_model, 'wbb_xgb_model2.joblib')

As you can see we have saved the model file in the above code and how we will write the Flask app file and model file to upload to the github repo.

Application Repository Structure

The screenshot of the app repo. — The screenshot of the app repository

The above image is the snapshot of the application repository which contains the following files and directories.

app.py — Flask application file
model.py — Model prediction file
requirements.txt — Application dependencies
Model directory — Saved model files
templates directory — Front-end UI file

app.py file

from flask import Flask, render_template, Response
from flask_restful import reqparse, Api
import flask

import numpy as np
import pandas as pd
import ast

import os
import json

from model import predict_yield

curr_path = os.path.dirname(os.path.realpath(__file__))

feature_cols = ['AverageRainingDays', 'clonesize', 'AverageOfLowerTRange',
    'AverageOfUpperTRange', 'honeybee', 'osmia', 'bumbles', 'andrena']

context_dict = {
    'feats': feature_cols,
    'zip': zip,
    'range': range,
    'len': len,
    'list': list,
}

app = Flask(__name__)
api = Api(app)

# # FOR FORM PARSING
parser = reqparse.RequestParser()
parser.add_argument('list', type=list)

@app.route('/api/predict', methods=['GET','POST'])
def api_predict():
    data = flask.request.form.get('single input')
    
    # converts json to int 
    i = ast.literal_eval(data)
    
    y_pred = predict_yield(np.array(i).reshape(1,-1))
    
    return {'message':"success", "pred":json.dumps(int(y_pred))}

@app.route('/')
def index():
    
    # render the index.html templete
    
    return render_template("index.html", **context_dict)

@app.route('/predict', methods=['POST'])
def predict():
    # flask.request.form.keys() will print all the input from form
    test_data = []
    for val in flask.request.form.values():
        test_data.append(float(val))
    test_data = np.array(test_data).reshape(1,-1)

    y_pred = predict_yield(test_data)
    context_dict['pred']= y_pred

    print(y_pred)

    return render_template('index.html', **context_dict)

if __name__ == "__main__":
    app.run()

The above code is the Python file that takes the input from users and prints the crop yield prediction on the front end.

Model.py file

import joblib 
import pandas as pd
import numpy as np
import os

# load the model file
curr_path = os.path.dirname(os.path.realpath(__file__))
xgb_model = joblib.load(curr_path + "/model/wbb_xgb_model2.joblib")

# function to predict the yield
def predict_yield(attributes: np.ndarray):
    """ Returns Blueberry Yield value"""
    # print(attributes.shape) # (1,8)

    pred = xgb_model.predict(attributes)
    print("Yield predicted")

    return pred[0]

Model.py file loads the model during runtime and gives the output of the prediction.

Deployment on Render

Once all the files are pushed to the github repository, you can simply create an account on render.com to push the branch of the repo which contains the app.py file along with other artifacts. then just simply push to deploy in seconds. Moreover, render also provides an automatic deployment option, ensuring that any changes which are to make to your deployment files are automatically reflected on the website.

Screenshot of the render cloud deployment process | flask deployment | crop yield prediction | ML — Screenshot of the render cloud deployment process

You can find more information about the project and code at this link of the github repository.

Conclusion

In this article, we learned about an end-to-end project of predicting wild blueberry yield using machine learning algorithms and deployment using FlaskAPI. We started loading the dataset, followed by EDA, data pre-processing, machine learning modeling, and deployment on the cloud service platform.

Results showed the model was able to predict crop yield with as much as 93% of R2. The Flask API makes it easy to access the model and use it to make predictions. it makes it accessible to a wide range of users, including farmers, researchers, and policymakers. now let’s look at a few of the lessons learned from this article.

We learned how to define problem statements for the project and perform an end-to-end ML project pipeline.
We learned about exploratory data analysis and pre-processing of the dataset for modeling
Finally, we applied machine learning algorithms to our feature set to deploy a model for predictions

Frequently Asked Questions

Q1. What is crop yield prediction using machine learning?

A. Farmers and agricultural industries can utilize crop yield prediction, a machine learning application, to accurately forecast and predict specific crop yields for a given year or season. This enables them to prepare for the harvesting season and effectively manage associated costs.

Q2. Which algorithms do farmers and agricultural industries use in smart agriculture?

A. In smart agriculture, employ various algorithms based on their applications. Some of these algorithms include Decision Tree Regressors, Random Forest Regressors, Gradient Boosting Regressors, Deep Neural Networks, and more.

Q3. How to use AI and ML in Agriculture?

A. Use AI and ML to predict and forecast crop yield and predict the estimated cost of harvesting during a season. AI algorithms help to detect crop diseases and plant classifications for the smooth sorting and distribution of crops.

Q4. What are the parameters for yield prediction?

A.Parameters like temperature, insect composition, crop height, location of soil, and various weather parameters like rainfall, and humidity predict the crop yield.

Q5. What are the objectives of the crop yield prediction project?

A. To help farmers and agricultural industries grow and estimate crop yield. Another objective is to help government agencies to decide the price of the crop output and take appropriate measures for the storage and distribution of crop yield.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Avikumar talaviya

I specialize in data science and machine learning with hands-on experience in working on various end-to-end data science projects. I am the chapter co-lead of the Mumbai local chapter of Omdena. I am also a kaggle master and educator ambassador at streamlit with volunteers around the world.

Agriculture Best of Tech Guide Machine Learning Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models