The Science of T20 Cricket: Decoding Player Performance with Predictive Modeling

Akshit Behera 14 Jun, 2023 • 16 min read


Cricket embraces data analytics for strategic advantage. With franchise leagues like IPL and BBL, teams rely on statistical models and tools for competitive edge. This article explores how data analytics optimizes strategies by leveraging player performances and opposition weaknesses. Python programming predicts player performances, aiding team selections and game tactics. The analysis benefits fantasy cricket enthusiasts and revolutionizes the sport through machine learning and predictive modeling.

Predictive Modeling | Machine learning | historical data

Learning Objectives

This project aims to demonstrate the usage of Python and machine learning to predict player performance in T20 matches. By the end of this article, you will be able to:

  1. Understand the applications of data analytics and machine learning in cricket.
  2. Learn to collect, clean, and process data using Python libraries like Pandas and NumPy.
  3. Understand the key performance metrics in cricket and how they can predict player performance.
  4. Learn how to build a predictive model using ridge regression.
  5. Apply the concepts and techniques learned in this article to real-world scenarios in the IPL and other cricket leagues.

This article was published as a part of the Data Science Blogathon.

Project Description

We aim to predict the player performance for an upcoming IPL match using Python and data analytics. The project includes collecting, processing, and analyzing data on player and team performance in previous T20 matches. It also involves building a predictive model that can forecast player performance in the next match.

Problem Statement

The problem we aim to solve is to provide IPL team coaches, management and fantasy league enthusiasts with a tool to help them make data-driven decisions about player selection and game tactics. Traditionally, the selection of players and game tactics in cricket have been based on subjective assessments and experience. However, with the advent of data-driven analytics, one can now use statistical models to gain insights into player performance and make informed decisions about team selection and game strategies.

Our solution consists of building a predictive model that can accurately forecast player performance based on historical data. This will help individuals and teams identify the best players for the next match and devise strategies to maximize their chances of winning.


Data Collection

  1. We will first extract the recent statistics of the relevant players from a cricket statistics website called
  2. Our code would iterate over a list of players and visit the website for each player to gather their batting and bowling statistics for recent T20 matches.

Data Preparation

  1. Next, we will clean and transform the data to prepare it for predictive modelling.
  2. We will remove all irrelevant columns and treat the missing values.
  3. We will then split the dataset into separate subsets for different performance metrics,
    such as runs scored, balls played, wickets taken, etc.

Model Training

  1. We will use ridge regression, a linear regression technique, to train models for predicting future performance based on past performance.
  2. For each player, we will train separate models for predicting runs scored, balls played, overs bowled, runs given, and wickets taken.
  3. Train the models using the training data and the optimal value for the hyperparameter alpha while using ridge regression with cross-validation.

Prediction and Confidence Intervals

  1. Train the models, we will apply them to predict the next match’s performance for each player.
  2. It will give us the point estimates for runs scored, balls played, overs bowled, runs given, and wickets taken.
  3. Additionally, we will calculate the 95% confidence intervals for these predictions.


  1. Finally, we will perform some post-processing steps on the predicted values and confidence intervals to handle specific cases.
  2. We will adjust the predicted runs given and overs bowled based on the condition that overs bowled cannot exceed 4.
  3. We will handle cases where the predicted runs scored or runs given are negative or zero by setting a minimum value.


Predictive Modeling | Machine learning | historical data

With the IPL 2023 season reaching its peak, cricket enthusiasts eagerly await the epic last league match between Gujarat Titans and Royal Challengers Bangalore. Determining this encounter’s outcome heavily relies on how each player performs. In our pursuit of insights into potential performances, we’ve curated a lineup of individuals who have consistently demonstrated their skillsets throughout the tournament:

  • Virat Kohli
  • Glenn Maxwell
  • Faf Du Plessis
  • Mohammed Siraj
  • Wayne Parnell
  • Shubman Gill
  • Hardik Pandya
  • Rashid Khan
  • Mohammed Shami

We will attempt to predict the performances of these players for this crucial game by using advanced statistical models and historical data.

Data Collection

We will begin the data collection and preparation by scraping for the most recent statistics of the relevant players. We structure and organize the collected data for model construction.

To begin with, we will import the required libraries, including time, pandas, and selenium. We utilize the Selenium library to control and orchestrate the Chrome web browser for web scraping purposes.

import time
import pandas as pd
import numpy as np
from selenium import webdriver

Specifying the path to the Chrome driver executable (chrome_driver_path) configures the Chrome driver. Additionally, the directory containing the Chrome driver is specified as the webdriver_path.

# Setting up the Chrome driver
chrome_driver_path = "{Your Chromedriver path}\chromedriver.exe"
webdriver_path = "{Your Webdriver path}\Chromedriver"

%cd webdriver_path
driver = webdriver.Chrome(chrome_driver_path)

We then initialize an empty DataFrame named final_data which will be used to store the collected player statistics. Next, we perform a loop that iterates over the list of our relevant player names.

Code Implementation

For each player, the code performs the following steps:

  1. It constructs a URL specific to the player by formatting the player’s name into the URL template. Use this URL to access the webpage containing the player’s statistics.
  2. Load the web page, and the code scrolls down to ensure that all the statistics is loaded.
  3. Extract the batting statistics of the player from a specific table on the webpage. The code locates the table using an XPath expression and retrieves the text content. The extracted data is then parsed and organized into a DataFrame named batting_stats. The code performs the following actions:
  4. It switches to the bowling statistics tab on the webpage and waits for a few seconds to ensure the content is loaded.
  5. It extracts the bowling statistics from a table and stores them in a DataFrame called “bowling_stats.”
  6. To maintain consistency in the data structure, we create an empty DataFrame if the stats for the current player are not found.
  7. We merge the batting and bowling statistics based on the “Match” column using the pd.merge() function.
  8. Missing values are filled with zeros using the fillna(0) method.
  9. The merged statistics DataFrame is sorted by the “Match” column.

Data Preparation

Once we have collected the required data, we will apply the following transformations:

  1. To predict future performance, we create lagged variables. We accomplish this by shifting the corresponding columns from the previous row using the shift(-1) method. This process generates columns such as “next_runs”, “next_balls”, “next_overs”, “next_runs_given”, and “next_wkts”.
  2. We add the player’s name as a new column called “Player” at the beginning of the DataFrame.
  3. Append the player’s statistics to the final_data DataFrame.
  4. Finally, the code filters out any rows where the “Match” column is zero, as they represent empty or invalid data.
  5. Next, the code uses NumPy’s np.where() function to handle missing values in the “Bowl Avg” column. It replaces any occurrence of “-” with 0 in the “Bowl Avg” column using the following line of code: final_data[‘Bowl Avg’] = np.where(final_data[‘Bowl Avg’]==’-‘,0,final_data[‘Bowl Avg’]).
  6. Similarly, the code handles missing values in the “Bowl SR” column. It replaces any occurrence of “-” with 0 in the “Bowl SR” column using the following line of code: final_data[‘Bowl SR’] = np.where(final_data[‘Bowl SR’]==’-‘,0,final_data[‘Bowl SR’]).
  7. The code then selects a subset of columns from the final_data DataFrame in a desired order. The final columns would consist of the columns – “Player”, “Match”, batting statistics such as “Runs Scored”, “Balls Played”, “Out”, “Bat SR”, “50”, “100”, “4s Scored”, “6s Scored”, “Bat Dot%”, and bowling statistics such as “Overs Bowled”, “Runs Given”, “Wickets Taken”, “Econ”, “Bowl Avg”, “Bowl SR”, “5W”, “4s Given”, “6s Given”, “Bowl Dot%”, as well as the lagged variables.
  8. The resulting final_data DataFrame contains the collected statistics for all the players, with appropriate column names and lagged variables.
# Setting up the Chrome driver
%cd "C:\Users\akshi\OneDrive\Desktop\ISB\Data Collection\
driver = wb.Chrome("C:\\Users\\akshi\\OneDrive\\Desktop\
                   \\ISB\\Data Collection\\Chromedriver\\\

# Extracting recent stats of the players
final_data = pd.DataFrame()  # Final dataframe to store
                             # all the player data

# Looping through all the players
for i in players[0:]:
    # Accessing the web page for the current player's stats
                end_over=9999".format(i.replace(' ','+')))
    # Scrolling down to load all the stats
    driver.execute_script("window.scrollTo(0, 1080)")
        # Extracting batting stats of the player
        batting_table = driver.find_element_by_xpath(
        bat = batting_table.text
        stats = pd.DataFrame(bat.split('\n'))[0].str.split(' ',
        stats.columns = stats.iloc[0]
        stats = stats[1:]
        del stats['%']
        stats = stats[['Match','Runs','Balls','Outs','SR',
        stats.columns = ['Match','Runs Scored','Balls Played',
                         'Out','Bat SR','50','100','4s Scored',
                         '6s Scored','Bat Dot%']
        # Switching to bowling stats tab
        bowling_tab = driver.find_element_by_xpath(
        # Extracting bowling stats of the player
        bowling_table = driver.find_element_by_xpath(
        bowl = bowling_table.text
        stats2 = pd.DataFrame(bowl.split('\n'))[0].str.split(' ',
        stats2.columns = stats2.iloc[0]
        stats2 = stats2[1:]
        stats2 = stats2[['Match','Overs','Runs','Wickets','Econ',
        stats2.columns = ['Match','Overs Bowled','Runs Given',
                          'Wickets Taken','Econ','Bowl Avg',
                          'Bowl SR','5W','4s Given','6s Given',
                          'Bowl Dot%']
        # If stats for current player not found,
        # create empty dataframe
        stats2 = pd.DataFrame({'Match':pd.Series(stats['Match'][0:1]),
                               'Overs Bowled':[0],'Runs Given':[0],
                               'Wickets Taken':[0],'Econ':[0],
                               'Bowl Avg':[0],'Bowl SR':[0],'5W':[0],
                               '4s Given':[0],'6s Given':[0],
                               'Bowl Dot%':[0]})
    # Merge batting and bowling stats
    merged_stats = pd.merge(stats,stats2,on='Match',how='outer').fillna(0)
    merged_stats = merged_stats.sort_values(by=['Match'])
    # Create lagged variables for future performance prediction
    merged_stats.insert(loc=0, column='Player', value=i)
    merged_stats['next_runs'] = merged_stats['Runs Scored'].shift(-1)
    merged_stats['next_balls'] = merged_stats['Balls Played'].shift(-1)
    merged_stats['next_overs'] = merged_stats['Overs Bowled'].shift(-1)
    merged_stats['next_runs_given'] = merged_stats['Runs Given'].shift(-1)
    merged_stats['next_wkts'] = merged_stats['Wickets Taken'].shift(-1)
    final_data = final_data.append(merged_stats)
final_data = final_data[final_data['Match']!=0]

final_data['Bowl Avg'] = np.where(final_data['Bowl Avg']=='-',
                                 0,final_data['Bowl Avg'])
final_data['Bowl SR'] = np.where(final_data['Bowl SR']=='-',
                                0,final_data['Bowl SR'])
final_data = final_data[['Player','Match', 'Runs Scored',
                         'Balls Played', 'Out', 'Bat SR',
                         '50', '100', '4s Scored',
                         '6s Scored','Bat Dot%',
                         'Overs Bowled','Runs Given',
                         'Wickets Taken', 'Econ',
                         'Bowl Avg', 'Bowl SR', '5W',
                         '4s Given', '6s Given',
                         'Bowl Dot%', 'next_runs',
                         'next_balls', 'next_overs',
                         'next_runs_given', 'next_wkts']]
final_data = final_data.replace('-',0)

Model Building

When it comes to building the model, we first create an empty data frame called models. This DataFrame will be used to store the predictions for each player.

  1. The code iterates over the list of players (players_list) and filters the final_data DataFrame for each player, creating a player-specific DataFrame called player_data.
  2. Missing value rows are dropped from player_data using the dropna() function, resulting in player_new DataFrame.
  3. Next, a model is built to predict the next runs scored by the player. Features (X_runs) and the target variable (y_runs) are separated from player_new. The data is split into training and testing sets using train_test_split().
  4. A loop is initiated, iterating over a range of alpha values from 0 to 100. For each alpha value, a Ridge regression model is trained and evaluated on both training and testing data. Scores are stored in the ridge_runs DataFrame.
  5. The average score for each alpha value is calculated and stored in the Average column of ridge_runs.
  6. The code finds the alpha value with the highest average score by selecting the row in ridge_runs where the Average column is maximum. If multiple rows have the same maximum average score, the first one is selected.
  7. The model for predicting the next runs scored is trained using the best alpha value (k_runs), and the standard deviation of the runs scored in the training data is calculated (sd_next_runs).
  8. Steps 5-7 are repeated for predicting the next balls played (next_balls), next overs bowled (next_overs), and next runs given (next_runs_given). Each model is trained and the respective standard deviations are calculated.
  9. The latest data for the player (obtained from the player DataFrame) is stored in the latest DataFrame
  10. The trained models predict the next runs, balls, overs, runs given, and wickets for the player, storing the predictions in the respective columns of “latest”.
  11. Confidence intervals are calculated using standard deviations and the formula for a 95% confidence interval. The lower and upper bounds of the confidence intervals are also stored in “latest”.
  12. The “latest” DataFrame, which includes predictions and confidence intervals for the current player, is appended to the “models” DataFrame.

Code Implementation

The above steps are repeated for each player in the players_list, resulting in a models DataFrame that contains the predictions and confidence intervals for all players.

models = pd.DataFrame()

# Iterate over the list of players
for player_name in players_list:
    # Filter the data for the current player
    player_data = final_data[final_data['Player'] == player_name]

    # Remove rows with missing values
    player_new = player_data.dropna()

    # Predict next runs
    X_runs = player_new[player_new.columns[2:11]]
    y_runs = player_new[player_new.columns[21:22]]
    X_train_runs, X_test_runs, y_train_runs, \
    y_test_runs = train_test_split(X_runs, y_runs, \
    ridge_runs = pd.DataFrame()

    # Iterate over a range of alpha values
    for j in range(0, 101):
        points_runs = linear_model.Ridge(alpha=j).fit(X_train_runs, \
        ridge_df_runs = pd.DataFrame({'Alpha': pd.Series(j), \
        'Train': pd.Series(points_runs.score(X_train_runs, \
        y_train_runs)), 'Test': pd.Series(points_runs.score( \
        X_test_runs, y_test_runs))})
        ridge_runs = ridge_runs.append(ridge_df_runs)

    # Calculate average score
    ridge_runs['Average'] = ridge_runs[['Train', 'Test']].mean(axis=1)

        # Find the alpha value with the highest average score
        k_runs = ridge_runs[ridge_runs['Average'] == \
        k_runs = k_runs.head(1)[0]
        k_runs = ridge_runs[ridge_runs['Average'] == \

    # Train the model with the best alpha value
    next_runs = linear_model.Ridge(alpha=k_runs), y_train_runs)
    sd_next_runs = stdev(X_train_runs['Runs Scored'].astype('float'))

    # Predict next balls
    X_balls = player_new[player_new.columns[2:11]]
    y_balls = player_new[player_new.columns[22:23]]
    X_train_balls, X_test_balls, y_train_balls, \
    y_test_balls = train_test_split(X_balls, y_balls, \
    ridge_balls = pd.DataFrame()

    # Iterate over a range of alpha values
    for j in range(0, 101):
        points_balls = linear_model.Ridge(alpha=j).fit(X_train_balls, \
        ridge_df_balls = pd.DataFrame({'Alpha': pd.Series(j), \
        'Train': pd.Series(points_balls.score(X_train_balls, \
        y_train_balls)), 'Test': pd.Series(points_balls.score( \
        X_test_balls, y_test_balls))})
        ridge_balls = ridge_balls.append(ridge_df_balls)

    # Calculate average score
    ridge_balls['Average'] = ridge_balls[['Train', 'Test']].mean(axis=1)

        # Find the alpha value with the highest average score
        k_balls = ridge_balls[ridge_balls['Average'] == \
        k_balls = k_balls.head(1)[0]
        k_balls = ridge_balls[ridge_balls['Average'] == \

    # Train the model with the best alpha value
    next_balls = linear_model.Ridge(alpha=k_balls), y_train_balls)
    sd_next_balls = stdev(X_train_balls['Balls Played'].astype('float'))

    # Predict next overs
    X_overs = player_new[player_new.columns[11:21]]
    y_overs = player_new[player_new.columns[25:26]]
    X_train_overs, X_test_overs, y_train_overs, \
    y_test_overs = train_test_split(X_overs, y_overs, \
    ridge_overs = pd.DataFrame()

    # Iterate over a range of alpha values
    for j in range(0, 101):
        points_overs = linear_model.Ridge(alpha=j).fit(X_train_overs, \
        ridge_df_overs = pd.DataFrame({'Alpha': pd.Series(j), \
        'Train': pd.Series(points_overs.score(X_train_overs, \
        y_train_overs)), 'Test': pd.Series(points_overs.score( \
        X_test_overs, y_test_overs))})
        ridge_overs = ridge_overs.append(ridge_df_overs)

    # Calculate average score
    ridge_overs['Average'] = ridge_overs[['Train', 'Test']].mean(axis=1)

        # Find the alpha value with the highest average score
        k_overs = ridge_overs[ridge_overs['Average'] == \
        k_overs = k_overs.head(1)[0]
        k_overs = ridge_overs[ridge_overs['Average'] == \

    # Train the model with the best alpha value
    next_overs = linear_model.Ridge(alpha=k_overs), y_train_overs)
    sd_next_overs = stdev(X_train_overs['Overs Bowled'].astype('float'))

    # Predict next runs given
    X_runs_given = player_new[player_new.columns[11:21]]
    y_runs_given = player_new[player_new.columns[24:25]]
    X_train_runs_given, X_test_runs_given, \
    y_train_runs_given, y_test_runs_given = \
    train_test_split(X_runs_given, y_runs_given, random_state=123)
    ridge_runs_given = pd.DataFrame()

    # Iterate over a range of alpha values
    for j in range(0, 101):
        points_runs_given = linear_model.Ridge(alpha=j).fit( \
        X_train_runs_given, y_train_runs_given)
        ridge_df_runs_given = pd.DataFrame({'Alpha': pd.Series(j), \
        'Train': pd.Series(points_runs_given.score( \
        X_train_runs_given, y_train_runs_given)), 'Test': \
        pd.Series(points_runs_given.score(X_test_runs_given, \
        ridge_runs_given = ridge_runs_given.append(ridge_df_runs_given)

    # Calculate average score
    ridge_runs_given['Average'] = \
    ridge_runs_given[['Train', 'Test']].mean(axis=1)

        # Find the alpha value with the highest average score
        k_runs_given = ridge_runs_given[ridge_runs_given['Average'] == \
        k_runs_given = k_runs_given.head(1)[0]
        k_runs_given = ridge_runs_given[ridge_runs_given['Average'] == \

    # Train the model with the best alpha value
    next_runs_given = linear_model.Ridge(alpha=k_runs_given), y_train_runs_given)
    sd_next_runs_given = \
    stdev(X_train_runs_given['Runs Given'].astype('float'))

    # Get the latest data for the player
    latest = player.groupby('Player').tail(1)

    # Predict next runs, balls, overs, runs given, and wickets
    latest['next_runs'] = next_runs.predict( \
    latest['next_balls'] = next_balls.predict( \
    latest['next_overs'] = next_overs.predict( \
    latest['next_runs_given'] = next_runs_given.predict( \
    latest['next_wkts'] = next_wkts.predict( \

    # Calculate confidence intervals for each prediction
    latest['next_runs_ll_95'], latest['next_runs_ul_95'] = \
    latest['next_runs'] - scipy.stats.norm.ppf(.95) * ( \
    sd_next_runs / math.sqrt(len(X_train_runs))), \
    latest['next_runs'] + scipy.stats.norm.ppf(.95) * ( \
    sd_next_runs / math.sqrt(len(X_train_runs)))
    latest['next_balls_ll_95'], latest['next_balls_ul_95'] = \
    latest['next_balls'] - scipy.stats.norm.ppf(.95) * ( \
    sd_next_balls / math.sqrt(len(X_train_balls))), \
    latest['next_balls'] + scipy.stats.norm.ppf(.95) * ( \
    sd_next_balls / math.sqrt(len(X_train_balls)))
    latest['next_overs_ll_95'], latest['next_overs_ul_95'] = \
    latest['next_overs'] - scipy.stats.norm.ppf(.95) * ( \
    sd_next_overs / math.sqrt(len(X_train_overs))), \
    latest['next_overs'] + scipy.stats.norm.ppf(.95) * ( \
    sd_next_overs / math.sqrt(len(X_train_overs)))
    latest['next_runs_given_ll_95'], latest['next_runs_given_ul_95'] \
    = latest['next_runs_given'] - scipy.stats.norm.ppf(.95) * ( \
    sd_next_runs_given / math.sqrt(len(X_train_runs_given))), \
    latest['next_runs_given'] + scipy.stats.norm.ppf(.95) * ( \
    sd_next_runs_given / math.sqrt(len(X_train_runs_given)))
    latest['next_wkts_ll_95'], latest['next_wkts_ul_95'] = \
    latest['next_wkts'] - scipy.stats.norm.ppf(.95) * ( \
    sd_next_wkts / math.sqrt(len(X_train_wkts))), \
    latest['next_wkts'] + scipy.stats.norm.ppf(.95) * ( \
    sd_next_wkts / math.sqrt(len(X_train_wkts)))

    # Append the latest predictions to the models dataframe
    models = models.append(latest)

Post Processing

In this section of the code, we perform some adjustments and rounding operations on the values obtained from the models. These adjustments are implemented w.r.t the specific rules of the game, and their objective is to guarantee that the figures remain within acceptable boundaries in accordance with the nature of T20 cricket.

For a better understanding of the matter, let us scrutinize each stage:

1. Adjusting next_runs_given based on next_overs

  • When next_overs exceeds 4, we modify the value of next_runs_given by computing a scaling factor using the proportion of next_overs to 4.
  • This adaptation is necessary as T20 cricket restricts bowlers to a maximum of 4 overs per game. If the predicted next_overs value exceeds 4, it indicates an unrealistic scenario, so we scale down the value of next_runs_given accordingly.
  • The same adjustment is performed for the lower and upper 95% confidence interval values (next_runs_given_ll_95 and next_runs_given_ul_95).

2. Limiting next_overs to a maximum of 4

  • If the value of next_overs exceeds 4, we set it to 4.
  • This limitation is imposed because, as mentioned earlier, T20 cricket has a maximum of 4 overs per bowler.

3. Adjusting next_runs based on next_balls

  • In cases where next_balls displays a negative value, indicating an unrealistic situation, we set next_runs to zero.
  • The same corrective measure is extended to apply on both the upper and lower values encompassing the 95% confidence intervals (next_runs_ll_95 and next_runs_ul_95).

4. Setting next_runs to a minimum of 1

  • If the value of next_runs is negative, we set it to 1.
  • This adjustment ensures that even if the model predicts negative values for next_runs, we consider a minimum value of 1 since it is not possible to score negative runs in cricket.
  • The same adjustment is performed for the lower and upper 95% confidence interval values (next_runs_ll_95 and next_runs_ul_95).

5. Adjusting next_runs based on next_balls if next_balls > 100

  • In scenarios where next_balls exceeds 100, recalibrations become imminent for determining how much each delivery contributes towards total runs scored. For accurate calculations, one must determine a scale factor based on the current number of runs scored compared to the delivery count. This established scale is then amplified through multiplication by 5 in order to achieve precision in calculation results.
  • We make this adjustment because a T20 innings consists of a maximum of 120 balls. If the predicted next_balls value for a player exceeds 100, it indicates an unlikely scenario. So we scale down the value of next_runs accordingly to align with the limited number of balls.
  • We perform the same adjustment for the lower and upper 95% confidence interval values (next_runs_ll_95 and next_runs_ul_95).

6. Setting next_balls to a minimum of 1

  • To avoid any confusion and discrepancies in our data, we further take the necessary measures to account for negative values in next_balls. Specifically, we set a baseline value of 1 when dealing with instances where next_balls has a negative output.
  • This way, we can maintain accuracy and integrity in our predictions and ensure that all results remain within the realm of possibility since having negative ball counts in cricket defies logic.
  • We apply the same adjustment to the lower and upper 95% confidence interval values (next_balls_ll_95 and next_balls_ul_95).

7. Setting next_wkts to a minimum of 1

  • If the value of next_wkts is negative, we set it to 1.
  • This adjustment ensures that even if the model predicts negative values for next_wkts, we consider a minimum value of 1 since it is not possible to have a negative number of wickets in cricket.
  • We make the same adjustment for the lower and upper 95% confidence interval values (next_wkts_ll_95 and next_wkts_ul_95).

8. Rounding values to 0 decimal places

  • We round the values of next_runs, next_balls, next_wkts, next_runs_given, and next_overs to 0 decimal places.
  • This rounding ensures that the values are presented as whole numbers, which is appropriate for representing runs, balls, and wickets in cricket.

These post-processing steps help in refining the predicted values obtained from the models by aligning them with the constraints and rules of T20 cricket. By making adjustments and rounding the values, we ensure that they are within meaningful ranges and suitable for practical interpretation in the context of the game.

# Adjusting values based on conditions and rounding

# Adjusting next_runs_given based on next_overs
models['next_runs_given'] = np.where(
    models['next_overs'] > 4,
    models['next_runs_given'] / models['next_overs'] * 4,
models['next_runs_given_ll_95'] = np.where(
    models['next_overs'] > 4,
    models['next_runs_given_ll_95'] / models['next_overs'] * 4,
models['next_runs_given_ul_95'] = np.where(
    models['next_overs'] > 4,
    models['next_runs_given_ul_95'] / models['next_overs'] * 4,

# Limiting next_overs to a maximum of 4
models['next_overs'] = np.where(
    models['next_overs'] > 4,
models['next_overs_ll_95'] = np.where(
    models['next_overs_ll_95'] > 4,
models['next_overs_ul_95'] = np.where(
    models['next_overs_ul_95'] > 4,

# Adjusting next_runs based on next_balls
models['next_runs'] = np.where(
    models['next_balls'] < 0,
models['next_runs_ll_95'] = np.where(
    models['next_balls'] < 0,
models['next_runs_ul_95'] = np.where(
    models['next_balls'] < 0,

# Setting next_runs to a minimum of 1
models['next_runs'] = np.where(
    models['next_runs'] < 0,
models['next_runs_ll_95'] = np.where(
    models['next_runs_ll_95'] < 0,
models['next_runs_ul_95'] = np.where(
    models['next_runs_ul_95'] < 0,

# Adjusting next_runs based on next_balls if next_balls > 100
models['next_runs'] = np.where(
    models['next_balls'] > 100,
    models['next_runs'] / models['next_balls'] * 5,
models['next_runs_ll_95'] = np.where(
    models['next_balls'] > 100,
    models['next_runs_ll_95'] / models['next_balls'] * 5,
models['next_runs_ul_95'] = np.where(
    models['next_balls'] > 100,
    models['next_runs_ul_95'] / models['next_balls'] * 5,

# Limiting next_balls to a maximum of 5
models['next_balls'] = np.where(
    models['next_balls'] > 100,
models['next_balls_ll_95'] = np.where(
    models['next_balls_ll_95'] > 100,
models['next_balls_ul_95'] = np.where(
    models['next_balls_ul_95'] > 100,

# Setting next_balls to a minimum of 1
models['next_balls'] = np.where(
    models['next_balls'] < 0,
models['next_balls_ll_95'] = np.where(
    models['next_balls_ll_95'] < 0,
models['next_balls_ul_95'] = np.where(
    models['next_balls_ul_95'] < 0,

# Setting next_wkts to a minimum of 1
models['next_wkts'] = np.where(
    models['next_wkts'] < 0,
models['next_wkts_ll_95'] = np.where(
    models['next_wkts_ll_95'] < 0,
models['next_wkts_ul_95'] = np.where(
    models['next_wkts_ul_95'] < 0,

# Rounding values to 0 decimal places
models['next_runs'] = round(models['next_runs'], 0)
models['next_runs_ll_95'] = round(models['next_runs_ll_95'], 0)
models['next_runs_ul_95'] = round(models['next_runs_ul_95'], 0)

models['next_balls'] = round(models['next_balls'], 0)
models['next_balls_ll_95'] = round(models['next_balls_ll_95'], 0)
models['next_balls_ul_95'] = round(models['next_balls_ul_95'], 0)

models['next_wkts'] = round(models['next_wkts'], 0)
models['next_wkts_ll_95'] = round(models['next_wkts_ll_95'], 0)
models['next_wkts_ul_95'] = round(models['next_wkts_ul_95'], 0)

models['next_runs_given'] = round(models['next_runs_given'], 0)
models['next_runs_given_ll_95'] = round(models['next_runs_given_ll_95'], 0)
models['next_runs_given_ul_95'] = round(models['next_runs_given_ul_95'], 0)

models['next_overs'] = round(models['next_overs'], 0)
models['next_overs_ll_95'] = round(models['next_overs_ll_95'], 0)
models['next_overs_ul_95'] = round(models['next_overs_ul_95'], 0)

The outcome of the dataframe ‘models’ with the predicted values would be as follows:


Use Cases

  1. In T20 cricket, where every run counts towards success or failure, strategic thinking becomes critical! With such a predictive model at a team’s disposal, it can get much easier than going by instincts alone. It can offer great value for coaches (who look after training), data analysts (who sift through data) and captains (who make the final calls).
  2. Moreover, selectors and team management can use the model to gain insights into player performance during selection processes or evaluation, providing a probable measure of success.
  3. Further, the model can be handy for teams that want to formulate pre-match strategies.
  4. Fans can better understand match dynamics while experiencing all aspects of T20 cricket. Fantasy cricket aficionados and wagerers alike can leverage the predictive power of this model to gain a competitive advantage. By tapping into the projected performance metrics, users are better equipped to tweak their team selections or wagering strategies, boosting their odds of attaining favourable outcomes.


While the predictive model described in this article provides valuable insights into Twenty20 cricket, its limitations must be acknowledged. The essence of the
model and the underlying data used for training and prediction result in these limitations. Understanding these limitations is essential to ensure that the model’s predictions are correctly interpreted and applied.

1.    Dependence on Historical Data: The efficacy of the model’s training and prediction mechanisms heavily depends on historical data. The precision of this information’s quality, quantity, and relevance are crucial to its accuracy and dependability in the application. Changes in team composition, player form, pitch conditions, or match dynamics during various time intervals can significantly impact the model’s ability to predict outcomes accurately. Consequently, it is essential to routinely update the model with the most recent data in order to maintain its applicability.

2.    T20 cricket is played in a variety of environments, including stadiums, pitches, weather conditions, and tournaments. It is possible that the model does not reflect the nuances of every specific condition, resulting in variations in predictions. Factors such as humidity, pitch deterioration, and ground dimensions can have a significant impact on match outcomes, but they may not be accounted for adequately in the model. In addition to the model’s predictions, it is essential to consider contextual factors and expert opinion.


In this article, we explored developing and applying a predictive model for T20 cricket. By leveraging historical match data and using advanced machine learning techniques, we demonstrated the potential of such a model to predict player performance and provide valuable insights into the game. As we conclude, let’s summarize the key learnings from this endeavour:

  1. Data-driven Decision-Making: Using predictive models in T20 offers teams, coaches, and stakeholders a new tool for data-driven decision-making. The model can provide valuable predictions influencing strategic decisions like team composition, batting order, bowling tactics, and field placements by analyzing past performance, contextual factors, and key variables.
  2. Importance of Quality Data: The relevance of committing to acquiring quality data cannot be understated when developing a trustworthy predictive model. The calibre of accuracy and reliability presented by the data used for training significantly influences outcomes. Therefore, comprehensive access to up-to-date and precise information is integral in ensuring that you leverage quality datasets.
  3. Contextual Considerations: While the model provides insights based on historical data, it is crucial to consider contextual factors and understand their influence on the outcomes. The model may not fully capture the impact of variables such as conditions, weather, player form, team strategies, and situational pressures, which can influence the game. Contextual knowledge plays a critical role in interpreting the model’s predictions and adapting them to the specific circumstances of each match.
  4. Acknowledging Uncertainty: Cricket, especially T20 cricket, is characterized by inherent uncertainty. The model, although valuable, cannot account for unforeseen events, exceptional individual performances, or the spontaneous nature of the game. Therefore, it is important to understand the model’s limitations and use it as a complementary tool.

Frequently Asked Questions

Q1. What are the three types of predictive models?

A. The three types of predictive models are classification models, regression models, and clustering models. Classification predicts categorical outcomes, regression predicts numerical values, and clustering identifies patterns or groups in data.

Q2. What are the two main predictive models?

A.The two main predictive models are machine learning models and statistical models. Machine learning models use algorithms to learn patterns from data, while statistical models are based on mathematical equations and assumptions.

Q3. What is predictive modeling used for?

A. Predictive modeling is used to make predictions or forecasts about future events or outcomes based on historical data and patterns. It is applied in various fields such as finance, healthcare, marketing, weather forecasting, and risk analysis.

Q4. How many types of predictive modelling techniques are there?

A. There are several types of predictive modeling techniques, including decision trees, random forests, neural networks, support vector machines, logistic regression, time series analysis, and ensemble methods. The choice of technique depends on the specific problem, data characteristics, and desired outcomes.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Akshit Behera 14 Jun 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

  • [tta_listen_btn class="listen"]