Top 3 Winning Solutions and Approaches from LTFS Data Science FinHack 2 (with Code)

Lakshay arora Last Updated : 18 Feb, 2020

10 min read

Overview

Presenting the top three winning solutions and approaches from the LTFS Data Science FinHack 2
The problem statement for this hackathon was from the finance industry and geared towards predicting the number of loan applications received

The Power of Data Science Hackathons

I love participating in data science hackathons primarily for two key reasons:

I get to learn A LOT. The top winning solutions and approaches typically engineer new ways to climb up the leaderboard. This varies from feature engineering to new takes on traditional machine learning algorithms. Whatever the case, they bring a fresh perspective to my learning journey
Data science hackathons are easily the best medium for evaluating yourself. We get to challenge and pit our knowledge against the top minds in data science. I always take a step back and evaluate my own performance against the top winning solutions

That second point is a key reason why we publish winning solutions and approaches to our hackathons. Our community loves to pour through these winning solutions, understand the thought process of the winners, and incorporate that into their own hackathon framework.

I am excited to bring forward the top three winning solutions and approaches from the LTFS Data Science FinHack 2 hackathon we conducted a few weeks ago. The problem statement was taken straight from the finance industry (more on that soon).

To participate in such hackathons and to practice and hone your data science skills, I highly recommend browsing through our DataHack platform. Make sure you don’t miss out on the next hackathon!

About the LTFS Data Science FinHack 2

Data Science FinHack 2 was a 9-day hackathon held between January 18th and 26th. Here’s a quick introduction to LTFS in case you need one:

Headquartered in Mumbai, LTFS is one of India’s most respected & leading NBFCs providing finance for two wheeler, farm equipment, housing, infra & microfinance. With a strong parentage & stable leadership, it also has a flourishing Mutual Fund & Wealth Advisory business under its broad umbrella.

Data Science FinHack2 was one of Analytics Vidhya’s biggest hackathons. The number of data scientists and aspirants who participated broke the previous record and the number of submissions was out of the roof as well:

Total Registrations: 6,319
Total Submissions: 8,090

There were a lot of lucrative prizes on offer along with interview opportunities with LTFS. Here’s the prize money distribution for the top three winners:

Rank #1: INR 2,00,000
Rank #2: INR 1,00,000
Rank #3: INR 50,000

Problem Statement for the LTFS Data Science FinHack 2

I love the problem statement posed by LTFS here. So, let’s spend a moment to understand the challenge in this hackathon before we look at the top three winning solutions.

LTFS receives a lot of requests for its various finance offerings that include housing loans, two-wheeler loans, real estate financing, and microloans. The number of applications received is something that varies a lot with the season. Going through these applications is a manual and tedious process.

Accurately forecasting the number of cases received can help with resource and manpower management resulting in quick response on applications and more efficient processing.

Here was the challenge for the LTFS Data Science FinHack 2 participants:

You have been appointed with the task of forecasting daily cases for the next 3 months for 2 different business segments aggregated at the country level keeping in consideration the following major Indian festivals (inclusive but not exhaustive list): Diwali, Dussehra, Ganesh Chaturthi, Navratri, Holi, etc. (You are free to use any publicly available open-source external datasets). Some other examples could be:

Weather
Macroeconomic variables, etc.

Understanding the LTFS Data Science FinHack 2 Dataset

The train data was provided in the following way:

For business segment 1, historical data was made available at branch ID level
For business segment 2, historical data was made available at the state level

Train File

Variable	Definition
application_date	Date of application
segment	Business Segment (1/2)
branch_id	Anonymized id for a branch at which application was received
state	State in which application was received (Karnataka, MP etc.)
zone	Zone of state in which application was received (Central, East etc.)
case_count	(Target) Number of cases/applications received

Test File

Forecasting was to be done at the country level for the dates provided in the test set for each segment.

Variable	Definition
id	Unique id for each sample in the test set
application_date	Date of application
segment	Business Segment (1/2)

Sample Submission

This file contains the exact submission format for the forecasts.

Variable	Definition
id	Unique id for each sample in the test set
application_date	Date of application
segment	Business Segment (1/2)
case_count	(Target) Predicted values for test set

Evaluation Metric for the LTFS Data Science FinHack 2 Hackathon

The evaluation metric for scoring the forecasts was MAPE (Mean Absolute Percentage Error) M with the formula:

where A_t is the actual value and F_t is the forecast value.

The final score was calculated using MAPE for both the segments using the following formula:

You can read more about evaluation metrics in machine learning here:

Data Science FinHack 2

Winners of the LTFS Data Science FinHack 2 Hackathon

Winning a data science hackathon is a herculean task. You are up against some of the top minds in data science – beating them and finishing in the top echelons of the leaderboard takes a lot of effort and analytical thinking (along with data science skills of course). So, hats off to the winners of the LTFS Data Science FinHack2.

Before we go through their winning approaches, let’s congratulate the winners:

Rank 1: Team Data Science FinHack 2 (Satya and Priyadarshi)
Rank 2: Team Data Science FinHack 2 (Abhiroop and Nitesh)
Rank 3: Zishan Kamal

You can check out the final rankings of all the participants on the contest leaderboard.

The top three finishers have shared their detailed solution and approach from the competition. Let’s go through them each.

Rank #3: Zishan Kamal

I segregated the data for different segments and processed and treated these 2 segments differently
Started with a baseline model to predict using last month’s average and explored several simple time-series models, like simple average, moving average, simple exponential smoothing, Double Exponential Smoothing (HOLT’s method), Triple Exponential Smoothing (HOLT-WINTER model)
I have also explored gradient boosting methods and deep learning methods (LSTM). I finally ended with the Facebook Prophet model for Segment 1 and LightGBM for segment 2
Carried out outlier treatment and also removed some initial data points since they were not consistent with recent data points
Feature Engineering – For LightGBM, I generated several date related features but finally kept the following features:
- ‘month’,
- ‘week’,
- ‘day’,
- ‘day_of_week’,
- ‘days_remaining_in_month’,
- ‘days_since’,
- ‘holiday‘
For the Prophet model, I generated several exogenous features but nothing helped. The final Prophet model was without any exogenous features
Here’s my Modeling Strategy:

Cross-Validation Strategy:

I made few adjustments (case counts for month-end and Sundays) as per cross-validation results since the model was not able to capture that even after introducing several features to help the model identify such patterns
External dataset for holidays used
Key Takeaway:
- A simple model can work great sometimes. Proper cross-validation is the key especially when I was using early stopping to avoid overfitting on the validation set

You can check out the full code and solution here.

Rank #2: Abhiroop and Nitesh

Here’s the approach used by Abhiroop and Nitesh:

Data Exploration:
- Day level trend of Applications Received for Segment 1 & 2. Segment 1 was being highly impacted by the festive seasons while there was no such major change in segment 2:
- Day level year-on-year trend for Segments 1 & 2:
- For segment 2, the first 10 days of each month are almost constant while there is a decline during the month-end irrespective of the weekdays:
Feature Engineering and External Datasets: Based on the data visualization, we created the following features that helped the model to improve MAPE:
- Derived features from date:
  - Day of the month
  - Weekdays
  - Week number of the month
  - Week number of the year
  - Month
  - Year
  - Quarter
  - Day of the month (grouped) with Weekdays: most important feature for segment 2
- Lag Feature:
  - Lag 365: # of Applications received on the same day last year
- Features from Holiday: (Source)
  - Days elapsed since last holiday: 2nd most important feature for segment 1
  - Holiday flag
Modeling Techniques used and hyperparameters used
- Segment 1
  - Tbats is built on Seg1 using the seasonal period of 7 and by taking the last 420 days of application count.
  - The final prediction for segment 1 has been calculated using the weighted average ensemble from Tbats and XGBoost prediction
  - Final Prediction = 0.8*Tbats + 0.2*Xgboost
- Segment 2
  - The final prediction for segment 2 was based just on a single XGBoost model
  - Following hyperparameters were used based on the time-split validation score:

Feature Importance Plot: Derived features like days elapsed since last holiday, modified week number (custom_date_key), flags for holidays, yearly lag of count, day of the month were found to be quite helpful in improving the model accuracy.

You can check out the full solution and code here.

And now, the winning solution for the LTFS Data Science FinHack 2 hackathon!

Rank #1: Satya and Priyadarshi

Derive Steady State
- Metric time series often show a change in behavior across duration. Therefore, before we perform any behavioral analysis, it becomes evident to capture these changes and find the recent steady-state. We performed the following steps:
  1. Detected all possible changes in training data using the “ruptures” library in Python
  2. Computed statistical measures like mean, median and standard deviation for regions between detected changes
  3. Retained only changes that persisted for a duration > minimum duration threshold, and, had changed in statistical measure > minimum significance threshold

Compute Temporal Behavior
- Step 1 – Data segregation using Clustering: We segregated the data to handle complex temporal behavior.
  - Cluster the data using clustering techniques (used k-means)
  - Find the best temporal behavior (pattern) in each cluster
  - If different clusters in a dataset show different patterns, we consider that data has a complex pattern and segregate it on the basis of clusters
  - Otherwise, we do not segregate the data
  - Insights derived from the given data:
    - Segment 1: We found 2 strong clusters as follows:
      - Cluster 1: Case count range (0-2724) – Best pattern: Day of Week
      - Cluster 2: Case count range (2772 – 4757) – Best pattern: Day of Week
      - Segregation not needed as both clusters had the same pattern.
    - Segment 2: We found 3 strong clusters as follows:
      - Cluster 1: Case count range (0-8623) – Best pattern: Day of Month
      - Cluster 2: Case count range (9519-19680) – Best pattern: Day of Week(Sun)
      - Cluster 3: Case count range (20638 – 32547) – Best pattern: Day of Month
      - Segregated data based on clusters & patterns, as clusters had different patterns
- Step 2 – Compute temporal behavior:
  - We performed the following steps to find the best temporal behavior:
  - Assumption: Data contains either Day of Week or Day of Month pattern (We can add more complex patterns if required)
    1. Divide data into buckets of each pattern (Day of Week and Day of Month). For example: for pattern type = Day of Week, all values of Monday in one bucket and all values of Tuesday in another bucket, etc.
    2. Compute intra-bucket variation factor to find variation in values within a bucket. Used coefficient of variation
    3. Compute inter-bucket variation factor to find variation in values across buckets
    4. Compute average inter and intra-bucket variation for each pattern type. For example: For pattern type = Day of week, average intra-bucket variation is 0.12 & average inter-bucket variation is 0.7
    5. Rank the patterns with the following objective:
      - Minimizing intra-bucket variation
      - Maximizing inter-bucket variation
    6. Choose the top-ranked pattern as the best representative pattern

Profile Festive Behavior
- We treated the behavior during festivals differently from the normal behavior of training data. This was done to ensure that behavioral patterns and variation during festivals are correctly captured and passed on for forecasting
- We performed the following steps :
  1. Referred Google to find all bank holidays in the given test and train data duration
  2. Using the derived temporal pattern in the previous step, we aggregated data (Month wise for Day of Month and Week wise for Day of Week pattern) specific to each festival and computed statistical measures. For example: Finding deviation for August 15th (Independence Day) P = 15th August (say Day of week = Sunday). If data has Day of week pattern S = Statistical measures of previous few Sundays and if data has Day of the Month pattern then, S = Statistical measures of 15th day of past few months(April 15, May 15, June 15, July 15)
  3. Compute deviation between P and S to finding variation factors
  4. Use this change/deviation factor to model festival behavior for forecasting

Compute Representative Values
- We computed representative values using the derived temporal patterns and multiple statistical measures. We performed the following steps:
  1. Prepare a bucket of values using the derived temporal pattern
  2. We fit a linear equation of each bucket using the sklearn library in Python to model a bucket and derive slope
  3. De-trend values in each bucket using the above-derived slope factor
  4. Derive statistical measures like mean+k(std), median+(l)MAD, Nth quantile) for a different set of k, l, N values. For example: Normal distribution range : k = [-2,2]
  5. Out of these derived measures, choose the one which best fits the values in the bucket. best fit value = min Error (f (mean, median, quantile))
  6. Compute representative values; representative value model = f(best fit value, slope)
Predict Future Date
- By the time we reach this step, we are ready with the models across different temporal dimension values for normal as well festival days and used the same in this step to forecast future dates. We performed the following steps for the given future dates:
  1. Get temporal behavior of data
  2. Get best-fit value for the above fetched temporal behavior
  3. Add the trend to the representative value to get the final forecasted value

forecasted value = f(future time stamp, representative value model, festive behavior)

Here is the full code and solution for the winning approach.

Evolution of the Result for the Winning Solution

Final Thoughts

3 supreme winning solutions! That was quite a learning experience for me personally. Time series hackathons are a tricky prospect but there is a lot to glean from these winning solutions.

Which is your favorite winning solution from this list? Would you approach the problem in a different manner? Share your ideas in the comments section below!

Make sure you visit the DataHack platform for more such data science hackathons and practice problems!

Lakshay arora

Ideas have always excited me. The fact that we could dream of something and bring it to reality fascinates me. Computer Science provides me a window to do exactly that. I love programming and use it to solve problems and a beginner in the field of Data Science.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Top 3 Winning Solutions and Approaches from LTFS Data Science FinHack 2 (with Code)

Overview

The Power of Data Science Hackathons

About the LTFS Data Science FinHack 2

Problem Statement for the LTFS Data Science FinHack 2

Understanding the LTFS Data Science FinHack 2 Dataset

Train File

Test File

Sample Submission

Evaluation Metric for the LTFS Data Science FinHack 2 Hackathon

Winners of the LTFS Data Science FinHack 2 Hackathon

Rank #3: Zishan Kamal

Rank #2: Abhiroop and Nitesh

Rank #1: Satya and Priyadarshi

Evolution of the Result for the Winning Solution

Final Thoughts

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM