Anomaly Detection in Credit Card Fraud

Sanket Sarwade Last Updated : 12 May, 2023

9 min read

Fraud detection | Anomlay Detection | Credit Card Fraud

Introduction

We live in a world that’s filled with data. Think about it – you probably use multiple online accounts every day, from email to social media to online shopping. But have you ever had a moment where you were using one of these accounts and received a notification on your phone asking if it’s really you trying to access your account? Well, this is where anomaly detection in credit card fraud comes in.

For example, let’s say you’re at work and you accidentally mistype your password while trying to log in to your Google account. Suddenly, you receive a message on your phone from Google asking if the login attempt is really you. This might make you wonder how Google is able to know that it’s not actually you trying to log in, especially if it’s just a simple mistake.

Anomaly detection is a fancy way of saying that computers are really good at finding patterns and noticing when something is out of the ordinary. In the case of Google’s login security, they use machine learning algorithms to create a “normal” pattern of your login behavior. This means that they learn what kind of device you usually use, what time of day you usually log in, and other similar details that make up your typical login behavior. Google’s system is able to recognize when some activity happens which is unusual and it as potentially suspicious. It’s kind of like how your bank might call you if they see a purchase on your account from a different country – they’re just making sure that it’s really you and not someone else using your account.

Learning Objectives

Understanding the importance of data in today’s world and how it is used in various online accounts.
Learning about anomaly detection and its role in identifying unusual behavior in online security systems.
Understanding how machine learning algorithms work to identify normal patterns of login behavior and detect suspicious activities.
Understanding the significance of login security and how it can be used to protect online accounts from unauthorized access.
Understanding the similarities between Google’s login security and the way banks protect their customers’ accounts.
Understanding the importance of being vigilant and cautious while using online accounts to ensure their safety and security.

This article was published as a part of the Data Science Blogathon.

Introduction
How Google Uses Anomaly Detection to Keep Your Account Secure?
Project Description
Anomaly Detection Using Python
Data Preparation
Data Preprocessing
Anomaly Detection Algorithms
Evaluation and Model Selection
Model Deployment
Conclusion

How Google Uses Anomaly Detection to Keep Your Account Secure?

Google collects data from various sources such as browser cookies, device fingerprints, IP addresses, and user account information. They use this data to create a baseline of your normal login behavior, including your usual device, location, and time of day for logging in to your Google account. This baseline is then used to train machine learning algorithms that can detect unusual login patterns. These algorithms analyze data points such as the device used to log in, the location, the time of day, and your behavior after logging in, to identify anomalies that deviate significantly from your normal behavior.

For instance, if someone tries to log in to your Google account from a different device, location, or time of day, the algorithm may flag this as a potentially unauthorized login attempt. Similarly, if your behavior after logging in is significantly different from your normal behavior, the algorithm may flag this as a potential security threat.

Google’s anomaly detection algorithms are constantly evolving and improving as they learn from new data. They use a feedback loop to continuously train and update their algorithms based on new login data and user feedback.

To accomplish this, Google needs a large server infrastructure to collect, store, and process the vast amounts of data generated by their users. They also need powerful machine learning algorithms to analyze the data and detect anomalies accurately.

Project Description

Anomaly detection is widely used across various fields, industries, and platforms. One such application of anomaly detection is in detecting credit card fraud. Our project will focus on implementing anomaly detection techniques to identify potentially fraudulent credit card transactions.

In our credit card fraud detection project, we will use anomaly detection techniques to identify potentially fraudulent transactions. We will use a dataset of credit card transactions to create a baseline of the customer’s normal spending behavior. Then we will apply machine learning algorithms to identify any transactions that deviate significantly from this baseline and flag them as potentially fraudulent.

We will also explore various techniques to improve the accuracy of our anomaly detection model, such as feature engineering, data preprocessing, and hyperparameter tuning. Our ultimate goal is to create a model that can accurately identify fraudulent transactions and minimize false positives, so that cardholders can be alerted in a timely manner and take appropriate actions to protect their accounts.

Here is link to the github repository contains Credit Card fraud detection.

You can download the credit card fraud dataset used in the project from the Kaggle website. Here is the link to the dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

Anomaly Detection Using Python

In this section, we will provide an end-to-end guide to implementing anomaly detection using Python. We will use the Credit Card Fraud Detection dataset from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders. The dataset contains 284,807 transactions, out of which 492 are frauds. The dataset is highly imbalanced, with fraud transactions accounting for only 0.17% of the total transactions.

Data Preparation

The first step is to prepare the data for anomaly detection. We will start by importing the necessary libraries and loading the dataset into a Pandas.

import pandas as pd

# Load the dataset
df = pd.read_csv("creditcard.csv")

# Check the shape of the dataset
print("Shape of the dataset:", df.shape)

# Check the first few rows of the dataset
print(df.head())

Output :

Shape of the dataset: (284807, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28  Amount  Class  
0 -0.189115  0.133558 -0.021053  149.62      0  
1  0.125895 -0.008983  0.014724    2.69      0  
2 -0.139097 -0.055353 -0.059752  378.66      0  
3 -0.221929  0.062723  0.061458  123.50      0  
4  0.502292  0.219422  0.215153   69.99      0  

[5 rows x 31 columns]

The dataset contains 31 columns, including the Time, Amount, and Class columns. The Class column indicates whether a transaction is fraudulent or not, where 1 indicates fraud and 0 indicates non-fraud.

Data Preprocessing

Before applying any anomaly detection algorithm, it is essential to preprocess the data to ensure that it is in a suitable format for the algorithm. Here are some steps that we can follow to preprocess the dataset:

Handling Missing Values

Missing values can affect the performance of the anomaly
detection algorithm. Therefore, it is essential to check whether there are any
missing values in the dataset and take appropriate action.

# Check if there are any missing values in the dataset

print(df.isnull().sum())

Output :

0
dtype : int64

The output shows that there are no missing values in the dataset.

Scaling the Data

Anomaly detection algorithms can be sensitive to the scale of the data. Therefore, it is important to scale the data before applying the algorithm. We can use the StandardScaler class from the sklearn.preprocessing module to scale the data.

from sklearn.preprocessing import StandardScaler

# Scale the Amount column
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))

# Scale the Time column
df['Time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1))

# Check the first few rows of the dataset after scaling
print(df.head())

Output :

       Time        V1        V2        V3        V4        V5        V6  \
0 -1.996583 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388   
1 -1.996583  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361   
2 -1.996562 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499   
3 -1.996562 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203   
4 -1.996541 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921   

         V7        V8        V9  ...       V21       V22       V23       V24  \
0  0.239599  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928   
1 -0.078803  0.085102 -0.255425  ...

Anomaly Detection Algorithms

There are various anomaly detection algorithms available. In this section, we will discuss some popular algorithms along with their implementation in Python.

Isolation Forest

Isolation Forest is a popular algorithm for anomaly detection that is based on the concept of decision trees. It works by creating random decision trees for the given data and isolating the anomalies by creating shorter paths for them.

Let’s implement the Isolation Forest algorithm on our credit card fraud dataset.

from sklearn.ensemble import IsolationForest

# Create the Isolation Forest object
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.01),
 max_features=1.0, random_state=42)

# Fit the data and tag the outliers
clf.fit(df)

# Get the predictions
y_pred = clf.predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

Output :

Number of outliers: 2848

The Isolation Forest algorithm has detected 2848 anomalies in the dataset.

Local Outlier Factor

Local Outlier Factor (LOF) is another popular algorithm for anomaly detection that is based on the concept of local density. It works by calculating the density of a data point relative to its neighbors and identifying points that have a much lower density than their neighbors as outliers.

Let’s implement the LOF algorithm on our credit card fraud dataset.

from sklearn.neighbors import LocalOutlierFactor

# Create the LOF object
clf = LocalOutlierFactor(n_neighbors=20, contamination=float(0.01))

# Fit the data and tag the outliers
y_pred = clf.fit_predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

Output :

Number of outliers: 2848

The LOF algorithm has also detected 2848 anomalies in the dataset, which is the same as the Isolation Forest algorithm.

One-class SVM

One-class SVM is another popular algorithm for anomaly detection that is based on the concept of maximum margin hyperplanes. It works by creating a hyperplane that separates the normal data points from the anomalies and identifying points that lie on the wrong side of the hyperplane as anomalies.

Let’s implement the One-class SVM algorithm on our credit card fraud dataset.

from sklearn.svm import OneClassSVM

# Create the One-class SVM object
clf = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.01)

# Fit the data and tag the outliers
clf.fit(df)

# Get the predictions
y_pred = clf.predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

Output

Number of outliers: 492

The One-class SVM algorithm has detected 492 anomalies in the dataset.

Evaluation and Model Selection

In this code, we have evaluated the performance of our models using cross-validation and selected the best performing model. We have used the stratified K-fold cross-validation technique to split the dataset into 5 folds, ensuring that the proportion of fraud cases is the same in each fold. Then, we have trained and evaluated three models – Logistic Regression, Decision Tree – using the cross-validation technique. We have used the average precision score as the evaluation metric because it is a suitable metric for imbalanced datasets.

from sklearn.model_selection import train_test_split


# Define X and y
X = df.drop('Class', axis=1)
y = df['Class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create a list of classifiers to evaluate
classifiers = [LogisticRegression(), DecisionTreeClassifier()]

# Create parameter grids for each classifier
lr_params = {'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10]}
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': [3, 5, 7]}
rf_params = {'n_estimators': [100, 300, 500], 'max_depth': [3, 5, 7]}
knn_params = {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}
param_grids = [lr_params, dt_params, rf_params, knn_params]

# Loop over classifiers and parameter grids to find the best model
for i, classifier in enumerate(classifiers):
    clf = GridSearchCV(classifier, param_grids[i], cv=5)
    clf.fit(X_train, y_train)
    print(classifier.__class__.__name__)
    print(clf.best_params_)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))

Model Deployment

The final step in the machine learning pipeline is to deploy the selected model to make predictions on new data. In this step, we will use the selected model to make predictions on the test dataset and evaluate its performance using classification metrics.

We will use the predict method of the trained model to make predictions on the test data, and then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics from the sklearn.metrics module.

The code for this step is as follows:

# make predictions on the test set
y_pred = rf_model.predict(X_test)

# evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# print the classification metrics
print(f"Accuracy: {acc}")
print(f"Precision: {prec}")
print(f"Recall: {rec}")
print(f"F1 Score: {f1}")

Output

Accuracy: 0.9995669627705019
Precision: 0.9090909090909091
Recall: 0.8088235294117647
F1 Score: 0.8560311284046692#import csv

In this code, we first use the predict method of the trained rf_model to make predictions on the test set X_test. We then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics. Finally, we print the classification metrics to the console.

Note that we have imported the required metrics from the sklearn.metrics module. These metrics help us to evaluate the performance of the model and make
informed decisions about its suitability for deployment.

Conclusion

In this article, we have discussed the concept of anomaly detection and various algorithms that can be used to detect anomalies in a dataset. We have also implemented some of these algorithms in Python and applied them to a credit card fraud dataset to detect anomalies. It is important to note that the choice of algorithm and the preprocessing techniques depend on the nature of the data and the problem at hand.

Overall, anomaly detection is a powerful tool that can provide valuable insights and help detect abnormalities in various datasets. As the amount of data continues to grow, the need for effective anomaly detection techniques becomes increasingly important.

Key Takeaways for Anomaly Detection in credit card fraud

Anomaly detection is used to detect unusual data points or patterns in a dataset and can be applied in various fields such as finance, healthcare, and cybersecurity.
The choice of algorithm and preprocessing techniques should be based on the nature of the data and the problem at hand.
The isolation forest algorithm is based on random forests. It is effective in detecting point anomalies and can be a suitable option for anomaly detection in some cases.
Preprocessing techniques such as scaling and feature selection can improve the accuracy of the model. It should be considered when implementing anomaly detection.
As the amount of data continues to grow, stay up-to-date with the latest algorithms and techniques to improve the accuracy and effectiveness of anomaly detection methods.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sanket Sarwade

I am Sanket Sarwade, a tech content enthusiast, who avidly explores AI, machine learning, generative AI, deep learning, blockchain, and emerging tools. As a data scientist, I'm driven to share my insights and make intricate concepts accessible through my writing. Join me on a journey of tech exploration and discovery.

Advanced Algorithm Guide Kaggle Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Anomaly Detection in Credit Card Fraud

Introduction

Learning Objectives

Table of contents

How Google Uses Anomaly Detection to Keep Your Account Secure?

Project Description

Anomaly Detection Using Python

Data Preparation

Data Preprocessing

Handling Missing Values

Scaling the Data

Anomaly Detection Algorithms

Isolation Forest

Local Outlier Factor

One-class SVM

Evaluation and Model Selection

Model Deployment

Conclusion

Key Takeaways for Anomaly Detection in credit card fraud

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID