Anomaly Detection in Credit Card Fraud

Sanket Sarwade 12 May, 2023

9 min read

Fraud detection | Anomlay Detection | Credit Card Fraud

Introduction

We live in a world that’s filled with data. Think about it – you probably use multiple online accounts every day, from email to social media to online shopping. But have you ever had a moment where you were using one of these accounts and received a notification on your phone asking if it’s really you trying to access your account? Well, this is where anomaly detection in credit card fraud comes in.

For example, let’s say you’re at work and you accidentally mistype your password while trying to log in to your Google account. Suddenly, you receive a message on your phone from Google asking if the login attempt is really you. This might make you wonder how Google is able to know that it’s not actually you trying to log in, especially if it’s just a simple mistake.

Anomaly detection is a fancy way of saying that computers are really good at finding patterns and noticing when something is out of the ordinary. In the case of Google’s login security, they use machine learning algorithms to create a “normal” pattern of your login behavior. This means that they learn what kind of device you usually use, what time of day you usually log in, and other similar details that make up your typical login behavior. Google’s system is able to recognize when some activity happens which is unusual and it as potentially suspicious. It’s kind of like how your bank might call you if they see a purchase on your account from a different country – they’re just making sure that it’s really you and not someone else using your account.

Learning Objectives

Understanding the importance of data in today’s world and how it is used in various online accounts.
Learning about anomaly detection and its role in identifying unusual behavior in online security systems.
Understanding how machine learning algorithms work to identify normal patterns of login behavior and detect suspicious activities.
Understanding the significance of login security and how it can be used to protect online accounts from unauthorized access.
Understanding the similarities between Google’s login security and the way banks protect their customers’ accounts.
Understanding the importance of being vigilant and cautious while using online accounts to ensure their safety and security.

This article was published as a part of the Data Science Blogathon.

How Google Uses Anomaly Detection to Keep Your Account Secure?

Google collects data from various sources such as browser cookies, device fingerprints, IP addresses, and user account information. They use this data to create a baseline of your normal login behavior, including your usual device, location, and time of day for logging in to your Google account. This baseline is then used to train machine learning algorithms that can detect unusual login patterns. These algorithms analyze data points such as the device used to log in, the location, the time of day, and your behavior after logging in, to identify anomalies that deviate significantly from your normal behavior.

For instance, if someone tries to log in to your Google account from a different device, location, or time of day, the algorithm may flag this as a potentially unauthorized login attempt. Similarly, if your behavior after logging in is significantly different from your normal behavior, the algorithm may flag this as a potential security threat.

Google’s anomaly detection algorithms are constantly evolving and improving as they learn from new data. They use a feedback loop to continuously train and update their algorithms based on new login data and user feedback.

To accomplish this, Google needs a large server infrastructure to collect, store, and process the vast amounts of data generated by their users. They also need powerful machine learning algorithms to analyze the data and detect anomalies accurately.

Project Description

Anomaly detection is widely used across various fields, industries, and platforms. One such application of anomaly detection is in detecting credit card fraud. Our project will focus on implementing anomaly detection techniques to identify potentially fraudulent credit card transactions.

In our credit card fraud detection project, we will use anomaly detection techniques to identify potentially fraudulent transactions. We will use a dataset of credit card transactions to create a baseline of the customer’s normal spending behavior. Then we will apply machine learning algorithms to identify any transactions that deviate significantly from this baseline and flag them as potentially fraudulent.

We will also explore various techniques to improve the accuracy of our anomaly detection model, such as feature engineering, data preprocessing, and hyperparameter tuning. Our ultimate goal is to create a model that can accurately identify fraudulent transactions and minimize false positives, so that cardholders can be alerted in a timely manner and take appropriate actions to protect their accounts.

Here is link to the github repository contains Credit Card fraud detection.

You can download the credit card fraud dataset used in the project from the Kaggle website. Here is the link to the dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

Anomaly Detection Using Python

In this section, we will provide an end-to-end guide to implementing anomaly detection using Python. We will use the Credit Card Fraud Detection dataset from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders. The dataset contains 284,807 transactions, out of which 492 are frauds. The dataset is highly imbalanced, with fraud transactions accounting for only 0.17% of the total transactions.

Data Preparation

The first step is to prepare the data for anomaly detection. We will start by importing the necessary libraries and loading the dataset into a Pandas.

import pandas as pd

# Load the dataset
df = pd.read_csv("creditcard.csv")

# Check the shape of the dataset
print("Shape of the dataset:", df.shape)

# Check the first few rows of the dataset
print(df.head())

Output :

Shape of the dataset: (284807, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28  Amount  Class  
0 -0.189115  0.133558 -0.021053  149.62      0  
1  0.125895 -0.008983  0.014724    2.69      0  
2 -0.139097 -0.055353 -0.059752  378.66      0  
3 -0.221929  0.062723  0.061458  123.50      0  
4  0.502292  0.219422  0.215153   69.99      0  

[5 rows x 31 columns]

The dataset contains 31 columns, including the Time, Amount, and Class columns. The Class column indicates whether a transaction is fraudulent or not, where 1 indicates fraud and 0 indicates non-fraud.

Data Preprocessing

Before applying any anomaly detection algorithm, it is essential to preprocess the data to ensure that it is in a suitable format for the algorithm. Here are some steps that we can follow to preprocess the dataset:

Handling Missing Values

Missing values can affect the performance of the anomaly
detection algorithm. Therefore, it is essential to check whether there are any
missing values in the dataset and take appropriate action.

# Check if there are any missing values in the dataset

print(df.isnull().sum())

Output :

0
dtype : int64

The output shows that there are no missing values in the dataset.

Scaling the Data

Anomaly detection algorithms can be sensitive to the scale of the data. Therefore, it is important to scale the data before applying the algorithm. We can use the StandardScaler class from the sklearn.preprocessing module to scale the data.

from sklearn.preprocessing import StandardScaler

# Scale the Amount column
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))

# Scale the Time column
df['Time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1))

# Check the first few rows of the dataset after scaling
print(df.head())

Output :

       Time        V1        V2        V3        V4        V5        V6  \
0 -1.996583 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388   
1 -1.996583  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361   
2 -1.996562 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499   
3 -1.996562 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203   
4 -1.996541 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921   

         V7        V8        V9  ...       V21       V22       V23       V24  \
0  0.239599  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928   
1 -0.078803  0.085102 -0.255425  ...

Anomaly Detection Algorithms

There are various anomaly detection algorithms available. In this section, we will discuss some popular algorithms along with their implementation in Python.

Isolation Forest

Isolation Forest is a popular algorithm for anomaly detection that is based on the concept of decision trees. It works by creating random decision trees for the given data and isolating the anomalies by creating shorter paths for them.

Let’s implement the Isolation Forest algorithm on our credit card fraud dataset.

from sklearn.ensemble import IsolationForest

# Create the Isolation Forest object
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.01),
 max_features=1.0, random_state=42)

# Fit the data and tag the outliers
clf.fit(df)

# Get the predictions
y_pred = clf.predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

Output :

Number of outliers: 2848

The Isolation Forest algorithm has detected 2848 anomalies in the dataset.

Local Outlier Factor

Local Outlier Factor (LOF) is another popular algorithm for anomaly detection that is based on the concept of local density. It works by calculating the density of a data point relative to its neighbors and identifying points that have a much lower density than their neighbors as outliers.

Let’s implement the LOF algorithm on our credit card fraud dataset.

from sklearn.neighbors import LocalOutlierFactor

# Create the LOF object
clf = LocalOutlierFactor(n_neighbors=20, contamination=float(0.01))

# Fit the data and tag the outliers
y_pred = clf.fit_predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

Output :

Number of outliers: 2848

The LOF algorithm has also detected 2848 anomalies in the dataset, which is the same as the Isolation Forest algorithm.

One-class SVM

One-class SVM is another popular algorithm for anomaly detection that is based on the concept of maximum margin hyperplanes. It works by creating a hyperplane that separates the normal data points from the anomalies and identifying points that lie on the wrong side of the hyperplane as anomalies.

Let’s implement the One-class SVM algorithm on our credit card fraud dataset.

from sklearn.svm import OneClassSVM

# Create the One-class SVM object
clf = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.01)

# Fit the data and tag the outliers
clf.fit(df)

# Get the predictions
y_pred = clf.predict(df)

# Reshape the predictions to a 1D array
y_pred = y_pred.reshape(-1,1)

# Print the number of outliers
print("Number of outliers:", len(df[y_pred == -1]))

Output

Number of outliers: 492

The One-class SVM algorithm has detected 492 anomalies in the dataset.

Evaluation and Model Selection

In this code, we have evaluated the performance of our models using cross-validation and selected the best performing model. We have used the stratified K-fold cross-validation technique to split the dataset into 5 folds, ensuring that the proportion of fraud cases is the same in each fold. Then, we have trained and evaluated three models – Logistic Regression, Decision Tree – using the cross-validation technique. We have used the average precision score as the evaluation metric because it is a suitable metric for imbalanced datasets.

from sklearn.model_selection import train_test_split


# Define X and y
X = df.drop('Class', axis=1)
y = df['Class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create a list of classifiers to evaluate
classifiers = [LogisticRegression(), DecisionTreeClassifier()]

# Create parameter grids for each classifier
lr_params = {'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10]}
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': [3, 5, 7]}
rf_params = {'n_estimators': [100, 300, 500], 'max_depth': [3, 5, 7]}
knn_params = {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}
param_grids = [lr_params, dt_params, rf_params, knn_params]

# Loop over classifiers and parameter grids to find the best model
for i, classifier in enumerate(classifiers):
    clf = GridSearchCV(classifier, param_grids[i], cv=5)
    clf.fit(X_train, y_train)
    print(classifier.__class__.__name__)
    print(clf.best_params_)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))

Model Deployment

The final step in the machine learning pipeline is to deploy the selected model to make predictions on new data. In this step, we will use the selected model to make predictions on the test dataset and evaluate its performance using classification metrics.

We will use the predict method of the trained model to make predictions on the test data, and then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics from the sklearn.metrics module.

The code for this step is as follows:

# make predictions on the test set
y_pred = rf_model.predict(X_test)

# evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# print the classification metrics
print(f"Accuracy: {acc}")
print(f"Precision: {prec}")
print(f"Recall: {rec}")
print(f"F1 Score: {f1}")

Output

Accuracy: 0.9995669627705019
Precision: 0.9090909090909091
Recall: 0.8088235294117647
F1 Score: 0.8560311284046692#import csv

In this code, we first use the predict method of the trained rf_model to make predictions on the test set X_test. We then evaluate the model’s performance using the accuracy_score, precision_score, recall_score, and f1_score metrics. Finally, we print the classification metrics to the console.

Note that we have imported the required metrics from the sklearn.metrics module. These metrics help us to evaluate the performance of the model and make
informed decisions about its suitability for deployment.

Conclusion

In this article, we have discussed the concept of anomaly detection and various algorithms that can be used to detect anomalies in a dataset. We have also implemented some of these algorithms in Python and applied them to a credit card fraud dataset to detect anomalies. It is important to note that the choice of algorithm and the preprocessing techniques depend on the nature of the data and the problem at hand.

Overall, anomaly detection is a powerful tool that can provide valuable insights and help detect abnormalities in various datasets. As the amount of data continues to grow, the need for effective anomaly detection techniques becomes increasingly important.

Key Takeaways for Anomaly Detection in credit card fraud

Anomaly detection is used to detect unusual data points or patterns in a dataset and can be applied in various fields such as finance, healthcare, and cybersecurity.
The choice of algorithm and preprocessing techniques should be based on the nature of the data and the problem at hand.
The isolation forest algorithm is based on random forests. It is effective in detecting point anomalies and can be a suitable option for anomaly detection in some cases.
Preprocessing techniques such as scaling and feature selection can improve the accuracy of the model. It should be considered when implementing anomaly detection.
As the amount of data continues to grow, stay up-to-date with the latest algorithms and techniques to improve the accuracy and effectiveness of anomaly detection methods.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.