Ananya Manjunath — Updated On June 13th, 2023
Algorithm Beginner Guide Healthcare Machine Learning Python


Breast cancer is a serious medical condition that affects millions and millions of women worldwide. Even though there is an improvement in the medical field, recognizing and treating breast cancer is possible but spotting it and treating it at an early stage is still not possible. By using Anomaly detection we can identify tiny yet vital patterns in breast cancer that might not be visible to the naked eye. By increasing the accuracy of screening methods, many lives can be saved and we can help them to beat breast cancer. In this generation of computer-controlled health care, anomaly detection is a powerful tool that can change how we deal with breast cancer screening and treatment.

breast cancer anomaly detection

Learning Objectives

In this article, we will do the following:

  1. We will explore the data and identify any potential anomalies.
  2. We will create visualizations to understand the data and its abnormalities in a better way.
  3. We will train and build a model to detect any abnormal data points.
  4. We will analyze and interpret our results to draw meaningful conclusions about Breast Cancer.

This article was published as a part of the Data Science Blogathon.

What is Breast Cancer?

Breast cancer occurs when breast cells grow uncontrollably and can be found in various parts of the breast. It can metastasize by spreading through blood vessels and lymph vessels to other areas of the body.

Why is Early Detection of Breast Cancer Crucial?

When we ignore or don’t care about the cancer symptoms or delay the treatment there will be a low chance of survival. There will be more problems associated with this and at the later or last stages the treatment might not work and there will be more costs for healthcare. Early treatment might help in overcoming the cancer and therefore it is important to treat it in the earliest possible stage.

What are the Types of Breast Cancer?

There are several types of breast cancer, and some of them are:

  • IDC (Invasive Ductal Carcinoma)
  • ILC (Invasive Lobular Cancer)
  • IBC (Inflammatory Breast Cancer)
  • TNBC (Triple Negative Breast Cancer)
  • MBC (Metastatic Breast Cancer)
  • DCIS (Ductal Carcinoma In Situ)
  • LCIS (Lobular Carcinoma In Situ)

Symptoms of Breast Cancer

  • Formation of new lumps in the underarms or in the breast.
  • There will be swelling of the breast or some part of it.
  • Irritation near the breast area.
  • The skin might get dry near the nipple or the breast.
  • There might be pain in the breast area.

Diagnosis for Breast Cancer

For the diagnosis of breast cancer, the following is done:

  • Examination of the Breast: In this, the doctor will check for lumps or any other abnormalities in both breasts.
  • X-ray of the Breast: The X-ray of the breast is called Mammogram. These are generally used for the screening of breast cancer. If there are any abnormalities found in the X-ray the doctor suggests the required treatment for further procedure.
  • Ultrasound of Breast: A breast ultrasound is done to check whether the lump formed is a solid mass or a fluid-filled cyst.
  • Sample Collection: This process is called Biopsy. In this process, the sample of the lump is taken by using a specialized needle device, and the core of the lump is extracted from the affected area.

Best Methods of Detecting Breast Cancer

Biopsy i.e., Mammography is one of the best ways to identify breast cancer. Another best way is said to be MRI (Magnetic resonance imaging) through which we can identify the high risk of breast cancer

How can we Detect Breast Cancer Using Machine Learning?

We can use many Machine Learning algorithms to detect breast cancer disease such algorithms include SVM, Decision Trees, and Neural Networks.

Using these algorithms we can predict cancer at an early stage and it will help the spreading of the disease to slow down and increases the probability of saving the life of the patient.

Understanding the Data and Problem Statement

The data set used for this project is sourced from the UCI Machine Learning Repository, containing 569 instances of breast cancer and 30 attributes. Interested readers may download the data set by clicking on the following link: here. Alternatively, the data set is available in the scikit-learn library, a popular machine-learning library for Python. By working through this blog, readers will gain a better understanding of the complexities involved in detecting anomalies in breast cancer data and how to effectively use the data set for machine learning purposes.

Problem Statement – Breast Cancer Anomaly Detection

The goal of the project or the aim is to understand the data and find out the occurrence of breast cancer that are irregular. In this, we will use the Isolation Forest library in Python to build and train the model to find the uneven data points in the dataset.

Ultimately, we will study and illuminate our results to conclude meaningful conclusions from the data.

The Pipeline of the Project

The project pipeline includes various steps, they are:

  • Importing the Libraries
  • Loading the dataset
  • Probing Data Analysis
  • Preprocessing of the data
  • Visualizing the data
  • Splitting of data into training and testing data set
  • Predicting anomalies using IsolationForest
  • Predicting anomalies using LocalOutlierFactor

Step-1: Importing the Libraries

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns12345python

Step-2: Loading and Reading the Dataset

df = pd.read_csv('data.csv')


Output | breast cancer anomaly detection

Step-3: Probing Data Analysis

3.1: Fetching the top 5 records in the data



Output | breast cancer anomaly detection

3.2:Finding out the number of columns in the dataset



Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],

3.3: Finding the length of data

print('length of data is', len(df))


length of data is 569

3.4: Getting the shape of the data



(569, 33)

3.5: Information on the data


Information on data - output | breast cancer anomaly detection

3.6: Datatypes of the columns



Datatypes output | breast cancer anomaly detection

3.7: Finding whether the dataset has null values




3.8: Number of rows and columns in the dataset

print('Count of columns in the data is: ', len(df.columns))
print('Count of rows in the data is: ', len(df))


Count of columns in the data is: 31

Count of rows in the data is: 569

3.9: Checking for unique values of diagnosis



array([1, 0])

3.10: Number of Diagnosis value




Step-4: Preprocessing of the Data

4.1: Handling Missing values:

In the preprocessing process handling the missing values is one of the most important steps if the dataset contains missing values. The presence of missing values can cause many problems such as it might cause errors in the program or simply that data is not available in the first place. There are many techniques to deal with type of error depending on the nature of the data.

Basically, there are techniques that are always suitable to handle the missing values. In some cases, we drop the row or column if the missing value is very less or very more or irrelevant to the given data or might not be useful in building a model. We will use is.null() function to find the missing values.

def null_values(data): 
  null_values = data.isnull().sum() 
  null_values = null_values[null_values > 0] 


Series([ ], dtype: int64)

All values in the data are present.

4.2:Encoding the data:

In the data pre-processing phase, the next step involves encoding the data into a suitable form for model building. This step involves converting categorical variables into numerical form i.e., changing the data type of the variable from object to int64, scaling down the data into a standard range, or applying any other transformations to create a clean dataset. In this project-based blog, we will use the LabelEncoder method from sklearn. preprocessing library to convert categorical variables into numerical ones so that we can use the variable in training the model.

To further elaborate on the data pre-processing step, it is very important to encode data even to visualize it. Many plots won’t use the categorical variable to interpret the results cause they are based on numerical calculations. Although we are using the LabelEncoder method in this project-based blog we can also use methods like one-hot encoding, binary encoding, etc. depending on the needs of the model.

Scaling the data to a standard range is very necessary to ensure the variables are weighted equally and that our model is not biased towards one particular feature. This can be achieved using methods such as standardization or normalization.

In the below code, we are first importing LabelEncoder from sklearn. preprocessing and then creating an object of that method. Then finally we will use the object to call the fit_transform function to transform the specified variable into a numerical datatype.

from sklearn.preprocessing import LabelEncoder


encoding the data - output

Step-5: Visualizing the data

To understand the data and its anomalies in a better way, we will try different types of visualizations. In these visualizations, we can perform scatter plots, histograms, box plots, and many more. By this, we can identify the outliers and patterns of the data which are not likely related to the raw data. These will majorly help us to construct an effective anomaly detection model.

In addition to this we can use other techniques such as clustering or regression analysis for the further analysis of the data and to understand the model in its various properties. In general, our main objective is to build a unique and reliable model that can detect and guide us through any unusual or unexpected patterns accurately in the data, which helps us to find the issues that may occur before they can cause any major harm or which disrupt our operations.

#Number of Malignant(M) and Benign(B) cells

plt.figure(figsize=(8, 6))

sns.countplot(x='diagnosis', data=df, palette= ['#FFC0CB', '#ADD8E6'],  
            edgecolor='black', linewidth=1.5)

plt.title('Diagnosis Count', fontsize=20, fontweight='bold')
plt.xlabel('Diagnosis', fontsize=14)
plt.ylabel('Count', fontsize=14)

ax = plt.gca()

for patch in ax.patches:
    plt.text(x=patch.get_x()+0.4, y=patch.get_height()+2, 
    s=str(int(patch.get_height())), fontsize=12)


visualising the data output
sns.heatmap(df.corr(),annot=True, cmap='coolwarm')


heat map | breast cancer anomaly detection

Kernel Density Estimation Plot showing the distribution of ‘radius_mean’ among benign and malignant tumors in a breast cancer dataset

def plot_distribution(df, var, target, **kwargs):
    row = kwargs.get('row', None)
    col = kwargs.get('col', None)
    facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col), var, shade=True)
    facet.set(xlim=(0, df[var].max()))
plot_distribution(df, var='radius_mean', target='diagnosis')



Scatter Plot showcasing the relationship between ‘radius_mean’ and ‘texture_mean’ in benign and malignant tumors of a breast cancer dataset.

def plot_scatter(df, var1, var2, target, **kwargs):
    row = kwargs.get('row', None)
    col = kwargs.get('col', None)
    facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col), var1, var2, alpha=0.5)
plot_scatter(df, var1='radius_mean', var2='texture_mean', target='diagnosis')


scatter plot output
import as px
fig = px.parallel_coordinates(df, dimensions=['radius_mean', 'texture_mean', 'perimeter_mean', 
          'area_mean', 'smoothness_mean', 'compactness_mean', 
          'concavity_mean', 'concave points_mean', 'symmetry_mean', 
      color='diagnosis', color_continuous_scale=px.colors.sequential.Plasma, 
    labels={'radius_mean': 'Radius Mean', 'texture_mean': 'Texture Mean', 
  perimeter_mean': 'Perimeter Mean', 'area_mean': 'Area Mean', 
  'smoothness_mean': 'Smoothness Mean', 'compactness_mean': 'Compactness Mean', 
   'concavity_mean': 'Concavity Mean', 'concave points_mean': 'Concave Points Mean', 
   symmetry_mean': 'Symmetry Mean', 'fractal_dimension_mean': 'Fractal Dimension Mean'},
   title='Breast Cancer Diagnosis by Mean Characteristics')


data visualization | breast cancer anomaly detection

Step-6: Model Development

The model development process utilized Python’s scikit-learn library to train and develop the isolation model, which identifies hidden data points. An unsupervised learning algorithm called Isolation Forest was used, known for its effectiveness in anomaly detection. It involves creating a random forest of isolation trees, training each with a randomly selected subset of the data. Outliers are detected based on the average path lengths of the data points.

By using this technique, we can identify the hidden outliers and patterns in the data. In total, we can say that the Isolation Forest algorithm is a vital tool for anomaly detection in Breast cancer data and also it has the ability to revolutionize the way by which we can approach a better way of screening and treating methods of this disease.

6.1: Splitting the data into features and target

from sklearn.feature_selection import SelectKBest, f_classif
# Split the data into features and target
X = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']

6.2: Printing X and Y values:



model deployment | output



6.3: Performing feature selection using SelectKBest and f_classif

# Performing feature selection using SelectKBest and f_classif
selector = SelectKBest(score_func=f_classif, k=5), y)




6.4: Get the indices of the selected features

# Getting the indices of the selected features
selected_indices = selector.get_support(indices=True)

6.5: Get the names of the selected features and print it

# Getting the names of the selected features
selected_features = X.columns[selected_indices].tolist()
# Printing the selected features


[‘perimeter_mean’, ‘concave points_mean’, ‘radius_worst’, ‘perimeter_worst’, ‘concave points_worst’]

Step-7: Splitting of data into training and testing data set

x = df[selected_features]
y = df['diagnosis']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

Step-8: Predicting anomalies using IsolationForest

8.1: Fit an Isolation Forest model on the training data

from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
# Fit an Isolation Forest model on the training data
clf = IsolationForest(n_estimators=100, max_samples="auto", contamination="auto", random_state=42)




8.2: Use the model to predict outliers in the test data

# Using the model to predict outliers in the test data
y_pred = clf.predict(X_test)
y_pred = np.where(y_pred == -1, 1, 0)  # Convert -1 (outlier) to 1, and 1 (inlier) to 0


array([1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0])

8.3: plotting the Outliers

# plot the outliers keep in  red color
plt.hist(y_test[y_pred==0], bins=20, alpha=0.5, label="Inliers")
plt.hist(y_test[y_pred==1], bins=20, alpha=0.5, label="Outliers")
plt.xlabel("Diagnosis (0: benign, 1: malignant)")
plt.title("Outliers detected by Isolation Forest")


breast cancer anomaly detection

Step-9: Predicting anomalies using LocalOutlierFactor

9.1: Predicting anomalies:

import plotly.graph_objs as go
from sklearn.neighbors import LocalOutlierFactor

model = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
# Predicting anomalies
y_pred1 = model.fit_predict(X)

9.2: Creating scatter plot and adding legends to the annotations:

# Creating scatter plot
fig = go.Figure()

        x=X.iloc[:, 0],
        y=X.iloc[:, 1],
        hovertemplate='Feature 1: %{x}<br>Feature 2: %{y}<extra></extra>'

    title='Local Outlier Factor Anomaly Detection',
    xaxis_title='Feature 1',
    yaxis_title='Feature 2'

# Add legend annotations
normal_points = go.Scatter(x=[], y=[], mode='markers', 
            marker=dict(color='yellow'), showlegend=True, name='Normal')
anomaly_points = go.Scatter(x=[], y=[], 
        mode='markers', marker=dict(color='darkviolet'), showlegend=True, name='Anomaly')
for i in range(len(X)):
    if y_pred1[i] == 1:
        normal_points['x'] += (X.iloc[i, 0],)
        normal_points['y'] += (X.iloc[i, 1],)
        anomaly_points['x'] += (X.iloc[i, 0],)
        anomaly_points['y'] += (X.iloc[i, 1],)



local outlier factor | anomaly detection


In this project-based blog, we took a look over anomaly detection in breast cancer data. We used Python’s Scikit-learn library for constructing and training an Isolation Forest model for detecting the hidden data points in the dataset. This model was capable of discovering the outliers and the hidden patterns in the data and helped us to get a meaningful conclusion.

By refining the accuracy of the screening method, we can potentially save countless lives and help them fight against breast cancer. Through the use of these machine learning and data visualization techniques, we can understand the complication connected with the detection of anomalies in breast cancer data in a better way and we can go one step ahead in learning effective and treating methods. Altogether, this project was a prominent success and has found a new way for breast cancer data analysis and anomaly detection.

Key Takeaways

  • By using anomaly detection methods we can identify subtle yet essential patterns in breast cancer data.
  • By enhancing the accuracy of screening methods, we can save many lives and help defeat breast cancer.
  • The Isolation Forest algorithm is a powerful tool for anomaly detection in breast cancer data and has the potential to revolutionize the way we approach screening and treatment methods for this disease.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. How do you detect breast abnormalities?

A. Breast abnormalities can be detected through various methods, including regular self-examinations, clinical breast examinations by healthcare professionals, and imaging tests. Recently use of AI technology has been used for anomaly detection.

Q2. What are the detections of breast cancer?

A. Breast cancer can be detected through a combination of screening methods, such as mammograms, clinical breast examinations, and breast self-exams. These screenings can help identify any suspicious lumps, changes in breast size or shape, nipple discharge, or other abnormalities that may indicate the presence of breast cancer.

Q3. What are the 5 warning signs of breast cancer?

A. The 5 warning signs of breast cancer include a new lump or mass in the breast or underarm, changes in breast size or shape, nipple discharge or inversion, skin dimpling or puckering, and redness or thickening of the breast skin.

Q4. What is the name of the blood test for breast cancer?

A. The blood test for breast cancer is called the circulating tumor DNA (ctDNA) test. It analyzes fragments of tumor DNA that circulate in the bloodstream, allowing for the detection of genetic mutations associated with breast cancer.

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article