Understanding Siamese Networks: A Comprehensive Introduction

14 min read


Siamese networks offer an intriguing approach to classification, allowing accurate image categorization based on just one example. These networks employ a concept called Contrastive Loss to gauge the similarity between pairs of images within a dataset. Unlike traditional methods focusing on deciphering image content, Siamese networks concentrate on understanding the variations and resemblances among images. This distinctive learning method contributes to their resilience in limited-data scenarios, enhancing performance even without domain-specific knowledge.

This article delves into the fascinating realm of Signature Verification through the lens of Siamese Networks. We’ll guide you through creating a functional model using PyTorch, providing insights and practical implementation steps along the way.

Learning Objectives

  • Understand the concept of Siamese networks and their unique architecture involving twin subnetworks.
  • Differentiate between loss functions used in Siamese networks, including Binary Cross-Entropy Loss, Contrastive Loss, and Triplet Loss.
  • Identify and describe real-world applications where Siamese networks can be effectively used, such as facial recognition, fingerprint recognition, and text similarity assessment.
  • Summarize the advantages and disadvantages of Siamese networks regarding one-shot learning, versatility, and domain-agnostic performance.

This article was published as a part of the Data Science Blogathon.

What are Siamese Networks?

Siamese Networks belong to a category of networks that employ two identical subnetworks for one-shot classification. These subnetworks share the same setup, parameters, and weights while accommodating different inputs. A Siamese Network learns a similarity function, unlike conventional CNNs, which are trained on copious amounts of data to predict multiple classes. This function allows us to discern between classes using minimal data, rendering them particularly effective for one-shot classification. This unique ability means that, in many instances, a single example is sufficient for these networks to classify images accurately.

A real-world application of Siamese Networks is in face recognition and signature verification tasks. Imagine a company implementing an automated face-based attendance system. With just one image of each employee available, traditional CNNs would struggle to classify thousands of employees precisely. Enter the Siamese network, excelling in precisely this kind of scenario.

Siamese Networks

Exploring Few-Shot Learning

In few-shot learning, models undergo training to make predictions based on a limited number of examples. This stands in contrast to the traditional approach, which demands a substantial volume of labeled data for training purposes. The significance of few-shot learning emerges when acquiring ample labeled data becomes challenging or expensive.

Few-shot models’ architecture leverages the nuances among a small handful of samples, allowing them to make predictions based on only a few or even a single example. Various design frameworks like Siamese Networks, Meta-learning, and similar approaches facilitate this capability. These frameworks empower the model to extract meaningful data representations and use them for novel, unseen samples.

A couple of practical instances where few-shot learning shines include:

  1. Object Detection in Surveillance: Few-shot learning can effectively identify objects within surveillance footage, even when only a few examples of those objects are available. After training the model on a modest set of labeled examples, it can subsequently detect these objects in new footage, even if it has never encountered them before.
Few-Shot Learning

2. Tailored Healthcare: Within personalized healthcare, medical professionals might possess a limited set of a patient’s medical records, comprising a handful of CT scans or blood tests. Using a few-shot learning model,e instances for training allow us to predict the patient’s prospective well-being. This might encompass forecasts about the potential onset of a specific ailment or the probable response to a particular therapeutic approach.

Tailored healthcare | Siamese Networks

The Architecture of Siamese Networks

The Siamese network design comprises two identical subnetworks, each processing one of the inputs. Initially, the inputs undergo processing through a convolutional neural network (CNN), which extracts significant features from the provided images. These subnetworks then generate encoded outputs, often through a fully connected layer, resulting in a condensed representation of the input data.

The CNN consists of two branches and a shared feature extraction component, composed of layers for convolution, batch normalization, and ReLU activation, followed by max pooling and dropout layers. The final segment involves the FC layer, which maps the extracted features to the ultimate classification outcomes. A function delineates a linear layer followed by a sequence of ReLU activations and a series of consecutive operations (convolution, batch normalization, ReLU activation, max pooling, and dropout). The forward function guides the inputs through both branches of the network.

The Differencing layer serves to identify similarities between inputs and amplify distinctions among dissimilar pairs, accomplished using the Euclidean Distance function:

Distance(x₁, x₂) = ∥f(x₁) – f(x₂)∥₂

In this context,

  • x₁, x₂ are the two inputs.
  • f(x) represents the output of the encoding.
  • Distance denotes the distance function.

This property enables the network to acquire effective data representations apply that to fresh, unseen samples. Consequently, the network generates an encoding, often represented as a similarity score, that aids in-class differentiation.

Depict the network’s architecture in the accompanying figure. Notably, this network operates as a one-shot classifier, negating the need for many examples per class.

Loss Functions Used in Siamese Networks

A loss function is a mathematical tool to gauge the dissimilarity between the anticipated and actual output within a machine-learning model, given a specific input. When training a model, the aim is to minimize this loss function by adjusting the model’s parameters.

Numerous loss functions cater to diverse problem types. For instance, mean squared error is apt for regression challenges, while cross-entropy loss suits classification tasks.

Distinct from several other network types, the Siamese Network embraces multiple loss functions, elaborated upon below.

Binary Cross-Entropy Loss

Binary cross-entropy loss proves valuable for binary classification tasks, where the objective is to predict between two possible outcomes. In the context of a Siamese network, the aim is to classify an image as either “similar” or “dissimilar” to another.

This function quantifies the disparity between the forecasted probability of the positive class and the actual outcome. Within the Siamese network, the forecasted probability pertains to the likelihood of image similarity, while the actual outcome assumes a binary form: 1 for image similarity and 0 for dissimilarity.

The function’s formulation involves the negative logarithm of the true class likelihood, calculated as:


  • y signifies the true label.
  • p signifies the predicted probability.

Training a model with binary cross-entropy loss strives to minimize this function by parameter adjustment. Through such minimization, the model gains proficiency in accurate class prediction.

Contrastive Loss

Contrastive Loss delves into the differentiation of image pairs by employing distance as a similarity measure. This function proves advantageous when the number of training instances per class is in limit. It’s important to note that Contrastive loss necessitates pairs of negative and positive training samples. A visualization of this loss is provided in the accompanying figure.

Contrastive Loss | Siamese Networks

The Contrastive Loss equation can be:

(1 – Y) * 0.5 * D^2 + Y * 0.5 * max(0, m – D^2)

Here’s the breakdown:

  • Y represents an input parameter.
  • D stands for the Euclidean distance.
  • When Y equals 0, the inputs belong to the same class. On the other hand, a Y value of 1 signifies that they come from different classes.
  • The parameter ‘m’ defines a margin for the distance function, helping identify pairs contributing to the loss. It’s worth noting that the value of ‘m’ is always greater than 0.

Triplet Loss

The triplet loss uses triples of data. The graphic below illustrates these triples.

Triplet Loss | Siamese Networks

The triplet loss function aims to enhance the separation between the anchor and negative samples while reducing the gap between the anchor and positive samples.

Mathematically, the Triplet loss function defines itself as the maximum difference between the anchor-to-positive distance (d(a,p)) and the anchor-to-negative distance (d(a,n)), subtracted by a margin value. When this difference is positive, the computed value becomes the loss; otherwise, it is set to zero.

Here’s a breakdown of the components:

  • d signifies the Euclidean distance.
  • a represents the anchor input.
  • p denotes the positive input.
  • n stands for the negative input.

The primary goal is to ensure that the positive input is closer to the anchor input than the negative input, maintaining a margin of separation.

Constructing a Siamese Network-Based Model for Signature Verification

Signature verification involves distinguishing counterfeit signatures from a collection of genuine ones. In this scenario, a model must grasp the nuances among numerous signatures. It must then discern between authentic and fake signatures when presented with either. Achieving this verification objective poses a considerable challenge for conventional CNNs due to the intricate variations and limited training instances. Compounding the difficulty, often only a solitary signature per individual exists, demanding the model’s proficiency in verifying thousands of individuals’ signatures. The forthcoming sections delve into creating a PyTorch-based model to address this intricate task.


The dataset we’ll utilize pertains to signature validation and is ICDAR 2011. This collection comprises Dutch signatures, encompassing both authentic and counterfeit ones. A sample of the data is here for reference. Link for the dataset.

Dataset | Siamese Networks

Problem Statement Description

This article delves into the task of detecting counterfeit signatures within a signature verification context. Our objective involves leveraging a dataset of signatures and employing a Siamese network to predict the authenticity of test signatures—discerning genuine ones from fraudulent ones. To accomplish this, we must establish a step-by-step process. This entails data ingestion from the dataset, creating image pairs, and their subsequent processing through the Siamese network. Upon training the network using the provided dataset, we then develop prediction functions.

Importing Essential

Building the Siamese Network necessitates the inclusion of several key libraries. We introduce the Pillow library (PIL) for image manipulation, matplotlib for visualization, numpy for numerical operations, and tqdm for a progress bar utility. Additionally, we harness the power of PyTorch and torchvision to facilitate network training and construction.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import torchvision.utils as tv_utils
from torch.autograd import Variable
from torch.utils.data import DataLoader, Dataset
import PIL.Image as Image
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import torch.utils.data as custom_data
from tqdm import tqdm

Utility Functions

To visualize the network’s outputs, craft a utility function. This function accepts images and their corresponding labels as inputs and arranges them in a grid for convenient visualization.

import numpy as np
import matplotlib.pyplot as plt

def display_image(img, caption=None, save=False):
    image_array = img.numpy()
    if caption:
            bbox={"facecolor": "white", "alpha": 0.8, "pad": 10},
    plt.imshow(np.transpose(image_array, (1, 2, 0)))

Data Preprocessing

The data structure utilized by the Siamese network markedly differs from conventional image classification networks. In contrast to furnishing a single image-label pair, the Dataset Generator for the Siamese network necessitates the provisioning of image pairs. These pairs undergo a transformation process involving conversion to black and white, subsequent resizing, and eventual conversion into Tensors. Two distinct categories of pairs are positive pairs, characterized by identical input images, and negative pairs, with dissimilar images. Additionally, a function provides the Dataset’s size when invoked.

import os
import pandas as pd
import torch
import torch.utils.data as data
from PIL import Image
import numpy as np

class PairedDataset(data.Dataset):
    def __init__(self, df_path=None, data_dir=None, transform=None, subset=None):
        self.df = pd.read_csv(df_path)
        if subset is not None:
            self.df = self.df[:subset]
        self.df.columns = ["image1", "image2", "label"]
        self.data_dir = data_dir
        self.transform = transform

    def __getitem__(self, index):
        pair1_path = os.path.join(self.data_dir, self.df.iat[index, 0])
        pair2_path = os.path.join(self.data_dir, self.df.iat[index, 1])

        pair1 = Image.open(pair1_path).convert("L")
        pair2 = Image.open(pair2_path).convert("L")

        if self.transform:
            pair1 = self.transform(pair1)
            pair2 = self.transform(pair2)

        label = torch.tensor([int(self.df.iat[index, 2])], dtype=torch.float32)

        return pair1, pair2, label

    def __len__(self):
        return len(self.df)

Concise Overview of Features

The network’s inputs consist of images comprising positive and negative data pairs. We represent these pairs as image data and transform them into Tensor format, effectively encapsulating the underlying image information. Labels associated with the Siamese network are categorical.

Feature Standardization Process

A crucial step involves standardizing features and converting images to black and white. Furthermore, we uniformly resize all images to a (105×105) square format, as the Siamese Network requires this dimension. Afterward, we convert all images into Tensors, which enhances computational efficiency and enables GPU utilization.

data_transform = transforms.Compose([
    transforms.Resize((105, 105)),

Splitting the Dataset

We partition the dataset into distinct training and testing segments to facilitate both model training and testing. For ease of illustration, we focus on the initial 1000 data points. Opting for a ‘load_subset’ function value of None would entail utilizing the complete dataset, albeit at the expense of prolonged processing time. Consider data Augmentation as an approach to enhance the network’s long-term performance.

train_dataset = PairedDataset(
        transforms.Resize((105, 105)),

evaluation_dataset = PairedDataset(
        transforms.Resize((105, 105)),

Neural Network Architecture

Constructing the described architecture involves a series of steps. Initially, we establish a function that constructs sets of Convolutions, Batch Normalization, and ReLU layers, offering the flexibility to include or exclude a Dropout layer at the end. Another function is devised to generate sequences of Fully Connected (FC) layers, complemented by subsequent ReLU layers. Once the CNN component is constructed via the aforementioned functions, attention shifts to shaping the FC segment of the network. Notably, distinct padding and kernel sizes are implemented throughout the network.

The FC portion consists of blocks comprising Linear layers trailed by ReLU activations. With the architecture defined, we execute a forward pass to process the provided data through the network. An important aspect to highlight is the “view” function, which reshapes the output of the preceding block by flattening dimensions. The stage is set for training the Siamese network using the supplied data upon establishing this mechanism.

class SiameseNetwork(nn.Module):
    def __init__(self):
        super(SiameseNetwork, self).__init__()

        self.cnn1 = nn.Sequential(
            self.create_conv_block(1, 96, 11, 1, False),
            self.create_conv_block(96, 256, 5, 2, True),
            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            self.create_conv_block(384, 256, 3, 1, True),

        self.fc1 = nn.Sequential(
            self.create_linear_relu(30976, 1024),
            self.create_linear_relu(1024, 128),
            nn.Linear(128, 2)

    def create_linear_relu(self, input_channels, output_channels):
        return nn.Sequential(nn.Linear(input_channels, output_channels),

    def create_conv_block(self, input_channels, output_channels, kernel_size,
    padding, dropout=True):
        if dropout:
            return nn.Sequential(
                nn.Conv2d(input_channels, output_channels, kernel_size=kernel_size,
                stride=1, padding=padding),
                nn.MaxPool2d(3, stride=2),
            return nn.Sequential(
                nn.Conv2d(input_channels, output_channels, kernel_size=kernel_size,
                nn.MaxPool2d(3, stride=2)

    def forward_once(self, x):
        output = self.cnn1(x)
        output = output.view(output.size()[0], -1)
        output = self.fc1(output)
        return output

    def forward(self, input1, input2):
        out1 = self.forward_once(input1)
        out2 = self.forward_once(input2)
        return out1, out2

Loss Function

The contrastive loss serves as the pivotal loss function for the Siamese Network. Defining this loss involves utilizing the equations elucidated earlier in the article. To enhance code efficiency, rather than defining the loss as a straightforward function, an alternative approach involves inheritance from the nn.Module class. This allows the creation of a customized class that furnishes the function’s outputs. Such a wrapper enables PyTorch to optimize code execution, thus enhancing overall runtime performance.

class ContrastiveLoss(nn.Module):
    def __init__(self, margin=2.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, output1, output2, label):
        euclidean_distance = F.pairwise_distance(output1, output2)
        loss_positive = (1 - label) * torch.pow(euclidean_distance, 2)
        loss_negative = label * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2)
        total_loss = torch.mean(loss_positive + loss_negative)
        return total_loss

Training the Siamese Network

With the data loaded and preprocessed, the stage is set to commence training the Siamese network. To initiate this process, we begin by establishing data loaders for both training and testing. Notably, the evaluation DataLoader is configured with a batch size of 1 to facilitate individualized evaluations. Subsequently, the model is deployed to the GPU, and pivotal components such as the Contrastive Loss function and the Adam optimizer are defined.

train_loader = DataLoader(train_dataset,
eval_loader = DataLoader(evaluation_dataset,

siamese_net = SiameseNetwork().cuda()
loss_function = ContrastiveLoss()
optimizer = torch.optim.Adam(siamese_net.parameters(), lr=1e-3, weight_decay=0.0005)

Subsequently, a function is crafted, accepting the train DataLoader as its input. Within this function, an ongoing array is maintained to track the loss, alongside a counter to facilitate future plotting endeavors. The subsequent iterative process navigates through the data points within the DataLoader. For each point, the image pairs are transferred to the GPU, subjected to network processing, and the Contrastive Loss is computed. Subsequent steps encompass the execution of a backward pass, culminating in the provision of the net loss pertaining to a batch of data.

def train(train_loader, model, optimizer, loss_function):
    total_loss = 0.0
    num_batches = len(train_loader)


    for batch_idx, (pair_left, pair_right, label) in
    enumerate(tqdm(train_loader, total=num_batches)):
        pair_left, pair_right, label = pair_left.cuda(),
        pair_right.cuda(), label.cuda()

        output1, output2 = model(pair_left, pair_right)

        contrastive_loss = loss_function(output1, output2, label)

        total_loss += contrastive_loss.item()

    mean_loss = total_loss / num_batches

    return mean_loss

The model can be trained over multiple epochs utilizing our devised function. In this demonstration, the article covers only a limited number of epochs. If the evaluation loss achieved during training represents the best performance observed throughout the training duration, the model is preserved for subsequent inference at that particular epoch.

best_eval_loss = float('inf')

for epoch in tqdm(range(1, num_epoch)):
    train_loss = train(train_loader)
    eval_loss = evaluate(eval_loader)

    print(f"Epoch: {epoch}")
    print(f"Training loss: {train_loss}")
    print(f"Evaluation loss: {eval_loss}")

    if eval_loss < best_eval_loss:
        best_eval_loss = eval_loss
        print(f"Best Evaluation loss: {best_eval_loss}")
        torch.save(siamese_net.state_dict(), "model.pth")
        print("Model Saved Successfully")

Testing the Model

An evaluation phase ensues following model training, allowing us to assess its performance and conduct inference for individual data points. Analogous to the training function, an evaluation function is constructed, taking the test data loader as input. The data loader is iterated through, processing one instance at a time. Subsequently, the image pairs for testing are extracted. These pairs are then sent to the GPU, enabling model execution. The resultant outputs from the model are utilized to compute the Contrastive loss, which is subsequently stored within a designated list.

def evaluate(eval_loader):
    loss_list = []
    counter_list = []
    iteration_number = 0

    for i, data in tqdm(enumerate(eval_loader, 0), total=len(eval_loader)):
        pair_left, pair_right, label = data
        pair_left, pair_right, label = pair_left.cuda(), pair_right.cuda(), label.cuda()
        output1, output2 = siamese_net(pair_left, pair_right)
        contrastive_loss = loss_function(output1, output2, label)
    loss_array = np.array(loss_list)
    mean_loss = loss_array.mean() / len(eval_loader)
    return mean_loss

We can execute the code to perform a single evaluation across all the test data points. To assess performance visually, we will generate plots depicting the images and display the pairwise distances identified by the model between the data points. Present these results in the form of a grid.

for i, data in enumerate(dl_eval, 0):
    x0, x1, label = data
    concat_images = torch.cat((x0, x1), 0)
    out1, out2 = siamese_net(x0.to('cuda'), x1.to('cuda'))

    euclidean_distance = F.pairwise_distance(out1, out2)
    if label == torch.FloatTensor([[0]]):
        label_text = "Original Pair of Signature"
        label_text = "Forged Pair of Signature"

    print("Predicted Euclidean Distance:", euclidean_distance.item())
    print("Actual Label:", label_text)
    if i == 4:


Output | Siamese Networks

Advantages and Disadvantages of Siamese Networks


  • One notable drawback of Siamese networks is their output, which provides a similarity score rather than a probability distribution that sums up to 1. This characteristic can present challenges in certain applications where probability-based outputs are preferable.


  • Siamese networks exhibit resilience when dealing with varying numbers of examples within different classes. This adaptability stems from the network’s ability to function effectively with limited class information.
  • The network’s classification performance does not hinge on providing domain-specific information, contributing to its versatility.
  • Siamese networks can make predictions even with just a single image per class.

Applications of Siamese Networks

Siamese Networks find utility in various applications, some outlined below.

Facial Recognition: Siamese networks prove advantageous in one-shot facial recognition tasks. By utilizing contrastive loss, these networks distinguish dissimilar faces from similar ones, enabling effective facial identification with minimal data samples.

Applications of Siamese Networks

Fingerprint Recognition: Harness the Siamese Networks for fingerprint recognition. By providing pairs of pre-processed fingerprints to the network, it learns to differentiate between valid and invalid prints, enhancing the accuracy of fingerprint-based authentication.

Fingerprint recognition | Siamese Networks

Signature Verification: This article primarily delved into the implementation of Signature Verification through Siamese networks. As demonstrated, the network processes pairs of signatures to determine the authenticity of signatures, distinguishing between genuine and forged ones.

Signature verification

Text Similarity: Siamese Networks also find relevance in assessing text similarity. Through paired input, the network can discern similarities between different textual pieces. Practical applications include identifying analogous questions within a question bank or retrieving similar documents from a text repository.

Text similarity


A Siamese neural network, often abbreviated as SNN, falls under the category of neural network designs incorporating two or more sub-networks that share an identical structure. In this context, “identical” implies having matching configurations, parameters, and weights. The synchronization of parameter updates between these sub-networks determines resemblances among inputs through the comparison of feature vectors.

Key Takeaways

  • Siamese networks excel in classifying datasets with limited examples per class, making them valuable for scenarios with scarce training data.
  • Through this exploration, we gained insight into the fundamental principles underpinning Siamese networks, encompassing their architecture, employed loss functions, and the process of training such networks.
  • Our journey encompassed the practical application of Siamese networks in the context of Signature verification, utilizing the ICDAR 2011 dataset. This involved the creation of a model capable of detecting counterfeit signatures.
  • The training and testing pipeline for Siamese networks became clear, offering a comprehensive understanding of how these networks operate. We delved into the representation of paired data, a crucial aspect of their effectiveness.

Frequently Asked Questions

Q1. What are the applications of Siamese networks?

Answer: Siamese networks find applications in various domains, such as image classification, object detection, text classification, and voice classification. Additionally, employ these networks to encode specific features. The versatility extends to creating similar models for classifying different shapes. Furthermore, Siamese networks play a crucial role in enabling one-shot learning tasks.

Q2. What does the Siamese network mean in the context of Natural Language Processing (NLP)?

Answer: In the formal characterization of Siamese networks in Natural Language Processing (NLP) through the triplet loss function, we can describe it as follows: Multiple identical neural networks constitute a Siamese network and receive input vectors to extract features. These extracted features are then fed into the triplet function, playing a crucial role in the few-shot learning process.

Q3. Why is it called a Siamese network?

Answer: Siamese Networks were first introduced by Gregory Koch in 2015. The term “Siamese” originates from the network’s structure, which involves two identical sub-networks processing distinct input samples using the same set of weights.

Q4. How does a Siamese network differ from a CNN?

Answer: A Siamese Network learns a similarity function, unlike a normal CNN, which learns to predict several classes using large amounts of data. The acquired function enables class differentiation with reduced data requirements.

Q5.How do I create a Siamese Network?

1. Choose a deep learning framework: TensorFlow, PyTorch, or Keras. 
2. Define the network architecture: Two identical branches with embedding outputs.
 3. Implement the contrastive loss function: Minimizes distance for similar images and maximizes for dissimilar.
 4. Train the network: Feed image pairs and backpropagate loss.
 5. Evaluate the network: Use metrics like accuracy, precision, and recall.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers