A Comprehensive Guide on Atrous Convolution in CNNs

Prashant Malge 19 Mar, 2024

13 min read

Introduction

In the realm of computer vision, Convolutional Neural Networks (CNNs) have redefined the landscape of image analysis and understanding. These powerful networks have enabled breakthroughs in tasks such as image classification, object detection, and semantic segmentation. They have laid the foundation for a wide range of applications in fields like healthcare, autonomous vehicles, and more.

However, as the demand for more context-aware and robust models continues to grow, traditional convolutional layers within CNNs have faced limitations in capturing extensive contextual information. This has led to the need for innovative techniques that can enhance the network’s ability to understand broader contexts without significantly increasing computational complexity.

Enter Atrous Convolution, a groundbreaking approach that has disrupted the conventional norms of convolutional layers within CNNs. Atrous Convolution, also known as dilated convolution, introduces a new dimension to the world of deep learning by enabling networks to capture broader context without significantly increasing computational cost or parameters.

Learning Objectives

Learn the basics of Convolutional Neural Networks and how they process visual data to understand images.
Understand how Atrous Convolution improves upon traditional convolution methods by capturing larger context in images.
Explore well-known CNN architectures that use Atrous Convolution, like DeepLab and WaveNet , to see how it enhances their performance.
Gain a hands-on understanding of the applications of Atrous Convolution in CNNs through practical examples and code snippets.

This article was published as a part of the Data Science Blogathon.

Understanding CNNs: How It Works
Starting With Atrous Convolution
Dilated Convolutions for Multi-Scale Feature Learning
Structure Of Atrous and Normal Convolutions
Comparison of Regular Convolution and Atrous (Dilated) Convolution
Applications of Atrous Convolution
Exploring Famous Architectures
Frequently Asked Questions

Understanding CNNs: How It Works

Convolutional Neural Networks (CNNs) are a class of deep neural networks primarily designed for analyzing visual data like images & videos. They’re inspired by the human visual system and are exceptionally effective in tasks involving pattern recognition within visual data. Here’s the breakdown:

Image Classification Architecture of CNN

Convolutional Layers: CNNs consist of multiple layers, with convolutional layers being the core. These layers employ convolution operations that apply learnable filters to input data, extracting various features from the images.
Pooling Layers: After convolution, pooling layers are often used to reduce spatial dimensions, compressing the information learned by the convolutional layers. Common pooling operations include max pooling or average pooling, which reduce the size of the representation while retaining essential information.
Activation Functions: Non-linear activation functions (like ReLU – Rectified Linear Unit) are used after convolution and pooling layers to introduce non-linearity to the network, allowing it to learn complex patterns and relationships within the data.
Fully Connected Layers: Towards the end of the CNN, fully connected layers are often utilized. These layers consolidate the features extracted by the previous layers and perform classification or regression tasks.
Point-Wise Convolution: Pointwise convolution, also known as 1×1 convolution, is a technique used in CNNs to perform dimensionality reduction and feature combination. It involves applying a 1×1 filter to the input data, effectively reducing the number of input channels and allowing for the combination of features across channels. Pointwise convolution is often used in conjunction with other convolutional operations to enhance the network’s ability to capture complex patterns and relationships within the data.
Learnable Parameters: CNNs rely on learnable parameters (weights and biases) that are updated during the training process. This training involves forward propagation, where the input data is passed through the network, and backpropagation, which adjusts the parameters based on the network’s performance.

Starting With Atrous Convolution

Atrous convolution, also known as dilated convolution, is a type of convolutional operation that introduces a parameter called the dilation rate. Unlike regular convolution, which applies filters to adjacent pixels, atrous convolution spaces out the filter parameters by introducing gaps between them, controlled by the dilation rate. This process enlarges the receptive field of the filters without increasing the number of parameters. In simpler terms, it allows the network to capture a broader context from the input data without adding more complexity.

The dilation rate determines how many pixels are skipped between each step of the convolution. A rate of 1 represents regular convolution, while higher rates skip more pixels. This enlarged receptive field enables capturing larger contextual information without increasing the computational cost, allowing the network to capture both local details and global context efficiently.

In essence, atrous convolution facilitates the integration of wider context information into convolutional neural networks, enabling better modeling of large-scale patterns within the data. It’s commonly used in applications where context at varying scales is crucial, such as semantic segmentation in computer vision or handling sequences in natural language processing tasks.

Dilated Convolutions for Multi-Scale Feature Learning

Dilated convolutions, also known as atrous convolutions, have been pivotal in multi-scale feature learning within neural networks. Here are some key points about their role in enabling multi-scale feature learning:

Contextual Expansion: Atrous convolutions allow the network to capture information from a broader context without significantly increasing the number of parameters. By introducing gaps in the filters, the receptive field expands without inflating the computational load.
Variable Receptive Fields: With dilation rates greater than 1, these convolutions create a ‘multi-scale’ effect. They enable the network to simultaneously process inputs at different scales or granularities, capturing both fine and coarse details within the same layer.
Hierarchical Feature Extraction: The dilation rates can be modulated across network layers to create a hierarchical feature extraction mechanism. Lower layers with smaller dilation rates focus on fine details, while higher layers with larger dilation rates capture a broader context.
Efficient Information Fusion: Atrous convolutions facilitate the fusion of information from different scales efficiently. They provide a mechanism to combine features from various receptive fields, enhancing the network’s understanding of complex patterns in the data.
Applications in Segmentation and Recognition: In tasks like image segmentation or speech recognition, dilated convolutions have been used to improve performance by enabling networks to learn multi-scale representations, leading to more accurate predictions.

Structure Of Atrous and Normal Convolutions

Input Image (Rectangle)
    |
    |
Regular Convolution (Box)
    - Kernel Size: Fixed kernel
    - Sliding Strategy: Across input feature maps
    - Stride: Usually 1
    - Output Feature Map: Reduced size
    
Atrous (Dilated) Convolution (Box)
    - Kernel Size: Fixed kernel with gaps (controlled by dilation)
    - Sliding Strategy: Spaced elements, increased receptive field
    - Stride: Controlled by dilation rate
    - Output Feature Map: Preserves input size, expanded receptive field

Comparison of Regular Convolution and Atrous (Dilated) Convolution

Aspect	Regular Convolution	Atrous (Dilated) Convolution
Filter Application	Applies filters to contiguous regions of input data	Introduces gaps between filter elements (holes)
Kernel Size	Fixed kernel size	Fixed kernel size, but with gaps (controlled by dilation)
Sliding Strategy	Slides across input feature maps	Spaced elements allow for an enlarged receptive field
Stride	Usually, a stride of 1	Increased effective stride, controlled by dilation rate
Output Feature Map Size	Reduction in size due to convolution	Preserves input size while increasing receptive field
Receptive Field	Limited effective receptive field	Expanded effective receptive field
Context Information Capture	Limited context capture	Enhanced capability to capture broader context

Applications of Atrous Convolution

Atrous convolutions enhance speed by expanding the receptive field without adding parameters.
They enable selective focus on specific input regions, improving feature extraction efficiency.
Computational complexity is reduced compared to traditional convolutions with larger kernels.
Ideal for real-time video processing and handling large-scale image datasets.

Exploring Famous Architectures

DeepLab [REF 1]

DeepLab is a series of convolutional neural network architectures created for semantic image segmentation. It is recognized for using atrous convolutions (also known as dilated convolutions) and atrous spatial pyramid pooling (ASPP) to capture multi-scale contextual information in images, allowing for precise pixel-level segmentation.

Here’s an overview of DeepLab:

DeepLab focuses on segmenting images into meaningful regions by assigning a label to each pixel, aiding in understanding the detailed context within an image.
Atrous Convolutions, utilized by DeepLab, are dilated convolutions that expand the network’s receptive field without sacrificing resolution. This allows DeepLab to capture context at multiple scales, enabling comprehensive information gathering without a significant increase in computational cost.
Atrous Spatial Pyramid Pooling (ASPP) is a feature used in DeepLab to efficiently gather multi-scale information. It employs parallel atrous convolutions with different dilation rates to capture context at multiple scales and effectively fuse the information.
DeepLab’s architecture, with its focus on multi-scale context and precise segmentation, has achieved state-of-the-art performance in various semantic segmentation challenges, showcasing high accuracy in segmentation tasks.

Improved DeepLab v3+ network structure | Atrous Convolution in CNNs

Code:

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Conv2DTranspose

def create_DeepLab_model(input_shape, num_classes):
    model = Sequential([
        Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=input_shape),
        Conv2D(64, (3, 3), activation='relu', padding='same'),
        MaxPooling2D(pool_size=(2, 2)),
        
        Conv2D(128, (3, 3), activation='relu', padding='same'),
        Conv2D(128, (3, 3), activation='relu', padding='same'),
        MaxPooling2D(pool_size=(2, 2)),
        
        # Add more convolutional layers as needed
        
        Conv2DTranspose(64, (3, 3), strides=(2, 2), padding='same', activation='relu'),
        Conv2D(num_classes, (1, 1), activation='softmax', padding='valid')
    ])
    return model

# Define input shape and number of classes
input_shape = (256, 256, 3)  # Example input shape
num_classes = 21  # Example number of classes

# Create the DeepLab model
deeplab_model = create_DeepLab_model(input_shape, num_classes)

# Compile the model (you might want to adjust the optimizer and loss function based on your task)
deeplab_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Print model summary
deeplab_model.summary()

Fully Convolutional Networks (FCNS) [REF 2]

FCN Architecture | Atrous Convolution in CNNs

Fully Convolutional Networks (FCNs) and Spatial Preservation: FCNs replace fully-connected layers with 1×1 convolutions, crucial for maintaining spatial information, especially in tasks like segmentation.
Encoder Structure: The encoder, often based on VGG, undergoes a transformation where fully connected layers are converted into convolutional layers. This retains spatial details and connectivity to the image.
Atrous Convolution Integration: Atrous convolutions are pivotal in FCNs. They enable the network to capture multi-scale information without significantly increasing parameters or losing spatial resolution.
Semantic Segmentation: Atrous convolutions aid in capturing wider contextual information at multiple scales, allowing the network to understand objects in various sizes and scales within the same image.
Decoder Role: The decoder network upsamples the feature maps to the original image size using backward convolutional layers. Atrous convolutions ensure that the upsampling process retains crucial spatial details from the encoder.
Improved Accuracy: Through the integration of Atrous convolutions, FCNs achieve improved accuracy in semantic segmentation tasks by efficiently capturing context and preserving spatial information at multiple scales.

Code:

import tensorflow as tf

# Define the atrous convolution layer function
def atrous_conv_layer(inputs, filters, kernel_size, rate):
    return tf.keras.layers.Conv2D(filters=filters, kernel_size=kernel_size, 
    dilation_rate=rate, padding='same', activation='relu')(inputs)

# Example FCN architecture with atrous convolutions
def FCN_with_AtrousConv(input_shape, num_classes):
    inputs = tf.keras.layers.Input(shape=input_shape)

    # Encoder (VGG-style)
    conv1 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
    conv2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(conv1)

    # Atrous convolution layers
    atrous_conv1 = atrous_conv_layer(conv2, 128, (3, 3), rate=2)
    atrous_conv2 = atrous_conv_layer(atrous_conv1, 128, (3, 3), rate=4)
    # Add more atrous convolutions as needed...

    # Decoder (transposed convolution)
    upsample = tf.keras.layers.Conv2DTranspose(64, (3, 3), strides=(2, 2), padding='same')
    (atrous_conv2)
    output = tf.keras.layers.Conv2D(num_classes, (1, 1), activation='softmax')(upsample)

    model = tf.keras.models.Model(inputs=inputs, outputs=output)
    return model

# Define input shape and number of classes
input_shape = (256, 256, 3)  # Example input shape
num_classes = 10  # Example number of classes

# Create an instance of the FCN with AtrousConv model
model = FCN_with_AtrousConv(input_shape, num_classes)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display model summary
model.summary()

LinkNet [REF 3]

Architecture of LinkNet | Atrous Convolution in CNNs

LinkNet is an advanced image segmentation architecture that combines the efficiency of its design with the power of atrous convolutions, also known as dilated convolutions. It leverages skip connections to enhance information flow and accurately segment images.

Efficient Image Segmentation: LinkNet efficiently segments images by employing atrous convolutions, a technique that expands the receptive field without increasing parameters excessively.
Atrous Convolutions Integration: Utilizing atrous convolutions, or dilated convolutions, LinkNet captures contextual information effectively while keeping computational requirements manageable.
Skip Connections for Improved Flow: LinkNet’s skip connections aid in better information flow across the network. This facilitates more precise segmentation by integrating features from different network depths.
Optimized Design: The architecture is optimized to strike a balance between computational efficiency and accurate image segmentation. This makes it suitable for various segmentation tasks.
Scalable Architecture: LinkNet’s design allows for scalability, enabling it to handle segmentation tasks of varying complexities with efficiency and accuracy.

Code:

import torch
import torch.nn as nn
import torch.nn.functional as F

class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1):
        super(ConvBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size,
         stride=stride, padding=padding)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        return x

class DecoderBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(DecoderBlock, self).__init__()
        self.conv1 = ConvBlock(in_channels, in_channels // 4, kernel_size=1, stride=1, padding=0)
        self.deconv = nn.ConvTranspose2d(in_channels // 4, out_channels, kernel_size=4, 
        stride=2, padding=1)
        self.conv2 = ConvBlock(out_channels, out_channels)

    def forward(self, x, skip):
        x = F.interpolate(x, scale_factor=2, mode='nearest')
        x = self.conv1(x)
        x = self.deconv(x)
        x = self.conv2(x)
        if skip is not None:
            x += skip
        return x

class LinkNet(nn.Module):
    def __init__(self, num_classes=21):
        super(LinkNet, self).__init__()

        # Encoder
        self.encoder = nn.Sequential(
            ConvBlock(3, 64),
            nn.MaxPool2d(2),
            ConvBlock(64, 128),
            nn.MaxPool2d(2),
            ConvBlock(128, 256),
            nn.MaxPool2d(2),
            ConvBlock(256, 512),
            nn.MaxPool2d(2)
        )

        # Decoder
        self.decoder = nn.Sequential(
            DecoderBlock(512, 256),
            DecoderBlock(256, 128),
            DecoderBlock(128, 64),
            DecoderBlock(64, 32)
        )

        # Final prediction
        self.final_conv = nn.Conv2d(32, num_classes, kernel_size=1)

    def forward(self, x):
        skips = []
        for module in self.encoder:
            x = module(x)
            skips.append(x.clone())

        skips = skips[::-1]  # Reverse for decoder

        for i, module in enumerate(self.decoder):
            x = module(x, skips[i])

        x = self.final_conv(x)
        return x

# Example usage:
input_tensor = torch.randn(1, 3, 224, 224)  # Example input tensor shape
model = LinkNet(num_classes=10)  # Example number of classes
output = model(input_tensor)
print(output.shape)  # Example output shape

InstanceFCN [REF 4]

This method adapts Fully Convolutional Networks (FCNs), which are highly effective for semantic segmentation, for instance-aware semantic segmentation. Unlike the original FCN, where each output pixel is a classifier of an object category, in InstanceFCN, each output pixel is a classifier of the relative positions of instances. For example, in the score map, each pixel is a classifier of whether it belongs to the “right side” of an instance or not.

Architecture of InstanceFCN score maps | Atrous Convolution in CNNs

How InstanceFCN Works

An FCN is applied on the input image to generate k² score maps, each corresponding to a particular relative position. These are called instance-sensitive score maps. To produce object instances from these score maps, a sliding window of size m×m is used. The m×m window is divided into k², m ⁄ k × m ⁄ k dimensional windows corresponding to each of the k² relative positions. Each m ⁄ k × m ⁄ k sub-window of the output directly copies values from the same sub-window in the corresponding score map. The k² sub-windows are put together according to their relative positions to assemble an m×m segmentation output. For example, the #1 sub-window of the output in the figure above is taken directly from the top-left m ⁄ k × m ⁄ k sub-window of the m×m window in the #1 instance-sensitive score map. This is called the instance assembling module.

InstanceFCN Architecture

The architecture consists of applying VGG-16 fully convolutionally on the input image. On the output feature map, there are two fully convolutional branches. One of them is for estimating segment instances (as described above) and the other is for scoring the instances.

Atrous convolutions, which introduce gaps in the filter, are used in parts of this architecture to expand the network’s field of view and capture more context information.

Main Architecture of InstanceFCN | Atrous Convolution in CNNs

For the first branch, 1×1 512-d conv. layer followed by a 3×3 conv. layer is used to generate the set of k² instance-sensitive score maps. The assembling module (as described earlier) is used to predict the m×m(= 21) segmentation mask. The second branch consists of a 3×3 512-d conv. layer followed by a 1×1 conv. layer. This 1×1 conv. layer is a per-pixel logistic regression for classifying instance/not an instance of the m×m sliding window centered at this pixel. Hence, the output of the branch is an objectness score map in which one score corresponds to one sliding window that generates one instance. Hence, this method is blind to the different object categories.

Code:

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, concatenate

# Define your atrous convolution layer
def atrous_conv_layer(input_layer, filters, kernel_size, dilation_rate):
    return Conv2D(filters=filters, kernel_size=kernel_size,
     dilation_rate=dilation_rate, padding='same', activation='relu')(input_layer)

# Define your InstanceFCN model
def InstanceFCN(input_shape):
    inputs = Input(shape=input_shape)
    
    # Your VGG-16 like fully convolutional layers here
    conv1 = Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
    conv2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv1)
    
    # Atrous convolution layer
    atrous_conv = atrous_conv_layer(conv2, filters=128, kernel_size=(3, 3),
     dilation_rate=(2, 2))

    # More convolutional layers and branches for scoring and instance estimation

    # Output layers for scoring and instance estimation
    score_output = Conv2D(num_classes, (1, 1), activation='softmax')(... )  
    # Your score output
    instance_output = Conv2D(num_instances, (1, 1), activation='sigmoid')(... )  
    # Your instance output

    return Model(inputs=inputs, outputs=[score_output, instance_output])

# Usage:
model = InstanceFCN(input_shape=(256, 256, 3))  # Example input shape
model.summary()  # View the model summary

Fully Convolutional Instance-aware Semantic Segmentation (FCIS)

Fully Convolutional Instance-aware Semantic Segmentation (FCIS) is built up of the IntanceFCN method. InstanceFCN is only able to predict a fixed m×m dimensional mask and cannot classify the object into different categories. FCIS fixes all of that by predicting different dimensional masks while also predicting the different object categories.

Joint Mask Prediction and Classification

Architecture of FCIS Score Maps | Atrous Convolution in CNNs

Given a RoI, the pixel-wise score maps are produced by the assembling operation as described above under InstanceFCN. For each pixel in ROI, there are two tasks (hence, two score maps are produced):

Detection: whether it belongs to an object bounding box at a relative position
Segmentation: whether it is inside an object instance’s boundary

Based on these, three cases arise:

High inside score and low outside score: detection+, segmentation+
Low inside score and high outside score: detection+, segmentation-
Both scores are low: detection-, segmentation-

For detection, the max operation is used to differentiate cases 1 and 2 (detection+) from case 3 (detection-). The detection score of the whole ROI is obtained via average pooling over all pixels’ likelihoods followed by the softmax operator across all the categories. For segmentation, softmax is used to differentiate case 1 (segmentation+) from the rest (segmentation-). The foreground mask of the ROI is the union of the per-pixel segmentation scores for each category.

Main Architecture of FCIS | Atrous Convolution in CNNs

ResNet is used to extract the features from the input image fully convolutionally. An RPN is added on top of the conv4 layer to generate the ROIs. From the conv5 feature map, 2k² × C+1 score maps are produced (C object categories, one background category, two sets of k² score maps per category) using a 1×1 conv. layer. The RoIs (after non-maximum suppression) are classified as the categories with the highest classification scores. To obtain the foreground mask, all RoIs with intersection-over-union scores higher than 0.5 with the RoI under consideration are taken. The mask of the category is averaged on a per-pixel basis, weighted by their classification scores. The averaged mask is then binarized.

Conclusion

Atrous Convolutions have transformed semantic segmentation by addressing the challenge of capturing contextual information without sacrificing computational efficiency. These dilated convolutions are designed to expand receptive fields while maintaining spatial resolution. They have become essential components of modern architectures such as DeepLab, LinkNet, and others.

The capability of Atrous Convolutions to capture multi-scale features and improve contextual understanding has led to their widespread adoption in cutting-edge segmentation models. As research progresses, the integration of Atrous Convolutions with other techniques holds the promise of further advancements in achieving precise, efficient, and contextually rich semantic segmentation across diverse domains.

Key Takeaways

Atrous Convolutions in CNNs help us understand complex images by looking at different scales without losing detail.
They keep the image clear and detailed, which makes it easier to identify each part of the image.
They are seamlessly integrated into architectures like DeepLab, LinkNet & others, boosting their efficacy in accurately segmenting objects across diverse domains.

Frequently Asked Questions

Q1. What is the primary advantage of using Atrous Convolutions?

A. Atrous Convolutions allow exploring different scales within an image without compromising on its details, enabling more comprehensive feature extraction.

Q2. How do Atrous Convolutions differ from regular convolutions?

A. Unlike regular convolutions, Atrous Convolutions introduce gaps in the filter elements, effectively increasing the receptive field without downsampling.

Q3. In which applications are Atrous Convolutions commonly used?

A. Atrous Convolutions are prevalent in semantic segmentation, image classification, and object detection tasks due to their ability to preserve image details.

Q4. Do Atrous Convolutions impact computational efficiency?

A. Yes, Atrous Convolutions help maintain computational efficiency by retaining the resolution of the feature maps, allowing for larger receptive fields without increasing the number of parameters significantly.

Q5. Are Atrous Convolutions limited to specific neural network architectures?

A. No, Atrous Convolutions can be integrated into various architectures like DeepLab, LinkNet, and others, showcasing their versatility across different frameworks.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.