A Comprehensive Guide to UNET Architecture | Mastering Image Segmentation

Premanand S 05 Nov, 2023 • 17 min read


In the exciting subject of computer vision, where images contain many secrets and information, distinguishing and highlighting items is crucial. Image segmentation, the process of splitting images into meaningful regions or objects, is essential in various applications ranging from medical imaging to autonomous driving and object recognition. Accurate and automatic segmentation has long been challenging, with traditional approaches frequently falling short in accuracy and efficiency. Enter the UNET architecture, an intelligent method that has revolutionized image segmentation. With its simple design and inventive techniques, UNET has paved the way for more accurate and robust segmentation findings. Whether you are a newcomer to the exciting field of computer vision or an experienced practitioner looking to improve your segmentation abilities, this in-depth blog article will unravel the complexities of UNET and provide a complete understanding of its architecture, components, and usefulness.

This article was published as a part of the Data Science Blogathon.

Understanding Convolution Neural Network

CNNs are a deep learning model frequently employed in computer vision tasks, including image classification, object recognition, and picture segmentation. CNNs are mainly to learn and extract relevant information from images, making them extremely useful in visual data analysis.

The critical components of CNNs

  • Convolutional Layers: CNNs comprise a collection of learnable filters (kernels) convolved with the input picture or feature maps. Each filter applies element-wise multiplication and summing to produce a feature map highlighting specific patterns or local features in the input. These filters can capture many visual elements, such as edges, corners, and textures.
convolutional layers | UNET Architecture | Image segmentation
  • Pooling Layers: Create the feature maps by the convolutional layers that are downsampled using pooling layers. Pooling reduces the spatial dimensions of the feature maps while maintaining the most critical information, lowering the computational complexity of succeeding layers and making the model more resistant to input fluctuations. The most common pooling operation is max pooling, which takes the most significant value within a given neighborhood.
  • Activation Functions: Introduce the Non-linearity into the CNN model using activation functions. Apply them to the outputs of convolutional or pooling layers element by element, allowing the network to understand complicated associations and make non-linear decisions. Because of its simplicity and efficiency in addressing the vanishing gradient problem, the Rectified Linear Unit (ReLU) activation function is common in CNNs.
  • Fully Connected Layers: Fully connected layers, also called dense layers, use the retrieved features to complete the final classification or regression operation. They connect every neuron in one layer to every neuron in the next, allowing the network to learn global representations and make high-level judgments based on the previous layers’ combined input.

The network begins with a stack of convolutional layers to capture low-level features, followed by pooling layers. Deeper convolutional layers learn higher-level characteristics as the network evolves. Finally, use one or more full layers for the classification or regression operation.

Need for a Fully Connected Network

Traditional CNNs are generally intended for image classification jobs in which a single label is assigned to the whole input image. On the other hand, traditional CNN architectures have problems with finer-grained tasks like semantic segmentation, in which each pixel of an image must be sorted into various classes or regions. Fully Convolutional Networks (FCNs) come into play here.

UNET Architecture | Image segmentation

Limitations of Traditional CNN Architectures in Segmentation Tasks

Loss of Spatial Information: Traditional CNNs use pooling layers to gradually reduce the spatial dimensionality of feature maps. While this downsampling helps capture high-level features, it results in a loss of spatial information, making it difficult to precisely detect and split objects at the pixel level.

Fixed Input Size: CNN architectures are often built to accept images of a specific size. However, the input images might have various dimensions in segmentation tasks, making variable-sized inputs challenging to manage with typical CNNs.

Limited Localisation Accuracy: Traditional CNNs often use fully connected layers at the end to provide a fixed-size output vector for classification. Because they do not retain spatial information, they cannot precisely localize objects or regions within the image.

Fully Convolutional Networks (FCNs) as a Solution for Semantic Segmentation

By working exclusively on convolutional layers and maintaining spatial information throughout the network, Fully Convolutional Networks (FCNs) address the constraints of classic CNN architectures in segmentation tasks. FCNs are intended to make pixel-by-pixel predictions, with each pixel in the input image assigned a label or class. FCNs enable the construction of a dense segmentation map with pixel-level forecasts by upsampling the feature maps. Transposed convolutions (also known as deconvolutions or upsampling layers) are used to replace the completely linked layers after the CNN design. The spatial resolution of the feature maps is increased by transposed convolutions, allowing them to be the same size as the input image.

During upsampling, FCNs generally use skip connections, bypassing specific layers and directly linking lower-level feature maps with higher-level ones. These skip relationships aid in preserving fine-grained details and contextual information, boosting the segmented regions’ localization accuracy. FCNs are extremely effective in various segmentation applications, including medical picture segmentation, scene parsing, and instance segmentation. It can now handle input images of various sizes, provide pixel-level predictions, and keep spatial information across the network by leveraging FCNs for semantic segmentation.

Image Segmentation

Image segmentation is a fundamental process in computer vision in which an image is divided into many meaningful and separate parts or segments. In contrast to image classification, which provides a single label to a complete image, segmentation adds labels to each pixel or group of pixels, essentially splitting the image into semantically significant parts. Image segmentation is important because it allows for a more detailed comprehension of the contents of an image. We can extract considerable information about object boundaries, forms, sizes, and spatial relationships by segmenting a picture into multiple parts. This fine-grained analysis is critical in various computer vision tasks, enabling improved applications and supporting higher-level visual data interpretations.

UNET Architecture | Types of Image segmentation

Understanding the UNET Architecture

Traditional image segmentation technologies, such as manual annotation and pixel-wise classification, have various disadvantages that make them wasteful and difficult for accurate and effective segmentation jobs. Because of these constraints, more advanced solutions, such as the UNET architecture, have been developed. Let us look at the flaws of previous ways and why UNET was created to overcome these issues.

  • Manual Annotation: Manual annotation entails sketching and marking image boundaries or regions of interest. While this method produces reliable segmentation results, it is time-consuming, labor-intensive, and susceptible to human mistakes. Manual annotation is not scalable for large datasets, and maintaining consistency and inter-annotator agreement is difficult, especially in sophisticated segmentation tasks.
  • Pixel-wise Classification: Another common approach is pixel-wise classification, in which each pixel in an image is classified independently, generally using algorithms such as decision trees, support vector machines (SVM), or random forests. Pixel-wise categorization, on the other hand, struggles to capture global context and dependencies among surrounding pixels, resulting in over- or under-segmentation problems. It cannot consider spatial relationships and frequently fails to offer accurate object boundaries.

Overcomes Challenges

The UNET architecture was developed to address these limitations and overcome the challenges faced by traditional approaches to image segmentation. Here’s how UNET tackles these issues:

  • End-to-End Learning: UNET takes an end-to-end learning technique, which means it learns to segment images directly from input-output pairs without user annotation. UNET can automatically extract key features and execute accurate segmentation by training on a large labeled dataset, removing the need for labor-intensive manual annotation.
  • Fully Convolutional Architecture: UNET is based on a fully convolutional architecture, which implies that it is entirely made up of convolutional layers and does not include any fully connected layers. This architecture enables UNET to function on input images of any size, increasing its flexibility and adaptability to various segmentation tasks and input variations.
  • U-shaped Architecture with Skip Connections: The network’s characteristic architecture includes an encoding path (contracting path) and a decoding path (expanding path), allowing it to collect local information and global context. Skip connections bridge the gap between the encoding and decoding paths, maintaining critical information from previous layers and allowing for more precise segmentation.
  • Contextual Information and Localisation: The skip connections are used by UNET to aggregate multi-scale feature maps from multiple layers, allowing the network to absorb contextual information and capture details at different levels of abstraction. This information integration improves localization accuracy, allowing for exact object boundaries and accurate segmentation results.
  • Data Augmentation and Regularization: UNET employs data augmentation and regularisation techniques to improve its resilience and generalization ability during training. To increase the diversity of the training data, data augmentation entails adding numerous transformations to the training images, such as rotations, flips, scaling, and deformations. Regularisation techniques such as dropout and batch normalization prevent overfitting and improve model performance on unknown data.

Overview of the UNET Architecture

UNET is a fully convolutional neural network (FCN) architecture built for image segmentation applications. It was first proposed in 2015 by Olaf Ronneberger, Philipp Fischer, and Thomas Brox. UNET is frequently utilized for its accuracy in picture segmentation and has become a popular choice in various medical imaging applications. UNET combines an encoding path, also called the contracting path, with a decoding path called the expanding path. The architecture is named after its U-shaped look when depicted in a diagram. Because of this U-shaped architecture, the network can record both local features and global context, resulting in exact segmentation results.

Critical Components of the UNET Architecture

  • Contracting Path (Encoding Path): UNET’s contracting path comprises convolutional layers followed by max pooling operations. This method captures high-resolution, low-level characteristics by gradually lowering the spatial dimensions of the input image.
  • Expanding Path (Decoding Path): Transposed convolutions, also known as deconvolutions or upsampling layers, are used for upsampling the feature maps from the encoding path in the UNET expansion path. The feature maps’ spatial resolution is increased during the upsampling phase, allowing the network to reconstitute a dense segmentation map.
  • Skip Connections: Skip connections are used in UNET to connect matching layers from encoding to decoding paths. These links enable the network to collect both local and global data. The network retains essential spatial information and improves segmentation accuracy by integrating feature maps from earlier layers with those in the decoding route.
  • Concatenation: Concatenation is commonly used to implement skip connections in UNET. The feature maps from the encoding path are concatenated with the upsampled feature maps from the decoding path during the upsampling procedure. This concatenation allows the network to incorporate multi-scale information for appropriate segmentation, exploiting high-level context and low-level features.
  • Fully Convolutional Layers: UNET comprises convolutional layers with no fully connected layers. This convolutional architecture enables UNET to handle images of unlimited sizes while preserving spatial information across the network, making it flexible and adaptable to various segmentation tasks.

The encoding path, or the contracting path, is an essential component of UNET architecture. It is responsible for extracting high-level information from the input image while gradually shrinking the spatial dimensions.

Convolutional Layers

The encoding process begins with a set of convolutional layers. Convolutional layers extract information at multiple scales by applying a set of learnable filters to the input image. These filters operate on the local receptive field, allowing the network to catch spatial patterns and minor features. With each convolutional layer, the depth of the feature maps grows, allowing the network to learn more complicated representations.

Activation Function

Following each convolutional layer, an activation function such as the Rectified Linear Unit (ReLU) is applied element by element to induce non-linearity into the network. The activation function aids the network in learning non-linear correlations between input images and retrieved features.

Pooling Layers

Pooling layers are used after the convolutional layers to reduce the spatial dimensionality of the feature maps. The operations, such as max pooling, divide feature maps into non-overlapping regions and keep only the maximum value inside each zone. It reduces the spatial resolution by down-sampling feature maps, allowing the network to capture more abstract and higher-level data.

The encoding path’s job is to capture features at various scales and levels of abstraction in a hierarchical manner. The encoding process focuses on extracting global context and high-level information as the spatial dimensions decrease.

Skip Connections

The availability of skip connections that connect appropriate levels from the encoding path to the decoding path is one of the UNET architecture’s distinguishing features. These skip links are critical in maintaining key data during the encoding process.

Feature maps from prior layers collect local details and fine-grained information during the encoding path. These feature maps are concatenated with the upsampled feature maps in the decoding pipeline utilizing skip connections. This allows the network to incorporate multi-scale data, low-level features and high-level context into the segmentation process.

By conserving spatial information from prior layers, UNET can reliably localize objects and keep finer details in segmentation results. UNET’s skip connections aid in addressing the issue of information loss caused by downsampling. The skip links allow for more excellent local and global information integration, improving segmentation performance overall.

To summarise, the UNET encoding approach is critical for capturing high-level characteristics and lowering the spatial dimensions of the input image. The encoding path extracts progressively abstract representations via convolutional layers, activation functions, and pooling layers. By integrating local features and global context, introducing skip links allows for preserving critical spatial information, facilitating reliable segmentation outcomes.

Decoding Path in UNET

A critical component of the UNET architecture is the decoding path, also known as the expanding path. It is responsible for upsampling the encoding path’s feature maps and constructing the final segmentation mask.

Upsampling Layers (Transposed Convolutions)

To boost the spatial resolution of the feature maps, the UNET decoding method includes upsampling layers, frequently done using transposed convolutions or deconvolutions. Transposed convolutions are essentially the opposite of regular convolutions. They enhance spatial dimensions rather than decrease them, allowing for upsampling. By constructing a sparse kernel and applying it to the input feature map, transposed convolutions learn to upsample the feature maps. The network learns to fill in the gaps between the current spatial locations during this process, thus boosting the resolution of the feature maps.


The feature maps from the preceding layers are concatenated with the upsampled feature maps during the decoding phase. This concatenation enables the network to aggregate multi-scale information for correct segmentation, leveraging high-level context and low-level features. Aside from upsampling, the UNET decoding path includes skip connections from the encoding path’s comparable levels.

The network may recover and integrate fine-grained characteristics lost during encoding by concatenating feature maps from skip connections. It enables more precise object localization and delineation in the segmentation mask.

The decoding process in UNET reconstructs a dense segmentation map that fits with the spatial resolution of the input picture by progressively upsampling the feature maps and including skip links.

The decoding path’s function is to recover spatial information lost during the encoding path and refine the segmentation findings. It combines low-level encoding details with high-level context gained from the upsampling layers to provide an accurate and thorough segmentation mask.

UNET can boost the spatial resolution of the feature maps by using transposed convolutions in the decoding process, thereby upsampling them to match the original image size. Transposed convolutions assist the network in generating a dense and fine-grained segmentation mask by learning to fill in the gaps and expand the spatial dimensions.

In summary, the decoding process in UNET reconstructs the segmentation mask by enhancing the spatial resolution of the feature maps via upsampling layers and skip connections. Transposed convolutions are critical in this phase because they allow the network to upsample the feature maps and build a detailed segmentation mask that matches the original input image.

Contracting and Expanding Paths in UNET

The UNET architecture follows an “encoder-decoder” structure, where the contracting path represents the encoder, and the expanding path represents the decoder. This design resembles encoding information into a compressed form and then decoding it to reconstruct the original data.

Contracting Path (Encoder)

The encoder in UNET is the contracting path. It extracts context and compresses the input image by gradually decreasing the spatial dimensions. This method includes convolutional layers followed by pooling procedures such as max pooling to downsample the feature maps. The contracting path is responsible for obtaining high-level characteristics, learning global context, and decreasing spatial resolution. It focuses on compressing and abstracting the input image, efficiently capturing relevant information for segmentation.

Expanding Path (Decoder)

The decoder in UNET is the expanding path. By upsampling the feature maps from the contracting path, it recovers spatial information and generates the final segmentation map. The expanding route comprises upsampling layers, often performed with transposed convolutions or deconvolutions to increase the spatial resolution of the feature maps. The expanding path reconstructs the original spatial dimensions via skip connections by integrating the upsampled feature maps with the equivalent maps from the contracting path. This method enables the network to recover fine-grained features and properly localize items.

The UNET design captures global context and local details by mixing contracting and expanding pathways. The contracting path compresses the input image into a compact representation, decided to build a detailed segmentation map by the expanding path. The expanding path concerns decoding the compressed representation into a dense and precise segmentation map. It reconstructs the missing spatial information and refines the segmentation results. This encoder-decoder structure enables precision segmentation using high-level context and fine-grained spatial information.

In summary, UNET’s contracting and expanding routes resemble an “encoder-decoder” structure. The expanding path is the decoder, recovering spatial information and generating the final segmentation map. In contrast, the contracting path serves as the encoder, capturing context and compressing the input image. This architecture enables UNET to encode and decode information effectively, allowing for accurate and thorough image segmentation.

Skip Connections in UNET

Skip connections are essential to the UNET design because they allow information to travel between the contracting (encoding) and expanding (decoding) paths. They are critical for maintaining spatial information and improving segmentation accuracy.

Preserving Spatial Information

Some spatial information may be lost during the encoding path as the feature maps undergo downsampling procedures such as max pooling. This information loss can lead to lower localization accuracy and a loss of fine-grained details in the segmentation mask.

By establishing direct connections between corresponding layers in the encoding and decoding processes, skip connections help to address this issue. Skip connections protect vital spatial information that would otherwise be lost during downsampling. These connections allow information from the encoding stream to avoid downsampling and be transmitted directly to the decoding path.

Multi-scale Information Fusion

Skip connections allow the merging of multi-scale information from many network layers. Later levels of the encoding process capture high-level context and semantic information, whereas earlier layers catch local details and fine-grained information. UNET may successfully combine local and global information by connecting these feature maps from the encoding path to the equivalent layers in the decoding path. This integration of multi-scale information improves segmentation accuracy overall. The network can use low-level data from the encoding path to refine segmentation findings in the decoding path, allowing for more precise localization and better object boundary delineation.

Combining High-Level Context and Low-Level Details

Skip connections allow the decoding path to combine high-level context and low-level details. The concatenated feature maps from the skip connections include the decoding path’s upsampled feature maps and the encoding path’s feature maps.

This combination enables the network to take advantage of the high-level context recorded in the decoding path and the fine-grained features captured in the encoding path. The network may incorporate information of several sizes, allowing for more precise and detailed segmentation.

UNET may take advantage of multi-scale information, preserve spatial details, and merge high-level context with low-level details by adding skip connections. As a result, segmentation accuracy improves, object localization improves, and fine-grained information in the segmentation mask is retained.

In conclusion, skip connections in UNETs are critical for maintaining spatial information, integrating multi-scale information, and boosting segmentation accuracy. They provide direct information flow across the encoding and decoding routes, allowing the network to collect local and global details, resulting in more precise and detailed image segmentation.

Loss Function in UNET

It is critical to select an appropriate loss function while training UNET and optimizing its parameters for picture segmentation tasks. UNET frequently employs segmentation-friendly loss functions such as the Dice coefficient or cross-entropy loss.

Dice Coefficient Loss

The Dice coefficient is a similarity statistic that calculates the overlap between the anticipated and true segmentation masks. The Dice coefficient loss, or soft Dice loss, is calculated by subtracting one from the Dice coefficient. When the anticipated and ground truth masks align well, the loss minimizes, resulting in a higher Dice coefficient.

The Dice coefficient loss is especially effective for unbalanced datasets in which the background class has many pixels. By penalizing false positives and false negatives, it promotes the network to divide both foreground and background regions accurately.

Cross-Entropy Loss

Use cross-entropy loss function in image segmentation tasks. It measures the dissimilarity between the predicted class probabilities and the ground truth labels. Treat each pixel as an independent classification problem in image segmentation, and the cross-entropy loss is computed pixel-wise.

The cross-entropy loss encourages the network to assign high probabilities to the correct class labels for each pixel. It penalizes deviations from the ground truth, promoting accurate segmentation results. This loss function is effective when the foreground and background classes are balanced or when multiple classes are involved in the segmentation task.

The choice between the Dice coefficient loss and cross-entropy loss depends on the segmentation task’s specific requirements and the dataset’s characteristics. Both loss functions have advantages and can be combined or customized based on specific needs.

1: Importing Libraries

import tensorflow as tf
import os
import numpy as np
from tqdm import tqdm
from skimage.io import imread, imshow
from skimage.transform import resize
import matplotlib.pyplot as plt
import random

2: Image Dimensions – Settings


3: Setting the Randomness

seed = 42
np.random.seed = seed

4: Importing the Dataset

# Data downloaded from - https://www.kaggle.com/competitions/data-science-bowl-2018/data 
#importing datasets
TRAIN_PATH = 'stage1_train/'
TEST_PATH = 'stage1_test/'

5: Reading all the Images Present in the Subfolder

train_ids = next(os.walk(TRAIN_PATH))[1]
test_ids = next(os.walk(TEST_PATH))[1]

6: Training

X_train = np.zeros((len(train_ids), IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((len(train_ids), IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)

7: Resizing the Images

print('Resizing training images and masks')
for n, id_ in tqdm(enumerate(train_ids), total=len(train_ids)):   
    path = TRAIN_PATH + id_
    img = imread(path + '/images/' + id_ + '.png')[:,:,:IMG_CHANNELS]  
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
    X_train[n] = img  #Fill empty X_train with values from img
    mask = np.zeros((IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
    for mask_file in next(os.walk(path + '/masks/'))[2]:
        mask_ = imread(path + '/masks/' + mask_file)
        mask_ = np.expand_dims(resize(mask_, (IMG_HEIGHT, IMG_WIDTH), mode='constant',  
                                      preserve_range=True), axis=-1)
        mask = np.maximum(mask, mask_)  
    Y_train[n] = mask   

8: Testing the Images

# test images
X_test = np.zeros((len(test_ids), IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
sizes_test = []
print('Resizing test images') 
for n, id_ in tqdm(enumerate(test_ids), total=len(test_ids)):
    path = TEST_PATH + id_
    img = imread(path + '/images/' + id_ + '.png')[:,:,:IMG_CHANNELS]
    sizes_test.append([img.shape[0], img.shape[1]])
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
    X_test[n] = img


9: Random Check of the Images

image_x = random.randint(0, len(train_ids))

10: Building the Model

inputs = tf.keras.layers.Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = tf.keras.layers.Lambda(lambda x: x / 255)(inputs)

11: Paths

#Contraction path
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(s)
c1 = tf.keras.layers.Dropout(0.1)(c1)
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu',
 kernel_initializer='he_normal', padding='same')(c1)
p1 = tf.keras.layers.MaxPooling2D((2, 2))(c1)

c2 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(p1)
c2 = tf.keras.layers.Dropout(0.1)(c2)
c2 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(c2)
p2 = tf.keras.layers.MaxPooling2D((2, 2))(c2)
c3 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(p2)
c3 = tf.keras.layers.Dropout(0.2)(c3)
c3 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu',
 kernel_initializer='he_normal', padding='same')(c3)
p3 = tf.keras.layers.MaxPooling2D((2, 2))(c3)
c4 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(p3)
c4 = tf.keras.layers.Dropout(0.2)(c4)
c4 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(c4)
p4 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(c4)
c5 = tf.keras.layers.Conv2D(256, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(p4)
c5 = tf.keras.layers.Dropout(0.3)(c5)
c5 = tf.keras.layers.Conv2D(256, (3, 3), activation='relu', 
kernel_initializer='he_normal', padding='same')(c5)

12: Expansion Paths

u6 = tf.keras.layers.Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same')(c5)
u6 = tf.keras.layers.concatenate([u6, c4])
c6 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_normal', 
c6 = tf.keras.layers.Dropout(0.2)(c6)
c6 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_normal', 
u7 = tf.keras.layers.Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same')(c6)
u7 = tf.keras.layers.concatenate([u7, c3])
c7 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_normal', 
c7 = tf.keras.layers.Dropout(0.2)(c7)
c7 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_normal', 
u8 = tf.keras.layers.Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same')(c7)
u8 = tf.keras.layers.concatenate([u8, c2])
c8 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_normal', 
c8 = tf.keras.layers.Dropout(0.1)(c8)
c8 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_normal', 
u9 = tf.keras.layers.Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same')(c8)
u9 = tf.keras.layers.concatenate([u9, c1], axis=3)
c9 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu', kernel_initializer='he_normal', 
c9 = tf.keras.layers.Dropout(0.1)(c9)
c9 = tf.keras.layers.Conv2D(16, (3, 3), activation='relu', kernel_initializer='he_normal', 

13: Outputs

outputs = tf.keras.layers.Conv2D(1, (1, 1), activation='sigmoid')(c9)

14: Summary

model = tf.keras.Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

15: Model Checkpoint

checkpointer = tf.keras.callbacks.ModelCheckpoint('model_for_nuclei.h5', 
verbose=1, save_best_only=True)

callbacks = [
        tf.keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),

results = model.fit(X_train, Y_train, validation_split=0.1, batch_size=16, epochs=25, 

16: Last Stage – Prediction

idx = random.randint(0, len(X_train))

preds_train = model.predict(X_train[:int(X_train.shape[0]*0.9)], verbose=1)
preds_val = model.predict(X_train[int(X_train.shape[0]*0.9):], verbose=1)
preds_test = model.predict(X_test, verbose=1)

preds_train_t = (preds_train > 0.5).astype(np.uint8)
preds_val_t = (preds_val > 0.5).astype(np.uint8)
preds_test_t = (preds_test > 0.5).astype(np.uint8)

# Perform a sanity check on some random training samples
ix = random.randint(0, len(preds_train_t))

# Perform a sanity check on some random validation samples
ix = random.randint(0, len(preds_val_t))


In this comprehensive blog post, we have covered the UNET architecture for image segmentation. By addressing the constraints of prior methodologies, UNET architecture has revolutionized picture segmentation. Its encoding and decoding routes, skip connections, and other modifications, such as U-Net++, Attention U-Net, and Dense U-Net, have proven highly effective in capturing context, maintaining spatial information, and boosting segmentation accuracy. The potential for accurate and automatic segmentation with UNET offers new pathways to improve computer vision and beyond. We encourage readers to learn more about UNET and experiment with its implementation to maximize its utility in their picture segmentation projects.

Key Takeaways

1. Image segmentation is essential in computer vision tasks, allowing the division of images into meaningful regions or objects.

2. Traditional approaches to image segmentation, such as manual annotation and pixel-wise classification, have limitations in terms of efficiency and accuracy.

3. Develop the UNET architecture to address these limitations and achieve accurate segmentation results.

4.  It is a fully convolutional neural network (FCN) combining an encoding path to capture high-level features and a decoding method to generate the segmentation mask.

5. Skip connections in UNET preserve spatial information, enhance feature propagation, and improve segmentation accuracy.

6. Found successful applications in medical imaging, satellite imagery analysis, and industrial quality control, achieving notable benchmarks and recognition in competitions.

Frequently Asked Questions

Q1. What is the U-Net architecture, and what is it used for?

A. The U-Net architecture is a popular convolutional neural network (CNN) architecture common for image segmentation tasks. Initially developed for biomedical image segmentation, it has since found applications in various domains. The U-Net architecture handles local and global information and has a U-shaped encoder-decoder structure.

Q2. How does the U-Net architecture work?

A. The U-Net architecture consists of an encoder path and a decoder path. The encoder path gradually reduces the spatial dimensions of the input image while increasing the number of feature channels. This helps in extracting abstract and high-level features. The decoder path performs upsampling and concatenation operations. And recover the spatial dimensions while reducing the number of feature channels. The network learns to combine the low-level features from the encoder path with the high-level features from the decoder path to generate segmentation masks.

Q3. What are the advantages of using the U-Net architecture?

A. The U-Net architecture offers several advantages for image segmentation tasks. Firstly, its U-shaped design allows for combining low-level and high-level features, enabling better localization of objects. Secondly, the skip connections between the encoder and decoder paths help preserve spatial information, allowing for more precise segmentation. Lastly, the U-Net architecture has a relatively small number of parameters, making it more computationally efficient than other architectures.

Q4. Why is U-Net better than CNN?

U-Net is better than CNN for image segmentation tasks because it has a U-shaped architecture that allows it to capture both high-level and low-level features of an image, as well as skip connections that preserve spatial information. This makes it better at segmenting fine-grained details, even with limited data.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Premanand S 05 Nov 2023

Learner, Assistant Professor Junior & Machine Learning enthusiast

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

  • [tta_listen_btn class="listen"]