Getting Started with Amazon SageMaker Ground Truth

Shikha Sharma 07 Jul, 2023 • 7 min read


In this era of Generative Al, data generation is at its peak. Building an accurate machine learning and AI model requires a high-quality dataset. The quality assurance of the dataset is the most critical task, as poor data causes inaccurate analytics and unidentified predictions that can affect the entire repo of any business and make a loss of billions or trillions of amount.
Source: Forbes

Data labeling is the first step towards data quality assurance that makes it understandable for AI models. Nobody can rely on humans to label data as humans can’t label the unlimited/every day generating data, so here we learn about Amazon SageMaker ground truth, a fantastic technique to create an accurately labeled dataset.

This article was published as a part of the Data Science Blogathon.

What is Amazon SageMaker Ground Truth?

Amazon SageMaker Ground Truth is a self-service offering that makes creating an efficient and highly accurate dataset accessible by performing data labeling tasks. Ground Truth also offers you to use human annotators through third-party vendors, Amazon Mechanical Turk, or even our private workforce, and a managed experience to set up end-to-end labeling jobs.

SageMaker Ground Truth can generate millions of automatically labeled synthetic data without any manual effort of data collection or labeling on our behalf. Ground Truth offers a data labeling facility for various data types, including images, text, and videos. It helps the machine learning models to ease the task of text classifications, segment segmentation, object detection, and image classification.

Use cases of Amazon SageMaker Ground Truth

Here are some industry use cases of SageMaker Ground Truth:

  1. Autonomous Vehicles: A large amount of labeled data is needed by training models for autonomous vehicles. SageMaker Ground Truth can annotate objects, such as cars, pedestrians, traffic signs, and road markings, to develop accurate perception models and helps with safe autonomous driving.
  2. Healthcare: Label Medical imaging datasets using SageMaker Ground Truth to train models for diagnosing and identifying diseases like cancer, brain tumors, and other abnormalities. It can also transcribe and annotate medical records for natural language processing (NLP) applications.
  3. Manufacturing: Labeling images and sensor data in manufacturing processes can help in quality control, defect detection, predictive maintenance, and optimizing production efficiency.

The flexibility of SageMaker Ground Truth enables its application across multiple industries where labeled datasets are required for training and improving machine learning models.

Automated Data Labeling via Ground Truth

Amazon SageMaker Ground Truth is the application of machine learning algorithms, it uses the concept of Active Learning to label the data automatically and accurately. Active learning is a type of machine learning technique used to identify complex data that the machine cannot understand in the first go, it extracts that data and send it out to the human for labeling. Let’s discuss the working of Ground Truth!
Source: LinkedIn

Step 1: Data Storage

Collect the raw and unlabelled data from different sources and store it in the S3 bucket.
Source: Sagemaker

Step 2: Sending Data to Human

In this step, pick a random piece of a dataset and send it to the human for manual data labeling.

Step 3: Human Labeling

As soon as the workers received the data chunk, they started labeling it.

Step 4: Label Consolidation Algorithm

Amazon Sagemaker Ground Truth uses this label Consolidation Algorithm to eliminate the risk of human errors and improve the accuracy of labeled datasets. The working of the algorithm includes gathering all labels for each data point in the dataset followed by consolidating them into single labels depending upon the weight of the labels.

Step 5: Resultant Dataset

Now, we stored the resultant dataset, a small labeled dataset.

Step 6: Amazon Sagemaker Model

Now we create a self-learning model based on the machine learning algorithms and install that with the customer account in order to train the model from the small labeled dataset the customer is creating so that it will label the rest of the unlabelled data on its own.

Step 7: Use the ML Model

In this step, we’re using the newly created ML model to label the unlabelled data points of the original dataset.

Step 8: Automated Labeling

Automated Labeling is applied to the remaining Dataset with the help of the Active Learning method.

Step 9: High Confidence

Here we check the confidence score of the model, and we apply the automated annotation only if the score of our model is high.

Step 10: Low Confidence

If the confidence score of the model is low, we can’t apply the automated annotation, and we will then send that portion of the data to humans for the sake of labeling. However, the model will automatically create a new dataset to train and improve its accuracy in this case.

The entire dataset undergoes a cycle of repeating these steps until it is fully labeled.

Impact of Amazon SageMaker Ground Truth to Increase the Accuracy

Sagemaker basically proposes two methods to enhance the training data accuracy:

1. Annotation Consolidation

The purpose of annotation Consolidation is to counteract the error/bias of each worker by sending each data object to two or more workers and then consolidating their responses into a single label for our data objects.
Source: Amazon

After collecting data from various workers, it applies the consolidation algorithm to compare them.


  • Detect the outlier annotations that are disregarded.
  • Applies a weighted consolidation of the annotations by assigning higher weights to more reliable annotations.
  • The label assigned to each object in the dataset is a probabilistic estimate of a true label. The object may have multiple annotations, but the output is a single label for each object.
  • Although we can choose the number of workers to perform annotation, which will increase the accuracy of our labels, the issue is that it will also increase the labeling cost.

The annotation Consolidation function offered by Ground Truth applies to all predefined labeling tasks, including NER( name entity recognition), bounding box, semantic segmentation, and image and text classification. Let’s understand each function!

  • Named Entity Recognition(NER): The Jaccard similarity is used for cluster text selections in NER. It took the mode of the label to calculate selection boundaries, and if the mode is unclear, it will go with a label median. At last random selection will play the role of this breaker to resolve the most assigned entity label in the cluster.
  • Bounding Box Annotation: In bounding box annotation, the consolidation task is performed by grabbing the bounded boxes from various workers and selecting the most similar ones via the Jaccard index, or intersection over union, of the boxes and averaging them.
  • Multi-class Annotation Consolidation for Image and Text Classification: The consolidation is performed by estimating the true class depending upon the class annotations from separate workers via Bayesian inference.
  • Semantic Segmentation Annotation: The system considers each pixel of an image as a multi-class object and treats the pixel annotations from workers as “votes.” Additionally, it incorporates extra information from surrounding pixels by applying a smoothing function to the image.

2. Best Practices on Annotation Interface

The annotation Interface has various features to improve the accuracy or quality of human labeling tasks. This well-organized and designed interface help worker obtain an adequate dataset with minimal error. The best practices include displaying brief instructions on a fixed-side panel and excellent and bad-label examples. Also, it has a feature to highlight only the image boundary for the bounding box annotations by darkening the background.


We discussed how Amazon Sagemaker Ground Truth will help to generate high-quality datasets for the machine learning model. The key takeaways of this Ground Truth blog include the following:

  • Data labeling is the first step towards data quality assurance that makes it understandable for AI models.
  • It can generate millions of automatically labeled synthetic data without any manual effort of data collection or labeling on our behalf.
  • Annotation Consolidation and Best Practices on Annotation Interface are two ways Sagemaker can enhance training data accuracy.

Frequently Asked Questions

Q1. What do you mean by Amazon SageMaker Ground Truth?

A. A highly managed data labeling service that efficiently creates high-quality labeled datasets for training models. It combines automated labeling through machine learning and human review to deliver highly accurate annotations.

Q2. Explain the working of SageMaker Ground Truth.

A. SageMaker Ground Truth uses a combination of automated and manual annotation techniques. It provides a web-based interface for human reviewers to annotate data based on predefined labeling tasks. The service also incorporates options for active learning, where it trains models on labeled data to propose labels for the remaining unlabeled data, thereby enhancing annotation efficiency.

Q3. Which types of data can SageMaker Ground Truth annotate?

A. SageMaker Ground Truth supports various data types, including images, text, audio, and video. It provides annotation tools for each data type, enabling accurate labeling for different use cases.

Q4. Can SageMaker Ground Truth integrate with other AWS services?

A. Yes, SageMaker Ground Truth seamlessly integrates with other AWS services. Use Amazon S3 for storing data, Amazon Mechanical Turk for sourcing human reviewers, and Amazon Rekognition for automated image and video analysis.

Q5. Explain how does SageMaker Ground Truth ensure the quality of labeled data.

A. SageMaker Ground Truth employs multiple mechanisms to ensure high-quality annotations. It includes features like review workflows, built-in annotation consolidation, and active learning to minimize errors and improve the accuracy of labeled datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Shikha Sharma 07 Jul 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Related Courses
0 Hrs 46 Lessons

Getting Started with Neural Networks

0 Hrs 21 Lessons

Getting started with Decision Trees


  • [tta_listen_btn class="listen"]