Tracking with Efficient Re-Identification in YOLO

Shaik Hamzah Last Updated : 05 May, 2025
12 min read

Identifying objects in real-time object detection tools like YOLO, SSD, DETR, etc., has always been the key to monitoring the movement and actions of various objects within a certain frame region. Several industries, such as traffic management, shopping malls, security, and personal protective equipment, have utilized this mechanism for tracking, monitoring, and gaining analytics. But the greatest challenge in such models are the anchor boxes or bounding boxes which often lose track of a certain object when a different object overlays over the the object we were tracking which causes the change in the identification tags of certain objects, such taggings could cause unwanted increment in tracking systems especially when it comes to analytics. Further in this article, we will be talking about how Re-ID in YOLO can be adopted.

Object Detection and Tracking as a Multi-Step Process

  1. Object Detection: Object detection basically detects, localizes, and classifies objects within a frame. There are many object detection algorithms out there, such as Fast R-CNN, Faster R-CNN, YOLO, Detectron, etc. YOLO is optimized for speed, while Faster R-CNN leans towards higher precision.
  2. Unique ID Assignment: In a real-world object tracking scenario, there is usually more than one object to track. Thus, following the detection in the initial frame, each object will be assigned a unique ID to be used throughout the sequence of images or videos. The ID management system plays a crucial role in generating robust analytics, avoiding duplication, and supporting long-term pattern recognition.
  3. Motion Tracking: The tracker estimates the positions of each unique object in the remaining images or frames to obtain the trajectories of each individual re-identified object. Predictive tracking models like Kalman Filters and Optical Flow are often used in conjunction to account for temporary occlusions or rapid motion.
Object Detection and Tracking as a Multi-Step Process
Source – Link

So Why Re-ID?

Re-ID or identification of objects would play an important role here. Re-ID in YOLO would enable us to preserve the identity of the tracked object. Several deep learning approaches can track and Re-ID together. Re-identification allows for the short-term recovery of lost tracks in tracking. It is usually done by comparing the visual similarity between objects using embeddings, which are generated by a different model that processes cropped object images. However, this adds extra latency to the pipeline, which can cause issues with latency or FPS rates in real-time detections.

Researchers often train these embeddings on large-scale person or object Re-ID datasets, allowing them to capture fine-grained details like clothing texture, colour, or structural features that stay consistent despite changes in pose and lighting. Several deep learning approaches have combined tracking and Re-ID in earlier work. Popular tracker models include DeepSORT, Norfair, FairMOT, ByteTrack, and others.

So Why Re-ID?
Source – Link

Let’s Discuss Some Widely Used Tracking Methods

1. Some Old Strategies

Some older strategies store each ID locally along with its corresponding frame and picture snippet. The system then reassigns IDs to certain objects based on visual similarity. However, this strategy consumes significant time and memory. Additionally, because this manual Re-ID logic doesn’t handle changes in viewpoint, background clutter, or resolution degradation well. It lacks the robustness needed for scalable or real-time systems.

2. ByteTrack

ByteTrack’s core idea is really simple. Instead of ignoring all low-confidence detections, it retains the non-background low-score boxes for a second association pass, which boosts track consistency under occlusion. After the initial detection stage, the system partitions boxes into high-confidence, low-confidence (but non-background), and background (discarded) sets.

First, it matches high-confidence boxes to both active and recently lost tracklets using IoU or optionally feature-similarity affinities, applying the Hungarian algorithm with a strict threshold. The system then uses any unmatched high-confidence detections to either spawn new tracks or queue them for a single-frame retry.

In the secondary pass, the system matches low-confidence boxes to the remaining tracklet predictions using a lower threshold. This step recovers objects whose confidence has dropped due to occlusion or appearance shifts. If any tracklets still remain unmatched, the system moves them into a “lost” buffer for a certain duration, allowing it to reincorporate them if they reappear. This generic two-stage framework integrates seamlessly with any detector model (YOLO, Faster-RCNN, etc.) and any association metric, delivering 50–60 FPS with minimal overhead.

However, ByteTrack still suffers identity switches when objects cross paths, disappear for longer periods, or undergo drastic appearance changes. Adding a dedicated Re-ID embedding network can mitigate these errors, but at the cost of an extra 15–25 ms per frame and increased memory usage.

If you want to refer to the ByteTrack GitHub, click here: ByteTrack

ByteTrack
Source – Link

3. DeepSORT

DeepSORT enhances the classic SORT tracker by fusing deep appearance features with motion and spatial cues to significantly reduce ID switches, especially under occlusions or sudden motion changes. To see how DeepSORT builds on SORT, we need to understand the four core components of SORT:

  • Detection: A per‑frame object detector (e.g, YOLO, Faster R‑CNN) outputs bounding boxes for each object.
  • Estimation: A constant‑velocity Kalman filter projects each track’s state (position and velocity) into the next frame, updating its estimate whenever a matching detection is found.
  • Data Association: An IOU cost matrix is computed between predicted track boxes and new detections; the Hungarian algorithm solves this assignment, subject to an IOU(min) threshold to handle simple overlap and short occlusions.
  • Track Creation & Deletion: Unmatched detections initialize new tracks; tracks missing detections for longer than a user‑defined Tₗₒₛₜ frames are terminated, and reappearing objects receive new IDs.

SORT achieves real-time performance on modern hardware due to its speed, but it relies solely on motion and spatial overlap. This often causes it to swap object identities when they cross paths, become occluded, or remain blocked for extended periods. To address this, DeepSORT trains a discriminative feature embedding network offline—typically using large-scale person Re-ID datasets—to generate 128-D appearance vectors for each detection crop. During association, DeepSORT computes a combined affinity score that incorporates:

  1. Motion-based distance (Mahalanobis distance from the Kalman filter)
  2. Spatial IoU distance
  3. Appearance cosine distance between embeddings

Because the cosine metric remains stable even when motion cues fail, such as during long‑term occlusions or abrupt changes in velocity, DeepSORT can correctly reassign the original track ID once an object re‑emerges.

Additional Details & Trade‑offs:

  • The embedding network typically adds ~20–30 ms of per‑frame latency and increases GPU memory usage, reducing throughput by up to 50 %.
  • To limit growth in computational cost, DeepSORT maintains a fixed‑length gallery of recent embeddings per track (e.g., last 50 frames), but even so, large galleries in crowded scenes can slow association.
  • Despite the overhead, DeepSORT often improves IDF1 by 15–20 points over SORT on standard benchmarks (e.g., MOT17), making it a go-to solution when identity persistence is critical.
DeepSORT
Source – Link

4. FairMOT

FairMOT is a truly single‑shot multi‑object tracker which simultaneously performs object detection and Re‑identification in one unified network, delivering both high accuracy and efficiency. When an input image is fed into FairMOT, it passes through a shared backbone and then splits into two homogeneous branches: the detection branch and the Re‑ID branch. The detection branch adopts an anchor‑free CenterNet‑style head with three sub‑heads – Heatmap, Box Size, and Center Offset

  • The Heatmap head pinpoints the centers of objects on a downsampled feature map
  • The Box Size head predicts each object’s width and height 
  • The Center Offset head corrects any misalignment (up to four pixels) caused by downsampling, ensuring precise localization. 

How FairMOT Works?

Parallel to this, the Re‑ID branch projects the same intermediate features into a lower‑dimensional embedding space, generating discriminative feature vectors that capture object appearance.

FairMOT
Source – Link

After producing detection and embedding outputs for the current frame, FairMOT begins its two-stage association process. In the first stage, it propagates each prior tracklet’s state using a Kalman filter to predict its current position. Then, it compares those predictions with the new detections in two ways. It computes appearance affinities as cosine distances between the stored embeddings of each tracklet and the current frame’s Re-ID vectors. At the same time, it calculates motion affinities using the Mahalanobis distance between the Kalman-predicted bounding boxes and the fresh detections. FairMOT fuses these two distance measures into a single cost matrix and solves it using the Hungarian algorithm to link existing tracks to new detections, provided the cost stays below a preset threshold.

Suppose any track remains unassigned after this first pass due to abrupt motion or weak appearance cues. FairMOT invokes a second, IoU‑based matching stage. Here, the spatial overlap (IoU) between the previous frame’s boxes and unmatched detections is evaluated; if the overlap exceeds a lower threshold, the original ID is retained, otherwise a new track ID is issued. This hierarchical matching—first appearance + motion, then pure spatial—allows FairMOT to handle both subtle occlusions and rapid reappearances while keeping computational overhead low (only ~8 ms extra per frame compared to a vanilla detector). The result is a tracker that maintains high MOTA and ID‑F1 on challenging benchmarks, all without the heavy separate embedding network or complex anchor tuning required by many two‑stage methods.

Ultralytics Re-Identification

Before starting with the changes made to this efficient re-identification strategy, we have to understand how the object-level features are retrieved in YOLO and BotSORT.

What is BoT‑SORT?

BoT‑SORT (Robust Associations Multi‑Pedestrian Tracking) was introduced by Aharon et al. in 2022 as a tracking‑by‑detection framework that unifies motion prediction and appearance modeling, along with explicit camera motion compensation, to maintain stable object identities across challenging scenarios. It combines three key innovations: an enhanced Kalman filter state, GMC, and IoU‑Re-ID fusion. BoT‑SORT achieves superior tracking metrics on standard MOT benchmarks.

You can read the research paper from here.

BoT‑SORT
Source: Link

Architecture and Methodology

1. Detection and Feature Extraction

  • Ultralytics YOLOv8’s detection module outputs bounding boxes, confidence scores, and class labels for each object in a frame, which serve as the input to the BoT‑SORT pipeline.

2. BOTrack: Maintaining Object State

  • Each detection spawns a BOTrack instance (subclassing STrack), which adds:
    • Feature smoothing via an exponential moving average over a deque of recent Re-ID embeddings.
    • curr_feat and smooth_feat vectors for appearance matching.
    • An eight-dimensional Kalman filter state (mean, covariance) for precise motion prediction.

This modular design also allows hybrid tracking systems where different tracking logic (e.g., occlusion recovery or reactivation thresholds) can be embedded directly in each object instance.

3. BOTSORT: Association Pipeline

  • The BOTSORT class (subclassing BYTETracker) introduces:
    • proximity_thresh and appearance_thresh parameters to gate IoU and embedding distances.
    • An optional Re-ID encoder to extract appearance embeddings if with_Re-ID=True.
    • A Global Motion Compensation (GMC) module to adjust for camera-induced shifts between frames.
  • Distance computation (get_dists) combines IoU distance (matching.iou_distance) with normalized embedding distance (matching.embedding_distance), masking out pairs exceeding thresholds and taking the element‑wise minimum for the final cost matrix.
  • Data association uses the Hungarian algorithm on this cost matrix; unmatched tracks may be reactivated (if appearance matches) or terminated after track_buffer frames.

This dual-threshold approach allows greater flexibility in tuning for specific scenes—e.g., high occlusion (lower appearance threshold), or high motion blur (lower IoU threshold).

4. Global Motion Compensation (GMC)

  • GMC leverages OpenCV’s video stabilization API to compute a homography between consecutive frames, then warps predicted bounding boxes to compensate for camera motion before matching.
  • GMC becomes especially useful in drone or handheld footage where abrupt motion changes could otherwise break tracking continuity.

5. Enhanced Kalman Filter

  • Unlike traditional SORT’s 7‑tuple, BoT‑SORT’s Kalman filter uses an 8‑tuple replacing aspect ratio a and scale s with explicit width w and height h, and adapts the process and measurement noise covariances as functions of w and h for more stable predictions.
formula
formula

6. IoU‑Re-ID Fusion

  • The system computes association cost elements by applying two thresholds (IoU and embedding). If either threshold exceeds its limit, the system sets the cost to the maximum; otherwise, it assigns the cost as the minimum of the IoU distance and half the embedding distance, effectively fusing motion and appearance cues.
  • This fusion enables robust matching even when one of the cues (IoU or embedding) becomes unreliable, such as during partial occlusion or uniform clothing among subjects.

The YAML file looks as follows:-

tracker_type: botsort      # Use BoT‑SORT

track_high_thresh: 0.25    # IoU threshold for first association

track_low_thresh: 0.10     # IoU threshold for second association

new_track_thresh: 0.25     # Confidence threshold to start new tracks

track_buffer: 30           # Frames to wait before deleting lost tracks

match_thresh: 0.80         # Appearance matching threshold

### CLI Example

# Run BoT‑SORT tracking on a video using the default YAML config

yolo track model=yolov8n.pt tracker=botsort.yaml source=path/to/video.mp4 show=True

### Python API Example

from ultralytics import YOLO

from ultralytics.trackers import BOTSORT

# Load a YOLOv8 detection model

model = YOLO('yolov8n.pt')

# Initialize BoT‑SORT with Re-ID support and GMC

args = {

    'with_Re-ID': True,

    'gmc_method': 'homography',

    'proximity_thresh': 0.7,

    'appearance_thresh': 0.5,

    'fuse_score': True

}

tracker = BOTSORT(args, frame_rate=30)

# Perform tracking

results = model.track(source='path/to/video.mp4', tracker=tracker, show=True)

You can read more about compatible YOLO trackers here.

Efficient Re-Identification in Ultralytics

The system usually performs re-identification by comparing visual similarities between objects using embeddings. A separate model typically generates these embeddings by processing cropped object images. However, this approach adds extra latency to the pipeline. Alternatively, the system can use object-level features directly for re-identification, eliminating the need for a separate embedding model. This change improves efficiency while keeping latency virtually unchanged.

Resource: YOLO in Re-ID Tutorial

Colab Notebook: Link to Colab

Do try to run your videos to see how Re-ID in YOLO works. In the Colab NB, we have to just replace the path of “occluded.mp4” with your video path 🙂

Re-ID
Source – Link

To see all of the diffs in context and grab the complete botsort.py patch, check out the Link to Colab and this Tutorial. Be sure to review it alongside this guide so you can follow each change step‑by‑step.

Step 1: Patching BoT‑SORT to Accept Features

Changes Made:

  • Method signature updated: update(results, img=None) → update(results, img=None, feats=None) to accept feature arrays.
    New attribute self.img_width is set from img.shape[1] for later normalization.
  • Feature slicing: Extracted feats_keep and feats_second based on detection indices.
  • Tracklet initialization: init_track calls now pass the corresponding feature subsets (feats_keep/feats_second) instead of the raw img array.

Step 2: Modifying the Postprocess Callback to Pass Features

Changes Made:

  • Update invocation: tracker.update(det, im0s[i]) → tracker.update(det, result.orig_img, result.feats.cpu().numpy()) so that the feature tensor is forwarded to the tracker.

Step 3: Implementing a Pseudo-Encoder for Features

Changes Made:

  • Dummy Encoder class created with an inference(feat, dets) method that simply returns the provided features.
  • Custom BOTSORTRe-ID subclass of BOTSORT introduced, where:
    • self.encoder is set to the dummy Encoder.
    • self.args.with_Re-ID flag is enabled.
  • Tracker registration: track.TRACKER_MAP[“botsort”] is remapped to BOTSORTRe-ID, replacing the default.

Step 4: Improving Proximity Matching Logic

Changes Made:

  • Centroid computation: Added an L2-based centroid extractor instead of relying solely on bounding-box IoU.
  • Distance calculation:
    • Compute pairwise L2 distances between track and detection centroids, normalized by self.img_width.
    • Build a proximity mask where L2 distance exceeds proximity_thresh.
  • Cost fusion:
    • Calculate embedding distances via existing matching.embedding_distance.
    • Apply both proximity mask and appearance_thresh to set high costs for distant or dissimilar pairs.
    • The final cost matrix is the element‑wise minimum of the original IoU-based distances and the adjusted embedding distances.

Step 5: Tuning the Tracker Configuration

Adjust the botsort.yaml parameters for improved occlusion handling and matching tolerance:

  • track_buffer: 300 — extends how long a lost track is kept before deletion.
  • proximity_thresh: 0.2 — allows matching with objects that have moved up to 20% of image width.
  • appearance_thresh: 0.3 — requires at least 70% feature similarity for matching.

Step 6: Initializing and Monkey-Patching the Model

Changes Made:

  • Custom _predict_once is injected into the model to extract and return feature maps alongside detections.
  • Tracker reset: After model.track(embed=embed, persist=True), the existing tracker is reset to clear any stale state.
  • Method overrides:
    • model.predictor.trackers[0].update is bound to the patched update method.
    • model.predictor.trackers[0].get_dists is bound to the new distance calculation logic.

Step 7: Performing Tracking with Re-Identification

Changes Made:

  • Convenience function track_with_Re-ID(img) uses:
    1. get_result_with_features([img]) to generate detection results with features.
    2. model.predictor.run_callbacks(“on_predict_postprocess_end”) to invoke the updated tracking logic.
  • Output: Returns model.predictor.results, now containing both detection and re-identification data.

With these concise modifications, Ultralytics YOLO with BoT‑SORT now natively supports feature-based re-identification without adding a second Re-ID network, achieving robust identity preservation with minimal performance overhead. Feel free to experiment with the thresholds in Step 5 to tailor matching strictness to your application.

Also read: Roboflow’s RF-DETR: Bridging Speed and Accuracy in Object Detection

⚠️ Note: These changes are not part of the official Ultralytics release. They need to be implemented manually to enable efficient re-identification.

Comparison of Results

Here, the water hydrant(id8), the woman near the truck(id67), and the truck(id3) on the left side of the frame have been re-identified accurately.

While some objects are identified correctly(id4, id5, id60), a few police officers in the background received different IDs, possibly due to frame rate limitations.

The ball(id3) and the shooter(id1) are tracked and identified well, but the goalkeeper(id2 -> id8), occluded by the shooter, was given a new ID due to lost visibility.

New Development

A new open‑source toolkit called Trackers is being developed to simplify multi‑object tracking workflows. Trackers will offer:

  • Plug‑and‑play integration with detectors from Transformers, Inference, Ultralytics, PaddlePaddle, MMDetection, and more.
  • Built‑in support for SORT and DeepSORT today, with StrongSORT, BoT‑SORT, ByteTrack, OC‑SORT, and additional trackers on the way.

DeepSORT and SORT are already import-ready in the GitHub repository, and the remaining trackers will be added in subsequent weeks.

Github Link – Roboflow

Conclusion

The comparison section shows that Re-ID in YOLO performs reliably, maintaining object identities across frames. Occasional mismatches stem from occlusions or low frame rates, common in real-time tracking. Adjustable proximity_thresh and appearance_thresh Offer flexibility for varied use cases.

The key advantage is efficiency: leveraging object-level features from YOLO removes the need for a separate Re-ID network, resulting in a lightweight, deployable pipeline.

This approach delivers a robust and practical multi-object tracking solution. Future improvements may include adaptive thresholds, better feature extraction, or temporal smoothing.

Note: These updates aren’t part of the official Ultralytics library yet and must be applied manually, as shown in the shared resources.

Kudos to Yasin, M. (2025) for the insightful tutorial on Tracking with Efficient Re-Identification in Ultralytics. Yasin’s Keep. Check here

GenAI Intern @ Analytics Vidhya | Final Year @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear