# Pedestrian Tracking using YOLOv4 and DeepSORT

## Introduction

While detecting objects in an image has been getting a lot of attention from the scientific community, a lesser-known and yet an area with widespread applications is tracking objects in a video or real-time video stream. That’s something that requires us to merge our knowledge of detecting objects in images with analyzing temporal information and using it to predict movement trajectories.

This project is inspired from recent *YOLOv4* and *DeepSORT* papers. The project consists of 4 parts: Detection, Estimation, Association and Tracks Handling.

## Kalman Filtering

Kalman filtering is an algorithm that provides estimates of some unknown variables given the measurements observed over time. With linear models with additive Gaussian noises, the Kalman filter provides optimal estimates. The process model defines the evolution of the state from time k−1 to time k as:

where F is the state transition matrix applied to the previous state vector x_k-1 , B is the control-input matrix applied to the control vector u_k-1 , and w_k-1 is the process noise vector that is assumed to be zero-mean Gaussian with the covariance Q.

The process model is paired with the measurement model that describes the relationship between the state and the measurement at the current time step k as:

where z_k is the measurement vector, H is the measurement matrix, and ν_k is the measurement noise vector that is assumed to be zero-mean Gaussian with the covariance R.

The role of the Kalman filter is to provide estimate of x_k at time k , given the initial estimate of x_0 , the series of measurement, z_1,z_2,…,z_k , and the information of the system described by F , B , H, Q , and R.

As described in the paper, tracking scenario is defined on the eight dimensional state space (u,v,a,h,u_dot,v_dot,a_dot,h_dot) that contains the bounding box center position (u,v), aspect ratio a, height h, and their respective velocities. Kalman filter with constant velocity motion(process) model and linear observation model are used, where bounding box coordinates (u,v,a,h) are used as direct observations of object state.

## Track Handling

- Existing tracks have a counter that shows number of past frames where measurement association has not been made. Every time new kalman filter prediction is made for new frame, this counter is increased by 1. It is reset to 0 only when measurement association is made.
- Tracks that go over a predefined maximum age are considered to have left the scene and are deleted from the set of tracks.
- For the detections that are not assigned to any tracks, new track hypotheses are created, which are considered as tentative and are only confirmed if associations happen for this track for let’s say initial 3 frames at least. Otherwise, they are deleted as well.

## Association

For every frame, we have detections from YOLOv4 that have gone through standard non-maximum suppression and we also have tracks coming in with their predictions using Kalman filter. Now, the task is to associate them using Hungarian Algorithm which solves the linear assignment problem combinatorically. The similarity measure metric which is used in this algorithm is computed from 2 metrics: Mahalanobis distance between predicted Kalman states and newly arrived detections, and cosine distance between the appearance descriptors of detections and track member having minimum distance from the detection.

*Mahalanobis distance:*

where d_j is the jth detection, y_i and S_i are mean and covariance matrix of ith track distribution.

*Cosine distance:*

where, r_j is appearance descriptor for jth detection and r_k is appearance descriptor for kth track member of ith track distribution.

Both metrics are combined to give final weighted sum:

The purpose of adding appearance descriptor is to not let occlusion for some duration delete the ongoing track.

The authors first built a classifier over the dataset, trained it till it achieved a reasonably good accuracy, and then strip the final classification layer. Assuming a classical architecture, we will be left with a dense layer producing a single feature vector, waiting to be classified. That feature vector becomes our “appearance descriptor” of the object.

The “Dense 10” layer shown in the above pic will be our appearance feature vector for the given crop. Once trained, we just need to pass all the crops of the detected bounding box from the image to this network and obtain the “128 X 1” dimensional feature vector.

## Conclusion

Thank you for interest in the blog. Please leave comments, feedback and suggestions if you feel any.

Full code on my GitHub repo *here*.