This assignment had two objectives:
- Deploy a pre-trained model to predict the Monocular Human Pose in an image.
- Read this paper and write a summary of the model architecture and the loss function used.
To run the pose estimation script follow the steps below:
- Install requirements
pip install -r requirements.txt - Download the pre-trained pose estimation model (
pose_resnet_50_256x256.pth.tar) from here and move it to the models directory. - Run the pose estimation script
python3 pose_estimation.py --image <path to image>
To convert the model to ONNX, use the --convert flag, it will save the model in the onnx format.
python3 pose_estimation.py --image <path to image> --convert
The model deployed on AWS Lambda gave the following results
| Input Image | Output Image |
|---|---|
![]() |
![]() |
Pose Tracking is the task of estimating multi-person human poses in videos and assigning unique instance IDs for each keypoint across frames. Accurate estimation of human keypoint-trajectories is useful for human action recognition, human interaction understanding, motion capture and animation.
Human Pose Estimation is a challenging task and it has come a long way from 80% PCKH@0.5 to more than 90%. The leading methods for solving this problem on MPII benchmark have considerable difference in many details but minor difference in accuracy which makes it difficult to tell which details are crucial. The above paper devises a simple paradigm to solve the problem statement. It provides baseline methods for both pose estimation and tracking. They are quite simple but surprisingly effective.
ResNet architecture with three deconvolutional layers over the last convolution stage in the ResNet, called C5, was used for pose estimation. This structure was espoused because it is the simplest to generate heatmaps from deep and low resolution features.
- The deconvolutional layers with batch normalization and ReLU activationare used.
- Each layer has 256 lters with 4X4 kernel.
- The stride is 2.
- A 1X1 convolutional layer is added at last to generate predicted heatmaps for all k key points.
The idea behind the Multi-person pose tracking is to assign a unique identication number (id) to a estimated human poses in a frame, and then tracks these human pose across other frames.
Ik : kth Frame
P : Human instance P = (J, id),
where J = {ji}1:Nj is the coordinates set of Nj body joints and id indicates the
tracking id.
When processing the Ik frame, we have the already processed human instances set Pk-1 = {Pik-1}1:Nk-1 in frame Ik-1 and the instances set Pk = {Pik}1:Nk in frame Ik whose id is to be assigned, where Nk-1 and Nk are the instance number in frame Ik-1 and Ik. If one instance Pkj in current frame Ik is linked to the instance Pik-1 in Ik-1 frame, then idk-1i is propagated to idkj, otherwise a new id is assigned to Pkj , indicating a new track.
In simple words, if the one human pose detected in the current frame matches with a already detected human pose in the previous frame, then the tracking id alloted is passed to the human pose detected in the current frame, and if not the new tracking id is assigned to it.
To achieve this the authors incorporated the following which makes their model different from the other state-of-the-art networks.
They found that applying a detector meant for a image to video can lead to missing and predicting false detections as a frame from a video can be blur or can have occlusions. For this they proposed to generate boxes for the processing frame from nearby frames using temporal information expressed in optical flow.
They proposed that joints coordinates of one human instance in a frame can be estimated from the previous frame. More specically, for each joint location (x, y) in Jik-1, the propagated joint location would be (x + *x; y + *y), where *x; *y are the flow field values at joint location (x, y).
Using this a new bounding box is computed for the human pose in the crrent frame.That box is expanded to some extend (15% in experiments) and is used as the candidate box for pose estimation.
*x = delta(x)
Finding pose similarity is a grave challenge in this use case.
- IoU(Intersection-over-Union) can't be used as the similarity metric as it will be difficult when an instance moves fast thus the boxes do not overlap, and in crowed scenes where boxes may not have the corresponding relationship with instances.
- The pose similarity could also be problematic when the pose of the same person is different across frames due to pose changing.
They proposed a metric which calculates the body joints distance between two human instances using Object Keypoint Similarity (OKS) which is a flow-based pose similarity metric.
where OKS represents calculating the Object Keypoint Similarity (OKS) between two human pose, and Jil represents the propagated joints for Jik from frame Ik to Il using optical flow.
This algorithm combines the above two Flow-based Pose Similarity and Flow-based Pose Similarity.
- Estimation: The input frame is processed to produce bounding boxes from
- human detector
- joint propogation from the previous frames using optimal flow
They are combined using Non-Maximum Suppression (NMS) operation. These boxes are then cropped and resized to estimate human pose using proposed pose estimation network(1.c).
- Tracking
The instances are tracked in a double-ended queue with fixed length LQ, denoted as
Q = [Pk-1,Pk-1, ..., Pk- LQ]
where Pk-i means tracked instances set in previous frame Ik-i and the Q's length LQ indicates how many previous frames considered when performing matching.
- For the Ik frame, flow-based pose similarity matrix Msim between the current untracked instances set of body joints Jk (id is none) and previous instances sets in Q is calculated.
- An id is assigned to each body joints instance J in Jk to a human pose to get the instance set Pk by using greedy matching and Msim.
- Finally Q is update by adding kth frame instance Pk.
JointsMSELoss is calculated by mean of MSELoss between the predicted heatmaps and targeted heatmaps for each joint/body-point
# Intuition: JointMSELoss
import torch.nn.functional.mse_loss as MSELoss
for idx in range(num_joints): # for each joint
heatmap_pred = heatmaps_pred[idx] #predicted heatmaps
heatmap_gt = heatmaps_gt[idx] #target heatmaps
loss = MSELoss(heatmap_pred, heatmap_gt)) #MSELoss b/w predicted & targeted heatmap of a joint
loss += 0.5 * loss #sum loss value for all joints
JointMSELoss = loss / num_joints # average loss over all joints lossThe targeted heatmap for each joint is generated by applying a 2D gaussian centered on joint’s ground truth location
target_weight can be used to give different weightage to different joints while calculating final JointsMSELoss
Original JointsMSELoss code:
import torch.nn as nn
class JointsMSELoss(nn.Module):
def __init__(self, use_target_weight):
super(JointsMSELoss, self).__init__()
self.criterion = nn.MSELoss(size_average=True)
self.use_target_weight = use_target_weight
def forward(self, output, target, target_weight): #e.g. output.shape= torch.Size([8, 16, 64, 64])
batch_size = output.size(0) # 8
num_joints = output.size(1) # 16
heatmaps_pred = output.reshape((batch_size, num_joints, -1)).split(1, 1)# return 16 tuples of torch.Size([8, 1, 4096])
heatmaps_gt = target.reshape((batch_size, num_joints, -1)).split(1, 1)
loss = 0
# Calculate MSELoss b/w pred & gt for each joint
for idx in range(num_joints):
heatmap_pred = heatmaps_pred[idx].squeeze() # torch.Size([8, 4096])
heatmap_gt = heatmaps_gt[idx].squeeze()
if self.use_target_weight:
loss += 0.5 * self.criterion(
heatmap_pred.mul(target_weight[:, idx]),
heatmap_gt.mul(target_weight[:, idx])
)
else:
loss += 0.5 * self.criterion(heatmap_pred, heatmap_gt) # sum loss
return loss / num_joints # Average loss




