- 0:00: The video discusses vit post paper which is currently leading in 2D post estimation on the Ms coco data set. - 0:13: Previous attempts to use Transformers for 2D Pro estimation have included transpose and token pose. - 0:26: Transpose uses a CNN backbone to extract local information from the input image and a Transformer encoder to understand the skeleton key points in the image. - 0:58: Token pose uses a similar approach but includes random tokens to represent missing or occluded key points. - 1:33: Another attempt, HR former, combines Transformer blocks and convolutional blocks for down sampling and up sampling. - 2:11: Vit pose simplifies the process by using only Transformers, making it easier to deal with the problem. - 2:21: Vit pose uses an encoder which is a Transformer to create tokens from an input image. - 3:50: Vit pose has two different decoder options - classic decoder and simple decoder. - 6:15: Vit pose allows multi-dataset training, enabling the utilization of different decoders depending on the data set. - 7:03: The video presents different variants of vit pose like base, large, huge, and gigantic, which differ in the number of layers and channel size. - 7:27: The video discusses the simplicity and scalability of vit pose. - 8:33: The video discusses the influence of pre-training data on the performance of vit pose. - 10:11: The video discusses the influence of input resolution on the performance of vit pose. - 11:32: The video discusses the influence of attention type on the performance of vit pose. - 14:55: The video discusses the influence of partially finetuning on the performance of vit pose. - 16:02: The video discusses the influence of multi-dataset training on the performance of vit pose. - 16:21: The video discusses the use of knowledge distillation to improve the generalizability of the model. - 21:12: The video presents the results of vit pose in comparison with different modules for the task of 2D post estimation on Ms Coco dataset. Positive Learnings: - Vit pose simplifies the process of 2D pose estimation by using only Transformers. - The use of an encoder which is a Transformer to create tokens from an input image has proven to be effective. - The use of different variants like base, large, huge, and gigantic can enhance the performance of vit pose. - The use of pre-training data can improve the performance of vit pose. - The use of knowledge distillation can improve the generalizability of the model. Negative Learnings: - Previous attempts to use Transformers for 2D Pro estimation such as transpose and token pose had limitations. - The use of a CNN backbone in transpose limits its effectiveness. - Token pose's use of random tokens to represent missing or occluded key points is not the most efficient approach. - HR former's combination of Transformer blocks and convolutional blocks for down sampling and up sampling makes it complicated. - Partially finetuning can negatively affect the performance of vit pose.
Thank you for this Interesting video. Would be interesting to see Bottom up pose estimation using transformers like ED-Pose. VitPose is top down so (a) Inference time increases with number of person. (b) It can not handle overlapping human scenarios.
It was very comprehensive, thanks a lot Soroush
- 0:00: The video discusses vit post paper which is currently leading in 2D post estimation on the Ms coco data set.
- 0:13: Previous attempts to use Transformers for 2D Pro estimation have included transpose and token pose.
- 0:26: Transpose uses a CNN backbone to extract local information from the input image and a Transformer encoder to understand the skeleton key points in the image.
- 0:58: Token pose uses a similar approach but includes random tokens to represent missing or occluded key points.
- 1:33: Another attempt, HR former, combines Transformer blocks and convolutional blocks for down sampling and up sampling.
- 2:11: Vit pose simplifies the process by using only Transformers, making it easier to deal with the problem.
- 2:21: Vit pose uses an encoder which is a Transformer to create tokens from an input image.
- 3:50: Vit pose has two different decoder options - classic decoder and simple decoder.
- 6:15: Vit pose allows multi-dataset training, enabling the utilization of different decoders depending on the data set.
- 7:03: The video presents different variants of vit pose like base, large, huge, and gigantic, which differ in the number of layers and channel size.
- 7:27: The video discusses the simplicity and scalability of vit pose.
- 8:33: The video discusses the influence of pre-training data on the performance of vit pose.
- 10:11: The video discusses the influence of input resolution on the performance of vit pose.
- 11:32: The video discusses the influence of attention type on the performance of vit pose.
- 14:55: The video discusses the influence of partially finetuning on the performance of vit pose.
- 16:02: The video discusses the influence of multi-dataset training on the performance of vit pose.
- 16:21: The video discusses the use of knowledge distillation to improve the generalizability of the model.
- 21:12: The video presents the results of vit pose in comparison with different modules for the task of 2D post estimation on Ms Coco dataset.
Positive Learnings:
- Vit pose simplifies the process of 2D pose estimation by using only Transformers.
- The use of an encoder which is a Transformer to create tokens from an input image has proven to be effective.
- The use of different variants like base, large, huge, and gigantic can enhance the performance of vit pose.
- The use of pre-training data can improve the performance of vit pose.
- The use of knowledge distillation can improve the generalizability of the model.
Negative Learnings:
- Previous attempts to use Transformers for 2D Pro estimation such as transpose and token pose had limitations.
- The use of a CNN backbone in transpose limits its effectiveness.
- Token pose's use of random tokens to represent missing or occluded key points is not the most efficient approach.
- HR former's combination of Transformer blocks and convolutional blocks for down sampling and up sampling makes it complicated.
- Partially finetuning can negatively affect the performance of vit pose.
Congratulations, a perfect and neat job
Thank you for your videos! they are very good to know the state of the art
Glad you enjoyed it
Hey, I am asking myself how to train ViTPose by myself. Did you coincidently trained it by yourself? If so could you share experiences?
Great job!
Thank you for this Interesting video. Would be interesting to see Bottom up pose estimation using transformers like ED-Pose. VitPose is top down so (a) Inference time increases with number of person. (b) It can not handle overlapping human scenarios.
Thanks for the feedback. I didn’t know about the ED-Pose. Surely will read it soon
That was great. Thanks.
Thanks for the feedback
nice job
Thanks
how do they detect poses from heatmaps for say 'k' people?
nevermind it doesn't detect multiple poses
làm ơn cho tôi code