ViTPose: 2D Human Pose Estimation

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 ต.ค. 2024

ความคิดเห็น •

  • @amirhosseinmohammadi4731
    @amirhosseinmohammadi4731 2 หลายเดือนก่อน

    It was very comprehensive, thanks a lot Soroush

  • @wolpumba4099
    @wolpumba4099 ปีที่แล้ว

    - 0:00: The video discusses vit post paper which is currently leading in 2D post estimation on the Ms coco data set.
    - 0:13: Previous attempts to use Transformers for 2D Pro estimation have included transpose and token pose.
    - 0:26: Transpose uses a CNN backbone to extract local information from the input image and a Transformer encoder to understand the skeleton key points in the image.
    - 0:58: Token pose uses a similar approach but includes random tokens to represent missing or occluded key points.
    - 1:33: Another attempt, HR former, combines Transformer blocks and convolutional blocks for down sampling and up sampling.
    - 2:11: Vit pose simplifies the process by using only Transformers, making it easier to deal with the problem.
    - 2:21: Vit pose uses an encoder which is a Transformer to create tokens from an input image.
    - 3:50: Vit pose has two different decoder options - classic decoder and simple decoder.
    - 6:15: Vit pose allows multi-dataset training, enabling the utilization of different decoders depending on the data set.
    - 7:03: The video presents different variants of vit pose like base, large, huge, and gigantic, which differ in the number of layers and channel size.
    - 7:27: The video discusses the simplicity and scalability of vit pose.
    - 8:33: The video discusses the influence of pre-training data on the performance of vit pose.
    - 10:11: The video discusses the influence of input resolution on the performance of vit pose.
    - 11:32: The video discusses the influence of attention type on the performance of vit pose.
    - 14:55: The video discusses the influence of partially finetuning on the performance of vit pose.
    - 16:02: The video discusses the influence of multi-dataset training on the performance of vit pose.
    - 16:21: The video discusses the use of knowledge distillation to improve the generalizability of the model.
    - 21:12: The video presents the results of vit pose in comparison with different modules for the task of 2D post estimation on Ms Coco dataset.
    Positive Learnings:
    - Vit pose simplifies the process of 2D pose estimation by using only Transformers.
    - The use of an encoder which is a Transformer to create tokens from an input image has proven to be effective.
    - The use of different variants like base, large, huge, and gigantic can enhance the performance of vit pose.
    - The use of pre-training data can improve the performance of vit pose.
    - The use of knowledge distillation can improve the generalizability of the model.
    Negative Learnings:
    - Previous attempts to use Transformers for 2D Pro estimation such as transpose and token pose had limitations.
    - The use of a CNN backbone in transpose limits its effectiveness.
    - Token pose's use of random tokens to represent missing or occluded key points is not the most efficient approach.
    - HR former's combination of Transformer blocks and convolutional blocks for down sampling and up sampling makes it complicated.
    - Partially finetuning can negatively affect the performance of vit pose.

  • @mjalali3109
    @mjalali3109 ปีที่แล้ว

    Congratulations, a perfect and neat job

  • @francisferri2732
    @francisferri2732 ปีที่แล้ว

    Thank you for your videos! they are very good to know the state of the art

  • @mrraptorious8090
    @mrraptorious8090 6 หลายเดือนก่อน

    Hey, I am asking myself how to train ViTPose by myself. Did you coincidently trained it by yourself? If so could you share experiences?

  • @rohollahhosseyni8564
    @rohollahhosseyni8564 ปีที่แล้ว

    Great job!

  • @nikhilchhabra
    @nikhilchhabra ปีที่แล้ว

    Thank you for this Interesting video. Would be interesting to see Bottom up pose estimation using transformers like ED-Pose. VitPose is top down so (a) Inference time increases with number of person. (b) It can not handle overlapping human scenarios.

    • @soroushmehraban
      @soroushmehraban  ปีที่แล้ว

      Thanks for the feedback. I didn’t know about the ED-Pose. Surely will read it soon

  • @Fateme_Pourghasem
    @Fateme_Pourghasem ปีที่แล้ว

    That was great. Thanks.

  • @alihadimoghadam8931
    @alihadimoghadam8931 ปีที่แล้ว

    nice job

  • @shklbor
    @shklbor หลายเดือนก่อน

    how do they detect poses from heatmaps for say 'k' people?

    • @shklbor
      @shklbor หลายเดือนก่อน +1

      nevermind it doesn't detect multiple poses

  • @ngtiens_dat
    @ngtiens_dat 11 วันที่ผ่านมา

    làm ơn cho tôi code