very intuitive, same video was uploaded yesterday also. but that video had fast audio. but this looks fine and able to process normally. Thanks for update if this was the update.
fine tune as you pre-train, create captions and then fine tune the model with those, there is evidence that even with class collision (having similar captions between the same batch) the model learns properly openaccess.thecvf.com/content/CVPR2023/papers/Goyal_Finetune_Like_You_Pretrain_Improved_Finetuning_of_Zero-Shot_Vision_Models_CVPR_2023_paper.pdf
Hello Henry! I don't know if it was in one of Yannic's or in your video, there was a paper in which targets are 3d facial expression vectors and input is speech. For some kind of speech emotion recognition. Can you help, please? ^^'
very intuitive, same video was uploaded yesterday also. but that video had fast audio. but this looks fine and able to process normally. Thanks for update if this was the update.
Yeah some trouble with the upload, thanks for this!
9:36 the million dollar question is: how to fine tune the clip model for the specific dataset😀
fine tune as you pre-train, create captions and then fine tune the model with those, there is evidence that even with class collision (having similar captions between the same batch) the model learns properly openaccess.thecvf.com/content/CVPR2023/papers/Goyal_Finetune_Like_You_Pretrain_Improved_Finetuning_of_Zero-Shot_Vision_Models_CVPR_2023_paper.pdf
Nice
Thanks for watching!
Hello Henry!
I don't know if it was in one of Yannic's or in your video,
there was a paper in which targets are 3d facial expression vectors and input is speech. For some kind of speech emotion recognition.
Can you help, please? ^^'
Sounds like a video of Yannic's, haven't read many papers with speech/audio data
@@connor-shorten I suspected so, but wasn't entirely sure. Thank you very much, though!:))