Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 ธ.ค. 2024

ความคิดเห็น • 153

  • @umarjamilai
    @umarjamilai  3 หลายเดือนก่อน +39

    My favorite (one of my favorite) pizza is actually "Pizza with mozzarella di bufala", also known as "bufalina" in Italy 😆😋

    • @jgfreak
      @jgfreak 3 หลายเดือนก่อน +1

      Sir, thanks a lot for these awesome works. Being a student, this benefits me a lot, sir.

    • @shauryaomar5090
      @shauryaomar5090 3 หลายเดือนก่อน

      Your lecture are very helpful. Please increase the frequency of videos.

    • @TempleBridge
      @TempleBridge 3 หลายเดือนก่อน +1

      Would love to visit Venice (one of my checklist to visit) , gonna try it ,!
      BTW , very grate-full for your lecture , got so much to learn and revise , I have followed the Stable diffusion too , learnt a lot and was able to co relate from that lecture to this .

    • @rajkachhadiya6192
      @rajkachhadiya6192 3 หลายเดือนก่อน +1

      thank you @umarjamilai, I love your video
      it's really connected with your previous videos and I have seen all the videos 😉
      thank you once again, TBH I don't have any words to explain my feeling

    • @Scaryder92
      @Scaryder92 7 วันที่ผ่านมา

      Bufala or Parmigiana ❤ Thanks for the amazing videos as always

  • @flavioferlin3127
    @flavioferlin3127 3 หลายเดือนก่อน +26

    You Sir are a source of pride to all of us Italian Computer Scientists. Auguri! Grazie!

  • @zhuoranlu3858
    @zhuoranlu3858 3 หลายเดือนก่อน +25

    you are the best youtuber on the internet, the best! Not one of! I have listened bunch of programming videos, none of them are like you, yours are so good, so up to date, so amazing

    • @umarjamilai
      @umarjamilai  3 หลายเดือนก่อน +1

      谢谢你的点赞!

  • @dinhluongnguyen9758
    @dinhluongnguyen9758 3 หลายเดือนก่อน +17

    You and Andrej are the two guys inspiring me a lot. Respect!

  • @harshwardhanfartale64
    @harshwardhanfartale64 3 หลายเดือนก่อน +21

    5 hours of top tier content that too completely for free! Thank you so much! Please keep uploading such content

  • @mlloving
    @mlloving วันที่ผ่านมา +1

    I appreciate your effort to make a such clear explanation. I spent the whole thanksgiving week to watch video.

  • @eddie12369-o
    @eddie12369-o 2 วันที่ผ่านมา +1

    It's so satisfying to know alot of details that was once quite confusing. Thanks for such a great effort to offer this amazing lecture! It helps me not only get deeper about how VLM is about but also refreshing and connecting the transformer part :)))

  • @TheNitroPython
    @TheNitroPython หลายเดือนก่อน +2

    Saw a Twitter post about how this an underrated channel. Bro, you videos are hardcore and I love it!

  • @nasirnr5518
    @nasirnr5518 3 หลายเดือนก่อน +11

    You're the best in explanation papers with codes. keep it up bro👏. I hope the next about *ControNet* from scratch.

  • @Bbb78651
    @Bbb78651 3 หลายเดือนก่อน +4

    This channel and video is the real deal. Amazing quality. Can't wait to watch the whole thing. Can't believe its completely free - we have no excuse! Keep up the great work and Assalamu Alaikum from Austin, TX!

  • @muhammed64648
    @muhammed64648 2 วันที่ผ่านมา +1

    I've been watching your videos for years, but this is the first time I've seen you. Look at that charisma! You're the king. Let me know if you come to Istanbul; I'll treat you to some kebab.

    • @umarjamilai
      @umarjamilai  2 วันที่ผ่านมา +1

      Thank you! 😋😋😋🥙🥙🥙

  • @hamzawi2752
    @hamzawi2752 3 หลายเดือนก่อน +25

    I have no words to thank you. I was thinking 2 weeks ago, why there are not books for VLMs like LLMs and today I found your comprehensive explanation video.

    • @tamineabderrahmane248
      @tamineabderrahmane248 3 หลายเดือนก่อน

      same problem i have faced

    • @vinc6966
      @vinc6966 3 หลายเดือนก่อน +3

      Same, i only found one paper: An Introduction to Vision-Language Modeling
      Maybe you will find it useful

    • @hamzawi2752
      @hamzawi2752 3 หลายเดือนก่อน

      @@vinc6966 I also found this today: Building and better understanding vision-language models: insights and future directions

  • @tamineabderrahmane248
    @tamineabderrahmane248 3 หลายเดือนก่อน +3

    the model is working , thank you Umar , you are great person . can you do more videos about training LM and finetuning it !

  • @Dorchares
    @Dorchares 3 หลายเดือนก่อน +2

    Stable diffusion video was great. I bet this is even better. Nice to see your videos man welcome back.

  • @parthvadera1
    @parthvadera1 หลายเดือนก่อน +1

    you are awesome sir! thank you so much for putting this together and sharing it so generously with us!! you and Andrej are boon to the ML/AI community

  • @bhaweshs8461
    @bhaweshs8461 3 หลายเดือนก่อน +2

    Can't thank you enough. You are simply the best guy on TH-cam in this field...

  • @kevinhu8057
    @kevinhu8057 27 วันที่ผ่านมา +1

    This is super super awesome! Thank you very much for this awesome work!

  • @danish5326
    @danish5326 3 หลายเดือนก่อน +3

    Man you are a saviour ...pls keep up the good work

  • @abdelwahadelmourabit8872
    @abdelwahadelmourabit8872 3 หลายเดือนก่อน +1

    from the bottom of my heart thank you. your explanations are exceptionally clear even for novices like me. We wish you the best

  • @muhammadharis5025
    @muhammadharis5025 3 หลายเดือนก่อน +2

    You are just doing amazing work and this will provide a path for all those who are interested in ai to do some amazing things. You are an inspiration for all the students who want to learn this field at the deep level. Thankyou for what you are doing this is helping us a lot to learn this field.

  • @shiwenshen4926
    @shiwenshen4926 3 หลายเดือนก่อน +1

    Contributo fantastico!! best content I ever saw in TH-cam

  • @matteo679
    @matteo679 3 หลายเดือนก่อน +1

    Priceless... Sei il mio mito e una continua fonte di ispirazione!

  • @කැලණිකුප්පි
    @කැලණිකුප්පි 3 หลายเดือนก่อน +9

    The Man is back 🤩🤩🤩🤩

  • @DanieleO.
    @DanieleO. 3 หลายเดือนก่อน +1

    Finito! Thank you a lot: I was very curious to learn how multimodal algorithms were even able to work, and it has been a very good challenge to follow along the flow of informations from input to the output, one math operation at a time. Kudos!

  • @tejasvix
    @tejasvix หลายเดือนก่อน

    Umar, I don't have words for this type of content I started with watching aladdin persson on youtube 4 years back where he implement papers and that go me into ML but he stopped uploading and after so muh Time I found you, I am glad and would love to see many more videos like this

  • @Yo-rw7mq
    @Yo-rw7mq 3 หลายเดือนก่อน +1

    Before watching the video, I want to thank you for the great effort! Your videos always answer my questions!!

  • @linhvu407
    @linhvu407 3 หลายเดือนก่อน +1

    Super well commented and structured code, well explained video. Superb quality video with zero fee!

  • @arturstupa186
    @arturstupa186 3 หลายเดือนก่อน +1

    I wanted to spend a few days reading how multimodal LMs work, but your video broke all my plans 😅. As always, perfect timing and explanation, keep up the great work!

  • @ilanaizelman3993
    @ilanaizelman3993 11 ชั่วโมงที่ผ่านมา +1

    Legend! Dont stop

  • @kumarprateek1279
    @kumarprateek1279 2 หลายเดือนก่อน +1

    Your content is top tier.

  • @machiniram
    @machiniram 3 หลายเดือนก่อน +1

    Thank youuu very much, I'm doing my master's thesis on Visual Language Models and this video is such an amazing resource to complete it. Excelent work!

  • @tamineabderrahmane248
    @tamineabderrahmane248 3 หลายเดือนก่อน +1

    you are the best ml engineer bro , there is no full explanation with pytorch code for multimodal LM in the entire youtube, may god preserve you .

  • @marsupilami125
    @marsupilami125 3 หลายเดือนก่อน +3

    I haven't seen the video yet, but I'm sure it's amazing, like all your videos, here's a little thank you

    • @umarjamilai
      @umarjamilai  3 หลายเดือนก่อน

      Thank you very much! Please reach out to me on LinkedIn if you have any questions or doubts.

  •  3 หลายเดือนก่อน +1

    What a video with a lot of efforts. Thanks.

  • @akramsalim9706
    @akramsalim9706 2 หลายเดือนก่อน +1

    what an excellent video. Well done bro

  • @hetanshpatel8521
    @hetanshpatel8521 3 หลายเดือนก่อน +1

    This is god level stuff. Thanks a lot man.

  • @bluebabboon
    @bluebabboon 3 หลายเดือนก่อน +5

    How is this still 39k subs? This is so alpha

    • @umarjamilai
      @umarjamilai  3 หลายเดือนก่อน +1

      Please share it with your network of friends, best way to help me

  • @roip429
    @roip429 หลายเดือนก่อน

    Love it! Very good. BTW image “Normalization” is actually Standardization. Normalization is the scaling

  • @danieldu5298
    @danieldu5298 หลายเดือนก่อน +1

    This is gold. thank you.

  • @ujjwaltiwari7250
    @ujjwaltiwari7250 3 หลายเดือนก่อน +1

    Best guy on TH-cam for this field

  • @AtanuChowdhury-d6o
    @AtanuChowdhury-d6o 3 หลายเดือนก่อน +1

    Thanks Umar Jamil. big fan of your work !

  • @TalkAItive
    @TalkAItive 12 วันที่ผ่านมา +1

    Thanks for the great material. Hope you enjoy the coffee.

  • @sagardesai1253
    @sagardesai1253 29 วันที่ผ่านมา

    Great video @umarjamilai, learned a lot, helpful, thanks for efforts.
    Detailed video on llama 3.2 Architectures will be helpful.

  • @kumarlodha347
    @kumarlodha347 2 หลายเดือนก่อน +1

    Amazingly fabulous as always.
    Could you please cover implementation of controlnet from scratch using Pytorch. Would love to see that.
    Thanks again for the great content

  • @MukteshSinghRathore
    @MukteshSinghRathore 3 หลายเดือนก่อน +1

    Sir you are the biggest inspiration for me . Thank you for your guidance

  • @mahsakhoshnoodi2972
    @mahsakhoshnoodi2972 3 หลายเดือนก่อน

    Yeees, I am a fan of your coding from scratch videos🤩

  • @Dim-zt5ei
    @Dim-zt5ei 14 วันที่ผ่านมา

    Incredible work Umar! You really have a great talent to visualize complex things. And doing all this work for free is impressive. Quick question for the experimented Data Scientists -> Is it normal to have soooo many layers (function that is in another function that is in another function .....). Is it efficient and don't you lose the general view of what the code is doing?

  • @nomaverse
    @nomaverse 2 หลายเดือนก่อน +2

    Thanks man, you are the best.
    PLEASE do a video about FLUX models.
    Thanks

  • @Wing-sv6ps
    @Wing-sv6ps 3 หลายเดือนก่อน +1

    love you explain thing at deep. Keep up the work!

  • @hariharpadhi2304
    @hariharpadhi2304 3 หลายเดือนก่อน +1

    Love ❤ your channel and learned a lot about he machine learning

  • @噜啦啦噜啦啦-m3p
    @噜啦啦噜啦啦-m3p 2 หลายเดือนก่อน +2

    Love from xi'an! Good explain !

    • @umarjamilai
      @umarjamilai  2 หลายเดือนก่อน +1

      那你一定认识最后我用的照片 😄

  • @hajaani6417
    @hajaani6417 3 หลายเดือนก่อน +1

    Legend is back 🎉

  • @swastikgorai2332
    @swastikgorai2332 3 หลายเดือนก่อน +1

    Ah yes! Gonna try this in the weekend!!

  • @santiagopazbedoya
    @santiagopazbedoya 3 หลายเดือนก่อน +3

    💚¡Gracias!

  • @rajkachhadiya6192
    @rajkachhadiya6192 3 หลายเดือนก่อน +1

    thank you so much @umarjamilai, I love your explanation method.
    I really like your method.
    I hope you will make a video on triton language and how it differs from the regular CUDA.

  • @mohammad_mohammad.
    @mohammad_mohammad. 3 หลายเดือนก่อน +1

    Thank you for this detail video.

  • @SurajPrasad-bf9qn
    @SurajPrasad-bf9qn 2 หลายเดือนก่อน +1

    Absolutely loving, could you please make video on same on CoCOOP also

  • @santiagopazbedoya
    @santiagopazbedoya 3 หลายเดือนก่อน +1

    Love it so much this content. I dont are if take 6 hour long. men i appreciate the effort

  • @GhulamJilaniRaza
    @GhulamJilaniRaza หลายเดือนก่อน

    Amazing lecture Umar! i hope and wish you keep doing this for a long long time. BTW which app do you use for making notes?

  • @marsupilami125
    @marsupilami125 3 หลายเดือนก่อน +4

    ¡Gracias!

  • @en-iyi-benim
    @en-iyi-benim 2 หลายเดือนก่อน +1

    in 4:10:25 he is showing a softmax after linear layer, why we didnt add softmax in GemmaModel class?

  • @md.bayazidrahman4274
    @md.bayazidrahman4274 3 หลายเดือนก่อน

    Jazakallah Khairan ❤
    Please federated learning from scratch next

  • @jiansiyong
    @jiansiyong 2 หลายเดือนก่อน +1

    Thanks for your selfless 、awesome vidio ,it is hard to descripe how much I appreciate it! god prey for you~~~~

  • @ProgramerSalar
    @ProgramerSalar 3 หลายเดือนก่อน

    Love From India, Sir

  • @mahmoudghareeb7124
    @mahmoudghareeb7124 3 หลายเดือนก่อน

    Welcome back!!❤❤❤

  • @老刘又被骂了
    @老刘又被骂了 หลายเดือนก่อน

    Great content!Thanks for sharing!
    Do you have plan about introducing the training process and fine-tuning of multi-modal LLMs?

  • @typon1
    @typon1 2 หลายเดือนก่อน

    love the way you say "Pepperoni"

    • @umarjamilai
      @umarjamilai  2 หลายเดือนก่อน

      🫠🥰

  • @bear2053
    @bear2053 2 หลายเดือนก่อน

    Thanks very much for your excellent explanation, Umar. And can you explain the recent popular Agents stuff , please? Hope you have a nice day.

  • @Makersdeaths
    @Makersdeaths 3 หลายเดือนก่อน +2

    You are a monster, My dream is be like you

  • @paneercheeseparatha
    @paneercheeseparatha 3 หลายเดือนก่อน +1

    Have you made any video on tokenizers?

  • @jonsnow553
    @jonsnow553 2 หลายเดือนก่อน +1

    Could you tell me how I can train this model for specific data to improve its performance further?

    • @ahmadishaq703
      @ahmadishaq703 18 วันที่ผ่านมา

      Hello, Did you find any way to train this?

  • @007Paulius
    @007Paulius 3 หลายเดือนก่อน +4

    Thanks

  • @maleekabakhtawar3892
    @maleekabakhtawar3892 3 หลายเดือนก่อน

    I Commented on you older video which was about DDP training. you explained everything soooo well. Can you Please make a detailed video about Tensor Parallel training just like ddp? It will be very helpful.

  • @TA-vf8yi
    @TA-vf8yi 3 หลายเดือนก่อน +1

    Geez. You deliver :)

  • @parthraut9020
    @parthraut9020 3 หลายเดือนก่อน

    @umarjamilai Thank you for this amazing video! At 1:13:35, why are skip connections done from before the LayerNorm vs after? Wouldn't this result in un-normalized values being added to the output of the MHA or MLP blocks? Is there some reason why the architecture looks like this?

  • @LightEnergyTrader
    @LightEnergyTrader 2 หลายเดือนก่อน

    just out of curiosity, what pdf is the one that you are going over here @05:02:36 Can we have that?

  • @yijianyin
    @yijianyin 3 หลายเดือนก่อน +1

    Unbelievable!!!!!!!!

  • @tamineabderrahmane248
    @tamineabderrahmane248 3 หลายเดือนก่อน +1

    i think maybe the tie_weights method should flip the equality because i tried this implementation with gpt2 from karpathy and loss was very very big at the beginning , after flip the equality the loss back to resonable range !

  • @onlyshorts6837
    @onlyshorts6837 หลายเดือนก่อน

    @umarjamilai great video , is their a way to train it on OCR task (like hawritten txtwritten for other langages ??)

  • @TempleBridge
    @TempleBridge 3 หลายเดือนก่อน +1

    One word - (Legend)^100

  • @bugdary
    @bugdary 3 หลายเดือนก่อน

    Great video. Thank you. Can you attach the slides of the vision transformers ?

  • @chenlin7535
    @chenlin7535 2 หลายเดือนก่อน

    how about train a Multimodal (Vision) model for video?

  • @alphagenerativeai
    @alphagenerativeai 2 หลายเดือนก่อน

    It would be good if you instruct how to encrypt data for each task of VLM

  • @aamir122a
    @aamir122a 3 หลายเดือนก่อน

    Time index 3:11:40 , I guess like LSTM the more information we pack in a given token , the further out we go , like the last token harder it gets to remember information about past token , I would love to get your views on why transformers can remember initial sequence and the last dew tokens however struggle to retrieve information from the middle .

  • @tamineabderrahmane248
    @tamineabderrahmane248 3 หลายเดือนก่อน

    i can not clone the model eventhough i have token , i can download just tokenizer and safetensor index ? is there any solution

    • @umarjamilai
      @umarjamilai  3 หลายเดือนก่อน +1

      You're probably using the SSH endpoint to clone, use the HTTPS one.

    • @tamineabderrahmane248
      @tamineabderrahmane248 3 หลายเดือนก่อน

      @@umarjamilai i am actually downloading the safetensor manually , and then i will put them in the same folder with .json files that i had .

  • @jakubstrawa8629
    @jakubstrawa8629 3 หลายเดือนก่อน

    36:22
    What's the point of doing super().__init__() in the config? It will always be initialized properly without it being there. It is redundant.

  • @shauryaomar5090
    @shauryaomar5090 หลายเดือนก่อน +1

    Please make a video on gpt and sora.

  • @zeamon4932
    @zeamon4932 2 หลายเดือนก่อน

    what’s the pad he used? ipad? and what’s the app? i like that it can be mapped to windows and streaming

    • @kabirkumar5815
      @kabirkumar5815 13 วันที่ผ่านมา

      Tldraw, maybe. Excalidraw is similar and I like it

  • @hosseinkhosravipour5545
    @hosseinkhosravipour5545 2 หลายเดือนก่อน

    Thanks a lot . Can any one help me in training time ? I confused about training time . Do we use image logits in loss function or just prompt tokens ?

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 3 หลายเดือนก่อน +2

    Valeu!

  • @aamir122a
    @aamir122a 3 หลายเดือนก่อน

    Great work , please share the software , ipad etc , you used for the presentation and if possible do a presentation on video LLM

  • @mahmoudghareeb7124
    @mahmoudghareeb7124 3 หลายเดือนก่อน

    Why the Dropout not used here?
    because Dropout is used durong training, but during inference.

  • @lakshay510
    @lakshay510 3 หลายเดือนก่อน

    Thank you so much for the video. Loved every bit of it. At one point you mentioned that they Gelu given heurisitics and experimentations, but won't changing one of many hyperparams can increase the training cost by a huge marigin? is there any way to guessitimate that intuition? May be like looking at activations on tensorboard / wandb? How does that work in industry?
    But thanks for the video again, had a great weekend watching it.

    • @umarjamilai
      @umarjamilai  3 หลายเดือนก่อน

      Read this paper by Norm Shazeer: "GLU Variants Improve Transformer". In the conclusion the author says "We offer no explanation as to why these
      architectures seem to work; we attribute their success, as all else, to divine benevolence."
      Noam Shazeer is one of the fathers of the Transformer model. If he says so, who am I to argue? 😁

    • @lakshay510
      @lakshay510 3 หลายเดือนก่อน

      @@umarjamilai Hahah true that and I remember you mentioned this in one of your other video I think it was Mistral Explanation video.
      What I m thinking right now is to how can we estimate which model to choose? There's moondream, clogVLM, miniGPT-4, florence 2, phi-3 etc. I m thinking on seeing how it performs for certain tasks on Image encoder first and then see which architecture uses that archietucture and finetune further on that task.
      Anyway thanks again :)

  • @jakubstrawa8629
    @jakubstrawa8629 3 หลายเดือนก่อน

    32:16, you mean patch number 16 is always on the bottom right? :)

  • @theindianrover2007
    @theindianrover2007 3 หลายเดือนก่อน +1

    respect!😍

  • @theoleher5715
    @theoleher5715 3 หลายเดือนก่อน +1

    Muito obrigado

  • @murugananthamboss9081
    @murugananthamboss9081 3 หลายเดือนก่อน

    Sir, could you create a video for pruning and knowledge distillation with llms?

  • @parikshitgehlaut9206
    @parikshitgehlaut9206 3 หลายเดือนก่อน

    Does anyone get error while loading model that "Tokenizer class GemmaTokenizer does not exist or is not currently imported." ?

    • @ramanandr7562
      @ramanandr7562 3 หลายเดือนก่อน

      Bro how is the tutorial.. Can we understood things clearly? What's your view.

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 2 หลายเดือนก่อน

    Thanks!

  • @hariharpadhi2304
    @hariharpadhi2304 3 หลายเดือนก่อน

    Please upload more such videos on the channel nobody teaches with code