If LLMs are text models, how do they generate images? (Transformers + VQVAE explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 พ.ค. 2024
  • In this video, I talk about Multimodal LLMs, Vector-Quantized Variational Autoencoders (VQ-VAEs), and how modern models like Google's Gemini, Parti, and OpenAI's DallE generate images together with text. I tried to cover a lot of bases starting from the very basics (latent space, autoencoders), all the way to more complex topics (like VQ-VAEs, codebooks, etc).
    Follow on Twitter: @neural_avb
    #ai #deeplearning #machinelearning
    To support the channel and access the Word documents/slides/animations used in this video, consider JOINING the channel on TH-cam or Patreon. Members get access to Code, project files, scripts, slides, animations, and illustrations for most of the videos on my channel! Learn more about perks below.
    Join and support the channel - www.youtube.com/@avb_fj/join
    Patreon - / neuralbreakdownwithavb
    Interesting videos/playlists:
    Multimodal Deep Learning - • Multimodal AI from Fir...
    Variational Autoencoders and Latent Space - • Visualizing the Latent...
    From Neural Attention to Transformers - • Attention to Transform...
    Papers to read:
    VAE - arxiv.org/abs/1312.6114
    VQ-VAE - arxiv.org/abs/1711.00937
    VQ-GAN - compvis.github.io/taming-tran...
    Gemini - assets.bwbx.io/documents/user...
    Parti - sites.research.google/parti/
    DallE - arxiv.org/pdf/2102.12092.pdf
    Timestamps:
    0:00 - Intro
    3:49 - Autoencoders
    6:16 - Latent Spaces
    9:50 - VQ-VAE
    11:30 - Codebook Embeddings
    14:40 - Multimodal LLMs generating images
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 42

  • @MrHampelmann123
    @MrHampelmann123 5 หลายเดือนก่อน +7

    Your videos are great. I like the switch of scenes from being outside to the whiteboard etc. really professional and engaging. Keep it up.

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      Awesome! Thanks!

  • @teezzz20
    @teezzz20 10 วันที่ผ่านมา

    Great video, this helps me a lot when trying to understand multimodel AI. Hope you will keep dong this type of videos!

  • @uniquescience7047
    @uniquescience7047 5 หลายเดือนก่อน +3

    wow, making variational AE and all the explaination so intuitive and easy to understand!

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      🙌🏼🙌🏼 Thanks! Glad it worked.

  • @user-jq1kc5lz1y
    @user-jq1kc5lz1y 5 หลายเดือนก่อน +2

    My mind can't decode into words how grateful I am, Great video!

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน +1

      Haha this made my day. Merry Christmas! :)

  • @TanNguyen-nq9nj
    @TanNguyen-nq9nj 5 หลายเดือนก่อน +2

    Nice explanation of the VQ-VAE. Thank you!

  • @lucamatteobarbieri2493
    @lucamatteobarbieri2493 3 หลายเดือนก่อน +1

    You are very good at explaining, thanks!

    • @avb_fj
      @avb_fj  3 หลายเดือนก่อน

      Glad it was helpful!

  • @TP-ct7qm
    @TP-ct7qm 5 หลายเดือนก่อน +1

    Awesome video! This had the right balance of technical and intuitive details for me. Keep em coming!

  • @eric-theodore-cartman6151
    @eric-theodore-cartman6151 5 หลายเดือนก่อน +1

    Absolutely wonderful!

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      🙌🏼 Thanks!!

  • @RealAnthonyPeng
    @RealAnthonyPeng หลายเดือนก่อน

    Thanks for the great video! I'm curious if there is existing work/paper on this LLM+VQ-VAE idea.

  • @zakarkak
    @zakarkak หลายเดือนก่อน

    thanks, love your videos

  • @fojo_reviews
    @fojo_reviews 5 หลายเดือนก่อน +2

    Learnt something before hitting the bed lol! Thanks for this...I finally know something about Gemini and how it works.

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน +2

      Thanks! Glad you enjoyed it.

  • @teleprint-me
    @teleprint-me 5 หลายเดือนก่อน +1

    This is amazing! This is what science is supposed to be about!

  • @henkjekel4081
    @henkjekel4081 2 หลายเดือนก่อน +1

    brilliant explanation, thank you

  • @higherbeingX
    @higherbeingX 5 หลายเดือนก่อน +4

    This video has the feel of Fayman Lectures

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      Wow... that's high praise, my friend! Appreciate it.

    • @higherbeingX
      @higherbeingX 5 หลายเดือนก่อน +1

      @@avb_fj Keep them coming.As an SE and technologist I like the way you are presenting complex facts in a simplified way.

  • @ChrisHow
    @ChrisHow 5 หลายเดือนก่อน +1

    Here from Reddit.
    Great video, thought it might be over my head but not at all.
    Also love the style 🏆

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      Awesome to know man! Thanks!

  • @aneeshsathe2494
    @aneeshsathe2494 5 หลายเดือนก่อน +1

    Amazing video! Looking forward to training an LLM using VQ-VAE

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      Hell yeah! Let me know how it goes!

  • @Blooper1980
    @Blooper1980 5 หลายเดือนก่อน +1

    GREAT VIDEO!

  • @NikhilKumar-fo5on
    @NikhilKumar-fo5on 4 วันที่ผ่านมา

    very nice!!

  • @bikrammajhi3020
    @bikrammajhi3020 หลายเดือนก่อน

    I lOVE HOW YOU START THE VIDEO

    • @avb_fj
      @avb_fj  หลายเดือนก่อน

      Haha thanks for the shoutout! I got that instant camera as a present the week I was working on the video, thought it’d be a perfect opportunity to play with it for the opening shot!

  • @sehajpasricha7231
    @sehajpasricha7231 4 หลายเดือนก่อน

    what if i fine tune a mistral 7B for next frame prediction on a big dataset of 1500 hours? what do you recommend me for next frame prediction (videos of a similar kind)

    • @avb_fj
      @avb_fj  4 หลายเดือนก่อน

      Sounds like a pretty challenging task especially coz Mistral 7B afaik isn't a multimodal model. There might be a substantial domain shift in your finetuning data compared to the original training text dataset it was trained on. If you want to use Mistral only, you may need to follow a VQ-VAE like architecture (described in the video) to use a codebook based image/video generation model that autoregressively generates visual content, similar to the original Dall-E. These are extremely compute expensive coz each video would require multiple frames (and each frame would require multiple tokens). Hard to suggest anything without knowing more about the project (mainly compute budget, whether it needs to be multi-modal i.e. text+vision, is it purely an image-reader or will it need to generate images, if videos then how long, etc) as optimal answers may change accordingly (from VQ-VAE codebooks to LLAVA like models that only uses image-encoders & no image-decoders/generation, to good old Conv-LSTM models that have huge memory benefits for video generation (but hard to make multimodal), to hierarchical-attention-based models. I don't have any video that jumps into my mind to share with you.

    • @sehajpasricha7231
      @sehajpasricha7231 4 หลายเดือนก่อน

      dataset videos are minute long, 20 fps, 128 tokens each frame - so 1200*128 tokens per video. Videos are highway car driving ones, and need to generate next frame like how a real driving video will look like. Imagine synthetic data for self driving@@avb_fj

    • @sehajpasricha7231
      @sehajpasricha7231 4 หลายเดือนก่อน

      also we can condition the model like "move left" (set of discrete commands) and it would generate the next frame like car is moving to the left. there are about 100,000 videos so 1650+ hours of video

  • @blancanthony9992
    @blancanthony9992 หลายเดือนก่อน

    Why Diffusion models are more used in modern days that VQ-VAE coupled with transformers regression ?

    • @avb_fj
      @avb_fj  หลายเดือนก่อน +1

      Diffusion models have in general shown more success in producing quality and diverse images. So they are the choice architecture for text to image models. However, diffusion models can’t be used to easily produce text. VQ-VAE is a special architecture coz it can be trained with a LLM to make them understand user input images & generate images coupled with text.
      So… in short, if you want your model to input text + images AND output text + images, VQVAE+transformers are a great choice.
      If you want to input text and generate images (no text), use something like stable diffusion with a control net.
      Hope that’s helpful.

    • @blancanthony9992
      @blancanthony9992 หลายเดือนก่อน +1

      @@avb_fj Yes it's help a lot to understand. I tried to produce images with VQVAE + text embeddings but i can't get diversity, maybe a random layer in the first embedding before image patches could be effective, i don't know, it seem that VQVAE can't produce good diversity. Maybe with PixelCNN, i will try.

  • @sehajpasricha7231
    @sehajpasricha7231 4 หลายเดือนก่อน

    sir could you refer me some followup resources to learn more about this?

    • @avb_fj
      @avb_fj  4 หลายเดือนก่อน +1

      Hello! There are some papers and videos linked in the description for follow up resources. I would also recommend to search Yannic Kilcher's channel if he has a video covering a topic you are interested in.

  • @FalahgsGate
    @FalahgsGate 5 หลายเดือนก่อน

    but I tested Gemini vision, and it is not in real-time response .... I created more vision apps. by using Gemini Vision API all are not real-time responses I think the Google video is a trick for us

  • @kat_the_vat
    @kat_the_vat 5 หลายเดือนก่อน

    i LOVE this video! my algorithm knows EXACTLY what i want and to think i got it served less than an hour after it was posted 🥲i feel so special
    shout out to the creator for making such a great video and shout out to youtube for bringing me here

    • @avb_fj
      @avb_fj  5 หลายเดือนก่อน

      So glad! :)