If LLMs are text models, how do they generate images? (Transformers + VQVAE explained)
ฝัง
- เผยแพร่เมื่อ 31 พ.ค. 2024
- In this video, I talk about Multimodal LLMs, Vector-Quantized Variational Autoencoders (VQ-VAEs), and how modern models like Google's Gemini, Parti, and OpenAI's DallE generate images together with text. I tried to cover a lot of bases starting from the very basics (latent space, autoencoders), all the way to more complex topics (like VQ-VAEs, codebooks, etc).
Follow on Twitter: @neural_avb
#ai #deeplearning #machinelearning
To support the channel and access the Word documents/slides/animations used in this video, consider JOINING the channel on TH-cam or Patreon. Members get access to Code, project files, scripts, slides, animations, and illustrations for most of the videos on my channel! Learn more about perks below.
Join and support the channel - www.youtube.com/@avb_fj/join
Patreon - / neuralbreakdownwithavb
Interesting videos/playlists:
Multimodal Deep Learning - • Multimodal AI from Fir...
Variational Autoencoders and Latent Space - • Visualizing the Latent...
From Neural Attention to Transformers - • Attention to Transform...
Papers to read:
VAE - arxiv.org/abs/1312.6114
VQ-VAE - arxiv.org/abs/1711.00937
VQ-GAN - compvis.github.io/taming-tran...
Gemini - assets.bwbx.io/documents/user...
Parti - sites.research.google/parti/
DallE - arxiv.org/pdf/2102.12092.pdf
Timestamps:
0:00 - Intro
3:49 - Autoencoders
6:16 - Latent Spaces
9:50 - VQ-VAE
11:30 - Codebook Embeddings
14:40 - Multimodal LLMs generating images - วิทยาศาสตร์และเทคโนโลยี
Your videos are great. I like the switch of scenes from being outside to the whiteboard etc. really professional and engaging. Keep it up.
Awesome! Thanks!
Great video, this helps me a lot when trying to understand multimodel AI. Hope you will keep dong this type of videos!
wow, making variational AE and all the explaination so intuitive and easy to understand!
🙌🏼🙌🏼 Thanks! Glad it worked.
My mind can't decode into words how grateful I am, Great video!
Haha this made my day. Merry Christmas! :)
Nice explanation of the VQ-VAE. Thank you!
You are very good at explaining, thanks!
Glad it was helpful!
Awesome video! This had the right balance of technical and intuitive details for me. Keep em coming!
Absolutely wonderful!
🙌🏼 Thanks!!
Thanks for the great video! I'm curious if there is existing work/paper on this LLM+VQ-VAE idea.
thanks, love your videos
Learnt something before hitting the bed lol! Thanks for this...I finally know something about Gemini and how it works.
Thanks! Glad you enjoyed it.
This is amazing! This is what science is supposed to be about!
brilliant explanation, thank you
This video has the feel of Fayman Lectures
Wow... that's high praise, my friend! Appreciate it.
@@avb_fj Keep them coming.As an SE and technologist I like the way you are presenting complex facts in a simplified way.
Here from Reddit.
Great video, thought it might be over my head but not at all.
Also love the style 🏆
Awesome to know man! Thanks!
Amazing video! Looking forward to training an LLM using VQ-VAE
Hell yeah! Let me know how it goes!
GREAT VIDEO!
very nice!!
I lOVE HOW YOU START THE VIDEO
Haha thanks for the shoutout! I got that instant camera as a present the week I was working on the video, thought it’d be a perfect opportunity to play with it for the opening shot!
what if i fine tune a mistral 7B for next frame prediction on a big dataset of 1500 hours? what do you recommend me for next frame prediction (videos of a similar kind)
Sounds like a pretty challenging task especially coz Mistral 7B afaik isn't a multimodal model. There might be a substantial domain shift in your finetuning data compared to the original training text dataset it was trained on. If you want to use Mistral only, you may need to follow a VQ-VAE like architecture (described in the video) to use a codebook based image/video generation model that autoregressively generates visual content, similar to the original Dall-E. These are extremely compute expensive coz each video would require multiple frames (and each frame would require multiple tokens). Hard to suggest anything without knowing more about the project (mainly compute budget, whether it needs to be multi-modal i.e. text+vision, is it purely an image-reader or will it need to generate images, if videos then how long, etc) as optimal answers may change accordingly (from VQ-VAE codebooks to LLAVA like models that only uses image-encoders & no image-decoders/generation, to good old Conv-LSTM models that have huge memory benefits for video generation (but hard to make multimodal), to hierarchical-attention-based models. I don't have any video that jumps into my mind to share with you.
dataset videos are minute long, 20 fps, 128 tokens each frame - so 1200*128 tokens per video. Videos are highway car driving ones, and need to generate next frame like how a real driving video will look like. Imagine synthetic data for self driving@@avb_fj
also we can condition the model like "move left" (set of discrete commands) and it would generate the next frame like car is moving to the left. there are about 100,000 videos so 1650+ hours of video
Why Diffusion models are more used in modern days that VQ-VAE coupled with transformers regression ?
Diffusion models have in general shown more success in producing quality and diverse images. So they are the choice architecture for text to image models. However, diffusion models can’t be used to easily produce text. VQ-VAE is a special architecture coz it can be trained with a LLM to make them understand user input images & generate images coupled with text.
So… in short, if you want your model to input text + images AND output text + images, VQVAE+transformers are a great choice.
If you want to input text and generate images (no text), use something like stable diffusion with a control net.
Hope that’s helpful.
@@avb_fj Yes it's help a lot to understand. I tried to produce images with VQVAE + text embeddings but i can't get diversity, maybe a random layer in the first embedding before image patches could be effective, i don't know, it seem that VQVAE can't produce good diversity. Maybe with PixelCNN, i will try.
sir could you refer me some followup resources to learn more about this?
Hello! There are some papers and videos linked in the description for follow up resources. I would also recommend to search Yannic Kilcher's channel if he has a video covering a topic you are interested in.
but I tested Gemini vision, and it is not in real-time response .... I created more vision apps. by using Gemini Vision API all are not real-time responses I think the Google video is a trick for us
i LOVE this video! my algorithm knows EXACTLY what i want and to think i got it served less than an hour after it was posted 🥲i feel so special
shout out to the creator for making such a great video and shout out to youtube for bringing me here
So glad! :)