DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
ฝัง
- เผยแพร่เมื่อ 26 มิ.ย. 2024
- ❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
In this video I cover DALL-E or "Zero-Shot Text-to-Image Generation" paper by OpenAI team.
They train a VQ-VAE to learn compressed image representations and then they train an autoregressive transformer on top of that discrete latent space and BPEd text.
The model learns to combine distinct concepts in a plausible way, image to image capabilities emerge, etc.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Paper: arxiv.org/abs/2102.12092
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 What is DALL-E?
03:25 VQ-VAE blur problems
05:15 transformers, transformers, transformers!
07:10 Stage 1 and Stage 2 explained
07:30 Stage 1 VQ-VAE recap
10:00 Stage 2 autoregressive transformer
10:45 Some notes on ELBO
13:05 VQ-VAE modifications
17:20 Stage 2 in-depth
23:00 Results
24:25 Engineering, engineering, engineering
25:40 Automatic filtering via CLIP
27:40 More results
32:00 Additional image to image translation examples
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► / theaiepiphany
One-time donation:
www.paypal.com/paypalme/theai...
Much love! ❤️
Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković
Zvonimir Sabljic
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► / aleksagordic
Twitter ► / gordic_aleksa
Instagram ► / aiepiphany
Facebook ► / aiepiphany
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► / discord
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS:
GitHub ► github.com/gordicaleksa
📚 FOLLOW ME ON MEDIUM:
Medium ► / gordicaleksa
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#dalle #openai #generativemodeling
very well explained!
Very well explained! Appreciate the content!
Awesome like always!
Appreciate it! 🔥🧠
Awesome videos!!!
Look forward to watching this, great paper to tackle!
Hahah dude I thought I scheduled it for tomorrow morning. 😂 My mistake haha.
Yup, definitely!
@@TheAIEpiphany can you try "vector quantized models for planning" paper next plss?
@@akashraut3581 I'll check it out!
@@TheAIEpiphany
Hello. Thank you for sharing knowledge. That's awesome. I have two general questions. Please, answer. I'm 15 yo, and i need help, i can't learn everything without your help, by yourself.
Firstly, is somewhere (maybe on github) transformer like in paper? VQ-GAN uses similar transformer, but not the same. I want to understand this kind of generation better, and if i find similar transformer with the same architecture, i can play with it, understand something better.
Second, you said VQ-VAE is blurry, therefore, can we use VQ-GAN at the first stage, to get better results?
In general, i understand your explanation about Dall-e, thank you again!
github.com/gordicaleksa/pytorch-original-transformer
Great video, love the content.
Thanks!
Unexpected music ending 😎
great video!....did you do also DallE2?
Great!! thank you so much!!
Are they planning to release the code for this? The github they have up doesn't allow text input (not sure what the point of even having it without any form of input is). I'd like to try this for myself.
In the video, you talk about VQ-VAE, but the paper mentions dVAE. Are those similar concepts, or is there a difference between them?
the concepts are similar in that both are autoencoders with discrete latent spaces, but they are trained a bit differently (he explains the differences very nicely from 13:00 on)
Than you! I have a one misunderstanding: the text token embedding size (256) is smaller than the image token embed size (8192). How can we combine them to pass through the transformer? I always think that embedding size shoud be same.
Isn't the number '256' refer to the maximum tokens of text. i.e. To limit a sequence to have 256 tokens at most (truncating). That's as far as I know!
Hi! when explaining the vq-vae modifications.. what does pdf stand for? Great video, thanks for it.
Usually that's the probability density function
@@benjaminstaar2974 thanks for the clarification 🙌🏻🤓
25:12 *underflow
bro i want to understand the math here but i couldnt may math skills are so basic, so what are the topics i need to know to understand it easily pls
Can we also have a code explained video for this plz
Yup, I have it on the list!
VQgan + clip is better than dalle and its strapped together very easily
Sure but it can't use text, in the original setup, that's the downside
@@TheAIEpiphany its super easy though to import and load a clip model, txt = clipmodel.encode_text(clip.tokenize("a photo of a volcano made out of candy").cuda()).detach().clone() -- then encode a vqgan decoded image with img = clipmodel.encode_image(a resized 224x224 vqgan img) , find the cosine similarity: loss = 10* -(torch.cosine_similarity(txt, img).mean()), then loss.backward() and vqgan is VERY responsive and knows exactly what clip / similarity is trying to say. it responds immediately
@@TheAIEpiphany it basically generates an almost complete image on the first 3 iterations with the lr set to .4.. in our setup of dalle, we still had to use clip's simple tokenizer and text encoder in order to use it. the discrete vae they provide in dalle doesn't encode text either -- only clip can encode both image and text - and i suspect thats what they used too. people will tell me they didn't simply use clip -- i say they should prove it by releasing it lol