DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 มิ.ย. 2024
  • ❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
    In this video I cover DALL-E or "Zero-Shot Text-to-Image Generation" paper by OpenAI team.
    They train a VQ-VAE to learn compressed image representations and then they train an autoregressive transformer on top of that discrete latent space and BPEd text.
    The model learns to combine distinct concepts in a plausible way, image to image capabilities emerge, etc.
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    ✅ Paper: arxiv.org/abs/2102.12092
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    ⌚️ Timetable:
    00:00 What is DALL-E?
    03:25 VQ-VAE blur problems
    05:15 transformers, transformers, transformers!
    07:10 Stage 1 and Stage 2 explained
    07:30 Stage 1 VQ-VAE recap
    10:00 Stage 2 autoregressive transformer
    10:45 Some notes on ELBO
    13:05 VQ-VAE modifications
    17:20 Stage 2 in-depth
    23:00 Results
    24:25 Engineering, engineering, engineering
    25:40 Automatic filtering via CLIP
    27:40 More results
    32:00 Additional image to image translation examples
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
    If these videos, GitHub projects, and blogs help you,
    consider helping me out by supporting me on Patreon!
    The AI Epiphany ► / theaiepiphany
    One-time donation:
    www.paypal.com/paypalme/theai...
    Much love! ❤️
    Huge thank you to these AI Epiphany patreons:
    Eli Mahler
    Petar Veličković
    Zvonimir Sabljic
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    👋 CONNECT WITH ME ON SOCIAL
    LinkedIn ► / aleksagordic
    Twitter ► / gordic_aleksa
    Instagram ► / aiepiphany
    Facebook ► / aiepiphany
    👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY:
    Discord ► / discord
    📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
    Substack ► aiepiphany.substack.com/
    💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS:
    GitHub ► github.com/gordicaleksa
    📚 FOLLOW ME ON MEDIUM:
    Medium ► / gordicaleksa
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    #dalle #openai #generativemodeling

ความคิดเห็น • 33

  • @TheAssassin74
    @TheAssassin74 2 ปีที่แล้ว +1

    very well explained!

  • @yirushen6460
    @yirushen6460 4 หลายเดือนก่อน

    Very well explained! Appreciate the content!

  • @PurpleOnion
    @PurpleOnion 2 ปีที่แล้ว +1

    Awesome like always!

  • @williamberriosrojas595
    @williamberriosrojas595 2 ปีที่แล้ว

    Awesome videos!!!

  • @connorshorten6311
    @connorshorten6311 2 ปีที่แล้ว +5

    Look forward to watching this, great paper to tackle!

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว +2

      Hahah dude I thought I scheduled it for tomorrow morning. 😂 My mistake haha.
      Yup, definitely!

    • @akashraut3581
      @akashraut3581 2 ปีที่แล้ว +1

      @@TheAIEpiphany can you try "vector quantized models for planning" paper next plss?

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว +1

      @@akashraut3581 I'll check it out!

    • @user-ek1qb2qo2u
      @user-ek1qb2qo2u 2 ปีที่แล้ว +1

      @@TheAIEpiphany
      Hello. Thank you for sharing knowledge. That's awesome. I have two general questions. Please, answer. I'm 15 yo, and i need help, i can't learn everything without your help, by yourself.
      Firstly, is somewhere (maybe on github) transformer like in paper? VQ-GAN uses similar transformer, but not the same. I want to understand this kind of generation better, and if i find similar transformer with the same architecture, i can play with it, understand something better.
      Second, you said VQ-VAE is blurry, therefore, can we use VQ-GAN at the first stage, to get better results?
      In general, i understand your explanation about Dall-e, thank you again!

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      github.com/gordicaleksa/pytorch-original-transformer

  • @zbaker0071
    @zbaker0071 2 ปีที่แล้ว +1

    Great video, love the content.

  • @ai4popugai
    @ai4popugai 7 หลายเดือนก่อน

    Unexpected music ending 😎

  • @galkim1
    @galkim1 ปีที่แล้ว

    great video!....did you do also DallE2?

  • @fly-code
    @fly-code 2 ปีที่แล้ว +1

    Great!! thank you so much!!

  • @erikbmyname
    @erikbmyname 2 ปีที่แล้ว

    Are they planning to release the code for this? The github they have up doesn't allow text input (not sure what the point of even having it without any form of input is). I'd like to try this for myself.

  • @mikhaildoroshenko2169
    @mikhaildoroshenko2169 2 ปีที่แล้ว

    In the video, you talk about VQ-VAE, but the paper mentions dVAE. Are those similar concepts, or is there a difference between them?

    • @butterkaffee910
      @butterkaffee910 2 ปีที่แล้ว

      the concepts are similar in that both are autoencoders with discrete latent spaces, but they are trained a bit differently (he explains the differences very nicely from 13:00 on)

  • @user-co6pu8zv3v
    @user-co6pu8zv3v 2 ปีที่แล้ว

    Than you! I have a one misunderstanding: the text token embedding size (256) is smaller than the image token embed size (8192). How can we combine them to pass through the transformer? I always think that embedding size shoud be same.

    • @Abdulazizab2
      @Abdulazizab2 2 ปีที่แล้ว

      Isn't the number '256' refer to the maximum tokens of text. i.e. To limit a sequence to have 256 tokens at most (truncating). That's as far as I know!

  • @AMauricioRepetto
    @AMauricioRepetto ปีที่แล้ว

    Hi! when explaining the vq-vae modifications.. what does pdf stand for? Great video, thanks for it.

    • @benjaminstaar2974
      @benjaminstaar2974 ปีที่แล้ว +1

      Usually that's the probability density function

    • @AMauricioRepetto
      @AMauricioRepetto ปีที่แล้ว

      @@benjaminstaar2974 thanks for the clarification 🙌🏻🤓

  • @mrtnetchart
    @mrtnetchart ปีที่แล้ว

    25:12 *underflow

  • @mohamedredha9586
    @mohamedredha9586 หลายเดือนก่อน

    bro i want to understand the math here but i couldnt may math skills are so basic, so what are the topics i need to know to understand it easily pls

  • @st3843
    @st3843 2 ปีที่แล้ว +1

    Can we also have a code explained video for this plz

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว +1

      Yup, I have it on the list!

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 2 ปีที่แล้ว +2

    VQgan + clip is better than dalle and its strapped together very easily

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว +1

      Sure but it can't use text, in the original setup, that's the downside

    • @fast_harmonic_psychedelic
      @fast_harmonic_psychedelic 2 ปีที่แล้ว +1

      @@TheAIEpiphany its super easy though to import and load a clip model, txt = clipmodel.encode_text(clip.tokenize("a photo of a volcano made out of candy").cuda()).detach().clone() -- then encode a vqgan decoded image with img = clipmodel.encode_image(a resized 224x224 vqgan img) , find the cosine similarity: loss = 10* -(torch.cosine_similarity(txt, img).mean()), then loss.backward() and vqgan is VERY responsive and knows exactly what clip / similarity is trying to say. it responds immediately

    • @fast_harmonic_psychedelic
      @fast_harmonic_psychedelic 2 ปีที่แล้ว +1

      @@TheAIEpiphany it basically generates an almost complete image on the first 3 iterations with the lr set to .4.. in our setup of dalle, we still had to use clip's simple tokenizer and text encoder in order to use it. the discrete vae they provide in dalle doesn't encode text either -- only clip can encode both image and text - and i suspect thats what they used too. people will tell me they didn't simply use clip -- i say they should prove it by releasing it lol