VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 พ.ค. 2024
  • ❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
    In this video I cover VQ-VAEs papers:
    1) Neural Discrete Representation Learning
    2) Generating Diverse High-Fidelity Images with VQ-VAE-2 (the only difference is the existence of a hierarchical structure of latents and priors)
    Many novel interesting AI papers such as DALL-E and Jukebox from OpenAI as well as VQ-GAN build off of VQ-VAEs, so it's fairly important to have a good grasp of how they work.
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    ✅ VQ-VAE1 paper: arxiv.org/abs/1711.00937
    ✅ VQ-VAE2 paper: arxiv.org/abs/1906.00446
    ✅ PyTorch code: colab.research.google.com/git...
    ✅ ELBO explained: mbernste.github.io/posts/elbo/
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    ⌚️ Timetable:
    00:00 Intro
    01:10 A tangent on autoencoders and VAEs
    07:50 Motivation behind discrete representations
    08:25 High-level explanation of VQ-VAE framework
    11:20 Diving deeper
    13:05 VQ-VAE loss
    16:20 PyTorch implementation
    23:30 KL term missing
    25:50 Prior autoregressive models
    28:50 Results
    32:20 VQ-VAE two
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
    If these videos, GitHub projects, and blogs help you,
    consider helping me out by supporting me on Patreon!
    The AI Epiphany ► / theaiepiphany
    One-time donation:
    www.paypal.com/paypalme/theai...
    Much love! ❤️
    Huge thank you to these AI Epiphany patreons:
    Eli Mahler
    Petar Veličković
    Zvonimir Sabljic
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    👋 CONNECT WITH ME ON SOCIAL
    LinkedIn ► / aleksagordic
    Twitter ► / gordic_aleksa
    Instagram ► / aiepiphany
    Facebook ► / aiepiphany
    👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY:
    Discord ► / discord
    📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
    Substack ► aiepiphany.substack.com/
    💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS:
    GitHub ► github.com/gordicaleksa
    📚 FOLLOW ME ON MEDIUM:
    Medium ► / gordicaleksa
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    #vqvae #discretelatents #generativemodeling

ความคิดเห็น • 74

  • @DCentFN
    @DCentFN 5 วันที่ผ่านมา +1

    Combining the code with the paper explanation helps the understanding immensely. Allows for concept as well as application. Thank you

  • @skymanaditya
    @skymanaditya 2 ปีที่แล้ว +1

    Loved the explanation, especially the part where you covered all the important aspects and showed them in the code. Subscribed and looking forward to more of this content!

  • @pawnagon4874
    @pawnagon4874 2 ปีที่แล้ว +14

    I wish I had one of these videos for every paper I read, awesome work

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      Glad to hear that man, thanks!

  • @stefanmai9879
    @stefanmai9879 ปีที่แล้ว

    You're a great teacher! Glad you came back to this paper and love the format with the code walkthroughs. Very thorough!

  • @user-hv2xy2zt1k
    @user-hv2xy2zt1k 2 ปีที่แล้ว +28

    Great explanation! Especially useful explanation of the code! Please keeping doing the code part! You are a life saver!

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      Super valuable thanks! I'll consider maybe doing a walk-through of some code feel free to suggest something!

    • @iceinmylean3947
      @iceinmylean3947 ปีที่แล้ว

      @@TheAIEpiphany Id be really interested in anything related to the autoregressive model part also mentioned in this video, maybe something like training a transformer?

  • @user-sz1hf9rv1u
    @user-sz1hf9rv1u 2 ปีที่แล้ว +7

    I truly appreciate your explanations, especially PyTorch implementation part, which reduce the gap between concepts and real world implementations. Finding this channel is like finding treasures to me, I've recommended this channel to all my friends. Look forward to your weekly update, thanks :)

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว +2

      Thanks man! 🙏 Yup I am getting back on track with TH-cam I had a weird period over the last month. 😄

  • @ShravanKumar147
    @ShravanKumar147 ปีที่แล้ว +2

    Thank you for such a great explanation, adding code into this format is really helpful to digest the concepts more intuitively. Please keep them coming the same way.

  • @Prashantserai
    @Prashantserai 8 หลายเดือนก่อน

    Fantastic in every way, including the code explanation as well!

  • @alexijohansen
    @alexijohansen 2 ปีที่แล้ว

    Thank you. Love the code, love the in depth explanation! Explaining the math is also great for a beginner like me.

  • @manuobelleiro7711
    @manuobelleiro7711 2 ปีที่แล้ว

    Hello, great video! I had a question regarding the token prediction training. Can this be used to generate images from a text description? If so, where in the code is this implemented? I'm having trouble understanding this last part

  • @artikeshari5441
    @artikeshari5441 2 ปีที่แล้ว +3

    Thanks for explaining it very clearly. Code explanation makes the concept more robust.

  • @hoanglinh96nl
    @hoanglinh96nl 5 หลายเดือนก่อน

    00:01 VQ-VAE is a crucial model for AI research and used in various novel works.
    02:02 Variational autoencoders use a stochastic bottleneck layer.
    05:58 VQ-VAEs impose structure into the latent space for continuous and meaningful interpolation.
    07:50 Discrete representations are a natural fit for many modalities and enable complex reasoning and predictive learning.
    11:40 Using l2 norm to find closest vector and approximate posterior
    13:33 The likelihood assumption and the loss terms in VQ-VAEs
    17:18 Conversion of bchw tensor to standard representation
    18:59 VQ-VAEs use flat input and an embedding table to find distance to codebook vectors.
    22:27 Explanation of implementing straight through gradient
    24:09 The approximate posterior z given x is a deterministic function.
    27:37 The model is an autoregressive token predictor for generating novel images.
    29:11 VQ-VAEs compress data to a discrete space with code size k=512.
    32:24 VQ-VAE v2 has hierarchical structure for better reconstructions
    34:04 VQ-VAEs capture high-resolution images with some distortion
    Crafted by Merlin AI.

  • @elbayo421
    @elbayo421 4 หลายเดือนก่อน

    Amazing explanation! Thank you very much. I was a bit troubled about understanding how this model can be used to generate new images but after reading around I think I get it now

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 2 ปีที่แล้ว +1

    Thanks for the amazing video... You can make them longer and more detailed if needed... Really fun to watch

  • @mehmetonur7925
    @mehmetonur7925 2 ปีที่แล้ว +6

    Code part is pretty good.It has made paper more clear.

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      Awesome thanks for that feedback man!

  • @sarathmohan3143
    @sarathmohan3143 2 ปีที่แล้ว

    Thank a lot sir.
    Simple and concise explanation by covering the related basics also.

  • @eranjitkumar11
    @eranjitkumar11 2 ปีที่แล้ว

    Hi, thank you for your work. Can you explain how they incorporate pixelcnn (or wavenet)?

  • @gougenot
    @gougenot ปีที่แล้ว

    Very nice code part. Truly helped me to understand, what is happening

  • @user-co6pu8zv3v
    @user-co6pu8zv3v 2 ปีที่แล้ว +1

    Great explanation!!! Thank you!

  • @christiannowak7094
    @christiannowak7094 2 ปีที่แล้ว +1

    Brilliant, never got so close to understand what's going on. Really well done

  • @terryr9052
    @terryr9052 2 ปีที่แล้ว

    I have been thinking about the VQ-VAE for generating music and it seems to me that one large limitation of quantizing your latent vectors is that you lose the ability to see interesting results that lay between clusters of latent vectors. For example, I train my model on both reggae and death metal songs and the resulting latent space shows two clusters. It would be nice to then hear songs that interpolate between the 2 clusters but it seems that the quantizing step will force any new vectors (our desired hybrid) to adopt the established codebook vectors which are only representative of the "pure" songs. Am I correct in this line of thinking? Has anyone seen any more info on this at all?

  • @yimingqu2403
    @yimingqu2403 2 ปีที่แล้ว +7

    Appreciate your work! Both paper and code parts are very helpful.
    Two suggestion to make the code more concise
    - pytorch has built in function to calculate pairwise distance `torch.cdist`.
    - directly using `index_select` to get the quantized matrix may be more convenient.

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      Not my implementation - I agree why not reuse the existing library code when possible

    • @kyde8392
      @kyde8392 2 ปีที่แล้ว

      Your suggestions are really neat 👌

  • @modyngs1256
    @modyngs1256 5 หลายเดือนก่อน

    Hi,
    What's the application you are using to write on the PDF? i mean the way you write something in side with the original pdf in the black side of the pdf?

  • @evgenydyshlyuk5604
    @evgenydyshlyuk5604 2 ปีที่แล้ว +1

    Great choice of the article, thank you, was very interesting!

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 ปีที่แล้ว

    Finally a good explanation on how the autoregressive prior part works :X

  • @michelspeiser5789
    @michelspeiser5789 2 หลายเดือนก่อน +1

    Instead of argmin on the distance to the closest embedding, couldn't we just use a softmax instead?

  • @TuanNguyen-su5ty
    @TuanNguyen-su5ty 4 หลายเดือนก่อน

    This video is invaluable. Thank you

  • @djabort
    @djabort ปีที่แล้ว

    thank you a lot. i like the format with code

  • @sarvagyagupta1744
    @sarvagyagupta1744 2 ปีที่แล้ว

    This is a good explanation of VQVAE. I do have a question though. OpenAI's Jukebox is based on VQVAE and they pass gradients through the latent space in their loss function. So is there any difference or what do you think is going on?

  • @kirtipandya4618
    @kirtipandya4618 2 ปีที่แล้ว +1

    Nice video. Please do more videos like this. 👍🏻

  • @KarimaKadaoui
    @KarimaKadaoui 2 ปีที่แล้ว +1

    Thank you so much for the explanation! I wanted to ask how you get to understand some of the details that are not mentioned in the paper, like how the KL Div ends up being equal to log K?

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      🙏 Well, analyzing these I bring in my understanding and background from elsewhere to better understand what is going on in this particular paper.

  • @hassenzaayra5419
    @hassenzaayra5419 ปีที่แล้ว

    thank you very much for this explanation.
    I would like to know how the creation of the codebook is going

  • @bdennyw1
    @bdennyw1 2 ปีที่แล้ว

    Love the Pytorch code!

  • @MuhammadAli-mi5gg
    @MuhammadAli-mi5gg 2 ปีที่แล้ว

    Thanks a lot, it was an awesome explanation.
    And yes the code part is necessary as far as I think, and would highly recommend that.
    Moreover, it would be great if you can also make some content regarding these distributions, because I have tried to understand them, but still, they sound quite fuzzy to me.
    Thanks again!

  • @drtristanbehrens
    @drtristanbehrens ปีที่แล้ว

    A great video! Thanks for sharing!

  • @letianwang5141
    @letianwang5141 7 หลายเดือนก่อน +1

    best explanation ever, unbiased comment

  • @user-my6yf1st8z
    @user-my6yf1st8z 2 ปีที่แล้ว +1

    THANK YOU BROTHER AMAZING

  • @mathkernel5136
    @mathkernel5136 ปีที่แล้ว

    How do we generate new images from the VQ-VAE model. Can you do a tutorial on the pix2pix model for generating new image samples? Thanks

  • @apollozou9809
    @apollozou9809 ปีที่แล้ว

    Overall great explanation. One thing I find confused though. In the paper, loss2 and loss3 are something between Codebook(embeddings vector) and the encoding after CNN. However, in the code, it is something between Quantized encoding after CNN and the encoding after CNN. Can you explain why they are the same thing?

  • @yinghaohu8784
    @yinghaohu8784 2 หลายเดือนก่อน

    You mentioned posterior and prior, can you provide some reference, why they model it in this way ?

  • @amonkotaro1723
    @amonkotaro1723 ปีที่แล้ว +1

    the distance looks like (a - b)^2 19:30

  • @user-tt7mp4dk9w
    @user-tt7mp4dk9w 9 หลายเดือนก่อน

    One thing that confused me is -> why do they convert BCHW to BHWC and then combine BHW x C => (16K, 64)? Should the quantization be done per image in the batch? It seems the entire batch is merged and quantized instead.

  • @IgorAherne
    @IgorAherne 7 หลายเดือนก่อน

    @TheAIEpiphany Man, that's such an epic explanation. Thank you so much for your help!
    One thing that I am struggling with, is 28:00 - by tweaking the prior, does that mean that we can trick the model about what "was" in the image? (what is is expected).
    The concept of predicting the next token is easy for me, - but what are we predicting? a next discrete-embedding vector from the table? But these vectors weren't guaranteed to be in any order...
    Or are we predicting the next word? In that case, how do we associate word token to the discrete-embedding vector?
    During teaching this autoregressive model, how do we know which one is the target/correct vector, that we want to be predicted?

    • @IgorAherne
      @IgorAherne 7 หลายเดือนก่อน

      If anyone else has this question, the autoregressive model is an addition which doesn't "improve" the quality of the VQ-VAE.
      But, we can swap it instead of the encoder+codebook, and use it to produce new images. So basically, "autoregressiveModel+decoder".
      You have to remember that once VQ-VAE is learned, the codebook vectors will be frozen forever. They will not be shuffled etc.
      So, when deployed into production, the Autoregressive model doesn't care what encoder does.
      Instead, the autoregressive model has learnt to look at the few code-book indices (we pick them arbitrarily), and to generate remaining indices of codebook that it thinks will be relevant.
      For example, if we gave it and index describing sky, it might decide that a following index describing a cloud will be more likely, than, say, of a fish.
      Once the autoregressive model produced all the needed indices, we feed the chosen codebook-vectors into the decoder.
      This allows us to generate images.

  • @hernanperez8427
    @hernanperez8427 2 ปีที่แล้ว

    thanks!! it really likes me, very usefull!

  • @MrMIB983
    @MrMIB983 2 ปีที่แล้ว +3

    Great, we also need VQ-GAN, TransGAN and GANsformer

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว +2

      VQ-GAN coming soon as well as DALL-E. I'll add the other 2 to my list. 😂 Thanks!

    • @varunsai9736
      @varunsai9736 2 ปีที่แล้ว +2

      Can you also do clip+ vqgan

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      @@varunsai9736 Sure I'll see whether I can cram it into VQGAN video

  • @sahilgoyal3811
    @sahilgoyal3811 5 หลายเดือนก่อน

    much helpful!

  • @kirtipandya4618
    @kirtipandya4618 2 ปีที่แล้ว +1

    Which software are you using for paper review? One side paper and you can draw and put code next to it.

  • @zongtaowang7840
    @zongtaowang7840 ปีที่แล้ว

    thank you for your explaining and code. When run the code there is an ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'. it seems there is no requirement.txt file there

  • @abdelrahmanwaelhelaly1871
    @abdelrahmanwaelhelaly1871 2 ปีที่แล้ว

    Thank you

  • @user-gz5ym6lb4l
    @user-gz5ym6lb4l ปีที่แล้ว

    Thanks for your amazing&simple explanation. It realy helpful.
    In some paper based on VQVAE, they use perplexity for measurement. But i can not understand what perplexity means in VQVAE model. So if you are not busy can i request explain 'what perplexity means in VQVAE?'
    Thanks again for your wonderful explain!

  • @srinathtangudu4899
    @srinathtangudu4899 ปีที่แล้ว

    awesome

  • @peterkonig9537
    @peterkonig9537 2 ปีที่แล้ว +1

    cool video

  • @deep.extrospection
    @deep.extrospection 2 ปีที่แล้ว +1

    Very good explanation. And with an implementation to support it.
    Thanks a lot!

  • @redone9553
    @redone9553 2 ปีที่แล้ว +1

    Code is nice

  • @johnpope1473
    @johnpope1473 2 ปีที่แล้ว +1

    You’re smashing it. Take some pauses. Pacing conveys a lot / gives space to digest content. Consider you want to cause people to have a light bulb moment. You can’t give people the answer so quickly. I’m looking forward to pytorch stuff. Maybe do some meditation before you record / stillness. Pause.

    • @TheAIEpiphany
      @TheAIEpiphany  2 ปีที่แล้ว

      Thanks for the feedback! I agree I need to work on me being less hectic haha I guess.

  • @djaym7
    @djaym7 2 ปีที่แล้ว

    +1 on code part

  • @razvanrotaru2285
    @razvanrotaru2285 2 ปีที่แล้ว

    i love you