Why Does Diffusion Work Better than Auto-Regression?

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 พ.ค. 2024
  • Have you ever wondered how generative AI actually works? Well the short answer is, in exactly the same as way as regular AI!
    In this video I break down the state of the art in generative AI - Auto-regressors and Denoising Diffusion models - and explain how this seemingly magical technology is all the result of curve fitting, like the rest of machine learning.
    Come learn the differences (and similarities!) between auto-regression and diffusion, why these methods are needed to perform generation of complex natural data, and why diffusion models work better for image generation but are not used for text generation.
    The following generative models were featured as demos in this video:
    Images: Adobe Firefly (www.adobe.com/products/firefl...)
    Text: ChatGPT (chat.openai.com)
    Audio: Suno.ai (suno.ai)
    Code: Gemini (gemini.google.com/app)
    Video: Lumiere (Lumiere-video.github.io)
    Chapters:
    00:00 Intro to Generative AI
    02:40 Why Naïve Generation Doesn't Work
    03:52 Auto-regression
    08:32 Generalized Auto-regression
    11:43 Denoising Diffusion
    14:19 Optimizations
    14:30 Re-using Models and Causal Architectures
    16:35 Diffusion Models Predict the Noise Instead of the Image
    18:19 Conditional Generation
    19:08 Classifier-free Guidance

ความคิดเห็น • 153

  • @doku7335
    @doku7335 3 วันที่ผ่านมา +61

    At first I thought "oh, another random video explaining the same basics and not adding anything new", but I was so wrong. It's an incredibly clear explanation of diffusion, and the start with the basic makes the full picture much clearer. Thank you for the video!

  • @jupiterbjy
    @jupiterbjy 5 วันที่ผ่านมา +59

    kinda sorry to my professors and seniors but this is the single best explanation of logics behind each models. About dozen min vid > 2 years of confusion in univ

  • @algorithmicsimplicity
    @algorithmicsimplicity  3 หลายเดือนก่อน +142

    Next video will be on Mamba/SSM/Linear RNNs!

    • @benjamindilorenzo
      @benjamindilorenzo 2 หลายเดือนก่อน

      great! Also maybe think about the Tradeoff between scaling and incremental improvements, in case your perspective is, that LLM´s also always approximate the data set and therefore memorize rather than any "emergent capabilities". So that ChatGPT also does "only" curve fitting.

    • @harshvardhanv3873
      @harshvardhanv3873 9 วันที่ผ่านมา +2

      I am student who is pursuing a degree in ai and we want more of your videos for even simplest of the concepts in ai, trust me this channel will be a huge deal in the near future, good luck!!

    • @QuantenMagier
      @QuantenMagier 2 ชั่วโมงที่ผ่านมา

      Well take my subscription then!!1111

  • @user-my3dd4lu2k
    @user-my3dd4lu2k หลายเดือนก่อน +99

    Man I love the fact that you present the fundamental idea with an Intuitionistic approach, and then discuss the optimization.

  • @pseudolimao
    @pseudolimao 3 วันที่ผ่านมา +10

    this is insane. I feel bad for getting this level of content for free

  • @user-fh7tg3gf5p
    @user-fh7tg3gf5p 3 หลายเดือนก่อน +34

    This genius only makes videos occassionally, that are not to be missed.

  • @pw7225
    @pw7225 5 วันที่ผ่านมา +10

    The way you tell the story is fantastic! I am surprised that all AI/ML books are so terrible at didactics. We should always start at the intuition, the big picture, the motivation. The math comes later when the intuition is clear.

  • @yqisq6966
    @yqisq6966 11 วันที่ผ่านมา +47

    The clearest and most concise explanation of diffusion model I've seen so far. Well done.

  • @Veptis
    @Veptis 3 วันที่ผ่านมา +6

    This is a great explanation on how image decoders work. I haven't seen this approach and narrative direction yet.
    This now makes my reference for explaining it to people that got no idea.!

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 15 วันที่ผ่านมา +31

    This is a much better explanation than the diffusion paper itself. They just went all around variational inference to get the same result!

  • @rafa_br34
    @rafa_br34 13 วันที่ผ่านมา +19

    Such an underrated video, I love how you went from the basic concepts to complex ones and didn't just explain how it works but also the reason why other methods are not as good/efficient.
    I will definitely be looking forward to more of your content!

  • @Jack-gl2xw
    @Jack-gl2xw 9 วันที่ผ่านมา +12

    I have trained my own diffusion models and it required me to do a deep dive of the literature. This is hands down the best video on the subject and covers so much helpful context that makes understanding diffusion models so much easier. I applaud your hard work, you have earned a subscriber!

  • @RicardoRamirez-dr6gc
    @RicardoRamirez-dr6gc 10 วันที่ผ่านมา +10

    This is seriously one of the best explainer videos i've ever seen. I've spent a long time trying to understand diffusion models and not a single video has come close to this one

  • @benjamindilorenzo
    @benjamindilorenzo 2 หลายเดือนก่อน +7

    Very good job.
    My suggestion is that you explain more about how it actually works, that the model learns to understand complete sceneries just from text prompts.
    This could fill its own video.
    Also it would be very nice to have a video about Diffusion Transformers like OpenAIs Sora probably is.
    Also it could be great to have a Video about the paper "Learning in High Dimension Always Amounts to Extrapolation".
    best wishes

    • @algorithmicsimplicity
      @algorithmicsimplicity  2 หลายเดือนก่อน +5

      Thanks for the suggestions, I was planning to make a video about why neural networks generalize outside their training set from the perspective of algorithmic complexity. That paper "Learning in High Dimension Always Amounts to Extrapolation" essentially argues that the interpolation vs extrapolation distinction is meaningless for high dimensional data, and I agree, I don't think it is worth talking about interpolation/extrapolation at all when explaining neural network generalization.

    • @benjamindilorenzo
      @benjamindilorenzo 2 หลายเดือนก่อน +2

      @@algorithmicsimplicity yes true. It would be great also because this links back to the LLM´s discussions, wether scaling up Transformers actually brings up "emergent capabilities", or if this is simple and less magical explainable by extrapolation.
      Or in other words: either people tend to believe, that Deep Learning Architectures like Transformers only approximating their training data set, or people tend to believe, that seemingly unexplainable or unexpected capabilities emerge while scaling.
      I believe, that extrapolation alone explains really good why LLM´s work so well, especially when scaled up AND that LLM´s "just" approximate their training data (curve fitting). This is why i brought this up ;)

  • @Frdyan
    @Frdyan 2 วันที่ผ่านมา +2

    I have a graduate degree in this shit and this is by far the clearest explanation of diffusion I've seen. Have you thought about doing a video running over the NN Zoo? I've used that as a starting point for lectures on NN and people seem to really connect with that paradigm

  • @HD-Grand-Scheme-Unfolds
    @HD-Grand-Scheme-Unfolds 15 วันที่ผ่านมา +6

    You truly understand how to simplify... to engage our imagination... to employ naive thought or ideas to make comparisons to bring across a deeper more core principles and concepts to make the subject for more easier to grasp and get an intuition for. Algorithmic Simplicity indeed... thank you for your style of presentation and teaching. love it love it... you make me know what question I want to ask but didn't know I wanted to ask. TH-cam needs your contribution in ML education. please don't forget that.

  • @justanotherbee7777
    @justanotherbee7777 3 หลายเดือนก่อน +3

    A person with very less background can understand what he describes here.. commenting to make youtube so it gets recommended for other ..
    wonderful video! really good one

  • @karlnikolasalcala8208
    @karlnikolasalcala8208 8 วันที่ผ่านมา +4

    This channel is gold, I'm glad I've randomly stumbled across one of your vids

  • @CodeMonkeyNo42
    @CodeMonkeyNo42 7 วันที่ผ่านมา

    Great video. Love the pacing and how you distiled the material into such an easy to watch video. Great job!

  • @MeriaDuck
    @MeriaDuck วันที่ผ่านมา

    This must be one of the best and concise explanations I've seen!

  • @jcorey333
    @jcorey333 3 หลายเดือนก่อน +7

    This is an amazing quality video! The best conceptual video on diffusion in AI I've ever seen.
    Thanks for making it!
    I'd love to see you cover RNNs.

  • @Matyanson
    @Matyanson 6 วันที่ผ่านมา +2

    Thank you for the explanation. I already knew a little bit about diffusion but this is exactly the way I'd hope to learn. Start from the simplest examples(usually historical) and progresivelly advance, explaining each optimisation!

  • @anthonybernstein1626
    @anthonybernstein1626 20 วันที่ผ่านมา +3

    I had a good idea how diffusion models work but I still learned a lot from this video. Thanks!

  • @banana_lemon_melon
    @banana_lemon_melon 7 วันที่ผ่านมา +1

    bruh, I loved your contents. Other channel/video usually explain general knowledge that can be easily found on internet. But you're going deeper to the intrinsic aspects of how the stuff works. This video, and one of your video about transformer, are really good.

  • @mrdr9534
    @mrdr9534 5 วันที่ผ่านมา +1

    Thanks for taking the time and effort of making and sharing these videos and Your knowledge.
    Kudos and best regards

  • @JordanMetroidManiac
    @JordanMetroidManiac 6 วันที่ผ่านมา +1

    I finally understand how models like Stable Diffusion work now! I tried understanding them before but got lost at the equation (17:50), but this video describes that equation very simply. Thank you!

  • @ecla141
    @ecla141 2 วันที่ผ่านมา +1

    Awesome video! I would love to see a video about graph neural networks

  • @xaidopoulianou6577
    @xaidopoulianou6577 10 วันที่ผ่านมา +1

    Very nicely and simply explained! Keep it up

  • @iestynne
    @iestynne 5 วันที่ผ่านมา +1

    Wow, fantastic video. Such clear explanations. I learned a great deal from this. Thank you so much!

  • @RobotProctor
    @RobotProctor 10 วันที่ผ่านมา +2

    I like to think of ML as a funky calculator. Instead of a calculator where you give it inputs and an operation and it gives you an output, you give it inputs and outputs and it gives you an operation.
    You said it's like curve fitting, which is the same thing, but I like thinking the words funky calculator because why not

  • @abdelhakkhalil7684
    @abdelhakkhalil7684 7 วันที่ผ่านมา +1

    This was a good watch, thank you :)

  • @tkimaginestudio
    @tkimaginestudio วันที่ผ่านมา +1

    Great explanations, thank you!

  • @1.4142
    @1.4142 3 หลายเดือนก่อน +4

    Some2 really brought out some good channels

  • @user-yj3mf1dk7b
    @user-yj3mf1dk7b 8 วันที่ผ่านมา +1

    nice explanations, although, i've already knew about diffusion. examples from simplest to final diffusion -- were a really nice touch.

  • @sanjeev.rao3791
    @sanjeev.rao3791 2 วันที่ผ่านมา +1

    Wow, that was a fantastic explanation.

  • @iancallegariaragao
    @iancallegariaragao 3 หลายเดือนก่อน +2

    Great video and amazing content quality!

  • @akashmody9954
    @akashmody9954 3 หลายเดือนก่อน +2

    Great video....already waiting for your next video

  • @ShubhamSinghYoutube
    @ShubhamSinghYoutube วันที่ผ่านมา +1

    Love the conclusion

  • @anatolyr3589
    @anatolyr3589 หลายเดือนก่อน +1

    Great explanation!👍👍, I personally would like to see a video observing all major types of neural nets with their distinctions, specifics, advantages, disadvantages etc. the author explains very well 👏👏

  •  7 วันที่ผ่านมา +2

    I think it would help to mention that the auto-regressors may be viewing the image as a sequence of pixels (RGB vectors). Overall excellent video, extremely intuitive.

    • @algorithmicsimplicity
      @algorithmicsimplicity  7 วันที่ผ่านมา +1

      In general, auto-regressors do not view images as a sequence. For example, PixelCNN uses convolutional layers and treats inputs as 2d images. Only sequential models such as recurrent neural networks would view the image as a sequence.

    •  6 วันที่ผ่านมา

      @@algorithmicsimplicity of course, but I feel mentioning it may help with intuition as you’re walking through pixel by pixel image generation

  • @user-er9pw4qh6j
    @user-er9pw4qh6j 17 วันที่ผ่านมา +2

    Soooo Good!!! Thanks for making it!!!!

  • @Mhrn.Bzrafkn
    @Mhrn.Bzrafkn 11 วันที่ผ่านมา +3

    It was too easy understanding👌🏻👌🏻

  • @paaabl0.
    @paaabl0. 6 วันที่ผ่านมา

    Great video! Focus on the right elements.

  • @vijayaveluss9098
    @vijayaveluss9098 9 วันที่ผ่านมา +1

    Great explanation

  • @RobotProctor
    @RobotProctor 10 วันที่ผ่านมา +1

    Thank you. This video is wonderful

  • @zephilde
    @zephilde 12 วันที่ผ่านมา +3

    Great visualisation! Good job!
    Maybe next video on LoRA or ControlNet ?

  • @marcinstrzesak346
    @marcinstrzesak346 15 วันที่ผ่านมา +1

    Very good video. Thank you

  • @khangvutien2538
    @khangvutien2538 9 วันที่ผ่านมา

    Thank you very much.
    I enjoyed the first part, the first 10 seconds.
    After, there are too any shortcuts in the explanations that I struugled to understand and be able to explain it again to myself. Still, I subscribed.
    As for suggestions for other videos, I'll check whether you have explained the U-Net already. If not I'd appreciate to have the same kind of explanation about it.

  • @psl_schaefer
    @psl_schaefer 21 ชั่วโมงที่ผ่านมา

    Amazing video!

  • @joaosousapinto3614
    @joaosousapinto3614 13 วันที่ผ่านมา +1

    Great video, congrats.

  • @ollie-d
    @ollie-d 2 วันที่ผ่านมา +1

    Solid video!

  • @mojtabavalipour
    @mojtabavalipour 7 วันที่ผ่านมา +1

    Well done!

  • @demohub
    @demohub 8 วันที่ผ่านมา +1

    Just subscribed. Great video

  • @meanderthalensis
    @meanderthalensis 9 วันที่ผ่านมา +1

    Great video!

  • @AurL_69
    @AurL_69 10 วันที่ผ่านมา +1

    thanks for explaining

  • @johnbolt2686
    @johnbolt2686 6 วันที่ผ่านมา

    I would recommend reading about active inference to possibly understand the role of generative models in intelligence.

  • @hmmmza
    @hmmmza 3 หลายเดือนก่อน +3

    what a great rare content!

  • @ArtOfTheProblem
    @ArtOfTheProblem 13 วันที่ผ่านมา +1

    great work

  • @pon1
    @pon1 9 วันที่ผ่านมา +1

    Still feels like magic to me 🙌🙌

  • @oculuscat
    @oculuscat 10 วันที่ผ่านมา +3

    Diffusion doesn't necessarily work better than auto-regression. The "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" paper introduces an architecture they call VAR that upscales noise using an AR model and this currently out-performs all diffusion models in terms of speed and accuracy.

  • @winstongraves8321
    @winstongraves8321 9 วันที่ผ่านมา +1

    Great video

  • @ChristProg
    @ChristProg 15 วันที่ผ่านมา +1

    Thank you So much Sir. Really interesting video. But i will like you to create a video on how the generative model uses the text promt during training. Thank you Sir. I subscribed !😊

  • @mallow610
    @mallow610 9 วันที่ผ่านมา +2

    Video is a banger

  • @infographie
    @infographie 8 วันที่ผ่านมา +1

    Excellent.

  • @aydr5412
    @aydr5412 6 วันที่ผ่านมา

    Thank you for the video. Imao curve fitting is oversimplification, it destructs us from real problem - what and how being optimized. Also there is different perspective on cases there we prefer computational efficiency over training quality: with efficiency you can train model on more data and for longer time using same amount of computational resources which actually results in better model

    • @johnmorrell3187
      @johnmorrell3187 5 วันที่ผ่านมา

      Curve fitting is optimization so I'd say the two explanations are equivalent.
      While it's true that a more efficient method -> longer training -> better behavior, it's also true that if compute and time really were not a limiting factor then these less efficient methods would give better final performance.

  • @kubaissen
    @kubaissen 3 หลายเดือนก่อน +1

    Nice vid thx

  • @craftydoeseverything9718
    @craftydoeseverything9718 7 ชั่วโมงที่ผ่านมา

    This was genuinely such a great video. I honestly feel like I could come away from this video and implement an image generator myself :) /gen

  • @zacklee5787
    @zacklee5787 5 วันที่ผ่านมา +1

    Not sure I agree with some of your analysis here. The strength of diffision models doesn't come from the lower depedence of objects/pixels the model generates at once. In fact, as you mention, the model actually predicts a whole image, in practice, at every step. Even when you use the trick of predicting the noise, the noise is unintuitively not random, that is, not randomly generated, but actually depends completely on the noise or lack there of in the input. It is after all equivalent to predicting the whole image.
    The real strength comes from the incremental nature, that is, a step of the model further down the line can "fix" a mistake it made previously by interpreting the previous generation as noise.
    In the space of all say 1024x1024 pixel value combinations, there is a manifold (essentially a subset of close together images) of all target images we want to generate. The diffusion model learns to take incremental steps toward that subset of "reasonable" images from any random starting point.

    • @algorithmicsimplicity
      @algorithmicsimplicity  4 วันที่ผ่านมา

      The noise is absolutely randomly generated. The reason the model can predict the noise (or equivalently image) is because it receives both the noise and image as input.
      If it was the case that the incremental nature helped, then I would expect diffusion models to generate higher quality outputs than auto-regressors, but this isn't the case. Auto-regressors generate higher quality outputs (e.g. arxiv.org/abs/2205.13554 ), they just take longer to run. If it was the case that NN are unable to give correct predictions on the first go, we would see the opposite, that diffusion models can correct previous generations and thereby achieve higher quality. Also see LLM which have no difficulty generating perfect outputs in one pass.
      Diffusion models only learn to take steps toward the data distribution starting at the standard normal distribution (origin).

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow 8 วันที่ผ่านมา

    A funny thing about watching a video like this is that you see an artificial neural network produce an image and then you have another layer of neural network in the brain that tries to figure out if it was a good match or not. The so called “blurry noise” could in principle look like a good match to someone and a bad match to someone else depending on how their own categorization works.
    It could also be good for everyone and bad for everyone of course or some arbitrary mix along that scale.
    The point is that “looks like blury noise” risks being a quite unobjective statement.
    I mean, people see images in the clouds and so on.

  • @frommarkham424
    @frommarkham424 10 วันที่ผ่านมา +2

    That was exactly how i guessed they did

  • @IceMetalPunk
    @IceMetalPunk 9 วันที่ผ่านมา +1

    And the newest/upcoming models seem to be tending more towards diffusion Transformers, which from my understanding is effectively a Transformer autoencoder with a diffusion model plugged in, applying diffusion directly to the latent space embeddings. Is that correct?

  • @Blooper1980
    @Blooper1980 11 วันที่ผ่านมา +1

    Finally I understand!

  • @MilesBellas
    @MilesBellas 10 วันที่ผ่านมา

    via Pi
    "Diffusion models and auto-regressive (AR) models are two popular approaches for generating images and other types of data. They differ in their fundamental techniques, generation time, and output quality. Here's a brief comparison:
    **Diffusion Models:**
    * Approach: Diffusion models are based on the idea of denoising images iteratively, starting from a noisy input and gradually refining it into a high-quality output.
    * Generation Time: Diffusion models are generally faster than AR models for image generation, especially when using optimizations like "asymmetric step" or Cascade models.
    * Output Quality: Diffusion models are known for generating high-quality and diverse images, especially when trained on large datasets like Stable Diffusion or DALL-E 2. They can capture various styles and generate coherent images with intricate details.
    **Auto-Regressive (AR) Models:**
    * Approach: AR models generate images pixel by pixel, conditioning each new pixel on previously generated pixels. This sequential approach makes AR models computationally expensive, especially for large images.
    * Generation Time: AR models tend to be slower than diffusion models due to their sequential nature. The generation time can be significantly longer for high-resolution images.
    * Output Quality: While AR models can produce high-quality images, they may struggle with capturing diverse styles or maintaining coherence across different image regions. They might require additional techniques, like classifier-free guidance or super-resolution, to achieve better results.
    In summary, diffusion models generally offer faster generation times and better output quality compared to AR models. However, both approaches have their strengths and limitations, and the choice between them depends on the specific use case, available computational resources, and desired generation speed and output quality."

  • @yk4r2
    @yk4r2 วันที่ผ่านมา +2

    Hey, could you kindly recommend more on causal architectures?

    • @algorithmicsimplicity
      @algorithmicsimplicity  วันที่ผ่านมา

      I haven't seen any material that cover them really well. There are basically 2 types of causal architectures, causal CNNs and causal transformers, with causal transformers being much more widely used in practice now. Causal transformers are also known as "decoder only transformers" ("encoders" uses regular self-attention layers, "decoders" use causal self-attention). If you search for encoder vs decoder-only transformers you should find some resources that explain the difference.
      Basically, to make a self-attention layer causal you mask the attention scores (i.e. set some to 0), so that words can only attend to words that came before them in the input. This makes it so that every word's vector only contains information from before it. This means you can use every word's vector to predict the word that comes after it, and it will be a valid prediction because that word's vector never got to attend (i.e. see) anything after it. So, it is as if you had applied the transformer to every subsequence of input words, except you only had to apply it once.

  • @muhammadaneeqasif572
    @muhammadaneeqasif572 4 วันที่ผ่านมา

    can you please share the code that ubused for generation of the images in the demo. it will be very helpful

  • @recklessroges
    @recklessroges 2 วันที่ผ่านมา

    Could you explain why the YOLO image classify is/was so effective? Thank you.

  • @hjups
    @hjups 3 หลายเดือนก่อน +1

    Do you have a citation that supports your claim for eps vs x0 prediction?
    It's true that the first sampling step with x0 tends to produce a blurry / averaged result, but that's a result of the loss function used when training DDPMs. If you were to use something more complex or another NN, then you'd have a GAN, which don't produce blurry or averaged results on a single forward pass.
    Also, if you examine the output of x0 = noise - eps for the first step, it's both mathematically and visually equivalent to the first x0 prediction sample - a blurry / averaged result. The same thing is also true when predicting velocity, but velocity is arguably harder for a network to predict due to the phase transition.

  • @alex65432
    @alex65432 3 หลายเดือนก่อน +1

    Can you make a video about the loss landscape.Like what effects do different weight inits. Optimizers or architectures like resnet have.

    • @algorithmicsimplicity
      @algorithmicsimplicity  3 หลายเดือนก่อน

      Thanks for the interesting suggestion! I was already planning to do a video about why neural networks generalize outside of their training set, I should be able to talk about the loss landscape in that video.

  • @IsaOzer-lx7sn
    @IsaOzer-lx7sn 11 ชั่วโมงที่ผ่านมา +1

    I want to learn more about the causal architecture idea for auto regressors, but I can't seem to find anything about them anywhere. Do you know where I can read more about this topic?

    • @algorithmicsimplicity
      @algorithmicsimplicity  11 ชั่วโมงที่ผ่านมา

      I haven't seen any material that cover them really well. There are basically 2 types of causal architectures, causal CNNs and causal transformers, with causal transformers being much more widely used in practice now. Causal transformers are also known as "decoder only transformers" ("encoders" uses regular self-attention layers, "decoders" use causal self-attention). If you search for encoder vs decoder-only transformers you should find some resources that explain the difference.
      Basically, to make a self-attention layer causal you mask the attention scores (i.e. set some to 0), so that words can only attend to words that came before them in the input. This makes it so that every word's vector only contains information from before it. This means you can use every word's vector to predict the word that comes after it, and it will be a valid prediction because that word's vector never got to attend (i.e. see) anything after it. So, it is as if you had applied the transformer to every subsequence of input words, except you only had to apply it once.

  • @alirezaghazanfary
    @alirezaghazanfary วันที่ผ่านมา +1

    thanks to very good video
    I have a question:
    can't we make a model that decrease the resolution of a picture (for example a 4*4 picture to a 2*2 and to 1*1 picture) and run it reverse (generate a 2*2 from 1*1 and 4*4 from 2*2) ?
    would this model works?

    • @algorithmicsimplicity
      @algorithmicsimplicity  17 ชั่วโมงที่ผ่านมา +1

      Yes you absolutely could, and according to this paper: arxiv.org/abs/2404.02905v1 it works pretty well.

  • @iwaniw55
    @iwaniw55 7 ชั่วโมงที่ผ่านมา

    Hi @algorithmicsimplicity, I am curious which papers/material did you reference for the general autogressor? I cannot seem to find any info on using random spaced out pixels to predict the next batch of pixels. Any help would be appreciated. Also great videos!!!

    • @algorithmicsimplicity
      @algorithmicsimplicity  7 ชั่วโมงที่ผ่านมา

      It is more widely known as "any-order autoregression", see e.g. this paper arxiv.org/abs/2205.13554

    • @iwaniw55
      @iwaniw55 6 ชั่วโมงที่ผ่านมา

      @@algorithmicsimplicity Thank you so much! This is exactly what I was missing.

  • @EricPham-gr8pg
    @EricPham-gr8pg 7 วันที่ผ่านมา

    Use lense projector and -zoom will save all the msthematical brain picking
    In video we use ccd cell in camera instantly illuminate LED pixel then zoom it down to tiny dot then send to ram and display on monitor by zoom factor corespond to resolutiom allow and zoom it back down when store it in time line of each coordinate and add all up with address and time then when unfold all we need is tiny dot first frame and last frame then start by last frame unfold into buffer subtract time but must adjust to phase angle of time at closest to last frame and just less tine drive with appropriate speed of each time axis so memory is so small

  • @quickdudley
    @quickdudley 4 วันที่ผ่านมา

    My brain misinterpreted the title as "Why diffusers work better than autoencoders" (I believe because the noising process works rather like data augmentation)

  • @duytdl
    @duytdl 9 วันที่ผ่านมา +2

    So why isn't diffusion better for text? Also are you saying that auto-regression is only bad because it's expensive to do (serially)? Or is diffusion fundamentally better for images?

    • @algorithmicsimplicity
      @algorithmicsimplicity  9 วันที่ผ่านมา +2

      Auto-regression is only bad because it is slow, it produces better generations for both text and images. For text, there aren't that many tokens that you need to generate, so you can just use auto-regression: it gives better results. For images, you are forced to use something faster, and diffusion is much faster while producing nearly as good generations.

  • @turhancan97
    @turhancan97 11 วันที่ผ่านมา +1

    Is the idea at the beginning of the video (auto regression image generation) self supervised learning?

    • @algorithmicsimplicity
      @algorithmicsimplicity  11 วันที่ผ่านมา +1

      Technically yes, self supervised learning just means that the labels used to train the model were created automatically from the data itself, instead of by a human. So yes both auto-regression and diffusion are self-supervised learning, since they automatically create masked/noised inputs and use the clean image as labels. Though usually when people refer to self-supervised learning specifically they mean self-supervised but not generative, so things like simCLR or contrastive learning.

    • @turhancan97
      @turhancan97 11 วันที่ผ่านมา +1

      @@algorithmicsimplicity I understand. Thanks a lot :)

  • @hamzaumair7909
    @hamzaumair7909 หลายเดือนก่อน +1

    I love your eplanations especially transfomers. Although this one imo could have been better, I think you are missing some ideas that should have been explained.

    • @algorithmicsimplicity
      @algorithmicsimplicity  หลายเดือนก่อน +1

      Thanks for the feedback, any ideas in particular that you think should have been explained?

  • @agustinbs
    @agustinbs 8 วันที่ผ่านมา +1

    This video is better than go to the MIT for machine learning degree. Man this is gold, thank you so much

  • @akashmody9954
    @akashmody9954 3 หลายเดือนก่อน +1

    Can you recommend some sources that i can follow if i want to do deeper into diffusion models and transformers?

    • @akashmody9954
      @akashmody9954 3 หลายเดือนก่อน

      I tried to go through the research papers but the math is overwhelming

    • @algorithmicsimplicity
      @algorithmicsimplicity  3 หลายเดือนก่อน +3

      ​@@akashmody9954 If you just want to learn how to train/use them, I'd highly recommend the fast.ai course by Jeremy Howard, it will give you practical experience using them. If you want to do research/develop new methods then I'm afraid there isn't any better option than just reading the papers. Although if code is available I sometimes find it easier to just read the code than the paper lol.

    • @akashmody9954
      @akashmody9954 3 หลายเดือนก่อน

      @@algorithmicsimplicity alright.....thanks a lot man, and loving your videos as always

  • @joshjohnson259
    @joshjohnson259 3 วันที่ผ่านมา +1

    If this explanation is too advanced for me how would you recommend I learn enough to be able to grasp these concepts? Can you direct me to some content that is one level down in complexity so I can see if that would be my starting point in understanding how these models work? I don’t really have any CS background.

    • @algorithmicsimplicity
      @algorithmicsimplicity  2 วันที่ผ่านมา

      If you just want to learn how to train/use these models, I would highly recommend the fast.ai course by Jeremy Howard (course.fast.ai/ ). You can also look at 3blue1brown's videos on neural networks and transformers which are aimed at a general audience, and Andrej Karpathy's videos on implementing a transformer from scratch for a more detailed walkthrough of the models.

  • @klaushermann6760
    @klaushermann6760 7 วันที่ผ่านมา

    Now we know they're not only predictors.

  • @akashmody9954
    @akashmody9954 3 หลายเดือนก่อน +1

    Can you make a video on how SORA by OpenAI works, what kind of architecture does it follow

    • @algorithmicsimplicity
      @algorithmicsimplicity  3 หลายเดือนก่อน +2

      Unfortunately OpenAI does not publicly release details on their architectures, they only said it was a transformer based diffusion model. This thread had some speculation on the exact architecture though: threadreaderapp.com/thread/1758433676105310543.html

  • @sichengmao4038
    @sichengmao4038 6 วันที่ผ่านมา

    can you explain why for diffusion model, there's no causal architecture? 16:26

    • @algorithmicsimplicity
      @algorithmicsimplicity  6 วันที่ผ่านมา

      Basically its because NN layers accumulate information from multiple input features into one feature's vector. By making the layer only take in information from features before it in the AR order, you get a causal architecture with the same size as the original model.
      For diffusion, you could in principle make a causal architecture, but you would need to make a feature vector for every feature in every step of the noising process. i.e. the size of the model would need to be increased by a factor equal to the number of denoising steps, which isn't practical.

    • @sichengmao4038
      @sichengmao4038 6 วันที่ผ่านมา

      @@algorithmicsimplicity don't quite understand why "the model size is increased by the number of denoising steps". What I imagine is, if we make an analogy to language model like Transformer, we now have a series of tokens (where each token is indeed a noisy image in the noising process), then we can still parallelize along the sequence dimension, isn't it?

    • @algorithmicsimplicity
      @algorithmicsimplicity  5 วันที่ผ่านมา

      @@sichengmao4038 You could do that, the problem is how you convert the entire image into a token. Usually in order to convert an image into a feature vector, you need to apply a full-sized neural network. So to get your noisy image tokens you need to apply a NN for each noising step.

  • @assgoblin3981
    @assgoblin3981 2 หลายเดือนก่อน

    Assgoblin approves of this content

  • @JoeJoeTater
    @JoeJoeTater 20 ชั่วโมงที่ผ่านมา +2

    18:10 This is wrong. The average of a bunch of noisy images is a less-noisy image. (See "regression towards the mean") You'd have to normalize that averaged image.

    • @algorithmicsimplicity
      @algorithmicsimplicity  17 ชั่วโมงที่ผ่านมา +1

      Right, I should have been more careful with my usage of the word "noisy". If you average a bunch of samples from a normal distribution, the result is a sample with less variance (i.e. less noisy). What I meant to say was the probability of the average under the normal distribution is higher (i.e. the result is closer to the origin). So the average still lies within the data manifold (as opposed to images, where the average moves outside the data manifold).

    • @fayezsalka
      @fayezsalka 4 ชั่วโมงที่ผ่านมา

      Yes, that was very confusing to me too. The average of a bunch of random noise samples is 0.5, which is the mean. You would literally get a smooth grey image. Not “noise” image as shown in the video

  • @craftydoeseverything9718
    @craftydoeseverything9718 7 ชั่วโมงที่ผ่านมา

    17:58 btw, you wrote "nose", instead of "noise"

    • @algorithmicsimplicity
      @algorithmicsimplicity  7 ชั่วโมงที่ผ่านมา +1

      So I did. Surprised no-one else mentioned it yet lol.

  • @dubfather521
    @dubfather521 6 วันที่ผ่านมา

    So denoising models work by predicting the clean image, and then to get the next step you noise its already clean output??? That doesn't make any sense. If it predicts the final image already why do you have to keep predicting.

    • @algorithmicsimplicity
      @algorithmicsimplicity  6 วันที่ผ่านมา

      The first time it predicts the clean image, it will not produce a good image, it will produce a blurry mess (because it will average over all of the training images). You then add noise to this blurry mess and you get an image that is almost pure noise, with a little but of structure from the original blurry mess. Then you use that as input and predict a clean image again, this time the produced image will be slightly sharper, because now the model is only averaging over all inputs which are consistent with the blurry structure from the first step. You repeat this many times, at each step the produced image gets sharper because more detail is left from the previous step.

    • @dubfather521
      @dubfather521 6 วันที่ผ่านมา

      @@algorithmicsimplicity ohhhhhh

  • @glaubherrocha2935
    @glaubherrocha2935 11 ชั่วโมงที่ผ่านมา

    a fixed pixel with random color wouldn't make it work?

    • @algorithmicsimplicity
      @algorithmicsimplicity  11 ชั่วโมงที่ผ่านมา

      I'm not sure what you are asking, can you elaborate?

  • @cognitive-carpenter
    @cognitive-carpenter 9 วันที่ผ่านมา

    Enjoyed I think is the wrong output

  • @chadarmstrong7458
    @chadarmstrong7458 12 วันที่ผ่านมา +1

    I didnt understand why you would predict the noise rather than the clean image. Your explanation didnt seem to be related to the problem...

    • @chadarmstrong7458
      @chadarmstrong7458 12 วันที่ผ่านมา +1

      "You get a blurry mess again" Why is that a problem in the early iterations?

    • @chadarmstrong7458
      @chadarmstrong7458 12 วันที่ผ่านมา +3

      "The advanage of doing it this way is that now the model output is uncertain at the later stages of the generation process" Why is that valuable? Why is that relevant to this other problem with the early stages that you are supposedly trying to solve?

    • @chadarmstrong7458
      @chadarmstrong7458 12 วันที่ผ่านมา +2

      "The average of a bunch of different noise samples which is still valid noise" Why does that matter?

    • @cakep4271
      @cakep4271 11 วันที่ผ่านมา

      I think 🤔 the main points are, 1. Predicting a clean image directly is slow, not creative, expensive. 2. So instead of predicting an image outright, just learn to "un-blur", and run it a bunch of times, cuz thats a fast process.
      So now, you tell it a pic of random noise is a cat, and to unblur the cat, thereby making the noise slightly more like a cat. Repeat over and over again. Eventually you have a clean image of a cat.

    • @banana_lemon_melon
      @banana_lemon_melon 7 วันที่ผ่านมา

      noisy image = clean image + noise
      .
      Now NN is given a noisy image as input, and output/predict the pure noise. Then we can do:
      clean image = noisy image (input) - noise (prediction output)
      .
      Predicting noise is easier than predicting image directly, maybe because the noise is having gaussian/normal distribution (not explained in this video, but we know regression can perform better if the target label has gaussian/normal distribution). I'm not sure about the distribution of pixel value in the images though.

  • @alexanderbrown-dg3sy
    @alexanderbrown-dg3sy 5 วันที่ผ่านมา

    I just came here to hit on diffusion models 😂. AR all day…who wants smoke? Paper for paper? given a few mods..AR is superior.