Diffusion models explained in 4-difficulty levels

AssemblyAI

มุมมอง 108 796

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 12 พ.ค. 2024
In this video, we will take a close look at diffusion models. Diffusion models are being used in many domains but they are most famous for image generation. You might have seen diffusion models at work through Dall-e 2 and Imagen.
Let's look into how diffusion models learn and manage to create high-resolution, realistic images.
Check out the blog post for a more detailed look at diffusion models. www.assemblyai.com/blog/diffu...
Get your Free Token for AssemblyAI Speech-To-Text API 👇www.assemblyai.com/?...
▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬
🖥️ Website: www.assemblyai.com
🐦 Twitter: / assemblyai
🦾 Discord: / discord
▶️ Subscribe: th-cam.com/users/AssemblyAI?...
🔥 We're hiring! Check our open roles: www.assemblyai.com/careers
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#MachineLearning #DeepLearning

ความคิดเห็น • 128

@pi5549 ปีที่แล้ว ⁺⁴⁴
'3/4/5-levels' looks like a very powerful way of explaining concepts. I'd like to see the higher levels be longer, and really drill down into the heart of the matter. So that the final level is communicating at an expert level. +1 / subbed.
@malikfahadsarwar2281 7 หลายเดือนก่อน ⁺⁸
It would be good if you also explain the reverse process in detail as you explained the forward process
@yricktube ปีที่แล้ว ⁺¹⁹
The way you describe it is how the get the 'original' picture back. But all the content generated is new (in that combination). I was waiting for the explanation of the step that describes how a new combination (so new content) is generated into existence from the latent space through diffusion, not only the method on how to get the starting picture back from the noise...
@polyfoxgames9006 ปีที่แล้ว ⁺³
You pair it with CLIP, which takes in a text string and image and outputs the distance between them. You denoise while lowering this distance
@I77AGIC ปีที่แล้ว ⁺⁶
The video did explain this very briefly. It's as simple as creating a random noise image and feeding it into the same model you used during training. It will turn it into a real image exactly the same way it happened during training. You just get rid of the part that turns an image into noise and use the part that turns noise into an image. You don't have to use CLIP or any text at all. That's a whole other ball game
@uquantum 3 หลายเดือนก่อน ⁺²
Thanks so much for a useful presentation…what a good idea to present in several levels!
@cosmingurau ปีที่แล้ว ⁺⁷
Sorry, but I don't understand something very important. WHY would you add the noise and then substract the noise? Correct me if I'm wrong, but the rightmost noise image in this example is basically an encoded image of the original dog image, that can be decoded deterministically with the neural network, in multiple steps. That's nice and dandy. And I do understand that the noise image is not like a RAR archive, which, were it to be slightly modified, would just yield corruption errors, and instead the modified noise image would still generate... an image. NOW.
1. How do you get from the user text prompt to the noise image of what the user WANTED, that will THEN be denoised (decoded)?
2. How is it so that not every OTHER noise result from the text prompt (except previously deterministically encoded images like this dog image for example) will output just a bunch of garbled mess? And yes, I know that is sometimes the case, I used Stable Diffusion daily.
@kaushiks7303 11 หลายเดือนก่อน
Thank you so much for the elegant explanation.
@zhaoyufei9096 ปีที่แล้ว ⁺²
really good video! I have checked few blogs explain how diffusion mode works, still can not understand. But after see your video one time , i have a better understanding how diffusion actually works! Really thanks!
@AssemblyAI ปีที่แล้ว
That's great to heat Zhao! Thank you for watching. :)
@sotasearcher 5 หลายเดือนก่อน
Such a great video to dive in! I'm live streaming learning about Diffusion, right now!
@Democracy_Manifest 11 หลายเดือนก่อน
This is an excellent video. Love the format. Well done, more please!
@AssemblyAI 11 หลายเดือนก่อน
Thanks, will do!
@user-wr4yl7tx3w ปีที่แล้ว ⁺¹⁹
This was so helpful. Love this format of starting easier and add layers of explanations.
@AssemblyAI ปีที่แล้ว ⁺¹
Great to hear, thanks!
@Kaleubs 3 หลายเดือนก่อน
Thanks for this video, this was very insightfull. Still have a lot to learn about this topic that will revolutionize our world so much
@synthoelectro ปีที่แล้ว ⁺¹⁵
now that's some quantum technology, man... Being one of the beta testers of Stable Diffusion helps me understand this even more.
@AssemblyAI ปีที่แล้ว
Awesome!
@talktovipin1 ปีที่แล้ว ⁺²
Thanks for the nice explanation. Appreciate if you can present similar type of explanation and compare DDPM vs DDIM.
@AssemblyAI ปีที่แล้ว
You're very welcome Vipin! Noted your recommendation!
@paramino ปีที่แล้ว ⁺²
This is very good intro for quick understanding of the concept 👍
@AssemblyAI ปีที่แล้ว
Glad it was helpful!
@MrAlextorex ปีที่แล้ว ⁺³
Diffusion models actually predict a bit of noise to remove from the input noisy image at inference time. The noise is added to images just to produce training data.
@OpuYT ปีที่แล้ว ⁺⁴
Thank you for your explanation!
@AssemblyAI ปีที่แล้ว
You're welcome!
@alirezaakhavi9943 9 หลายเดือนก่อน
really amazing video thank you very much! subbed! :)
@BartoszBielecki ปีที่แล้ว ⁺⁵
Regarding level 3. Is every single pixel diffused at each step, or there is a subset that is randomly chosen? Is the sampling separate for every pixel or we take one value and then multiply it by each pixel? Subsequent diffusions work on the already diffused value, I guess (we don't try to remember what was the mean of the original pixel, but just use the new one)?
@inetmiguel 5 หลายเดือนก่อน
Nice explanation! I feel like the video title is misleading, it is just one explanation going deeper and not complete without the deeper levels of knowledge and differs a lot from other videos that start from zero the explanation at different levels. This is more like 4 shades of Diffussion :D Thanks for sharing!
@shashankshekharsingh2912 หลายเดือนก่อน
Now, that's a great explanation for Diffusion Models.
@sinsernadeesoyo 5 หลายเดือนก่อน
This video was awesome! Well done :) and thank you
@yousufmamsa 11 หลายเดือนก่อน
Great explanation of diffusion models. Thank you.
@AssemblyAI 10 หลายเดือนก่อน
Glad it was helpful!
@hamidzemirline7318 2 หลายเดือนก่อน
thanks for this great presentation
@soulaymanal-abdallah6410 ปีที่แล้ว
AMAZING !! Thanks so much!
@dandogamer ปีที่แล้ว ⁺¹
This was a great explanation! I tried to read the blog first but the maths notation was way over my head
@AssemblyAI ปีที่แล้ว
Thank you Chewie :)
@AIMLDLNLP-TECH 6 หลายเดือนก่อน
Appreciate your explanation skill.
Q. What is diffusion model
Ans. Let's say you tell your best friend, Sarah, about this amazing new flavor. Sarah gets excited and tells her friend, Tom. Then Tom tells his cousin, Emily. Emily, in turn, tells her family, and the news keeps spreading from person to person, creating a chain reaction. This process of your ice cream flavor information spreading from one person to another is like how a drop of ink spreads in water. At first, it's just a small spot, but then it spreads out and covers more and more area as time goes on.
In the diffusion model, experts study how things, whether it's information, ideas, or products like your ice cream flavor, spread through a community of people. They try to understand how fast it spreads, how many people it reaches, and what factors influence its spread. By understanding these patterns, they can learn a lot about how people share and adopt new things!
@whentheinternetwasgood8049 ปีที่แล้ว ⁺¹
so much good info! Thank you!!!
@AssemblyAI ปีที่แล้ว
You're very welcome!
@rasmustoivanen2709 ปีที่แล้ว ⁺⁵
Can you post a follow up explanation on how the text conditionalized generation works. Like Imagen I quess for example used T5 but how actually that text embedding affects the generated image and how it is trained. Cause in the end we have a system where the "noise" is generated with some text embedding so I am curious how that process works
@AssemblyAI ปีที่แล้ว ⁺³
Thank you for this suggestion Rasmus, noted!
@MrAlextorex ปีที่แล้ว
The text is transformed to visual embeddings (visual concepts) using another model and it is fed to diffusion model alogside the noisy image. The other model is CLIPS which was trained separately on images with associated descriptions
@randomaccessofshortvideos6214 5 หลายเดือนก่อน
❤🎉 amazing lecture
@jhanolaer8286 ปีที่แล้ว
Beautiful❤
@faridalaghmand4802 2 หลายเดือนก่อน
Excellent:)
@JanMatusiewicz ปีที่แล้ว ⁺²
Thanks for clear explanations and link to the blog!
@AssemblyAI ปีที่แล้ว
You're very welcome!
@alaad1009 5 หลายเดือนก่อน
Thank You !
@Arrogan28 2 หลายเดือนก่อน
Wow, this is really great, it definitely helped me understand how these models are working.
However I did have one question. In your explanation of how a gaussian noise was created for an image, I was a bit confused. As i have had to generate an image of pure noise following a gaussian distribution before, but in those cases I just generated it by for each pixel, calling a function to get some random number generated following a gaussian noise distribution, usually centered where 0.5 would say be the zero value for that distrution, and so basically remapping the -1 to 1 distribution to be 0 to 1. ie Xnew = (X/2) + 0.5. Hopefully that makes sense. But the way you described it sounded like the noise was created by placing a sort of splat on the image following say a guassian distribution, and then place down subsequence splats in positions that are based on that first previous splats position on the image. I guess this is needed so you can generated all the inbetween time steps from image to pure noise. Rather then just teh final image.. But I didn't quite get exactly how you are creating the noise. For example are you actually splatting a sort of guassian distrution that happens over several pixels for each position, or is it just effecting that one pixel. I could see it happenign both ways and wasn't quite certain from your explanation which one was happening. ie do you come up with the position, then on that pixel just create one value that follows the guassian distrution curve to pick it's value. Or are you placing some splat that at it's center is say the brightest, but falls off to zero, following a gaussian distribution curve? If the latter then how wide is that, ie what would be the radius in pixels for that? And in either case, how is that mixed with the image? Do you multiply the image by the value in that pixel from the noise you generated, or do you mix between them?
I doubt anyone with read this, as it's quite a long comment/question for a yoututbe video, but thought it woudln't hurt to try, as I am very interested in how these models work, and the under the hood details...
@John-eq8cu 6 หลายเดือนก่อน
I want to understand diffusion models so I can understand how it's possible for artificial intelligence to produce an image. Your explanation helps. A bit.
@jayseb ปีที่แล้ว
Great explanation, easy to follow. So in essence, the first step is fixed, then is variable for the decoding if I understand it right?
@MrAlextorex ปีที่แล้ว
First step is just to generate training data: final images with coresponding noisy images and the number of steps used to add noise
@user-wr4yl7tx3w ปีที่แล้ว ⁺¹
Wow this is so helpful.
@AssemblyAI ปีที่แล้ว
Great to hear!
@0xeb- ปีที่แล้ว
Good job.
@andikafaishal2230 ปีที่แล้ว ⁺¹
my brain cannot handle this
@akrammekbal8936 ปีที่แล้ว ⁺¹
diffusion model can add noise to image1 and then in the revers process it make a different image (not the same) ?????? plz rpns?
@chaneydw ปีที่แล้ว
A very confusing, yet somehow great explanation of diffusion models. Thank you!
@AssemblyAI ปีที่แล้ว ⁺¹⁰
A confusing but somehow positive feedback. :D Thank you!
@al-aminibrahim1394 ปีที่แล้ว
thanks for this
@AssemblyAI ปีที่แล้ว
You bet!
@BogdanEchoMilosevic ปีที่แล้ว ⁺²
Having just watched 5 videos on this, umm, "topic?" I feel as if I have been in a coma for 25 years. I am looking for the simplest possible explanation on how this whole AI thing works, yet there don't seem to be any videos that can explain that without using already established terminology that, to me, is completely foreign. Your video is obviously well made, and you are good at explaining this, especially with the example of a drop of paint in water, but I am obviously so far from even beginning anything beyond. Apart from understanding "noise", I have no clue as to what "diffusion", or "model" or anything means. I could always watch videos on any topic, i.e. quantum physics, rocket science, robotics, or anything, and get the basic idea, but this time I feel like I'm years behind... If you could make a video explaining this as if you would explain it to someone in kindergarten, I would definitely come back and watch
@ahmedsinger9435 ปีที่แล้ว
Tysm
@AssemblyAI ปีที่แล้ว
You're very welcome. :)
@automatalearninglab ปีที่แล้ว
Great Video! B)
@AssemblyAI ปีที่แล้ว
Thank you!
@Fsh98 3 หลายเดือนก่อน
NICE
@adammason1587 ปีที่แล้ว
@AssemblyAI
Why 255 (Probability Density Graph), does it have to do with binary? Network Engineer here, and I am trying to draw correlations between IP address ranges being 255 and subnet ranges being 255 and the graph you displayed. They all have binary masks in common hence why I am asking.
@vyndecimibd ปีที่แล้ว
It has to do with binary indeed. 0-255 just represents all the 8-bit values possible. When dealing with standard colored pixels we have an 8-bit value for each of the red, green and blue values. Having a 24-bit value for each pixel is simply the standard, it already gives 16+ million unique colors that are possible for a pixel. en.wikipedia.org/wiki/List_of_monochrome_and_RGB_color_formats#24-bit_RGB
@Grifter ปีที่แล้ว
Fascinating stuff, Great explanation.
@AssemblyAI ปีที่แล้ว
Thank you!
@ONDANOTA ปีที่แล้ว ⁺¹
what role do images in the training set play? are diffusion models violating copyright or not?
@abail7010 ปีที่แล้ว
This depends on the data the model is trained on! There is not only one specific diffusion model instead you can train as much models as you like. If your training data contains copyright limited images you won't be able to use the model for commercial purposes but there are many open source non-copyright datasets out there!
@audiogus2651 ปีที่แล้ว ⁺⁴
Anyone else see a horse in this drop of paint? 1:00
@harshadmane8785 6 หลายเดือนก่อน
Great
@cevxj ปีที่แล้ว
Thanks
@AssemblyAI ปีที่แล้ว
You're very welcome!
@xgalarion8659 10 หลายเดือนก่อน
Good explanation but i do hate when papers add needless maths and physics which are tangential at best when they should be describing their model in a simple way.
@thobeycampion5387 11 หลายเดือนก่อน
wow someone finally pulled this off
@vasudevankannan9823 หลายเดือนก่อน
Can diffusion models be used in denoising audio. If yes, how?
@Gurugurustan ปีที่แล้ว
Can someone explain why do we need to know the initial position of the ink in water if we already knew where the ink was first introduced?
@AssemblyAI ปีที่แล้ว
Think of it as not as the position but the shape of the ink. We're trying to reach the initial shape right after the moment it was dropped.
@targetdexter ปีที่แล้ว
Great explanation!
@AssemblyAI ปีที่แล้ว
Glad you think so!
@PhilipRittscher ปีที่แล้ว ⁺¹
"Full noise" that contains a message is not "white noise". These input "white noise" images are just a puzzle containing info for a computer algorithm to solve. I would not, at this point want to bet our future - or even crossing the street - on "advanced AI"
@MrAlextorex ปีที่แล้ว
"Full noise" is just used for AI to see patterns in it like a kid see shapes in a noisy TV screen. It is just a way to give imagination to AI. To get what you want you can guide the generation using text prompts.
@trentkuhn ปีที่แล้ว
would you say the process is fractal?
@generichuman_ ปีที่แล้ว ⁺¹
It most definitely is not fractal
@yaruuvva 10 หลายเดือนก่อน
I need level 5/6/7 of explanations
@zenchiassassin283 6 หลายเดือนก่อน
Any level >= 5 ?
@S.Mullen ปีที่แล้ว ⁺²⁷
The explanation of adding noise was well done, but the reverse process--by far the strangest process--was not really explained at all. You introduced, but did not explain, some learning process. This unexplained process "somehow" gets back the image. Every "explanation" of SD always skips over this step! Why? (Also skipped, how the text prompt is "combined" with the image. Folks mumble about CL??, but never clearly explain it.) You are a very very good presenter. Please take 15 minutes to "explain" SD.
@RTukka ปีที่แล้ว ⁺⁶
Yeah, I am having this frustration as well, except I think I may understood the concepts more poorly than you.
Regarding the process of how the Level 4 part "somehow" gets back to the image, it could be because UNets are just really complicated, so it almost has to be handwaved? Every explanation I've seen of them (which is not many) immediately descends into highly technical language. It's evidently a step wise process but I don't understand really anything about what is happening in each of those steps, and what data is used during them.
I also don't think I understand the point of _gradually_ adding noise to the original image if you just end up with 100% noise at the end, and then that's where the denoising process starts. Exactly when and how are the partially noised images used? In the UNet? If this is the case, either the explanations of UNets I've seen are missing that info, or they're explaining it in a way that I completely fail to comprehend.
In addition, the explanations I've seen tend to use a single image as an example of how the model is trained. But I understand that these models are trained on many images. So the steps laid out in this video are repeated on thousands of images to train the model to generate an image of a dog (or any image??), but how is information from repeating that process combined into the algorithm or latent space or whatever? Do you start with a virgin model or some generalized model or latent space, which then gets modified when you train it on the first image, and then you carry those modifications over when you train the next image? It seems like that ought to be how it works, but if it is, I think a great explanation for how this stuff works would make that explicit.
And then, yeah, how do text prompts work? Both at a basic level, with just a single word prompt like "dog," but also, how are complicated multi-prompt words managed? (I imagine many of the common "mistakes" of diffusion models might be illustrative.)
@abail7010 ปีที่แล้ว ⁺⁶
A U-Net is a standardized deep-learning model that takes the image as an input and has another image (with the same dimensions) as an output. It is trained the conventional way with the so called gradient decent algorithm that aims to minimizes the least squared error loss function. In this case, the model aims to predict a mask of the image which represents the noise that was added to the previous step so that we can simply substract that noise from the noisy image to get back to the original image.
I hope that was at least somewhat helpful? :)
@jocke8277 11 หลายเดือนก่อน
@@abail7010 UNet predicts the noise, and a scheduler removes the noise from the image right?
@abail7010 11 หลายเดือนก่อน ⁺¹
@@jocke8277 On a high level, yes that is true! :)
@angelxiii3181 ปีที่แล้ว
I wish my brain was smart enough to understand!
@jwithy ปีที่แล้ว ⁺³
“OK level one… non-equilibrium thermodynamics” 🥴
@bayesianlee6447 ปีที่แล้ว ⁺¹
level 0 - annealed Langevin dynamics
@AssemblyAI ปีที่แล้ว ⁺³
Hahah I understand the frustration :D But it's just what Diffusion Models are based on so you don't actually have to understand non-equilibrium thermodynamics. :D -Mısra
@dcodeai369 ปีที่แล้ว
I have one question. I hate maths but I love to train models. I tried to learn math but godd it's 😵😵. Any advice?
@ujjalkrdutta7854 ปีที่แล้ว ⁺³
Stick to applied ML then. In that case you can make use of existing frameworks and libraries to implement models for solving problems, without knowing the working under the hood.
1. But if you do want to understand the math, the only way is to refer to better learning resources, and keep trying iteratively. Often, it's not the math alone, but the way it is being taught, that makes a whole lot of difference in one's understanding. For eg, back in grad school, I used to refer to Salman Khan's math videos to get the actual understanding of linear algebra concepts (which could not be attained even after reading a few standard books)
2. Having said that, each one of us has to maintain a trade-off in math deep dive vs actual implementation. No ones knows all things a 100%
@dcodeai369 ปีที่แล้ว ⁺¹
@@ujjalkrdutta7854 What you said about math is true. I'm sticking with applied ml for now. There is a lot to explore there. Thank you for your time
@jonathaningram8157 ปีที่แล้ว
there is the whole language part missing.
@truejim 11 หลายเดือนก่อน
In the level 1 explanation, what’s the point of introducing the phrase “thermodynamic equilibrium”? Most lay people understand what it means when we say food coloring diffuses into clear water. Reminding the viewer why that happens from a physics standpoint makes the level 1 explanation less clear, not more clear.
@kartikpodugu ปีที่แล้ว
Learnt a lot of new things from this video.
Why it is called as UNet
Why it is called as diffusion model.
What diffusion model does and how it does.
Thanks
@abraruralam3534 ปีที่แล้ว
This feels like it should not be possible...then again, its not too different from us humans imagining faces in the clouds. Computers just take this hallucination to the next level.
@terjeoseberg990 6 หลายเดือนก่อน ⁺¹
Wow! There’s another video if yours below this one, and your hair is so different that I didn’t recognize that it’s you.
@retroathlete5814 ปีที่แล้ว ⁺¹
Fine, you add noise to an image and then restore it. VERY simple concepts (even if very hard in practice). But the magic of DALL-E, Midjourney & Stable Diffusion is the creation of NEW images. This is the third video I'm watching that explains the same trivial diffusion concept. Guess I'll have to ask ChatGPT instead.
@curvingorbit8262 2 หลายเดือนก่อน
Exactly! I've watched and read numerous explanations of diffusion models, but not one so far has told me how the process ends with an image DIFFERENT from the one with which it began.
@saraebrahimi3795 4 หลายเดือนก่อน
that was aweeeeessssommmmmmeeeee
@chrisyoutube8488 4 หลายเดือนก่อน
I came to the comments to see if this was Mandy Moore.
@kaiboshvanhortonsnort359 ปีที่แล้ว
I dunno about all that, I just type in 'boobs' and the thing delivers. Whatever math those silicon wafers decide to subject themselves to, that's on them.
@bunnystrasse 3 หลายเดือนก่อน
Who is the lady? Her @
@ivuvu4065 ปีที่แล้ว ⁺²
6 minutes explaining nothing and at the end.. blablabla super fast about convolution... and nothing clear :/
@resurrection355 4 หลายเดือนก่อน
You are beautiful
@potrishead ปีที่แล้ว ⁺²
Sorry, but this video is very frustrating. Nothing was explained in terms of either the technique for reversing or how it relates to new image creation when prompting, which is obviously what we are mostly interested on.
@rae1220 13 วันที่ผ่านมา
Then this just isn’t the video for you. This was purely explaining the concept
Helped me a lot
@MistereXMachina ปีที่แล้ว
Can we take a moment to appreciate how silly it is to say, "we're gonna explain this in 4 levels - 1 being the easiest, 4 being the hardest"
and immediately starting level 1 with: "diffusion models were inspired by non-equilibrium thermodynamics from physics and as you can understand from the name this field deals with system d that are not in thermodynamic equilibrium"
next time ask ChatGPT to write it for you lmao, imagine going up to a five year old and being like,
"Hey kid, you're familiar with thermodynamic equilibrium right? Well the area of machine learning concerned with image generation using diffusion models takes that principle, but is inspired by its inverse."
@franzmkrumenacker2519 ปีที่แล้ว
Please note that what she referenced in Level 1 is secondary school stuff. The authors obviously assumed *this* basis to build upon, not that of a five-year-old kid.
@cipherxen2 10 หลายเดือนก่อน
She might not have technical background. No technical person will mispronounce variance as variation.
@milesgreb3537 4 หลายเดือนก่อน
This stuff just sucks man
@Adityak1997 2 หลายเดือนก่อน
who all think she's AI generated ??

ต่อไป

เล่นอัตโนมัติ

What is GPT-3 and how does it work? | A Quick Review