I am a fan of your work. I read your "Grokking Machine Learning". It's awesome. I am totally impressed. I stopped watching other AI videos and following you for most of the stuff. Simple and practical explanation. Thanks a lot and grateful for spreading the knowledge.
Really incredible job of stepping through the HELLO WORLD of image generation, especially how the video compresses the key output a 4x4 pixel grid and clearly hand computes each step of the way!
thank you for your amazing educational videos! I have a questions though, is there any transformers (+ attention mechanism) involved in the text2image generator (the diffusion model)? If no, then how the semantic in the text is captured??
In intermediate result it is said that after sigmoid, we will not get sharp image of ball and bat. How can there be fractional pixel values. Since it is monochromatic, it should be either in 0 or 1 right. Rounding off to nearest integer will give same result as before sigmoid. Even if it's not monochrome, pixels can't be in fractions right?
Thank you for such wonderful visualization that conveys an overview of complex mathematical concepts. Can you please do a video detailing the underlying architecture of the neural network that forms the diffusion model? Also, are Generative Adversarial Networks (GANs) not used anymore for image generation?
Could be that the diffusion model is trained to learn what amount of noise have to be removed from the input image instead the image with less noise? That is what i understended from others sources, cause they say that that is more easy for the model. Thank you, and good video, very enlightening
Hi @Louis. Your videos are very informative and I love them. Thank you so much for sharing your knowledge with us. I wanted to know if "Fourier Transforms in AI" is in your pipeline. I request you to please give some intuitions around that in a video. Thanks in advance.
Good question, I'm not fully aware. There's this but I'm not 100% sure if it's the original: stability.ai/news/stable-diffusion-public-release I always use this explanation as reference, there may be some good leads there jalammar.github.io/illustrated-stable-diffusion/
Thank you! Yes, ball and bat should be three gray or black squares. Since these images are not so exact, there could also be dark gray, or some variations.
This is wonderful… Perhaps the best low-level description of the diffusion process I’ve seen…. But discrete images of bats and balls represented as single pixels- are a long way away from a PHOTO REALISTIC pirate standing on a ship at sunrise. What I can’t get my head around is how these discrete images (which actually exist in the multi-dimensional data set space) are combined, really, grafted together (parts pulled from each existing image) into a single image with correct composition, scaling, coloring, shadows, etc. If I lay even a specifically chosen (by the NN) bat and ball pictures over each other to produce a “fuzzy” combined image (composition) and then use another NN to sharpen the fuzzy image into a crisp composition with all the attributes defined in the prompt and pointed to by the embeddings…. There’s still too much magic inside the DIFFUSION black box which I just don’t understand…. Even understanding the denoising and self-attention processes.
I guess what I have not been able to determine after watching maybe 30-35 hours of Diffusion videos.. is specifically how the black box COMPOSES a complicated scene BEFORE the process begins which “tightens” the image up by removing noise between the given and target in successive passes of the decoder. I get the fact (one) that the prompts correspond to embeddings, and the embeddings point to some point in multi-dimensional space which contains all sorts of related info and perhaps a close image representation of the prompted request….. or perhaps not. I get the fact (two) that the diffusion process is able to generate virtually any complicated scene starting from random noise when gently persuaded to a target by the prompt…. What I don’t understand is how the black box builds a complicated FUZZY image once the various “parts” of the composition are identified. Does the composing process start with a single image if available in the dataset and scale individual attributes to correspond with the prompt…? -or- Does the composing process start with segmented attributes, scale all appropriately, and combine into a single image…? A closer look at how the scene COMPOSITION works would be a great addition to your very helpful library of vids, thnx.
Ok… for those with the same “problem…” The missing part, at least for me, is the “classifier” portion of the model which I have NOT seen explained in the high-level Diffusion explanation vids. This tripped me up… Here is good vid and corresponding paper which helps understand the “feature” set extraction within the image convolution process which penultimately creates an “area/segment aware” data-set (image) which can be directed to include the visual requirements described in a text prompt. th-cam.com/video/N15mjfAEPqw/w-d-xo.htmlsi=6sZxibtFvjrVNHeE In a nutshell… the features extracted from each image are MUCH more descriptive than I had pictured allowing for much better interpolation, composition and reconstruction of multiple complex forms in each image. Of course the queues to build these complex images all happen as the model interpolates its learned data, converging on the visual representation of the text prompt, somewhere in the multi-dimensional space which we can not comprehend… so in a sense it’s still all a black box. I don’t pretend to understand it all… but it does give the gist of how certain abstract features within the models convolutional layers blow themselves up into full blown objects.
Another good set of vids which get into IMAGE COMPOSITION: th-cam.com/video/vyfq3SgXQyU/w-d-xo.htmlsi=ShiOXaQH_0baU8Z- Especially helpful is the last vid.. url posted above.
I am a fan of your work. I read your "Grokking Machine Learning". It's awesome. I am totally impressed. I stopped watching other AI videos and following you for most of the stuff. Simple and practical explanation. Thanks a lot and grateful for spreading the knowledge.
These videos are always incredibly helpful, informative, and understandable. Very grateful
I am sharing this video to my students here in India, excellent work luis!🎉
Serrano you are a genius bro your channel is so underrated
Really incredible job of stepping through the HELLO WORLD of image generation, especially how the video compresses the key output a 4x4 pixel grid and clearly hand computes each step of the way!
Always impressed with how understandable, but detailed your videos are. Thank you!
Amazing, I hope to truly understand the mechanism of stable diffusion through this video!
Arguably the greatest teacher alive
Thank you :)
Superb, so elegant explanation. Big thanks Sir!
excellent explanation - thank you so much
I respect your concise explaination
Amazing!! Thanks for this high level overview. It was really helpful and fun 👍
Really amazing work easy to understand and grasp doing a great deal for the community thanks alot..
You are the best expainer ever. You are amazing.
Great video, it gives good intuition to deep network architecture. Thanks
So can we just use the diffusion model to denoise low quality or night time shots?
Yes absolutely, they can be used to denoise already existing images.
thank you for your amazing educational videos!
I have a questions though, is there any transformers (+ attention mechanism) involved in the text2image generator (the diffusion model)?
If no, then how the semantic in the text is captured??
In intermediate result it is said that after sigmoid, we will not get sharp image of ball and bat. How can there be fractional pixel values. Since it is monochromatic, it should be either in 0 or 1 right. Rounding off to nearest integer will give same result as before sigmoid. Even if it's not monochrome, pixels can't be in fractions right?
Muy BALL-issimo 😄 Loved the puns!!!!!😋😋😋
Thank you for such wonderful visualization that conveys an overview of complex mathematical concepts.
Can you please do a video detailing the underlying architecture of the neural network that forms the diffusion model?
Also, are Generative Adversarial Networks (GANs) not used anymore for image generation?
Serrano Academy: The art of Understanding
Luis Serrano: The GOD of Understanding
Thank you so much, what an honour! :)
@@SerranoAcademy Thank you, the honour is ours! :)
Amazing deep dismantling job of complex structures. that s real ML/AI democratization.
Could be that the diffusion model is trained to learn what amount of noise have to be removed from the input image instead the image with less noise? That is what i understended from others sources, cause they say that that is more easy for the model. Thank you, and good video, very enlightening
Thanks for teaching Mr Luis! I still remember fondly you teaching me machine learning basics over drinks in SF
Thanks Jon!!! Great to hear from you! How’s it going?
Amazing as always!
Hi @Louis. Your videos are very informative and I love them. Thank you so much for sharing your knowledge with us.
I wanted to know if "Fourier Transforms in AI" is in your pipeline. I request you to please give some intuitions around that in a video. Thanks in advance.
Thanks for the suggestion! It's definitely a great idea. In the meantime, 3blue1brown has great videos on Fourier transformations, take a look!
Hello Serrano, is there paper like attention is all you need for Stable diffusion?
Good question, I'm not fully aware. There's this but I'm not 100% sure if it's the original: stability.ai/news/stable-diffusion-public-release
I always use this explanation as reference, there may be some good leads there jalammar.github.io/illustrated-stable-diffusion/
thanks @@SerranoAcademy 🙂
(at 17:25), the image on the right, baseball and bat should have 3 gray squares right? Very nice channel, I just subscribed.
Thank you! Yes, ball and bat should be three gray or black squares. Since these images are not so exact, there could also be dark gray, or some variations.
Finally the diffusion penny dropped for me, many thanks
Thank you so much!!!
Thanks ❤
This is wonderful…
Perhaps the best low-level description of the diffusion process I’ve seen….
But discrete images of bats and balls represented as single pixels- are a long way away from a PHOTO REALISTIC pirate standing on a ship at sunrise.
What I can’t get my head around is how these discrete images (which actually exist in the multi-dimensional data set space) are combined, really, grafted together (parts pulled from each existing image) into a single image with correct composition, scaling, coloring, shadows, etc.
If I lay even a specifically chosen (by the NN) bat and ball pictures over each other to produce a “fuzzy” combined image (composition) and then use another NN to sharpen the fuzzy image into a crisp composition with all the attributes defined in the prompt and pointed to by the embeddings….
There’s still too much magic inside the DIFFUSION black box which I just don’t understand…. Even understanding the denoising and self-attention processes.
I guess what I have not been able to determine after watching maybe 30-35 hours of Diffusion videos.. is specifically how the black box COMPOSES a complicated scene BEFORE the process begins which “tightens” the image up by removing noise between the given and target in successive passes of the decoder.
I get the fact (one) that the prompts correspond to embeddings, and the embeddings point to some point in multi-dimensional space which contains all sorts of related info and perhaps a close image representation of the prompted request….. or perhaps not.
I get the fact (two) that the diffusion process is able to generate virtually any complicated scene starting from random noise when gently persuaded to a target by the prompt….
What I don’t understand is how the black box builds a complicated FUZZY image once the various “parts” of the composition are identified.
Does the composing process start with a single image if available in the dataset and scale individual attributes to correspond with the prompt…?
-or-
Does the composing process start with segmented attributes, scale all appropriately, and combine into a single image…?
A closer look at how the scene COMPOSITION works would be a great addition to your very helpful library of vids, thnx.
Ok… for those with the same “problem…”
The missing part, at least for me, is the “classifier” portion of the model which I have NOT seen explained in the high-level Diffusion explanation vids.
This tripped me up…
Here is good vid and corresponding paper which helps understand the “feature” set extraction within the image convolution process which penultimately creates an “area/segment aware” data-set (image) which can be directed to include the visual requirements described in a text prompt.
th-cam.com/video/N15mjfAEPqw/w-d-xo.htmlsi=6sZxibtFvjrVNHeE
In a nutshell… the features extracted from each image are MUCH more descriptive than I had pictured allowing for much better interpolation, composition and reconstruction of multiple complex forms in each image.
Of course the queues to build these complex images all happen as the model interpolates its learned data, converging on the visual representation of the text prompt, somewhere in the multi-dimensional space which we can not comprehend… so in a sense it’s still all a black box.
I don’t pretend to understand it all… but it does give the gist of how certain abstract features within the models convolutional layers blow themselves up into full blown objects.
Another good short vid which shows how diffusion accomplishes image COMPOSITION:
th-cam.com/video/xtlxCz349WU/w-d-xo.htmlsi=PJl_vWueiQdZxLn1
Another good vid which gets into composition:
th-cam.com/video/3b7kMvrPZX8/w-d-xo.htmlsi=AwNQJAjABKn-iV4F
Another good set of vids which get into IMAGE COMPOSITION:
th-cam.com/video/vyfq3SgXQyU/w-d-xo.htmlsi=ShiOXaQH_0baU8Z-
Especially helpful is the last vid.. url posted above.
thank you
🙏