Can you explain how the (text) guidance with clip works? I don't find any information other than that CLIP is used to influence the UNet during training through attention layers (also indicated by the "famous" LDM depiction). However it is not mentioned how the CLIP embeddings are aligned with the latents used by the UNet or VAE. I suppose it must be involved in the training process somehow? otherwise the embeddings can not be compatible or?
@@christopherhornle4513 Hi Christopher! I'm preparing a video on how to code Stable Diffusion from zero, without using any external library except for PyTorch. I'll explain how the UNet works, how CLIP works (with Classifier and Classifier-Free guidance). I'll also explain advanced topics like score-based generative models and k-diffusion. The math is very hard, but I'll try to explain the concepts behind the maths rather than the proofs, so that people who have little or no maths background can understand what's going on even if they don't understand every detail. Since time is limited and the topic is vast, it will take me some more time before the video is ready. Please stay tuned!
@@umarjamilai That sounds awesome, thank you! I know pretty much how CLIP and the UNet work independently from each other. Cross attention, also clear. I am just wondering how the text embeddings are compatible with the UNet, if they are from a separate model (CLIP). I guess the UNet is trained feeding in CLIP text via attention to reproduce CLIP images (frozen VAE). Just stange that its not mentioned in the places I looked at.
@@christopherhornle4513 Let me simplify it for you: the Unet is a model that is trained to predict the noise added to a noisy image at a particular time of a time-schedule, so given X+Noise and the time step T, the Unet has to return X. During the training, we not only provide X+Noise and T, but we also provide the CLIP embeddings (that is the embeddings of the caption associated with the image), so when training we provide the Unet with X (image) + Noise + T (time step) + CLIP_EMBEDDINGS (embeddings associated with the caption of the image). When T=1000, the image is completely noisy according to the Normal distribution. When you sample (generate an image), you start from complete noise (T=1000). Since it is complete noise, the model could output any image when denoising, because it doesn't know which image the noise corresponds to. To "guide" the de-noisification process, the Unet needs some "help", and that guidance is your prompt. Since CLIP's embeddings kind of represent a language model, if an image was trained with caption "red car with man driving", if you use the prompt "red car with woman driving", CLIP's embeddings will tell the UNET how to denoise the image so as to reproduce something that is close to your prompt. So, summarizing, Unet and CLIP are connected because CLIP's embeddings (the embeddings extracted by encoding the caption associated with the image being trained upon) are used when training the Unet (they are given as parameters in each layer of the Unet) and the CLIP's embeddings (from your prompt) as used as input of the Unet to help him denoise when generating the image. I hope this clarifies the process. In my next video, which hopefully will come within two weeks, I'll explain everything in detail.
Thank you very much!! Now I understand: During training the UNet learns to predict the noise given the text embedding (+timestamp and other if provided). So it learns which (text) embeddings are associated with specific features of the images and the noise predictions for those images. During sampling we start with noise (no encoded image) and provide an embedding, the model will use it as guidance for denoising along the features it has learned to be associated with that embedding.
You can start by browsing the code I've shared, because it's a full working code to train a diffusion model. I'll try to make a video explaining each line of code as well
Full code and PDF slides available at: github.com/hkproj/pytorch-ddpm
Can you explain how the (text) guidance with clip works? I don't find any information other than that CLIP is used to influence the UNet during training through attention layers (also indicated by the "famous" LDM depiction). However it is not mentioned how the CLIP embeddings are aligned with the latents used by the UNet or VAE. I suppose it must be involved in the training process somehow? otherwise the embeddings can not be compatible or?
@@christopherhornle4513 Hi Christopher! I'm preparing a video on how to code Stable Diffusion from zero, without using any external library except for PyTorch. I'll explain how the UNet works, how CLIP works (with Classifier and Classifier-Free guidance). I'll also explain advanced topics like score-based generative models and k-diffusion. The math is very hard, but I'll try to explain the concepts behind the maths rather than the proofs, so that people who have little or no maths background can understand what's going on even if they don't understand every detail. Since time is limited and the topic is vast, it will take me some more time before the video is ready. Please stay tuned!
@@umarjamilai That sounds awesome, thank you! I know pretty much how CLIP and the UNet work independently from each other. Cross attention, also clear. I am just wondering how the text embeddings are compatible with the UNet, if they are from a separate model (CLIP). I guess the UNet is trained feeding in CLIP text via attention to reproduce CLIP images (frozen VAE). Just stange that its not mentioned in the places I looked at.
@@christopherhornle4513 Let me simplify it for you: the Unet is a model that is trained to predict the noise added to a noisy image at a particular time of a time-schedule, so given X+Noise and the time step T, the Unet has to return X. During the training, we not only provide X+Noise and T, but we also provide the CLIP embeddings (that is the embeddings of the caption associated with the image), so when training we provide the Unet with X (image) + Noise + T (time step) + CLIP_EMBEDDINGS (embeddings associated with the caption of the image). When T=1000, the image is completely noisy according to the Normal distribution. When you sample (generate an image), you start from complete noise (T=1000). Since it is complete noise, the model could output any image when denoising, because it doesn't know which image the noise corresponds to. To "guide" the de-noisification process, the Unet needs some "help", and that guidance is your prompt. Since CLIP's embeddings kind of represent a language model, if an image was trained with caption "red car with man driving", if you use the prompt "red car with woman driving", CLIP's embeddings will tell the UNET how to denoise the image so as to reproduce something that is close to your prompt. So, summarizing, Unet and CLIP are connected because CLIP's embeddings (the embeddings extracted by encoding the caption associated with the image being trained upon) are used when training the Unet (they are given as parameters in each layer of the Unet) and the CLIP's embeddings (from your prompt) as used as input of the Unet to help him denoise when generating the image. I hope this clarifies the process. In my next video, which hopefully will come within two weeks, I'll explain everything in detail.
Thank you very much!! Now I understand: During training the UNet learns to predict the noise given the text embedding (+timestamp and other if provided). So it learns which (text) embeddings are associated with specific features of the images and the noise predictions for those images. During sampling we start with noise (no encoded image) and provide an embedding, the model will use it as guidance for denoising along the features it has learned to be associated with that embedding.
This is a great conceptual breakdown of diffusion models thank you!
So cool! Thank you for your explanations
I would love a video of you breaking down the math :)
Hi! A new video is coming soon :) stay tuned!
Thank you for thr clear explanation!
Dear Sir please make a video on details explanation on code of diffusion model . It will be helpful. Thanks for understanding and valuable video
Great!!!!
I can only say geniou guy ever
i came here form your VAE video. after that, should i be doing the 5hr long stable diffusion or this one?? what do you suggest?
I watched the 5 hour one first then come to this. Now I would say, I know how to train the model, thanks to Umar.
Can you do code for inpainting in diffusion model please
If cant you implement an example in next tutorial like you made for the transformers, it will be great !😊
You can start by browsing the code I've shared, because it's a full working code to train a diffusion model. I'll try to make a video explaining each line of code as well
@@umarjamilai okay, thank guys!
can u do a Bert coding video
Thanks for the suggestion, I'll try my best
Hi! My new video on BERT is out: th-cam.com/video/90mGPxR2GgY/w-d-xo.html