Hands down the best underrated channel right now. I hope you get millions of subscribers soon. The amount of effort and the level of detail all in one video is amazing! Thank you!
Fantastic video! You’re undoubtedly on your way to becoming one of the top lecturers in Generative AI. I’m excited to see more of your work in the future!
If you look at the latent space images at 37:26 you cannot believe the decoder can re-generate the original image from it. As there is simply a lot of missing information. Any explanation on how it does it? First I thought this is due to original image information leaking through through skip-connections between down and up blocks. But we are not using those in the auto-encoder.
Hello, in the encoder part, you said 'since the image we deal with has higher resolution, so we don't need self-attention'. I don't understand this, can you further explain why we abandon self-attention in the down block and up block?
I am assuming you are referring to autoencoder part ? If yes, then its not that we dont 'need' self attention, rather we are deciding to not use it and the only reason of not using attention for the downblock(and upblock) is to reduce compute cost(memory as well as time). During the downblocks we would be working with large image sizes(example 512x512) which means the self attention computation(which grows quadratically with number of spatial tokens) would be very costly(leading to out of memory issues), hence I just avoid it. That is why I use attention only in midblocks once the spatial size is reduced to a smaller value like 32x32 .
Actually here its not needed. Earlier I was passing fake to discriminator rather than fake.detach() in Line 70, in which case it was needed. Then added detach() but forgot to remove the retain_graph. Don't think its needed anymore. Thank You for this!
Awesome, I just have one question, why not only use the adversarial loss? (or maybe first train the VAE using normal L1/L2 and then fine tune it using adversarial loss) I feel like using all at once would just end up consuming more resources. Or would it cause the VAE to pretty much inherit all problems GANs have (mode collapse, vanishing gradients, etc)?
Thank You. So at the start they aren't all used at once. I set the disc_step_start @34:44(at what step to start adding adversarial loss) parameter to a value such that by that steps the autoencoder has already learnt to reconstruct well, and then adversarial loss component is added to the loss to improve the quality of those reconstructions. So technically, like you said we are kind of first training VAE and then fine tuning using adversarial loss(except we keep autoencoder loss even after adversarial loss kicks in, to not lose all the latent space properties from vqvae/vae that we have learnt). And besides since the discriminator is just a 3-4 conv layer model, the additional burden of computation because of adding adversarial loss is not that significant. Regarding the point of just using the adversarial loss, I am not sure as I haven't experimented with it but I would suspect that while it should work, it will bring up the same problems that gan training has like you mentioned. By training it using combination of both autoencoder and adversarial loss, we are getting the increased sample quality(like without blurriness) and keeping the training more stable(because of first training using vae/vqvae losses). Infact when I was going through issues in the latent-diffusion repo and vqgan repo, I saw couple where folks mentioned that if the disc_start is set to a lower value or 0, the model does not converge well. Empirically they found that adding adversarial loss ONLY after autoencoder reconstructions are decent, leads to better results.
Hi Tushar . . . Amazing lecture thankyou . . . ❤❤❤ Lot of love for this video . . .. looking forward for more advance content in text to image generation . . . . I am researching in making some combo of GANs and existing SD model weights such that we can generate some low quality image or latent on low end devices like CPU and then upscale low res image using some GANs . . . Trying to implement something kind of mobile diffusion . . . . Please suggest some good resource to move forward 😅
Hi Divya, Thank you so much for the kind words :) And yes, latest developments in the space of combining diffusion and gan is indeed very exciting. Unfortunately, I have not gone deep into this(yet), but to me it feels like MobileDiffusion is UFOGen+distillation plus a whole lot of architectural changes(this is again by just giving a cursory look at the mobilediffusion paper). And I think in UFOGen the authors initialize gan with the SD model weights only. I am assuming you might already know about that but if you don't, maybe give that paper a read and use that as a starting point? Sorry for not being much help on this. Once I myself have through few papers on this, maybe after that I can be of better help.
Thank you for the tutorial it was really helpful ! i'm wondering about the values of the discriminator loss for the first few epochs i'm trying to test this architecture on another dataset and the discriminator loss is high and stable from the first epoch.
Hello, The suggestion that I have seen in discussion over the web is that invoking discriminator(and d loss) should be delayed till the autoencoder reconstructs best images that it can(it will still be blurry). And then we should add discriminator loss components in the training. Here is an issue which provides some discussion on discriminator loss which I think might be helpful for you. github.com/CompVis/taming-transformers/issues/93
thanks a lot! can you explain why you didn't use SA in the AE implementation? you mentioned that it's related to the images' high resolution, why? why you do use SA in he DDMP implementation?
Primarily just to save on computational cost(both time and compute). Say our feature map is CxNxN. Inside the self attention layer, autoencoder will try to compute attention between each pair of feature map cells. The computation complexity will have a factor of N^2 in it. The main reason I did not try with attention on any layer of autoencoder is to save on this computation cost. As typically N would be 512/256/128 for the autoencoder blocks. But in the diffusion model part, since we are working with encoder outputs, where N is smaller(32/16/8), this cost is lesser and hence SA can be added for all diffusion model blocks. The official implementation usually has attention in the last or last two blocks of autoencoder as well(github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/models/first_stage_models/vq-f8/config.yaml#L21). But there are also configurations which do not have attention at all, so I tried without SA in autoencoder and the results were fine so kept it that way. Let me know if this clarifies things for you or if you have any further doubts.
@@Explaining-AI thanks! Another question, you said that there are some differences between your implementation and the official one, can you mention the importance ones? Apart from efficiency that you mentioned.
@@nadavpotasman I am sorry but I don't remember all the differences.However, I took another look at my code and the official implementation. And these are the differences that I could find. 1. Official Repo has Layer Norm, in attention blocks and after Cross attention they also have mlp layer. github.com/CompVis/latent-diffusion/blob/main/ldm/modules/attention.py#L203 I use group norm everywhere and I dont have any MLP after cross attention. 2. Upsampling position : For upsampling, official implementation uses interpolation followed by convolution (github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/openaimodel.py#L116) whereas I use conv transpose 3. Different number of attention heads, batch size and channels in each block) (github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/configs/latent-diffusion/celebahq-ldm-vq-4.yaml#L36C9-L36C26) Thats all that I could find as of now.
Hello, I might be misunderstanding your question so do let me know if thats the case. But for stable diffusion we don't need transformer. In the repo (github.com/explainingai-code/StableDiffusion-PyTorch?tab=readme-ov-file#training), what you are mentioning, is exactly what I implemented. Train VQVAE+PerceptualLoss+Discriminator (same as VQ-GAN without transformer, which is only needed for generating new latent images) on a dataset. Once auto-encoder part of VQGAN is trained, we then save the latent representations for all the training images, using the trained encoder of VQGAN. Finally, use these latent representations, to train LDM. We don't need to train transformer, as thats for generating new latent images , for which we are using DDPM.
Thank you for such a detail video. Got so many things clear. Can anyone point out the structural steps for implementation of LDM like if I am doing in Jupyter. Ist step, second step and so on...... 1: dataset 2: Vae or noise sheduler 3: Diff Model and so on?????
Glad you found the video helpful. Regarding the different steps, what you mentioned are pretty much correct order of steps. Below I have added the steps in detail. 1. Create the dataset class 2. Build VAE/VQVAE model 3. Implement training script for VAE and train it. 4. Create Diffusion Model 5. Implement Diffusion Noise Scheduler 6. Write the code for diffusion model training using trained vae model and noise scheduler 7. Implement sampling script using the trained diffusion model(as well as trained vae) and noise scheduler Do let me know if you need specific details on any step or if you face any issues in implementation.
Thanks! For the diffusion model I used single Nvidia V100 which took around 15 mins per epoch and as far as I remember, I trained for about 50 epochs (to get these outputs and I stopped at that point, ideally should train for much longer to get better quality outputs).
@@Explaining-AI Thank you for your prompt reply! I am building something similar for generating synthetic galaxy images to learn about LDMs. Your videos are a lifesaver.
@@HardikBishnoi Yes , I have not used it myself but I remember reading some implementation where diffusers(huggingface) + flashattention gave 3x speedup.
Hello, this video is actually just for training unconditional LDM. Conditional(text/class/mask) LDM is covered in the second part here - th-cam.com/video/hEJjg7VUA8g/w-d-xo.html
thanks for explanation. i have a question you said the autoencoder takes the image in pixel space generate the latent space and decoder do the reverse. if we train the diffusion model it will take the noise sample of the latent space and then goes through the unet which work as a noise predictor and remove the noise from the noisy latenct space iamges from the encoder side. my question is how and from where it will add nosie to the latent space and how the Unet will do the diffusion process to remove the noise because you also said that we don't need the Time step information? kindly i have these doubts if you can address it kindly clear my doubts. Thanks for the making video on this topic!
Hello, the diffusion process here is same as what happens in DDPM. The only thing different is that rather than training diffusion on a dataset of images, we first train an autoencoder on our dataset. Then train diffusion on this autoencoder generated latent representations. So after auto encoder is trained we have these set of steps. 1. Take the dataset image(pixel space) 2. Encode it using trained and frozen autoencoder(to get latent image) 3. Sample noise, timestep 4. Add noise to latent image(from step 2) 5. Train Unet to take the noisy image from Step 4 and predict the original noise(from Step 3) that was added Then at inference, generate a random noise in the latent space, have unet denoise it iteratively from t=T to t=1and after that feed the t=1 denoised latent image to decoder of autoencoder to get generated image(pixel space). Do Let me know if this does not clear things up and there is still some doubt.
@@Explaining-AIthanks, i have another question why you don't keep the self-attention in the down sapling of the encoder. how it keep attention on the image features in the rest blocks
@@paktv858 The encoder deals with very large image sizes(example 256x256) compared to ldm(32x32).Which means the self attention computation would be very costly hence I just avoid it. If you really want to add and experiment with that, I would still suggest to try adding it only in the last downblocklayer(at 32x32 resolution) . The official repo also does not add it for all variants - github.com/CompVis/stable-diffusion/blob/main/models/first_stage_models/vq-f4/config.yaml
Thanks for the video, it is really great. I have finetuned a stable diffusion v1.5 model and now I am trying to built a stable diffusion model from scratch without using any pretrained ckpts and running it locally. So is it possible that we can train the model without using any pretrained checkpoint ?
Hello, yes its definitely possible. Though depending on your dataset and image resolution you might have to use a lot of compute time, and also if your pre-trained checkpoint was trained on images similar to your task, then your generation results(without pre-training) would be of lesser quality(than with pretraining) .
Thank you for such a detail video. Got so many things clear. Can anyone point out the structural steps for implementation of LDM like if I am doing in Jupyter. Ist step, second step and so on...... 1: dataset 2: Vae or noise sheduler 3: Diff Model and so on?????]
Glad you found the video helpful. Regarding the different steps, what you mentioned are pretty much correct order of steps. Below I have added the steps in detail. 1. Create the dataset class 2. Build VAE/VQVAE model 3. Implement training script for VAE and train it. 4. Create Diffusion Model 5. Implement Diffusion Noise Scheduler 6. Write the code for diffusion model training using trained vae model and noise scheduler 7. Implement sampling script using the trained diffusion model(as well as trained vae) and noise scheduler Do let me know if you need specific details on any step or if you face any issues in implementation.
Haven't given the AI voice option any thought until now, but as a viewer, was the clarity of audio that bad for you ? And entire video or some specific part ?
*Github Implementation* : github.com/explainingai-code/StableDiffusion-PyTorch
*Stable Diffusion Part II* : th-cam.com/video/hEJjg7VUA8g/w-d-xo.html
*DDPM Implementation Video* : th-cam.com/video/vu6eKteJWew/w-d-xo.html
*Diffusion Models Math Explanation* : th-cam.com/video/H45lF4sUgiE/w-d-xo.html
Hands down the best underrated channel right now. I hope you get millions of subscribers soon. The amount of effort and the level of detail all in one video is amazing! Thank you!
Thank you so much for saying this. As long as the channel consistently improves the knowledge of its viewers, I am super happy :)
Bro explained VQ-VAE in full in the mean time. What an absolute legend!
wow! I can't thank you enough for this. your explanation are a game-changer.
Thank you for the appreciation :)
Fantastic video! You’re undoubtedly on your way to becoming one of the top lecturers in Generative AI. I’m excited to see more of your work in the future!
Thank you so much for your words of encouragement and support :)
Don't have enough words to describe it. This is presented and explained so beautifully. Thanks, Legend.
Thank you for these kind words Vikram :)
This is insanely underrated.
best tutorial i have ever seen
Thank you :)
this was such an amazing end to end tutorial. thank you so much!
Happy that it was helpful to you :)
Amazing explanation and Implementation. Thanks you so much.
Thank you!
Amazing channel to understand the basic and the advanced concepts , keep up the good work bro !!!
Thank you for the appreciation :)
It was great! Good luck!
Nice one, can't wait for the conditional diffusion model.
Thank you!
keep it up, these videos of yours are very helpful! 🔥
Thank You :)
just like compounding alpha_t , consider this as a compounding thank you. This series is just awesome 🔥🔥🔥 . Wish you create more tutorials Tushar
Thank you so much for this appreciation. Am truly humbled.
If you look at the latent space images at 37:26 you cannot believe the decoder can re-generate the original image from it. As there is simply a lot of missing information. Any explanation on how it does it? First I thought this is due to original image information leaking through through skip-connections between down and up blocks. But we are not using those in the auto-encoder.
Thank you for such a wonderful explaination.😃
Really Glad it was helpful!
Great work once again! Thanks for the great explanation :)
@himanshurai6481 Thank you so much for the appreciation :)
denoising papers channel I could say! Thanks a lot TK🙏🙏
Thank you! Though I am currently working on expanding the library to have other topics(other than diffusion) :) Lets see how that goes.
Hello, in the encoder part, you said 'since the image we deal with has higher resolution, so we don't need self-attention'. I don't understand this, can you further explain why we abandon self-attention in the down block and up block?
I am assuming you are referring to autoencoder part ? If yes, then its not that we dont 'need' self attention, rather we are deciding to not use it and the only reason of not using attention for the downblock(and upblock) is to reduce compute cost(memory as well as time).
During the downblocks we would be working with large image sizes(example 512x512) which means the self attention computation(which grows quadratically with number of spatial tokens) would be very costly(leading to out of memory issues), hence I just avoid it. That is why I use attention only in midblocks once the spatial size is reduced to a smaller value like 32x32 .
@@Explaining-AI Thank you. This totally solves my doubt.
Great video. Very well explained
why do we do retain_graph=True in the VAE training script?
Actually here its not needed. Earlier I was passing fake to discriminator rather than fake.detach() in Line 70, in which case it was needed. Then added detach() but forgot to remove the retain_graph. Don't think its needed anymore. Thank You for this!
@@Explaining-AI okay great to know. i was able to 3x the batch size with retain_graph=False
Perfect . Will soon make the change in the repo as well . Thank you!
Awesome, I just have one question, why not only use the adversarial loss? (or maybe first train the VAE using normal L1/L2 and then fine tune it using adversarial loss) I feel like using all at once would just end up consuming more resources.
Or would it cause the VAE to pretty much inherit all problems GANs have (mode collapse, vanishing gradients, etc)?
Thank You. So at the start they aren't all used at once.
I set the disc_step_start @34:44(at what step to start adding adversarial loss) parameter to a value such that by that steps the autoencoder has already learnt to reconstruct well, and then adversarial loss component is added to the loss to improve the quality of those reconstructions. So technically, like you said we are kind of first training VAE and then fine tuning using adversarial loss(except we keep autoencoder loss even after adversarial loss kicks in, to not lose all the latent space properties from vqvae/vae that we have learnt).
And besides since the discriminator is just a 3-4 conv layer model, the additional burden of computation because of adding adversarial loss is not that significant.
Regarding the point of just using the adversarial loss, I am not sure as I haven't experimented with it but I would suspect that while it should work, it will bring up the same problems that gan training has like you mentioned.
By training it using combination of both autoencoder and adversarial loss, we are getting the increased sample quality(like without blurriness) and keeping the training more stable(because of first training using vae/vqvae losses).
Infact when I was going through issues in the latent-diffusion repo and vqgan repo, I saw couple where folks mentioned that if the disc_start is set to a lower value or 0, the model does not converge well. Empirically they found that adding adversarial loss ONLY after autoencoder reconstructions are decent, leads to better results.
@@Explaining-AI Very interesting, I will be looking forward to more videos by you, they're very helpful and well made.
Hi Tushar . . . Amazing lecture thankyou . . . ❤❤❤ Lot of love for this video . . .. looking forward for more advance content in text to image generation . . . . I am researching in making some combo of GANs and existing SD model weights such that we can generate some low quality image or latent on low end devices like CPU and then upscale low res image using some GANs . . . Trying to implement something kind of mobile diffusion . . . . Please suggest some good resource to move forward 😅
Hi Divya,
Thank you so much for the kind words :)
And yes, latest developments in the space of combining diffusion and gan is indeed very exciting.
Unfortunately, I have not gone deep into this(yet), but to me it feels like MobileDiffusion is UFOGen+distillation plus a whole lot of architectural changes(this is again by just giving a cursory look at the mobilediffusion paper).
And I think in UFOGen the authors initialize gan with the SD model weights only.
I am assuming you might already know about that but if you don't, maybe give that paper a read and use that as a starting point?
Sorry for not being much help on this. Once I myself have through few papers on this, maybe after that I can be of better help.
Thank you for the tutorial it was really helpful ! i'm wondering about the values of the discriminator loss for the first few epochs i'm trying to test this architecture on another dataset and the discriminator loss is high and stable from the first epoch.
Hello, The suggestion that I have seen in discussion over the web is that invoking discriminator(and d loss) should be delayed till the autoencoder reconstructs best images that it can(it will still be blurry).
And then we should add discriminator loss components in the training.
Here is an issue which provides some discussion on discriminator loss which I think might be helpful for you. github.com/CompVis/taming-transformers/issues/93
@@Explaining-AI thank you !
thanks a lot! can you explain why you didn't use SA in the AE implementation? you mentioned that it's related to the images' high resolution, why? why you do use SA in he DDMP implementation?
Primarily just to save on computational cost(both time and compute).
Say our feature map is CxNxN. Inside the self attention layer, autoencoder will try to compute attention between each pair of feature map cells. The computation complexity will have a factor of N^2 in it. The main reason I did not try with attention on any layer of autoencoder is to save on this computation cost. As typically N would be 512/256/128 for the autoencoder blocks. But in the diffusion model part, since we are working with encoder outputs, where N is smaller(32/16/8), this cost is lesser and hence SA can be added for all diffusion model blocks.
The official implementation usually has attention in the last or last two blocks of autoencoder as well(github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/models/first_stage_models/vq-f8/config.yaml#L21). But there are also configurations which do not have attention at all, so I tried without SA in autoencoder and the results were fine so kept it that way. Let me know if this clarifies things for you or if you have any further doubts.
@@Explaining-AI thanks!
Another question, you said that there are some differences between your implementation and the official one, can you mention the importance ones?
Apart from efficiency that you mentioned.
@@nadavpotasman I am sorry but I don't remember all the differences.However, I took another look at my code and the official implementation. And these are the differences that I could find.
1. Official Repo has Layer Norm, in attention blocks and after Cross attention they also have mlp layer.
github.com/CompVis/latent-diffusion/blob/main/ldm/modules/attention.py#L203
I use group norm everywhere and I dont have any MLP after cross attention.
2. Upsampling position :
For upsampling, official implementation uses interpolation followed by convolution (github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/openaimodel.py#L116) whereas I use conv transpose
3. Different number of attention heads, batch size and channels in each block) (github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/configs/latent-diffusion/celebahq-ldm-vq-4.yaml#L36C9-L36C26)
Thats all that I could find as of now.
Can we train our dataset with vq-gan (without transformer) and then use it in train_ddpm_vqve?
Hello, I might be misunderstanding your question so do let me know if thats the case.
But for stable diffusion we don't need transformer.
In the repo (github.com/explainingai-code/StableDiffusion-PyTorch?tab=readme-ov-file#training), what you are mentioning, is exactly what I implemented.
Train VQVAE+PerceptualLoss+Discriminator (same as VQ-GAN without transformer, which is only needed for generating new latent images) on a dataset.
Once auto-encoder part of VQGAN is trained, we then save the latent representations for all the training images, using the trained encoder of VQGAN.
Finally, use these latent representations, to train LDM. We don't need to train transformer, as thats for generating new latent images , for which we are using DDPM.
@@Explaining-AI Thank you so much for your kind reply. Yeah, you are right.
Thank you for such a detail video. Got so many things clear.
Can anyone point out the structural steps for implementation of LDM like if I am doing in Jupyter. Ist step, second step and so on......
1: dataset
2: Vae or noise sheduler
3: Diff Model
and so on?????
Glad you found the video helpful.
Regarding the different steps, what you mentioned are pretty much correct order of steps. Below I have added the steps in detail.
1. Create the dataset class
2. Build VAE/VQVAE model
3. Implement training script for VAE and train it.
4. Create Diffusion Model
5. Implement Diffusion Noise Scheduler
6. Write the code for diffusion model training using trained vae model and noise scheduler
7. Implement sampling script using the trained diffusion model(as well as trained vae) and noise scheduler
Do let me know if you need specific details on any step or if you face any issues in implementation.
AWESOME!
Thank you for sharing.
What was the compute you used to train this? And how long did it take? Great video btw!
Thanks! For the diffusion model I used single Nvidia V100 which took around 15 mins per epoch and as far as I remember, I trained for about 50 epochs (to get these outputs and I stopped at that point, ideally should train for much longer to get better quality outputs).
@@Explaining-AI Thank you for your prompt reply! I am building something similar for generating synthetic galaxy images to learn about LDMs. Your videos are a lifesaver.
@@Explaining-AI Can we also use Flash Attention instead of normal attention?
@@HardikBishnoi Yes , I have not used it myself but I remember reading some implementation where diffusers(huggingface) + flashattention gave 3x speedup.
@@HardikBishnoi Glad these were helpful to you!
Does this video explain the training of the entire model in which I can enter the prompts to generate images?
Hello, this video is actually just for training unconditional LDM. Conditional(text/class/mask) LDM is covered in the second part here - th-cam.com/video/hEJjg7VUA8g/w-d-xo.html
thanks for explanation. i have a question you said the autoencoder takes the image in pixel space generate the latent space and decoder do the reverse. if we train the diffusion model it will take the noise sample of the latent space and then goes through the unet which work as a noise predictor and remove the noise from the noisy latenct space iamges from the encoder side.
my question is how and from where it will add nosie to the latent space and how the Unet will do the diffusion process to remove the noise because you also said that we don't need the Time step information?
kindly i have these doubts if you can address it kindly clear my doubts. Thanks for the making video on this topic!
Hello, the diffusion process here is same as what happens in DDPM. The only thing different is that rather than training diffusion on a dataset of images, we first train an autoencoder on our dataset. Then train diffusion on this autoencoder generated latent representations.
So after auto encoder is trained we have these set of steps.
1. Take the dataset image(pixel space)
2. Encode it using trained and frozen autoencoder(to get latent image)
3. Sample noise, timestep
4. Add noise to latent image(from step 2)
5. Train Unet to take the noisy image from Step 4 and predict the original noise(from Step 3) that was added
Then at inference, generate a random noise in the latent space, have unet denoise it iteratively from t=T to t=1and after that feed the t=1 denoised latent image to decoder of autoencoder to get generated image(pixel space). Do Let me know if this does not clear things up and there is still some doubt.
@@Explaining-AIthanks, i have another question why you don't keep the self-attention in the down sapling of the encoder. how it keep attention on the image features in the rest blocks
@@paktv858 The encoder deals with very large image sizes(example 256x256) compared to ldm(32x32).Which means the self attention computation would be very costly hence I just avoid it. If you really want to add and experiment with that, I would still suggest to try adding it only in the last downblocklayer(at 32x32 resolution) . The official repo also does not add it for all variants - github.com/CompVis/stable-diffusion/blob/main/models/first_stage_models/vq-f4/config.yaml
last question, here self attention block is used as model? right and inside of the self attention model it used as layer with Norm?@@Explaining-AI
@@paktv858 Yes I just use normalization and pytorch's multiheadattention module (this is for both self attention and cross attention).
Thanks for the video, it is really great. I have finetuned a stable diffusion v1.5 model and now I am trying to built a stable diffusion model from scratch without using any pretrained ckpts and running it locally. So is it possible that we can train the model without using any pretrained checkpoint ?
Hello, yes its definitely possible. Though depending on your dataset and image resolution you might have to use a lot of compute time, and also if your pre-trained checkpoint was trained on images similar to your task, then your generation results(without pre-training) would be of lesser quality(than with pretraining) .
Thank you for such a detail video. Got so many things clear.
Can anyone point out the structural steps for implementation of LDM like if I am doing in Jupyter. Ist step, second step and so on......
1: dataset
2: Vae or noise sheduler
3: Diff Model
and so on?????]
Glad you found the video helpful.
Regarding the different steps, what you mentioned are pretty much correct order of steps. Below I have added the steps in detail.
1. Create the dataset class
2. Build VAE/VQVAE model
3. Implement training script for VAE and train it.
4. Create Diffusion Model
5. Implement Diffusion Noise Scheduler
6. Write the code for diffusion model training using trained vae model and noise scheduler
7. Implement sampling script using the trained diffusion model(as well as trained vae) and noise scheduler
Do let me know if you need specific details on any step or if you face any issues in implementation.
Try a version with an AI voice for clarity ?
Haven't given the AI voice option any thought until now, but as a viewer, was the clarity of audio that bad for you ? And entire video or some specific part ?
@@Explaining-AI
It's just an idea......
😊👍👍
@@Explaining-AI
Maybe you could do a test with an American voice, maybe even female, to see how it impacts view count ?