Stable Diffusion from Scratch in PyTorch | Unconditional Latent Diffusion Models

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 มิ.ย. 2024
  • In this video, we'll cover everything from the building blocks of stable diffusion to its implementation in PyTorch and see how to build and train Stable Diffusion from scratch.
    This is Part I of the tutorial where I explain latent diffusion models specifically unconditional latent diffusion models. We dive deep into what is latent diffusion , how latent diffusion works , what are the components and losses used in training latent diffusion models and then finally implement and train latent diffusion models.
    The second part will cover conditional latent diffusion models and we will transition to Stable diffusion. The series will be a stable diffusion guide from scratch and you will be able to code stable diffusion in pytorch by yourself by end of it.
    ⏱️ Timestamps
    00:00 Intro
    00:56 Recap of Diffusion Models
    3:31 Why Latent Diffusion Models
    4:31 Introduction to Latent Diffusion Models
    5:55 Review of VAE : Variational Auto Encoder
    6:29 Review of VQVAE : Vector Quantised Variational Auto Encoder
    7:36 Issue with L2 Reconstruction Loss for Latent Diffusion Models
    8:50 Perceptual Loss
    13:44 LPIPS Implementation
    16:40 Adversarial Loss in Latent Diffusion Models
    19:38 AutoEncoder Architecture in Latent Diffusion Models
    23:04 VAE Implementation vs VQVAE Implementation
    24:22 Autoencoder Implementation for Latent Diffusion Models
    32:23 Training AutoEncoder for Latent Diffusion Models
    32:37 Discriminator for Latent Diffusion Models
    36:33 Results of Autoencoder Training
    37:40 VQGAN = VQVAE + LPIPS + Discriminator
    38:26 Latent Diffusion Model Architecture
    39:37 Training Diffusion of Latent Diffusion Models
    41:11 Latent Diffusion Model Results
    41:45 Whats Next
    42:18 Outro
    Paper - tinyurl.com/exai-latent-diffus...
    Implementation - tinyurl.com/exai-stable-diffus...
    🔔 Subscribe :
    tinyurl.com/exai-channel-link
    📌 Keywords:
    #stablediffusion #stable_diffusion

ความคิดเห็น • 47

  • @Explaining-AI
    @Explaining-AI  4 หลายเดือนก่อน +1

    *Github Implementation* : github.com/explainingai-code/StableDiffusion-PyTorch
    *Stable Diffusion Part II* : th-cam.com/video/hEJjg7VUA8g/w-d-xo.html
    *DDPM Implementation Video* : th-cam.com/video/vu6eKteJWew/w-d-xo.html
    *Diffusion Models Math Explanation* : th-cam.com/video/H45lF4sUgiE/w-d-xo.html

  • @PoojaShetty-lf5rr
    @PoojaShetty-lf5rr 15 วันที่ผ่านมา

    Great video. Very well explained

  • @vikramsandu6054
    @vikramsandu6054 17 วันที่ผ่านมา

    Don't have enough words to describe it. This is presented and explained so beautifully. Thanks, Legend.

    • @Explaining-AI
      @Explaining-AI  16 วันที่ผ่านมา

      Thank you for these kind words Vikram :)

  • @winterknight1159
    @winterknight1159 3 หลายเดือนก่อน +2

    Hands down the best underrated channel right now. I hope you get millions of subscribers soon. The amount of effort and the level of detail all in one video is amazing! Thank you!

    • @Explaining-AI
      @Explaining-AI  3 หลายเดือนก่อน +1

      Thank you so much for saying this. As long as the channel consistently improves the knowledge of its viewers, I am super happy :)

  • @Janamejaya.Channegowda
    @Janamejaya.Channegowda 4 หลายเดือนก่อน

    Thank you for sharing.

  • @alexijohansen
    @alexijohansen 4 หลายเดือนก่อน

    AWESOME!

  • @bhai_ki_clips3648
    @bhai_ki_clips3648 2 หลายเดือนก่อน

    Amazing explanation and Implementation. Thanks you so much.

  • @ActualCode0
    @ActualCode0 4 หลายเดือนก่อน

    Nice one, can't wait for the conditional diffusion model.

  • @himanshurai6481
    @himanshurai6481 4 หลายเดือนก่อน

    Great work once again! Thanks for the great explanation :)

    • @Explaining-AI
      @Explaining-AI  4 หลายเดือนก่อน

      @himanshurai6481 Thank you so much for the appreciation :)

  • @AniketKumar-dl1ou
    @AniketKumar-dl1ou 4 หลายเดือนก่อน

    Thank you for such a wonderful explaination.😃

    • @Explaining-AI
      @Explaining-AI  4 หลายเดือนก่อน

      Really Glad it was helpful!

  • @chickenp7038
    @chickenp7038 4 หลายเดือนก่อน +2

    best tutorial i have ever seen

  • @computing_T
    @computing_T 3 หลายเดือนก่อน +1

    denoising papers channel I could say! Thanks a lot TK🙏🙏

    • @Explaining-AI
      @Explaining-AI  3 หลายเดือนก่อน +1

      Thank you! Though I am currently working on expanding the library to have other topics(other than diffusion) :) Lets see how that goes.

  • @_divya_shakti
    @_divya_shakti 4 หลายเดือนก่อน +1

    Hi Tushar . . . Amazing lecture thankyou . . . ❤❤❤ Lot of love for this video . . .. looking forward for more advance content in text to image generation . . . . I am researching in making some combo of GANs and existing SD model weights such that we can generate some low quality image or latent on low end devices like CPU and then upscale low res image using some GANs . . . Trying to implement something kind of mobile diffusion . . . . Please suggest some good resource to move forward 😅

    • @Explaining-AI
      @Explaining-AI  4 หลายเดือนก่อน +1

      Hi Divya,
      Thank you so much for the kind words :)
      And yes, latest developments in the space of combining diffusion and gan is indeed very exciting.
      Unfortunately, I have not gone deep into this(yet), but to me it feels like MobileDiffusion is UFOGen+distillation plus a whole lot of architectural changes(this is again by just giving a cursory look at the mobilediffusion paper).
      And I think in UFOGen the authors initialize gan with the SD model weights only.
      I am assuming you might already know about that but if you don't, maybe give that paper a read and use that as a starting point?
      Sorry for not being much help on this. Once I myself have through few papers on this, maybe after that I can be of better help.

  • @Swatisd97
    @Swatisd97 หลายเดือนก่อน

    Thanks for the video, it is really great. I have finetuned a stable diffusion v1.5 model and now I am trying to built a stable diffusion model from scratch without using any pretrained ckpts and running it locally. So is it possible that we can train the model without using any pretrained checkpoint ?

    • @Explaining-AI
      @Explaining-AI  หลายเดือนก่อน

      Hello, yes its definitely possible. Though depending on your dataset and image resolution you might have to use a lot of compute time, and also if your pre-trained checkpoint was trained on images similar to your task, then your generation results(without pre-training) would be of lesser quality(than with pretraining) .

  • @HardikBishnoi
    @HardikBishnoi 2 หลายเดือนก่อน

    What was the compute you used to train this? And how long did it take? Great video btw!

    • @Explaining-AI
      @Explaining-AI  2 หลายเดือนก่อน +1

      Thanks! For the diffusion model I used single Nvidia V100 which took around 15 mins per epoch and as far as I remember, I trained for about 50 epochs (to get these outputs and I stopped at that point, ideally should train for much longer to get better quality outputs).

    • @HardikBishnoi
      @HardikBishnoi 2 หลายเดือนก่อน

      @@Explaining-AI Thank you for your prompt reply! I am building something similar for generating synthetic galaxy images to learn about LDMs. Your videos are a lifesaver.

    • @HardikBishnoi
      @HardikBishnoi 2 หลายเดือนก่อน

      @@Explaining-AI Can we also use Flash Attention instead of normal attention?

    • @Explaining-AI
      @Explaining-AI  2 หลายเดือนก่อน

      @@HardikBishnoi Yes , I have not used it myself but I remember reading some implementation where diffusers(huggingface) + flashattention gave 3x speedup.

    • @Explaining-AI
      @Explaining-AI  2 หลายเดือนก่อน

      @@HardikBishnoi Glad these were helpful to you!

  • @jaszczurix7652
    @jaszczurix7652 3 หลายเดือนก่อน

    Does this video explain the training of the entire model in which I can enter the prompts to generate images?

    • @Explaining-AI
      @Explaining-AI  3 หลายเดือนก่อน

      Hello, this video is actually just for training unconditional LDM. Conditional(text/class/mask) LDM is covered in the second part here - th-cam.com/video/hEJjg7VUA8g/w-d-xo.html

  • @chickenp7038
    @chickenp7038 4 หลายเดือนก่อน +2

    why do we do retain_graph=True in the VAE training script?

    • @Explaining-AI
      @Explaining-AI  4 หลายเดือนก่อน +1

      Actually here its not needed. Earlier I was passing fake to discriminator rather than fake.detach() in Line 70, in which case it was needed. Then added detach() but forgot to remove the retain_graph. Don't think its needed anymore. Thank You for this!

    • @chickenp7038
      @chickenp7038 4 หลายเดือนก่อน +1

      @@Explaining-AI okay great to know. i was able to 3x the batch size with retain_graph=False

    • @Explaining-AI
      @Explaining-AI  4 หลายเดือนก่อน +1

      Perfect . Will soon make the change in the repo as well . Thank you!

  • @paktv858
    @paktv858 3 หลายเดือนก่อน

    thanks for explanation. i have a question you said the autoencoder takes the image in pixel space generate the latent space and decoder do the reverse. if we train the diffusion model it will take the noise sample of the latent space and then goes through the unet which work as a noise predictor and remove the noise from the noisy latenct space iamges from the encoder side.
    my question is how and from where it will add nosie to the latent space and how the Unet will do the diffusion process to remove the noise because you also said that we don't need the Time step information?
    kindly i have these doubts if you can address it kindly clear my doubts. Thanks for the making video on this topic!

    • @Explaining-AI
      @Explaining-AI  3 หลายเดือนก่อน +1

      Hello, the diffusion process here is same as what happens in DDPM. The only thing different is that rather than training diffusion on a dataset of images, we first train an autoencoder on our dataset. Then train diffusion on this autoencoder generated latent representations.
      So after auto encoder is trained we have these set of steps.
      1. Take the dataset image(pixel space)
      2. Encode it using trained and frozen autoencoder(to get latent image)
      3. Sample noise, timestep
      4. Add noise to latent image(from step 2)
      5. Train Unet to take the noisy image from Step 4 and predict the original noise(from Step 3) that was added
      Then at inference, generate a random noise in the latent space, have unet denoise it iteratively from t=T to t=1and after that feed the t=1 denoised latent image to decoder of autoencoder to get generated image(pixel space). Do Let me know if this does not clear things up and there is still some doubt.

    • @paktv858
      @paktv858 3 หลายเดือนก่อน

      @@Explaining-AIthanks, i have another question why you don't keep the self-attention in the down sapling of the encoder. how it keep attention on the image features in the rest blocks

    • @Explaining-AI
      @Explaining-AI  3 หลายเดือนก่อน +1

      @@paktv858 The encoder deals with very large image sizes(example 256x256) compared to ldm(32x32).Which means the self attention computation would be very costly hence I just avoid it. If you really want to add and experiment with that, I would still suggest to try adding it only in the last downblocklayer(at 32x32 resolution) . The official repo also does not add it for all variants - github.com/CompVis/stable-diffusion/blob/main/models/first_stage_models/vq-f4/config.yaml

    • @paktv858
      @paktv858 3 หลายเดือนก่อน

      last question, here self attention block is used as model? right and inside of the self attention model it used as layer with Norm?@@Explaining-AI

    • @Explaining-AI
      @Explaining-AI  3 หลายเดือนก่อน

      @@paktv858 Yes I just use normalization and pytorch's multiheadattention module (this is for both self attention and cross attention).

  • @MilesBellas
    @MilesBellas หลายเดือนก่อน

    Try a version with an AI voice for clarity ?

    • @Explaining-AI
      @Explaining-AI  หลายเดือนก่อน +1

      Haven't given the AI voice option any thought until now, but as a viewer, was the clarity of audio that bad for you ? And entire video or some specific part ?

    • @MilesBellas
      @MilesBellas หลายเดือนก่อน

      @@Explaining-AI
      It's just an idea......
      😊👍👍

    • @MilesBellas
      @MilesBellas หลายเดือนก่อน

      ​@@Explaining-AI
      Maybe you could do a test with an American voice, maybe even female, to see how it impacts view count ?