Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 ม.ค. 2025

ความคิดเห็น • 79

  • @outliier
    @outliier  ปีที่แล้ว +5

    Join our Discord for Generative AI: discord.com/invite/BTUAzb8vFY

    • @qiaozhaohui
      @qiaozhaohui ปีที่แล้ว +1

      Good job!

    • @NoahElRhandour
      @NoahElRhandour ปีที่แล้ว +1

      and this link is really not a virus?

  • @xiaolongye-y4g
    @xiaolongye-y4g ปีที่แล้ว +6

    You are definitely the most detailed and understandable person I have ever seen.

  • @ml-ok3xq
    @ml-ok3xq 11 หลายเดือนก่อน +8

    Congrats on stable cascade 🎉

  • @dbender
    @dbender 11 หลายเดือนก่อน +2

    Super nice video which explains the architecture behind Stable Cascade. Step B was nicely visualized, but I still need a bit more time to fully grasp it. Well done!

  • @SaraKangazian
    @SaraKangazian หลายเดือนก่อน

    Thank you for your wonderful explanation. Yes, I am very interested in learning about diffusion models, especially text to image.

  • @jeanbedry3941
    @jeanbedry3941 ปีที่แล้ว +2

    This is great, models that are intuitive to understand are the bests ones I find. Great job of explaining it as well.

  • @NedoAnimations
    @NedoAnimations 10 หลายเดือนก่อน +1

    Hi! If it's not a secret, where do you get datasets for training Text2img models? Very great video!

  • @ratside9485
    @ratside9485 ปีที่แล้ว +3

    Wir brauchen mehr Würstchen! 🙏🍽️

  • @dbssus123
    @dbssus123 ปีที่แล้ว +2

    awesoooom !!! I always wait your videos

  • @EvanSpades
    @EvanSpades 8 หลายเดือนก่อน

    Love this - what a fantastic achievement!

  • @hayhay_to333
    @hayhay_to333 ปีที่แล้ว +1

    Damn, you're so smart thanks for explaining this to us. I hope you'll make millions of dollars.

    • @outliier
      @outliier  ปีที่แล้ว

      Haha thank you!

  • @e.galois4940
    @e.galois4940 ปีที่แล้ว +3

    Tks very much

  • @mtolgacangoz
    @mtolgacangoz 8 หลายเดือนก่อน

    Brilliant work!

  • @omarei
    @omarei ปีที่แล้ว +2

    Awesome

  • @macbetabetamac8998
    @macbetabetamac8998 ปีที่แล้ว +1

    Amazing work mate ! 🙏

  • @mik3lang3lo
    @mik3lang3lo ปีที่แล้ว +1

    Great job as always

  • @eswardivi
    @eswardivi ปีที่แล้ว +3

    Amazing work. I am wondering, how this video was made i.e. Editing Process and Cool Animations

    • @outliier
      @outliier  ปีที่แล้ว +3

      Thank you a lot! I edit all videos in premiere pro and some of the animations like the compute gpu hours comparison between stable diffusion and würstchen were made mit manim. (The library from 3blue1brown)

  • @flakky626
    @flakky626 11 หลายเดือนก่อน +1

    Can you please tell where did you study entirety of ML/Deep learning? (courses?)

  • @adrienforbu5165
    @adrienforbu5165 ปีที่แล้ว +1

    Amazing explainations, good job

  • @arpanpoudel
    @arpanpoudel ปีที่แล้ว +2

    thanks for the awesome content.

  • @leab.6600
    @leab.6600 ปีที่แล้ว +2

    Super helpful

  • @jonmichaelgalindo
    @jonmichaelgalindo ปีที่แล้ว

    Awesome work and great insights! ❤

  • @TheAero
    @TheAero ปีที่แล้ว

    Why use a second encoder? Isn't that what VQGan is supposed to do?

    • @outliier
      @outliier  ปีที่แล้ว +1

      Yes but the VQGAN can only do a certain spatial compression, afterwards it gets really bad. That's why we introduce a second one

    • @TheAero
      @TheAero ปีที่แล้ว +1

      @@outliier So can we replace the GAN-Encoder to a pre-trained better encoder and reduce the expense of using 2 encoders instead of one? So fundamentally, start with a simple enccoder then replace with a pre-trained better one and continue trainer, so that you also improve the decoder?

  • @NoahElRhandour
    @NoahElRhandour ปีที่แล้ว +3

    🔥🔥🔥

  • @factlogyofficial
    @factlogyofficial ปีที่แล้ว +1

    good job guys !!

  • @lookout816
    @lookout816 ปีที่แล้ว +1

    Great video 👍👍

  • @mohammadaljumaa5427
    @mohammadaljumaa5427 ปีที่แล้ว +1

    Amazing job and I really love the idea of reducing the size of the models, since it’s just make so much sense for me!! I have a small question, what gpus did you use for training? Did you use a cloud provider for that or you have your own local station? If the second I’m interested to know which hardware components you have? Just curious because I’m trying to make a decision between using cloud providers for training vs buying a local station 😊

    • @outliier
      @outliier  ปีที่แล้ว

      Hey there. We were using the stability cluster.

    • @outliier
      @outliier  ปีที่แล้ว

      Local would be much more expensive I guess. What gpus are you thinking to buy / rent and how many?

  • @xyzxyz324
    @xyzxyz324 ปีที่แล้ว

    well explained, thank you!

  • @timeTegus
    @timeTegus ปีที่แล้ว +1

    I love the video. :) and i would love more detail 😮😮😮😮

    • @outliier
      @outliier  ปีที่แล้ว +1

      Noted, in case for Würstchen, you can take a look at the paper: arxiv.org/abs/2306.00637

  • @jollokim1948
    @jollokim1948 ปีที่แล้ว

    Hi Dominic,
    This is some great work you have accomplished and definitely a step in the right right direction of democratizing the diffusion method.
    I have some questions, and a little bit of critique if that would be okay.
    You say you achieve a compression rate of 42x, however, is this a fair statement when that vector is never decompressed into an actual image?
    It looks more like your Stage C can create some sort of feature vectors of images in very low dimensional space using the text descriptions. Which then are used to guide the actual image creation, along with embedded text in stage B.
    In my opinion it looks more like you have used stage C to learn a feature vector representation for the image, which is used as a condition similar to how language free text-to-image models might use the image itself to guide in training.
    However, I don't believe this to be a 42x image compression without the decompression. Have you tried connecting a decoder onto vectors coming out of stage C?
    (I would believe the that vector might not be big enough to create a high resolution images because of it's dimensional size)
    I hope you can answer some of my questions or clear up any misunderstandings on my part.
    I'm currently doing my thesis on fast diffusion models and found your concept of extreme compression very compelling. Directions on where to go next regarding this topic is also very much appreciated :)
    Best of luck with further research.

  • @JT-hg7mj
    @JT-hg7mj ปีที่แล้ว

    Did you use the same dataset with SDXL?

  • @nexyboye5111
    @nexyboye5111 5 หลายเดือนก่อน

    good job guyz!

  • @davidyang102
    @davidyang102 ปีที่แล้ว +1

    Why do you still do stage A, would it be possible to just do stage B direct from the image? I assume the issue is stage A is cheaper than stage B to train?

    • @outliier
      @outliier  ปีที่แล้ว +1

      Yea you can. We actually even tried that out. But it takes longer to learn and as of now we didnt achieve quite the same results with a single compression stage. The VQGAN is just really neat and provides a free compression already, which simplifies things for Stage B a lot I think. But definitely more experiments could be made here :D

    • @davidyang102
      @davidyang102 ปีที่แล้ว +1

      ​@@outliier Really cool work. Is the use of diffusion models to compress data in this way a generic technique that can be used anywhere? For example could I use it to compress text?

    • @pablopernias
      @pablopernias ปีที่แล้ว

      @@davidyang102 The only issue with text is its discrete nature. If you're ok with having continuous latent representations for text instead of discrete tokens then I think it could theoretically work, although we haven't properly tried with anything else than RGB images. The important thing is having a powerful enough signal so the diffusion model can rely on it and only require to complete missing details instead of having to make a lot of information up.

  • @darrynrogers204
    @darrynrogers204 ปีที่แล้ว

    I very much like the image you are using at the opening of the video. The glitchy 3D graph that looks like an image generation gone wrong. How was it generated? Was it intentional or a bit of buggy code?

    • @outliier
      @outliier  ปีที่แล้ว

      Hey, which glitchy 3d graph? Could you give the timestamp?

    • @darrynrogers204
      @darrynrogers204 ปีที่แล้ว

      The one at 0:01. Right at the start. It says "outlier" at the bottom in mashed up AI text. Also the same image that you are using for your TH-cam banner on your channel page.@@outliier

  • @digiministrator
    @digiministrator ปีที่แล้ว

    Hello,
    How do I make a Seamless Pattern with Würstchen, I try a few prompts, the edges are always problematic.

    • @outliier
      @outliier  ปีที่แล้ว

      Someone on the discord was talking about circular padding on the convolutions. Maybe you can try that

  • @muhammadrezahaghiri
    @muhammadrezahaghiri ปีที่แล้ว

    That is a great project, I am excited to test the project.
    Out of curiosity, how is it possible to fine tune the model?

    • @outliier
      @outliier  ปีที่แล้ว +1

      Hey, there is not yet official code for that. If you are interested you can give it a shot yourself. With the diffusers release in the next days, this should become much easier I think

    • @swannschilling474
      @swannschilling474 ปีที่แล้ว +1

      This is very interesting!! 😊

  • @MiyawMiv
    @MiyawMiv ปีที่แล้ว +1

    awsome

  • @krisman2503
    @krisman2503 ปีที่แล้ว

    Hey, does it recover from the noise or a encoded xT during the inference?

    • @outliier
      @outliier  ปีที่แล้ว

      During inference you start from pure noise and start denoising and after every denoise step, you noise the image again and then denoise again and then noise and so on

  • @davidgruzman5750
    @davidgruzman5750 ปีที่แล้ว

    Thank you for explanations! I am a bit puzzled - why we call the state in the inner layers of AE as latent, since we actually can observer it?

    • @outliier
      @outliier  ปีที่แล้ว

      which "state" are you referring to? The ones from Stage B?

    • @davidgruzman5750
      @davidgruzman5750 ปีที่แล้ว

      ​@@outliier I would refer to one you mention in 1:27 point of the video. It is probabbly stage A.

    • @outliier
      @outliier  ปีที่แล้ว +1

      @@davidgruzman5750 ah got it. Well you can observe it, but you can’t really understand right? If you print or visualise the latents they are not really meaningful. There are strategies to make them more meaningful tho. But just by themselves its hard to understand them. That’s what we usually call latents I would say

  • @saulcanoortiz7902
    @saulcanoortiz7902 ปีที่แล้ว

    How do you create the dynamic videos of NNs? I want to create a TH-cam Channel explaining theory&code in Spanish. Best regards.

  • @KienLe-md9yv
    @KienLe-md9yv 8 หลายเดือนก่อน

    At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you

    • @outliier
      @outliier  8 หลายเดือนก่อน

      The VQGAN decoder can also decode continuous latents. It‘s as easy as that.

  • @jeffg4686
    @jeffg4686 9 หลายเดือนก่อน

    Nice !

  • @ChristProg
    @ChristProg 8 หลายเดือนก่อน

    Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you

  • @aiartbx
    @aiartbx ปีที่แล้ว +1

    Looks very interesting. Depending on how fast the generation is real time diffusion seems closer than expected.
    Btw any hugging space demo we can try this?

    • @outliier
      @outliier  ปีที่แล้ว +1

      Hey thank you! The demo is available here: huggingface.co/spaces/warp-ai/Wuerstchen

  • @streamtabulous
    @streamtabulous ปีที่แล้ว

    what about decompression times? are they faster and would it be less resources on older systems.
    curious if the models from this wound benefit users, IE most still use 1.5 nd v2 models of SD due to the decompression times of SDXL models taking so long.

    • @outliier
      @outliier  ปีที่แล้ว +1

      Hey, we have a comparison to inference times compared to SDXL in the blog post here: huggingface.co/blog/wuerstchen
      And I think the model should be comparable to SD1.X in terms of Speed.

    • @streamtabulous
      @streamtabulous ปีที่แล้ว

      @@outliier thought those where compression only times not decompression times, that's awesome to read.
      People like you are hero's to me

    • @outliier
      @outliier  ปีที่แล้ว +2

      @@streamtabulous hey, those barcharts are for full sampling times after feeding in the prompt until you receive the final image in pixel space. That is so kind of you, I appreciate it a lot. But people like Pablo, the HF team and other people helping us out together are the real reason that this was possible. And I promise this is only the start.

    • @streamtabulous
      @streamtabulous ปีที่แล้ว

      @@outliier the whole team are a god send, myself i am on a disability pension neuromuscular so can't afford the pay to use like abode firefly that's a tick off charging as they use Stable Diffusion.
      Been disabled I game so have a gtx1070 have a rtx3060 in another system.
      But one of the Thing I miss doing is art and helping people by restoring there photos free, I have Stable diffusion on my PCs and I love its letting me do stuff I could not before including photo restorations, makes my life better as it give me joy doing that stuff.
      knowing from work like yours and the team your with, that it will mean in the near future I can do not better art but better photo Restorations faster and higher quality for people with my hardware means a massive amount to me.
      I'm doing a video tomorrow to help teach people how I use SD and models to help restore photos. only found SD a few weeks ago but i am working out how to use it in ways to help people with damaged old photos.

  • @KienLe-md9yv
    @KienLe-md9yv 8 หลายเดือนก่อน

    So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?

    • @outliier
      @outliier  8 หลายเดือนก่อน

      What do you mean exactly?

  • @beecee793
    @beecee793 ปีที่แล้ว

    If I need X time to inference on SD on a given example GPU, what do I need and how fast in the same environment would inferencing this be? Will it run on my toaster?

    • @outliier
      @outliier  ปีที่แล้ว +1

      Hey. Take a look at the blog post. It has a inference time bar chart: huggingface.co/blog/wuertschen

    • @beecee793
      @beecee793 ปีที่แล้ว

      @@outliier Thank you

  • @lawtonkovac4215
    @lawtonkovac4215 ปีที่แล้ว

    💘 promo sm