Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

แชร์
ฝัง
  • เผยแพร่เมื่อ 10 ก.ย. 2023
  • Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression.
    If you want to dive in even more into Würstchen here is the link to the paper & code:
    Arxiv: arxiv.org/abs/2306.00637
    Huggingface: huggingface.co/docs/diffusers...
    Github: huggingface.co/dome272/wuerst...
    We also created a community Discord for people interested in Generate AI:
    / discord
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 81

  • @outliier
    @outliier  8 หลายเดือนก่อน +5

    Join our Discord for Generative AI: discord.com/invite/BTUAzb8vFY

    • @qiaozhaohui
      @qiaozhaohui 8 หลายเดือนก่อน +1

      Good job!

    • @NoahElRhandour
      @NoahElRhandour 8 หลายเดือนก่อน +1

      and this link is really not a virus?

  • @ml-ok3xq
    @ml-ok3xq 3 หลายเดือนก่อน +8

    Congrats on stable cascade 🎉

  • @user-rw3xm8nv7u
    @user-rw3xm8nv7u 6 หลายเดือนก่อน +4

    You are definitely the most detailed and understandable person I have ever seen.

  • @jeanbedry3941
    @jeanbedry3941 8 หลายเดือนก่อน +2

    This is great, models that are intuitive to understand are the bests ones I find. Great job of explaining it as well.

  • @macbetabetamac8998
    @macbetabetamac8998 8 หลายเดือนก่อน +1

    Amazing work mate ! 🙏

  • @mtolgacangoz
    @mtolgacangoz 6 วันที่ผ่านมา

    Brilliant work!

  • @dbender
    @dbender 3 หลายเดือนก่อน

    Super nice video which explains the architecture behind Stable Cascade. Step B was nicely visualized, but I still need a bit more time to fully grasp it. Well done!

  • @jonmichaelgalindo
    @jonmichaelgalindo 7 หลายเดือนก่อน

    Awesome work and great insights! ❤

  • @adrienforbu5165
    @adrienforbu5165 8 หลายเดือนก่อน +1

    Amazing explainations, good job

  • @mik3lang3lo
    @mik3lang3lo 8 หลายเดือนก่อน +1

    Great job as always

  • @dbssus123
    @dbssus123 8 หลายเดือนก่อน +2

    awesoooom !!! I always wait your videos

  • @EvanSpades
    @EvanSpades 4 วันที่ผ่านมา

    Love this - what a fantastic achievement!

  • @lookout816
    @lookout816 8 หลายเดือนก่อน +1

    Great video 👍👍

  • @ratside9485
    @ratside9485 8 หลายเดือนก่อน +3

    Wir brauchen mehr Würstchen! 🙏🍽️

  • @e.galois4940
    @e.galois4940 8 หลายเดือนก่อน +3

    Tks very much

  • @arpanpoudel
    @arpanpoudel 8 หลายเดือนก่อน +2

    thanks for the awesome content.

  • @xyzxyz324
    @xyzxyz324 7 หลายเดือนก่อน

    well explained, thank you!

  • @factlogyofficial
    @factlogyofficial 8 หลายเดือนก่อน +1

    good job guys !!

  • @leab.6600
    @leab.6600 8 หลายเดือนก่อน +2

    Super helpful

  • @mohammadaljumaa5427
    @mohammadaljumaa5427 8 หลายเดือนก่อน +1

    Amazing job and I really love the idea of reducing the size of the models, since it’s just make so much sense for me!! I have a small question, what gpus did you use for training? Did you use a cloud provider for that or you have your own local station? If the second I’m interested to know which hardware components you have? Just curious because I’m trying to make a decision between using cloud providers for training vs buying a local station 😊

    • @outliier
      @outliier  8 หลายเดือนก่อน

      Hey there. We were using the stability cluster.

    • @outliier
      @outliier  8 หลายเดือนก่อน

      Local would be much more expensive I guess. What gpus are you thinking to buy / rent and how many?

  • @omarei
    @omarei 8 หลายเดือนก่อน +2

    Awesome

  • @glazastik_original
    @glazastik_original 2 หลายเดือนก่อน +1

    Hi! If it's not a secret, where do you get datasets for training Text2img models? Very great video!

  • @timeTegus
    @timeTegus 8 หลายเดือนก่อน +1

    I love the video. :) and i would love more detail 😮😮😮😮

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Noted, in case for Würstchen, you can take a look at the paper: arxiv.org/abs/2306.00637

  • @hayhay_to333
    @hayhay_to333 8 หลายเดือนก่อน +1

    Damn, you're so smart thanks for explaining this to us. I hope you'll make millions of dollars.

    • @outliier
      @outliier  8 หลายเดือนก่อน

      Haha thank you!

  • @jeffg4686
    @jeffg4686 หลายเดือนก่อน

    Nice !

  • @jollokim1948
    @jollokim1948 5 หลายเดือนก่อน

    Hi Dominic,
    This is some great work you have accomplished and definitely a step in the right right direction of democratizing the diffusion method.
    I have some questions, and a little bit of critique if that would be okay.
    You say you achieve a compression rate of 42x, however, is this a fair statement when that vector is never decompressed into an actual image?
    It looks more like your Stage C can create some sort of feature vectors of images in very low dimensional space using the text descriptions. Which then are used to guide the actual image creation, along with embedded text in stage B.
    In my opinion it looks more like you have used stage C to learn a feature vector representation for the image, which is used as a condition similar to how language free text-to-image models might use the image itself to guide in training.
    However, I don't believe this to be a 42x image compression without the decompression. Have you tried connecting a decoder onto vectors coming out of stage C?
    (I would believe the that vector might not be big enough to create a high resolution images because of it's dimensional size)
    I hope you can answer some of my questions or clear up any misunderstandings on my part.
    I'm currently doing my thesis on fast diffusion models and found your concept of extreme compression very compelling. Directions on where to go next regarding this topic is also very much appreciated :)
    Best of luck with further research.

  • @NoahElRhandour
    @NoahElRhandour 8 หลายเดือนก่อน +3

    🔥🔥🔥

  • @ChristProg
    @ChristProg 10 วันที่ผ่านมา

    Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you

  • @truck.-kun.
    @truck.-kun. 4 หลายเดือนก่อน

    This needs more reach!

  • @MiyawMiv
    @MiyawMiv 8 หลายเดือนก่อน +1

    awsome

  • @eswardivi
    @eswardivi 8 หลายเดือนก่อน +3

    Amazing work. I am wondering, how this video was made i.e. Editing Process and Cool Animations

    • @outliier
      @outliier  8 หลายเดือนก่อน +3

      Thank you a lot! I edit all videos in premiere pro and some of the animations like the compute gpu hours comparison between stable diffusion and würstchen were made mit manim. (The library from 3blue1brown)

  • @muhammadrezahaghiri
    @muhammadrezahaghiri 8 หลายเดือนก่อน

    That is a great project, I am excited to test the project.
    Out of curiosity, how is it possible to fine tune the model?

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Hey, there is not yet official code for that. If you are interested you can give it a shot yourself. With the diffusers release in the next days, this should become much easier I think

    • @swannschilling474
      @swannschilling474 8 หลายเดือนก่อน +1

      This is very interesting!! 😊

  • @flakky626
    @flakky626 3 หลายเดือนก่อน +1

    Can you please tell where did you study entirety of ML/Deep learning? (courses?)

  • @darrynrogers204
    @darrynrogers204 8 หลายเดือนก่อน

    I very much like the image you are using at the opening of the video. The glitchy 3D graph that looks like an image generation gone wrong. How was it generated? Was it intentional or a bit of buggy code?

    • @outliier
      @outliier  8 หลายเดือนก่อน

      Hey, which glitchy 3d graph? Could you give the timestamp?

    • @darrynrogers204
      @darrynrogers204 8 หลายเดือนก่อน

      The one at 0:01. Right at the start. It says "outlier" at the bottom in mashed up AI text. Also the same image that you are using for your TH-cam banner on your channel page.@@outliier

  • @hipy-tz3qt
    @hipy-tz3qt 8 หลายเดือนก่อน +1

    Awesome! I have a question: who decided to call it "Würstchen" and why? I am German and just wondering

    • @akashdutta6235
      @akashdutta6235 8 หลายเดือนก่อน

      Man who loves hot dogs😂

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      We called it Würstchen because Pablo is from Spain and we called our first model Paella. And I‘m from Germany as well, so I thought let’s call the next model after something german lol

  • @JT-hg7mj
    @JT-hg7mj 7 หลายเดือนก่อน

    Did you use the same dataset with SDXL?

  • @davidgruzman5750
    @davidgruzman5750 8 หลายเดือนก่อน

    Thank you for explanations! I am a bit puzzled - why we call the state in the inner layers of AE as latent, since we actually can observer it?

    • @outliier
      @outliier  8 หลายเดือนก่อน

      which "state" are you referring to? The ones from Stage B?

    • @davidgruzman5750
      @davidgruzman5750 8 หลายเดือนก่อน

      ​@@outliier I would refer to one you mention in 1:27 point of the video. It is probabbly stage A.

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      @@davidgruzman5750 ah got it. Well you can observe it, but you can’t really understand right? If you print or visualise the latents they are not really meaningful. There are strategies to make them more meaningful tho. But just by themselves its hard to understand them. That’s what we usually call latents I would say

  • @aiartbx
    @aiartbx 8 หลายเดือนก่อน +1

    Looks very interesting. Depending on how fast the generation is real time diffusion seems closer than expected.
    Btw any hugging space demo we can try this?

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Hey thank you! The demo is available here: huggingface.co/spaces/warp-ai/Wuerstchen

  • @KienLe-md9yv
    @KienLe-md9yv 20 วันที่ผ่านมา

    At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you

    • @outliier
      @outliier  20 วันที่ผ่านมา

      The VQGAN decoder can also decode continuous latents. It‘s as easy as that.

  • @digiministrator
    @digiministrator 5 หลายเดือนก่อน

    Hello,
    How do I make a Seamless Pattern with Würstchen, I try a few prompts, the edges are always problematic.

    • @outliier
      @outliier  5 หลายเดือนก่อน

      Someone on the discord was talking about circular padding on the convolutions. Maybe you can try that

  • @krisman2503
    @krisman2503 8 หลายเดือนก่อน

    Hey, does it recover from the noise or a encoded xT during the inference?

    • @outliier
      @outliier  8 หลายเดือนก่อน

      During inference you start from pure noise and start denoising and after every denoise step, you noise the image again and then denoise again and then noise and so on

  • @streamtabulous
    @streamtabulous 8 หลายเดือนก่อน

    what about decompression times? are they faster and would it be less resources on older systems.
    curious if the models from this wound benefit users, IE most still use 1.5 nd v2 models of SD due to the decompression times of SDXL models taking so long.

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Hey, we have a comparison to inference times compared to SDXL in the blog post here: huggingface.co/blog/wuerstchen
      And I think the model should be comparable to SD1.X in terms of Speed.

    • @streamtabulous
      @streamtabulous 8 หลายเดือนก่อน

      @@outliier thought those where compression only times not decompression times, that's awesome to read.
      People like you are hero's to me

    • @outliier
      @outliier  8 หลายเดือนก่อน +2

      @@streamtabulous hey, those barcharts are for full sampling times after feeding in the prompt until you receive the final image in pixel space. That is so kind of you, I appreciate it a lot. But people like Pablo, the HF team and other people helping us out together are the real reason that this was possible. And I promise this is only the start.

    • @streamtabulous
      @streamtabulous 8 หลายเดือนก่อน

      @@outliier the whole team are a god send, myself i am on a disability pension neuromuscular so can't afford the pay to use like abode firefly that's a tick off charging as they use Stable Diffusion.
      Been disabled I game so have a gtx1070 have a rtx3060 in another system.
      But one of the Thing I miss doing is art and helping people by restoring there photos free, I have Stable diffusion on my PCs and I love its letting me do stuff I could not before including photo restorations, makes my life better as it give me joy doing that stuff.
      knowing from work like yours and the team your with, that it will mean in the near future I can do not better art but better photo Restorations faster and higher quality for people with my hardware means a massive amount to me.
      I'm doing a video tomorrow to help teach people how I use SD and models to help restore photos. only found SD a few weeks ago but i am working out how to use it in ways to help people with damaged old photos.

  • @davidyang102
    @davidyang102 8 หลายเดือนก่อน +1

    Why do you still do stage A, would it be possible to just do stage B direct from the image? I assume the issue is stage A is cheaper than stage B to train?

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Yea you can. We actually even tried that out. But it takes longer to learn and as of now we didnt achieve quite the same results with a single compression stage. The VQGAN is just really neat and provides a free compression already, which simplifies things for Stage B a lot I think. But definitely more experiments could be made here :D

    • @davidyang102
      @davidyang102 8 หลายเดือนก่อน +1

      ​@@outliier Really cool work. Is the use of diffusion models to compress data in this way a generic technique that can be used anywhere? For example could I use it to compress text?

    • @pablopernias
      @pablopernias 8 หลายเดือนก่อน

      @@davidyang102 The only issue with text is its discrete nature. If you're ok with having continuous latent representations for text instead of discrete tokens then I think it could theoretically work, although we haven't properly tried with anything else than RGB images. The important thing is having a powerful enough signal so the diffusion model can rely on it and only require to complete missing details instead of having to make a lot of information up.

  • @beecee793
    @beecee793 8 หลายเดือนก่อน

    If I need X time to inference on SD on a given example GPU, what do I need and how fast in the same environment would inferencing this be? Will it run on my toaster?

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Hey. Take a look at the blog post. It has a inference time bar chart: huggingface.co/blog/wuertschen

    • @beecee793
      @beecee793 8 หลายเดือนก่อน

      @@outliier Thank you

  • @saulcanoortiz7902
    @saulcanoortiz7902 4 หลายเดือนก่อน

    How do you create the dynamic videos of NNs? I want to create a TH-cam Channel explaining theory&code in Spanish. Best regards.

  • @TheAero
    @TheAero 8 หลายเดือนก่อน

    Why use a second encoder? Isn't that what VQGan is supposed to do?

    • @outliier
      @outliier  8 หลายเดือนก่อน +1

      Yes but the VQGAN can only do a certain spatial compression, afterwards it gets really bad. That's why we introduce a second one

    • @TheAero
      @TheAero 8 หลายเดือนก่อน

      @@outliier So can we replace the GAN-Encoder to a pre-trained better encoder and reduce the expense of using 2 encoders instead of one? So fundamentally, start with a simple enccoder then replace with a pre-trained better one and continue trainer, so that you also improve the decoder?

  • @KienLe-md9yv
    @KienLe-md9yv 20 วันที่ผ่านมา

    So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?

    • @outliier
      @outliier  20 วันที่ผ่านมา

      What do you mean exactly?

  • @lawtonkovac4215
    @lawtonkovac4215 7 หลายเดือนก่อน

    💘 promo sm