If it needs other people's intellectual property to work, then it is a legal concern for those whose work it being exploited without consent. This is why Getty images is suing, and why Adobe is building their AI Firefly off of licensed work, and making guidelines to compensate those who stock are being used by their AI.
@@Isaac-wr8et Why do ignorant people like you comment? This is literally an explanation of how these types of AI work, no different than explaining other AI models. Nothing about this is abstract mumbo jumbo BS, you just don't understand it, lmao.
@@youssefabusamra3142 the way I see it, people aren't happy when their private information or content is downloaded/stolen without permission by the government or AI. The source that these AIs are using too "learn", is doing just that.
@@Isaac-wr8et "the government or AI" my brother in christ they uploaded these photos for public recognition agreeing to the site's terms and conditions what the AI is doing is the equivalent of "looking" at these photos and recognizing features and patterns. If you're so dead set on the stealing narrative then pick a real painting that was used for training and try to "steal" it by recreating it with prompts
Stable diffusion doesn't actually actually apply noise to images, it uses a compressed low dimensional latent representation of the image and applies noise to that. The model is running in this abstract latent space, and then the autoencoder recreates the image afterwards.
Great point. Yes I skipped over this mainly for the sake of the length of the video. This also explains the slightly odd brown noise we see in the video, which is actually a low noise latent passed back through the VAE decoder.
I once used an overtrained network to store several images, then manipulated the low dimensional part to give some trippy image tweening (a few inputs reperesenting this is image 1,2,3,... etc. and then getting the in-between images). Unfortunately very low resolution, and took ages, guess a web-browser isn't the place for running neural nets...
@@threeMetreJim That’s always a fascinating experiment with VAEs. Encode two items to two latent points, take the midpoint of the two latent points, and then decode that latent midpoint to see what the resulting item is. I tried this with music and it was interesting to hear a transition from Beethoven to Schubert.
@@rayankhan12 sure i will send it in parts here is some code in pytorch (i personally know only tensorflow but i still got the gist of how i would go about doing it)
@@rayankhan12 this is abit mathematicly involved. Key point it shows is that the limiting behivior is not what makes these work. They r simply autoencoders with some extra whistles
Came here by accident and man, aren't you the gifted one? I was engrossed in the video knowing barely anything about the technologies and techniques uses, and I don't feel dumber -- that's an achievement :) Thanks again, will pop here often.
Finally! Ever since Stable Diffusion was released I was looking for an explainer on how it worked that wasn't "Oh it generates images from noise" or something that went too deep into technicals that I didn't understand. Very beautifully explained Dr. Mike Pound! Hope you do another video where you dive into the code where we can see the parts which were visualized here. One thing that's still unclear to me is how was the network trained to relate text with images and how does it utilize this information when actually producing images?
@@thebirdhasbeencharged I don't understand why people answer questions they don't know the answer to. He's asking how the diffusion model which starts from purely random noise, uses the text embedding generated from clip to guide the diffusion. "A.I. is just fancy pattern matching" is about as unhelpful an answer as you could imagine.
I'm a little unclear on that myself, but the best understanding I can manage is that the CLIP (language model) embeddings must be included in the diffusion network's training. So while it's learning how to predict the noise on a picture of, say, a bunny, it's also given the text description of the bunny, which means it's learning how the descriptions affect the noise at the same time as it's learning how the underlying picture does. I think. As I said, not 100% clear on that, so don't take my word for it 😅
They asked humans, "Is this a frog? Yes or no." They took that data to develop an AI that could be asked, "Is this a frog? Yes or no." They did the same with "stilts". They did the same with "on". The difference being that they used a variety of known "objects" to determine whether they were "on" something or not. They also probably classified "on" as a verb, rather than a noun. This makes it a union of two objects. A union associated with "proximity" or something like that. Like he said at the end, they need an intact frog and intact stilts as a requirement in the "frog on stlits" image. So they look like "frog feet with proximity to stilt objects" etc. I would assume human objects on stilts strongly guided their classification of frog objects on stilts.
@@dialecticalmonist3405 I guess that's a decent high-level explanation, but I would clarify that it's actually not using any kind of classification system. Classifiers are an entirely different family of neural networks. The guidance system used here is a transformer-based language model, which is less like asking "is this a frog (y/n)?" and more like asking "here's an image, describe what it is".
It's strikingly similar, except a sculptor starts with a goal image in mind, but AI image generation doesn't; it just has general "knowledge" of "associations" between the words of the prompt and parts of images.
The synthesized-speech scad (scam advert) that I received after watching this video reminded me a little too much about how all of our advancements will eventually be weaponized against us. I'm both filled with joy for the beautiful engineering that led to stable diffusion, and a sense of overwhelming dread for how it will eventually be utilized commercially.
Don't worry friend, just do what I do: 1. Assume everything on the internet is fake (including other people) 2. Retreat from society into a cave 3. Starve to death It's kinda like Plato's Cave, but in reverse. Anyway, it's a pretty solid solution 🙂
In fact it is already being done for artists, the LAION database should not be used for commercial use, and many IAs are actually using it in that way, not to mention that this database has images protected by copyright, so sell or publishing these resulting images is a clear violation of copyright
Add noise to images and train a model to undo that addition.. then you have something that maps from noise to images. One thing I find so impressive about these researchers.. is that they would try this. It’s so bizarre.. just because, from a distance, it’s not at all clear that such a task is doable.
Right, it sounds like a completely non-intuitive way of going about it, and yet, that's what ended up working. They must have iterated on a gazillion different ideas before they landed on this one.
The idea of adding something, do a transformation and substract to get only the transformation of the data or the something is actually quite common in math and control theory. Real hit from AI was to get "the" transformation from data, something and output into an algorithm. This general function of transformation is what allows the image generation. We give it data or something that is slightly off, and amplify the error by the transformation. 😅 it's some weird combination of the chicken egg paradox and a rock paper scissor but with data, something, algorithm and output.
It's why science works better by being public and not subject to short/mid-term revenue. There were already teams training models to undo noise, there was already GPT to interpret text, there was already a database of millions of text-to-image pairings and there were already models trying to feed text to image-based neural networks. Kinda like smartphones, all it took is someone putting the pieces in the right order for the right purpose to make something more useful than the sum of its parts.
@@Alex-ye8qp To me this is the most bizarre part. How can a network even be trained to do that? So utterly bizarre to me. You're telling me "green cow grazing on mars" has a deterministic noise profile??
I couldn't agree more! Since the release of Stable Diffusion, I've been searching for an explanation that strikes the right balance between simplicity and technicality. Your video did an excellent job of providing a clear understanding without overwhelming us with excessive technical details. Dr. Mike Pound, you have a remarkable talent for explaining complex topics in a beautifully straightforward manner!
I followed some of that.. but some of that also sounded a lot like Michelangelo's "start with the block of marble and carve away everything that doesnt look like "X." I will come back to watch this again after the first watching settles! Thank you for providing this.
Would have been nice hear a bit more about the "gpt-style transformer embedding". Wouldn't those classifications have to be included in the training data already?
This is basically what CLIP does. CLIP learns from a massive amount of image-description pairs using GPT-style (Transformers) encoding so that it can map texts and images. CLIP data are not classification labels. Then the difference between the texts and the generated images can be calculated and minimized.
Key word is embeddings. Initial feature space of text has two bad properties: it has big dimensionality (each token is it's own dimension essentially) and sparsity. By using Transformers you compress representation of this object in more compact and dense form, so it's easier to work with.
Been listening to house music in the background (on the low down) when the odd watching computerphile / numberphile for quite a while now. Thought it was time to fess up. Vibing it is probably just me on this tip.
Yep! It's a bit like apophenia, like looking at random clouds and seeing coherent shapes in them, but with some priming about what you "should" be seeing :)
No, that's not how it works at all. His explanation is highly inaccurate and misleading, which is throwing you off. Try reading the actual papers on the subject, or going through the code.
No, that's not how it works at all. His explanation is highly inaccurate and misleading, which is throwing you off. Try reading the actual papers on the subject, or going through the code.
But sculptors start already with a finished image in mind, while AI image generators, the way I understand it, makes it up as it goes along. It's less sculpting and more slapping clay into shape for a person that requests a clay sculpture, but he doesn't specify what exactly he wants, but he checks every time to see if the shape makes him happy.
Wow! Had not seen listing paper since my dad was trying to teach me basic on a commodore 64. Had no idea it was still a thing. Big jump from having to read code on paper to make sense of it to this.
I watched Dr Mike Pound's video on Convolutional Neural Networks when it first came out and it got me into machine learning. Now I'm doing undergrad computer vision research with CNNs. It's honestly kind of crazy to think about how much this channel has affected my life.
@@emmafountain2059 yup. I started watching this channel when I was a bored IT audit intern who hated doing work papers. I literally sat in the bathroom or went on walks outside and just watched. Now I’m a pentester. The reach of high quality TH-cam channels like Computerphile are hard to measure but I don’t think I’m unique.
This is how DALL-E works in a nutshell: "Read user prompt. Decide it's against their arbitrary moral codex. Emit error." Excellent vid btw. Explained something complex in a very easy way.
Great explanation. Just complicated enough to understand for someone who keeps up with this stuff on the surface level, but isn't interested in reading the papers. Thanks.
@@jeremiahweaver4677The usual setup of stable diffusion with automatic1111, or the rather simpler and easier (but not less powerful) fooocus, it’s not a typo, it’s fooocus with three “o”. Or comfyui if you like node based workflows.
Stable Diffusion is actually runable (in inference mode, i.e. for generation - this is different from training) on a regularish computer. The main factor is time but if you have a reasonably modern graphics card, you probably can run stable diffusion in principle. It just might take minutes rather than seconds for a single image. Somebody ran a variant of it on a not even that new iphone. It did take like half an hour iirc so it's not a thing most people would *want* to do, but one of the big selling poitns of stable diffusion is, that it's never the less *possible.* Stuff like Dall-E 2 or Imagen actually really does need a beefy computer with lots of specialized hardware (in particular, above-consumer-hardware VRAM) to get things done. Some of the oldest methods, though, can also work on a regular computer. I'm directly optimizing a version of CLIP towards some image for instance. It's not nearly as good as stable diffusion, but it's basically how all this madness of arbitrary images from text began
The biggest limit is vram not gpu power as much, on a rtx 3000 series gpu 2-10s for 512^2 20 steps and 15-60 secs for 1024^2 image. Slow down starts due to vram not being able to hold everything. So you can get silly things like the 3060 being better then a 3080ti for some uses of SD.
@@asdf30111 yeah VRAM is always the main issue with AI. Gotta store huge matrices in memory. I wonder if, going forward, hardware providers will bump up VRAM on their high end consumer cards due to increasing consumer demand... Or perhaps decently sized tensor cores will become more commonplace
I would love to hear more about the process. Like how does it recognize that the image now looks like a frog on stilts? Seems to me like that's where the real complexity is.
Same, I understood the noise subtraction bit, but I can't quite understand how the subtraction can lead to a picture of a frog, was the IA trained with "words vs images"? So it can relate what a frog would look like. Also, what the initial input picture (12:30) looks like? Is it just random generated noise?
@@skirtsonsale yes a labeled dataset (image-text pairs, LAION dataset) was used to train the network. That is why it is called guided diffusion. The text guides the diffusion process not to a random image, but conditioned on the text (again the pairs were used for training). During training, it sample randoms noise from a random t according to the noise schedule (such that during training it is learned for all t). The input image on 12:30 is such image corrupted using noise from a random t. So somewhere between noise and an image.
@@tristanstevens6162 But why doesn't that just produce some incohesive amalgamation of the training data? How does it know to specifically put the bunny ears on the frog's head? Is that where the magic of having a large amount training data comes in, in that it better understands the correlation between the label and the image?
@@jonatansexdoer96 Yep. CLIP is the language model used in these, and it's seen enough examples of things labeled "bunny" that look different from each other to abstract the idea of where bunny ears are located in any given underlying image.
I remember rewatching Brows Held High's episode on the movie Blue. In it, there are clips from the film. It was just a blank blue screen. On film. Sot there was some noise from the grain. It was from a DVD rip (I think). And it had therefore, by the time it got from the film to my computer screen on youtube, gone through numerous re-encodings. Encodings that expect visual interests and details to compress.. but those had none. So I noticed that the artifacting was picked up as not-noise. And it tried to encode it as if it was normal video. And through the generations of transfers, the blue blank screen was now... Filled with random shapes of blue tones that had gotten enhanced over time. I joked then that we were basically seeing the encoders hallucinations. Little did I know, that a few years later, seceral image processors would spring up that essentially used that method, but guided. And they would be able to hallucinate pretty high resolution images... From noise ..
@@tristanstevens6162 That dataset accounts for understanding text too? Like, if you have sets of images of cats and sets of images of frogs and very few sets of different animals being "fused" (e.g. maybe one capybara that looks like a dog), how would the neural network get to the interpretation of what a cat-frog means as an image? Is LAION that big? Or does the GPT neural network somehow bridge that gap?
@@ekki1993 a lot of the job of "understanding" is being done by the embedding network, which was trained on a very large corpus of words. So while the training set for stable diffusion might not have any examples of a frog-bunny fusion, CLIP is able to take the phrase "frog-bunny fusion" and transform it into a vector that encodes something about the meaning of the phrase. Stable diffusion was trained conditioned on this embedding, so it generally has learned to take concepts from this embedding and include them in the image. The hope is that stable diffusion is able to generalize across all concepts that can be represented by the embedding, so that even if it hasn't seen this specific thing before, it has seen similar stuff and is able to still produce a reasonable image that matches the concepts in the embedding.
@@ekki1993 It's possible for all/any of the components to contribute to the result working - like, even if it hasn't seen any pictures by an artist x, "style of artist x" may still work as a prompt if it's seen text describing them. This is an issue for artists who've been complaining that image generators can reproduce their style in some ways. It means that nothing can be done to prevent this; a base model might still understand them if the images aren't in the set. "Worse", fine-tuning seems to work well enough that you can add in new concepts and styles at home even if they're not in the model originally.
Gaming laptop with a 3060 (6GB) here working great for Stable Diffusion, I'm using the Automatic1111 web UI distro bundle which made setup incredibly easy. I'm still learning how to use it to get "what I mean" results, but it is quite amazing.
I'm using a 3060 Ti. But inference is *always* more efficient than training, so my hardware can't handle Dreambooth, and certainly would never come close to handling a full initial training.
This is EXACTLY the detailed, nuts and bolts video I was hoping to find on AI art. I'm fascinated that random noise seems to be key. I have been working in 3D generated art for many many years and random noise is so powerful in creating imagery and textures in 3D. Such a fascinating, enigmatic concept - noise is nothing but at the same time everything. Even further fascinating to ponder that we can introduce chemicals into the human brain to create random noise, resulting in random infinite hallucinations which likewise have been used for millennia to generate art as well.
Two things amaze me... First, the AI-aspect which I will need more time to study (it's new to me). Second: Mate... they still make folded printer paper like that? It's been decades since I last saw it. You're near a mainframe ammirite?
Very insightful, although the one thing he didn't do a great job explaining is why it's better to ask it to go from T to T0 and then add back noise equal to some lower T instead of just going from T to T-1 to T-2 and so on. He sort of touched on it vaguely but didn't really explain why it's actually better
When you say you remove the noise and then add most of it back, do you mean you add back the predicted noise, or you add back newly generated noise? And both approaches seem plausible, so what is the reasoning?
He explains the "add noise - estimate noise - substract noise - add most noise back - repeat" loop about four times, but then when it comes to how any of this relates to producing images of the actual prompts instead of random noise, it's just "oh GPT embedding", as if that's self-explanatory. Somewhat in the category of that 'draw the rest of the owl' meme for me I'm afraid.
15:15 Why would we amplify the difference and not the parts that stay the same? Might be in the wrong here but in my understanding we are trying to use the parts that both the text guided and unguided noise produced to get the best output since the parts they both produced will be the best fit. Why then use the difference
Curiously, this is the same process as creating a stone sculpture: you start with a block of stone with no shape and gradually take away all parts of the stone that are not shaped like the thing you are sculpting.
Awesome video. That’s the clearest explanation I’ve seen. The hand drawn explanation explained it so perfectly. Would love to see a follow up video that goes through the code. Also would be awesome to include examples when talking about the muppets in the kitchen, etc.
stable diffusion is the best thing since sliced bread. make this a series! it's hard to understand how this thing works, but it's more useful than ever to understand, because it's open source and runs on consumer graphics cards and everyone can hack on it!!
@@stephenkamenar I can tell you that in-painting and out-painting are both kinda similar and straightforward. You start with your image, replace your "empty" pixels with fresh random noise, and run your cycles again, but this time with much lesser starting t, so that the network tries to incorporate existing image because it assumes that there is not that much noise
Using pure noise and "returning it to the noise free original" reminds me of a quote: "It is easy. You just chip away the stone that doesn’t look like David."
@@stephenkamenar I believe img2img is the same process, it just uses your supplied input image plus noise and a lower t-value as the first input to the diffusion network. Which lets it keep some of the structure of your input, the same way during training it learned to keep the structure of the bunny.
Could you add foreign language subtitles to these videos? Right now I'd need German subtitles. I've always loved your explainers but AI image generators is the first time where I'd need to show a video of yours to someone who's not fluent in English. TH-cam offers options to crowdsource subtitles BTW. Thanks a lot, keep up the good work! ☺
Great video!! Would be interesting to see the differences between dall-e ans stable Diffusion. And how the last one requires less training and compute power
I feel the GPT embedding is skipped over quite easily. With my limited knowledge on how these text Transformers work there is a vector representation of the description that in stead of representing the words represent the 'meaning' of the text. However how does this transform to an image that does exactly that? You would still need a lot of training data that confirms whether an image matches a description, right? Did they use alt descriptions or something similar for this, eg (publicly) available image descriptions? I guess my point is that i do not see how this 'knowledge' of what's in the text is transferred to the 'knowledge' of what's in the image, apart from there being a mechanism of steering it towards the 'knowledge' but where does this knowledge come from?
The network is basically implicitly imbued with this understanding through the training process: For the training process of these conditional models, you take a labeled dataset {(image, label)}, and the label is a natural-language description text *describing that image* . For a particular data sample, the network will then see a noisy version of this image, along with the embedding of the description of the original clean image. The training loss then rewards the network for reproducing this image given these two bits of information, and punishes it for reproducing *any* other image. I hope it's clear that the training will therefore steer the network into a direction of 'understanding' these descriptions in some abstract way. That this understanding of the input text is *abstract* is actually what you hope allows the model to extrapolate, so it can generate new images conditioned on new texts not seen during training. That this extrapolation works ridiculously well is one of the amazing things about these models.
A glib native English speaker, and that's OK, however, its crucial to define many bits of jargon at the outset, e.g., "NOISE", "IMAGE", etc....All of this technology has adequate well-understood jargon that originated in the 1960s or earlier, and as a result, the wheel is continually re-invented, probably imperfectly, as well! Computer "scientists" don't appear to exploit earlier software art very well....
On a side note: Is there some supply back room full of that old continuous stationary somewhere on campus? Good job recycling it, since there is no other use for it anymore.
Very interesting, the amount of areas where machine learning algorithms have improved to such a degree is getting more every day. Glad you could explain the process behind stable diffusion
Stable Diffusion is very easy to run on your own computer. Used a version with an easy to use web ui and it takes maybe 10-15 sec to produce one 512x512px image. My pc is an around $1000-1200 gaming pc. I imagine with even a basic modern laptop you can get to under a minute.
Identifying objects within images is now more accurate thanks to sophisticated recognition systems. SmythOS incorporates AI for image recognition, enhancing features like photo management and security surveillance.
Glad to have finally found someone I can actually listen to about AI, someone that doesn't hype things up and isn't trying to sell me something.
If it needs other people's intellectual property to work, then it is a legal concern for those whose work it being exploited without consent.
This is why Getty images is suing, and why Adobe is building their AI Firefly off of licensed work, and making guidelines to compensate those who stock are being used by their AI.
@@Isaac-wr8et Why do ignorant people like you comment? This is literally an explanation of how these types of AI work, no different than explaining other AI models. Nothing about this is abstract mumbo jumbo BS, you just don't understand it, lmao.
@@Isaac-wr8et salty credit obsessed artist spotted
@@youssefabusamra3142 the way I see it, people aren't happy when their private information or content is downloaded/stolen without permission by the government or AI. The source that these AIs are using too "learn", is doing just that.
@@Isaac-wr8et "the government or AI" my brother in christ they uploaded these photos for public recognition agreeing to the site's terms and conditions
what the AI is doing is the equivalent of "looking" at these photos and recognizing features and patterns. If you're so dead set on the stealing narrative then pick a real painting that was used for training and try to "steal" it by recreating it with prompts
Stable diffusion doesn't actually actually apply noise to images, it uses a compressed low dimensional latent representation of the image and applies noise to that. The model is running in this abstract latent space, and then the autoencoder recreates the image afterwards.
Great point. Yes I skipped over this mainly for the sake of the length of the video. This also explains the slightly odd brown noise we see in the video, which is actually a low noise latent passed back through the VAE decoder.
@@michaelpound9891 if you tell AI to run for zero steps you can look at the noise.
I once used an overtrained network to store several images, then manipulated the low dimensional part to give some trippy image tweening (a few inputs reperesenting this is image 1,2,3,... etc. and then getting the in-between images). Unfortunately very low resolution, and took ages, guess a web-browser isn't the place for running neural nets...
@@threeMetreJim That’s always a fascinating experiment with VAEs. Encode two items to two latent points, take the midpoint of the two latent points, and then decode that latent midpoint to see what the resulting item is. I tried this with music and it was interesting to hear a transition from Beethoven to Schubert.
fatter and older
A deep dive on the google colab code would be amazing!
☝☝
There r videos that do that.
Also there r alot of papers explaining the system in detail.
@@nevokrien95 Can you provide links to those videos please?
@@rayankhan12 sure i will send it in parts here is some code in pytorch (i personally know only tensorflow but i still got the gist of how i would go about doing it)
@@rayankhan12 this is abit mathematicly involved. Key point it shows is that the limiting behivior is not what makes these work. They r simply autoencoders with some extra whistles
Came here by accident and man, aren't you the gifted one? I was engrossed in the video knowing barely anything about the technologies and techniques uses, and I don't feel dumber -- that's an achievement :)
Thanks again, will pop here often.
Finally! Ever since Stable Diffusion was released I was looking for an explainer on how it worked that wasn't "Oh it generates images from noise" or something that went too deep into technicals that I didn't understand.
Very beautifully explained Dr. Mike Pound! Hope you do another video where you dive into the code where we can see the parts which were visualized here.
One thing that's still unclear to me is how was the network trained to relate text with images and how does it utilize this information when actually producing images?
@@thebirdhasbeencharged I don't understand why people answer questions they don't know the answer to. He's asking how the diffusion model which starts from purely random noise, uses the text embedding generated from clip to guide the diffusion. "A.I. is just fancy pattern matching" is about as unhelpful an answer as you could imagine.
I'm a little unclear on that myself, but the best understanding I can manage is that the CLIP (language model) embeddings must be included in the diffusion network's training. So while it's learning how to predict the noise on a picture of, say, a bunny, it's also given the text description of the bunny, which means it's learning how the descriptions affect the noise at the same time as it's learning how the underlying picture does.
I think. As I said, not 100% clear on that, so don't take my word for it 😅
They asked humans, "Is this a frog? Yes or no."
They took that data to develop an AI that could be asked, "Is this a frog? Yes or no."
They did the same with "stilts".
They did the same with "on". The difference being that they used a variety of known "objects" to determine whether they were "on" something or not.
They also probably classified "on" as a verb, rather than a noun. This makes it a union of two objects. A union associated with "proximity" or something like that.
Like he said at the end, they need an intact frog and intact stilts as a requirement in the "frog on stlits" image. So they look like "frog feet with proximity to stilt objects" etc.
I would assume human objects on stilts strongly guided their classification of frog objects on stilts.
@@dialecticalmonist3405 I guess that's a decent high-level explanation, but I would clarify that it's actually not using any kind of classification system. Classifiers are an entirely different family of neural networks. The guidance system used here is a transformer-based language model, which is less like asking "is this a frog (y/n)?" and more like asking "here's an image, describe what it is".
@@generichuman_ haha I'm assuming they deleted their unhelpful message, as I don't see it. Great call out. =)
Can't believe Mike can effortlessly make that shape with his hand (little finger) at 5:37
Oh i DEFINITELY want to see mike's deep dive into the code!
the explanation sounds like magic. It is like a sculptor saying he just chips away pieces of the stone until he finds the horse hidden inside.
It's strikingly similar, except a sculptor starts with a goal image in mind, but AI image generation doesn't; it just has general "knowledge" of "associations" between the words of the prompt and parts of images.
@@grafzeppelin4069so essentially it just pieces together the image based on what the prompt says, and on what it already knows?
The synthesized-speech scad (scam advert) that I received after watching this video reminded me a little too much about how all of our advancements will eventually be weaponized against us. I'm both filled with joy for the beautiful engineering that led to stable diffusion, and a sense of overwhelming dread for how it will eventually be utilized commercially.
Don't worry friend, just do what I do:
1. Assume everything on the internet is fake (including other people)
2. Retreat from society into a cave
3. Starve to death
It's kinda like Plato's Cave, but in reverse.
Anyway, it's a pretty solid solution 🙂
In fact it is already being done for artists, the LAION database should not be used for commercial use, and many IAs are actually using it in that way, not to mention that this database has images protected by copyright, so sell or publishing these resulting images is a clear violation of copyright
@@Hagaren333 Begun, the data wars have
We are getting ever closer to living in a dystopian Cyberpunk universe.
Just abolish capitalism, then.
I tried to guess how these things work. Now I'm taking the difference between my guess and this explanation and feeding it to my neurons. Thanks!
Insane how much progress was made in just 2 years, looking back at how the images used to look vs now is incredible
Add noise to images and train a model to undo that addition.. then you have something that maps from noise to images.
One thing I find so impressive about these researchers.. is that they would try this. It’s so bizarre.. just because, from a distance, it’s not at all clear that such a task is doable.
And with the intention to guess how much noise a picture would have only by text description as an input haha
Right, it sounds like a completely non-intuitive way of going about it, and yet, that's what ended up working. They must have iterated on a gazillion different ideas before they landed on this one.
The idea of adding something, do a transformation and substract to get only the transformation of the data or the something is actually quite common in math and control theory.
Real hit from AI was to get "the" transformation from data, something and output into an algorithm. This general function of transformation is what allows the image generation. We give it data or something that is slightly off, and amplify the error by the transformation.
😅 it's some weird combination of the chicken egg paradox and a rock paper scissor but with data, something, algorithm and output.
It's why science works better by being public and not subject to short/mid-term revenue. There were already teams training models to undo noise, there was already GPT to interpret text, there was already a database of millions of text-to-image pairings and there were already models trying to feed text to image-based neural networks. Kinda like smartphones, all it took is someone putting the pieces in the right order for the right purpose to make something more useful than the sum of its parts.
@@Alex-ye8qp To me this is the most bizarre part. How can a network even be trained to do that? So utterly bizarre to me. You're telling me "green cow grazing on mars" has a deterministic noise profile??
I couldn't agree more! Since the release of Stable Diffusion, I've been searching for an explanation that strikes the right balance between simplicity and technicality. Your video did an excellent job of providing a clear understanding without overwhelming us with excessive technical details. Dr. Mike Pound, you have a remarkable talent for explaining complex topics in a beautifully straightforward manner!
I followed some of that.. but some of that also sounded a lot like Michelangelo's "start with the block of marble and carve away everything that doesnt look like "X." I will come back to watch this again after the first watching settles! Thank you for providing this.
Would have been nice hear a bit more about the "gpt-style transformer embedding". Wouldn't those classifications have to be included in the training data already?
This is basically what CLIP does. CLIP learns from a massive amount of image-description pairs using GPT-style (Transformers) encoding so that it can map texts and images. CLIP data are not classification labels. Then the difference between the texts and the generated images can be calculated and minimized.
Key word is embeddings. Initial feature space of text has two bad properties: it has big dimensionality (each token is it's own dimension essentially) and sparsity. By using Transformers you compress representation of this object in more compact and dense form, so it's easier to work with.
Been listening to house music in the background (on the low down) when the odd watching computerphile / numberphile for quite a while now.
Thought it was time to fess up.
Vibing it is probably just me on this tip.
So stable diffusion is just the AI version of that sculpting joke: Start with a big block and take away the parts that dont fit
Yep! It's a bit like apophenia, like looking at random clouds and seeing coherent shapes in them, but with some priming about what you "should" be seeing :)
"I saw the angel in the marble and carved until I set him free. ”
- Michelangelo
No, that's not how it works at all. His explanation is highly inaccurate and misleading, which is throwing you off. Try reading the actual papers on the subject, or going through the code.
No, that's not how it works at all. His explanation is highly inaccurate and misleading, which is throwing you off. Try reading the actual papers on the subject, or going through the code.
But sculptors start already with a finished image in mind, while AI image generators, the way I understand it, makes it up as it goes along. It's less sculpting and more slapping clay into shape for a person that requests a clay sculpture, but he doesn't specify what exactly he wants, but he checks every time to see if the shape makes him happy.
would like to see more details but the explanation was superb for an introduction
Wow! Had not seen listing paper since my dad was trying to teach me basic on a commodore 64. Had no idea it was still a thing. Big jump from having to read code on paper to make sense of it to this.
Pounding that like button! You guys have inspired me to start an undergraduate degree in Cyber Security - thank you for all of your videos!
I watched Dr Mike Pound's video on Convolutional Neural Networks when it first came out and it got me into machine learning. Now I'm doing undergrad computer vision research with CNNs. It's honestly kind of crazy to think about how much this channel has affected my life.
@@emmafountain2059 Well played! Are you enjoying yourself doing the computer vision research?
@@emmafountain2059 yup. I started watching this channel when I was a bored IT audit intern who hated doing work papers. I literally sat in the bathroom or went on walks outside and just watched. Now I’m a pentester. The reach of high quality TH-cam channels like Computerphile are hard to measure but I don’t think I’m unique.
@@emmafountain2059 That's amazing! Wish you the best!
I love that he's doing all of this on 1980s printer paper. Proper geek
"I saw the angel in the marble and carved until I set him free. ” - Michelangelo
This is how DALL-E works in a nutshell:
"Read user prompt. Decide it's against their arbitrary moral codex. Emit error."
Excellent vid btw. Explained something complex in a very easy way.
Great explanation. Just complicated enough to understand for someone who keeps up with this stuff on the surface level, but isn't interested in reading the papers. Thanks.
Dr Mike Pound with pen and paper can make me understand any topic.
My favorite part is where he explains AI while drawing on printer paper from 1989 XD
As i understand it, it's somehow like an subtractive synthesizer in audio,the A.I. has the role of the filter.
Less than 2 years later and it's so widespread, plus it's so easy to generate images locally with a decent GPU
But what software would you use?
@@jeremiahweaver4677stable diffusion webui is great for that
@@jeremiahweaver4677The usual setup of stable diffusion with automatic1111, or the rather simpler and easier (but not less powerful) fooocus, it’s not a typo, it’s fooocus with three “o”.
Or comfyui if you like node based workflows.
Stable Diffusion is actually runable (in inference mode, i.e. for generation - this is different from training) on a regularish computer. The main factor is time but if you have a reasonably modern graphics card, you probably can run stable diffusion in principle.
It just might take minutes rather than seconds for a single image.
Somebody ran a variant of it on a not even that new iphone. It did take like half an hour iirc so it's not a thing most people would *want* to do, but one of the big selling poitns of stable diffusion is, that it's never the less *possible.*
Stuff like Dall-E 2 or Imagen actually really does need a beefy computer with lots of specialized hardware (in particular, above-consumer-hardware VRAM) to get things done.
Some of the oldest methods, though, can also work on a regular computer. I'm directly optimizing a version of CLIP towards some image for instance. It's not nearly as good as stable diffusion, but it's basically how all this madness of arbitrary images from text began
The biggest limit is vram not gpu power as much, on a rtx 3000 series gpu 2-10s for 512^2 20 steps and 15-60 secs for 1024^2 image. Slow down starts due to vram not being able to hold everything. So you can get silly things like the 3060 being better then a 3080ti for some uses of SD.
@@asdf30111 yeah VRAM is always the main issue with AI. Gotta store huge matrices in memory. I wonder if, going forward, hardware providers will bump up VRAM on their high end consumer cards due to increasing consumer demand...
Or perhaps decently sized tensor cores will become more commonplace
I would love to hear more about the process. Like how does it recognize that the image now looks like a frog on stilts? Seems to me like that's where the real complexity is.
Same, I understood the noise subtraction bit, but I can't quite understand how the subtraction can lead to a picture of a frog, was the IA trained with "words vs images"? So it can relate what a frog would look like.
Also, what the initial input picture (12:30) looks like? Is it just random generated noise?
@@skirtsonsale yes a labeled dataset (image-text pairs, LAION dataset) was used to train the network. That is why it is called guided diffusion. The text guides the diffusion process not to a random image, but conditioned on the text (again the pairs were used for training).
During training, it sample randoms noise from a random t according to the noise schedule (such that during training it is learned for all t). The input image on 12:30 is such image corrupted using noise from a random t. So somewhere between noise and an image.
@@skirtsonsale The concept of transformation in graphics will help you understand this.
@@tristanstevens6162 But why doesn't that just produce some incohesive amalgamation of the training data? How does it know to specifically put the bunny ears on the frog's head? Is that where the magic of having a large amount training data comes in, in that it better understands the correlation between the label and the image?
@@jonatansexdoer96 Yep. CLIP is the language model used in these, and it's seen enough examples of things labeled "bunny" that look different from each other to abstract the idea of where bunny ears are located in any given underlying image.
I remember rewatching Brows Held High's episode on the movie Blue. In it, there are clips from the film. It was just a blank blue screen. On film. Sot there was some noise from the grain. It was from a DVD rip (I think). And it had therefore, by the time it got from the film to my computer screen on youtube, gone through numerous re-encodings. Encodings that expect visual interests and details to compress.. but those had none. So I noticed that the artifacting was picked up as not-noise. And it tried to encode it as if it was normal video. And through the generations of transfers, the blue blank screen was now... Filled with random shapes of blue tones that had gotten enhanced over time.
I joked then that we were basically seeing the encoders hallucinations. Little did I know, that a few years later, seceral image processors would spring up that essentially used that method, but guided. And they would be able to hallucinate pretty high resolution images... From noise ..
The best compsci content on the internet, period.
i am still a bit confused about the process but loved the ending!
12:58 I'd like to hear more about that GPT-style transformer embedding of text. Was text part of the training set?
yes they used image-text pairs dataset (LAION) to train the guided diffusion model
@@tristanstevens6162 That dataset accounts for understanding text too? Like, if you have sets of images of cats and sets of images of frogs and very few sets of different animals being "fused" (e.g. maybe one capybara that looks like a dog), how would the neural network get to the interpretation of what a cat-frog means as an image? Is LAION that big? Or does the GPT neural network somehow bridge that gap?
@@ekki1993 a lot of the job of "understanding" is being done by the embedding network, which was trained on a very large corpus of words. So while the training set for stable diffusion might not have any examples of a frog-bunny fusion, CLIP is able to take the phrase "frog-bunny fusion" and transform it into a vector that encodes something about the meaning of the phrase. Stable diffusion was trained conditioned on this embedding, so it generally has learned to take concepts from this embedding and include them in the image. The hope is that stable diffusion is able to generalize across all concepts that can be represented by the embedding, so that even if it hasn't seen this specific thing before, it has seen similar stuff and is able to still produce a reasonable image that matches the concepts in the embedding.
@@DontThinkSo11 Thanks for the answer!
@@ekki1993 It's possible for all/any of the components to contribute to the result working - like, even if it hasn't seen any pictures by an artist x, "style of artist x" may still work as a prompt if it's seen text describing them.
This is an issue for artists who've been complaining that image generators can reproduce their style in some ways. It means that nothing can be done to prevent this; a base model might still understand them if the images aren't in the set. "Worse", fine-tuning seems to work well enough that you can add in new concepts and styles at home even if they're not in the model originally.
I'm running Stable Diffusion locally on a 3080 Ti, works fine.
Gaming laptop with a 3060 (6GB) here working great for Stable Diffusion, I'm using the Automatic1111 web UI distro bundle which made setup incredibly easy. I'm still learning how to use it to get "what I mean" results, but it is quite amazing.
I've ran it on a 2060 fine, though I think its the lowest supported card
I'm using a 3060 Ti. But inference is *always* more efficient than training, so my hardware can't handle Dreambooth, and certainly would never come close to handling a full initial training.
"That next video" sounds like exactly the video I want for these networks!
This is EXACTLY the detailed, nuts and bolts video I was hoping to find on AI art. I'm fascinated that random noise seems to be key. I have been working in 3D generated art for many many years and random noise is so powerful in creating imagery and textures in 3D. Such a fascinating, enigmatic concept - noise is nothing but at the same time everything. Even further fascinating to ponder that we can introduce chemicals into the human brain to create random noise, resulting in random infinite hallucinations which likewise have been used for millennia to generate art as well.
🤯
Two things amaze me... First, the AI-aspect which I will need more time to study (it's new to me).
Second: Mate... they still make folded printer paper like that? It's been decades since I last saw it. You're near a mainframe ammirite?
I went looking on amazon for it after watching. LOL
Please do a walk-through of the Stable Diffusion code.
So Stable diffusion is a for loop at the end of the day, impressive 😌
Eagerly waiting for the deep dive video👍👍👍
Very insightful, although the one thing he didn't do a great job explaining is why it's better to ask it to go from T to T0 and then add back noise equal to some lower T instead of just going from T to T-1 to T-2 and so on. He sort of touched on it vaguely but didn't really explain why it's actually better
When you say you remove the noise and then add most of it back, do you mean you add back the predicted noise, or you add back newly generated noise? And both approaches seem plausible, so what is the reasoning?
He explains the "add noise - estimate noise - substract noise - add most noise back - repeat" loop about four times, but then when it comes to how any of this relates to producing images of the actual prompts instead of random noise, it's just "oh GPT embedding", as if that's self-explanatory. Somewhat in the category of that 'draw the rest of the owl' meme for me I'm afraid.
Follow-up video is going to be great!
"The image is already complete inside the noise, before I start my work. It is already there, I just have to remove the superfluous noise"
The suggested follow up with details of the program would be great!
15:15 Why would we amplify the difference and not the parts that stay the same? Might be in the wrong here but in my understanding we are trying to use the parts that both the text guided and unguided noise produced to get the best output since the parts they both produced will be the best fit. Why then use the difference
very interesting, great explication, thanks
Curiously, this is the same process as creating a stone sculpture: you start with a block of stone with no shape and gradually take away all parts of the stone that are not shaped like the thing you are sculpting.
perfect explanations
nice
My brain hurts.
Can we start a gofundme for Mike as his talks are so good and I wouldn't want to see him spend more money on Google dev access?
i am a simple man. i see Mike Pound, i click
Thank you, now I can actually understand what the methods behind these generators are all about
I wish I could understand this. You must be a genius!
Where can I find his other videos where he is delving into the code etc? He explains very good.
@@ProxyAuthenticationRequired what a smart guy
it seems to me, this video you are looking for is not out yet, but I guess it will come on this channel soon :)
There's another video on the channel titled "Stable Diffusion in Code (AI Image Generation) - Computerphile" - am presuming that's the one you mean
Awesome video. That’s the clearest explanation I’ve seen. The hand drawn explanation explained it so perfectly. Would love to see a follow up video that goes through the code. Also would be awesome to include examples when talking about the muppets in the kitchen, etc.
stable diffusion is the best thing since sliced bread.
make this a series! it's hard to understand how this thing works, but it's more useful than ever to understand, because it's open source and runs on consumer graphics cards and everyone can hack on it!!
there's a lot more to cover. like model finetuning, textual inversion and dreambooth
img2img, txt2video, inpainting, outpainting, etc etc etc
@@stephenkamenar I can tell you that in-painting and out-painting are both kinda similar and straightforward.
You start with your image, replace your "empty" pixels with fresh random noise, and run your cycles again, but this time with much lesser starting t, so that the network tries to incorporate existing image because it assumes that there is not that much noise
Using pure noise and "returning it to the noise free original" reminds me of a quote:
"It is easy. You just chip away the stone that doesn’t look like David."
@@stephenkamenar I believe img2img is the same process, it just uses your supplied input image plus noise and a lower t-value as the first input to the diffusion network. Which lets it keep some of the structure of your input, the same way during training it learned to keep the structure of the bunny.
Thank you so much for talking about this topic! Great and very enjoyable!
Love that he draws on old dot matrix paper.
Yay new Mike Pound video
As others have said, I'd love to see another video that elaborates on how the text prompt is factored into the model.
New camera/processing? looks good!
It is amazing that something like that works
Love that he’s drawing on dotmatrix printer paper !!
Could you add foreign language subtitles to these videos? Right now I'd need German subtitles. I've always loved your explainers but AI image generators is the first time where I'd need to show a video of yours to someone who's not fluent in English. TH-cam offers options to crowdsource subtitles BTW. Thanks a lot, keep up the good work! ☺
TH-cam has a function of automatic subtitles.
Computer Phile, this is very Good and intuitive 😊
Great video!! Would be interesting to see the differences between dall-e ans stable Diffusion. And how the last one requires less training and compute power
Dr. Pound be teaching me CS since 2015.
Just starts trowing loose change at him at the end😂😂
Jokes aside, loved the video, I finally somewhat grasp how this black magic works
I feel the GPT embedding is skipped over quite easily. With my limited knowledge on how these text Transformers work there is a vector representation of the description that in stead of representing the words represent the 'meaning' of the text. However how does this transform to an image that does exactly that? You would still need a lot of training data that confirms whether an image matches a description, right? Did they use alt descriptions or something similar for this, eg (publicly) available image descriptions? I guess my point is that i do not see how this 'knowledge' of what's in the text is transferred to the 'knowledge' of what's in the image, apart from there being a mechanism of steering it towards the 'knowledge' but where does this knowledge come from?
The network is basically implicitly imbued with this understanding through the training process:
For the training process of these conditional models, you take a labeled dataset {(image, label)}, and the label is a natural-language description text *describing that image* . For a particular data sample, the network will then see a noisy version of this image, along with the embedding of the description of the original clean image. The training loss then rewards the network for reproducing this image given these two bits of information, and punishes it for reproducing *any* other image. I hope it's clear that the training will therefore steer the network into a direction of 'understanding' these descriptions in some abstract way.
That this understanding of the input text is *abstract* is actually what you hope allows the model to extrapolate, so it can generate new images conditioned on new texts not seen during training. That this extrapolation works ridiculously well is one of the amazing things about these models.
concise explanation Sir but why do we add noise to the original image?
Amazing explanation. Thank you!
Hiii Mr. Pound, hope you've been doing well. Love you!
Is that Professor Brian Cox asking the questions, excellent show. ..
Please do that next video!!! You guys are great.
A glib native English speaker, and that's OK, however, its crucial to define many bits of jargon at the outset, e.g., "NOISE", "IMAGE", etc....All of this technology has adequate well-understood jargon that originated in the 1960s or earlier, and as a result, the wheel is continually re-invented, probably imperfectly, as well! Computer "scientists" don't appear to exploit earlier software art very well....
On a side note: Is there some supply back room full of that old continuous stationary somewhere on campus? Good job recycling it, since there is no other use for it anymore.
Thank you for covering this topic!
Interesting to see (literally). Thanks for sharing.
Great video. Is there a name for the iterative denoising procedure ?
You did a great job explaining how the process works and provided visual examples. Nice work with this video.
Waiting for the follow up video!
it looks to me like he's tweaking a bit with that shoulder :)
Great video, btw! Thanks
I definitely want to see the follow up video.
Love this channel, it even helped me with IT certifications. Diffie Hellman for the win!!!
Nice explanation
I really liked the explanation. Clear and easy to grasp. Thanks!
Fantastic explanation!
That's why you don't denoise too much if you just want to "beautified" the original image. The more noise, the more random image SD will generate.
amazing explanation!
Always nice to listen to a lefty :)
Very interesting, the amount of areas where machine learning algorithms have improved to such a degree is getting more every day.
Glad you could explain the process behind stable diffusion
"It is easy. You just chip away the stone that doesn’t look like David."
Great explanation. Thanks.
TH-cam just stopped recommending me computerphile and numberphile videos, do you know how badly I wanted to know this info 4mo ago?
Stable Diffusion is very easy to run on your own computer. Used a version with an easy to use web ui and it takes maybe 10-15 sec to produce one 512x512px image. My pc is an around $1000-1200 gaming pc. I imagine with even a basic modern laptop you can get to under a minute.
Identifying objects within images is now more accurate thanks to sophisticated recognition systems. SmythOS incorporates AI for image recognition, enhancing features like photo management and security surveillance.
Excellent!!
Really hard to follow, but one thing i did learn. Lots of noise.