GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Yannic Kilcher

มุมมอง 35 842

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 15 มิ.ย. 2024
#glide #openai #diffusion
Diffusion models learn to iteratively reverse a noising process that is applied repeatedly during training. The result can be used for conditional generation as well as various other tasks such as inpainting. OpenAI's GLIDE builds on recent advances in diffusion models and combines text-conditional diffusion with classifier-free guidance and upsampling to achieve unprecedented quality in text-to-image samples.
Try it yourself: huggingface.co/spaces/valhall...
OUTLINE:
0:00 - Intro & Overview
6:10 - What is a Diffusion Model?
18:20 - Conditional Generation and Guided Diffusion
31:30 - Architecture Recap
34:05 - Training & Result metrics
36:55 - Failure cases & my own results
39:45 - Safety considerations
Paper: arxiv.org/abs/2112.10741
Code & Model: github.com/openai/glide-text2im
More diffusion papers:
arxiv.org/pdf/2006.11239.pdf
arxiv.org/pdf/2102.09672.pdf
Abstract:
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at this https URL.
Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
TH-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 75

@YannicKilcher 2 ปีที่แล้ว ⁺⁸
OUTLINE:
0:00 - Intro & Overview
6:10 - What is a Diffusion Model?
18:20 - Conditional Generation and Guided Diffusion
31:30 - Architecture Recap
34:05 - Training & Result metrics
36:55 - Failure cases & my own results
39:45 - Safety considerations
Paper: arxiv.org/abs/2112.10741
Code & Model: github.com/openai/glide-text2im
Try it yourself: huggingface.co/spaces/valhalla/glide-text2im
More diffusion papers:
arxiv.org/pdf/2006.11239.pdf
arxiv.org/pdf/2102.09672.pdf
@LOFIENJOYTIME ปีที่แล้ว
I would like to ask about the FID and IS values of the experiments with classifier-free guidance and CLIP guidance on MS-COCO 64 × 64 in Figure 6 of my paper. What are their respective values? Do they have different results due to different scales?
@chickenp7038 2 ปีที่แล้ว ⁺⁸
this is the best day!! elon talking ml with lex and then glide review!!!!!
@theodorosgalanos9663 2 ปีที่แล้ว ⁺³
As a designer, the progressive generation of an artifact from this paper is what impressed me the most. Really great stuff!
@Vikram-wx4hg 2 ปีที่แล้ว
That was a fantastic review/tutorial. Thanks Yannic, enjoyed it!
@MrStarchild3001 2 ปีที่แล้ว ⁺³
Super helpful! Thank you.
@paulcurry8383 2 ปีที่แล้ว ⁺²
Love your videos! Can’t tell how cherry picked these outputs are, but still impressive!
@jackshi7613 2 ปีที่แล้ว ⁺¹
Very cool and very impressive ...
@AiveanZ 2 ปีที่แล้ว ⁺³³
Is there a community effort to replicate the large model? Based on the examples it's amazingly capable and I believe everyone would benefit from having access to it (like it happened with GPT-J from Eleuther AI).
@democratizing-ai 2 ปีที่แล้ว ⁺⁵
Yes, we from LAION are creating a image text dataset with billions of samples and someone from Eleuther is working on the code. it also seems that we can get compute donations that will be sufficient, but we will have to see what the next year brings :)
@ZedaZ80 2 ปีที่แล้ว ⁺⁵
Just make an ai do it, sheesh :P
@Neptutron 2 ปีที่แล้ว ⁺¹
I wonder this too...that being said rudalle took absolutely forever to come out...
@mgostIH 2 ปีที่แล้ว ⁺¹
@@Neptutron It took less than a year!
@HoriaCristescu 2 ปีที่แล้ว ⁺²
@@ZedaZ80 I'd love to see an AI that can reimplement a model from Yannic's review video.
@alpers.2123 2 ปีที่แล้ว ⁺⁶
Maybe mixing models helps. Generate image with a DALL-E then input it to GLIDE to get better & bigger version.
@makakogordo 2 ปีที่แล้ว
Thanks for another excellent video bb
@albertwang5974 2 ปีที่แล้ว
This is a magic model!
@justfoundit 2 ปีที่แล้ว ⁺¹
Why not using CLIP on the final image and roll back the gradient through t steps? Would it explode/vanish?
@sapiranimations 2 ปีที่แล้ว
Unlike DALL-E which really seemed to draw semantic meaning from text prompts, even when ambiguous, this model seems to be more or less copy pasting existing/common concepts to fit the text prompt. So when asked to do something that doesn't exist like triangular wheels or a mouse running after a lion it seems to struggle. It just pastes a wheel, and something triangular on top of it. This is why I still have more faith in the sequential transformer approach. It seems that it really force the model to understand the relation between text and image.
@aaron6807 ปีที่แล้ว
We have come a long way
@paulcurry8383 2 ปีที่แล้ว ⁺¹¹
Thank god they didn’t release the full model, otherwise a gang of 256x256 images would kill my family
@brun301 ปีที่แล้ว
Ooh this did not age well 😂 are you still alive?
@jonginkim941 2 ปีที่แล้ว
Hi, thank you for the great video! Can I know what kind of app you use to edit pdfs in the videos? Looks good as it has extra margins to write something on.
@YannicKilcher 2 ปีที่แล้ว ⁺¹
OneNote
@jonginkim941 2 ปีที่แล้ว
Thank you very much!
@paxdriver 2 ปีที่แล้ว ⁺⁸
Video game asset designers are going to be professional narrators 5 years from now. We'll have come full circle from stone age bonfires and mythologies to creating religions out of GAN / diffusion hybrid models and burn GPUs to punish apostates lol
@correctionquest4075 2 ปีที่แล้ว ⁺¹
5:42 Old Gregg!
@erikpaul8647 2 ปีที่แล้ว
I mean - that treasure is definitely well hidden. Can't argue with that.
@shubham-pp4cw 2 ปีที่แล้ว ⁺²
if have two ques to Yannic: -
1) How do you select which ML research papers to shows, how long it takes to read a ML papers( give some adavice to me so that i can also read ML papers completely in one go ).
2) After reading ML papers did you get intitiution of papers easily or read papers again, take some references from somewhere. (what are the steps so that a ML papers meaning and there implementation can be achieved ).
@Phenix66 2 ปีที่แล้ว ⁺⁵
The fucking future is here... And I'm sitting there, being happy that I can disentangle MNIST...
@vsiegel 2 ปีที่แล้ว ⁺¹
Yes. And it came quicker than expected. Exponential AI development feels like time contraction.
@tempdeltavalue 2 ปีที่แล้ว
1. What is diagonal gaussian ?
2. What does it mean predict where's noise ? (generate image (resolution of orig img) with just noise) ?
@YannicKilcher 2 ปีที่แล้ว
1. A diagonal gaussian is a multivariate gaussian distribution with a diagonal covariance matrix
2. Not sure what you mean here. I might have not been accurate in the video
@alexijohansen 2 ปีที่แล้ว
They don’t say how many steps of noising/denoising they do?
@philosophicalgamer2564 2 ปีที่แล้ว
❤
@JTMoustache 2 ปีที่แล้ว ⁺²
Weights and biases is on holidays !
👏🏾
Great video 👏🏼
👏🏿
Precision / Recall curves are drunk
@CharlesVanNoland 2 ปีที่แล้ว
Years ago this is about how I concluded that video games would be designed at some point in the future.
@GeneralKenobi69420 2 ปีที่แล้ว ⁺⁸
Wasn't there a couple of hobbyists trying to replicate DALL-E for the last year? And then this comes out and annihilates it lol. F for them I guess
@YannicKilcher 2 ปีที่แล้ว ⁺⁶
There are still differences, for example this model takes quite a while (15 seconds on an A100) to produce a sample.
@alexnichol3138 2 ปีที่แล้ว
@@YannicKilcher dalle is slower I believe and requires expensive CLIP reranking as well
@dreadfulbodyguard7288 2 ปีที่แล้ว
Commenting after release of dalle2 lmao.
@rb8049 2 ปีที่แล้ว ⁺²
Pretty impressive. Next is to develop some useful tools for artists and engineers. Clearly could develop subject specific models.
@nostradamus9132 2 ปีที่แล้ว
Can somebody recommend me an image generation model publicly available?
@eelcohoogendoorn8044 2 ปีที่แล้ว ⁺²
'if the noise is normally distributed it will end up normally distributed'; central limit theory would say thats a limiting belief.
@YannicKilcher 2 ปีที่แล้ว
Wise words!
@eelcohoogendoorn8044 2 ปีที่แล้ว
@@YannicKilcher Not untrue, mind; just limiting. Was that a Swiss accent or did you actually say iff though?
Anyway, these results are amazing. I was pretty hyped about diffusion models before... but after this id go as far as to say image generation is 'solved'. Most of these are well past the uncanny valley which isnt something id say of any GAN except in rather limited domains.
Good point about the sensitivity to text; seems indeed the discriminative training is superior in that regard. But then why does the CLIP loss not deliver? Could the apparent textual intelligence be due to the sampling/filtering step that DALLE does as a last step?
@phaze7272 2 ปีที่แล้ว
5:41
@vsiegel 2 ปีที่แล้ว ⁺³
The million dollar question: What is the stock ticker symbol of the first company generating convincing personalized erotic photography?
And two further areas of work got redundant where people _really_ did not expect it.
It is hard enough to convince people that AI can be creative at all.
To me this is in the category of things that are hard to believe exist even if you know how they work.
@killers31337 2 ปีที่แล้ว
I think we are currently in an awkward stage when it comes to commercialization:
1. Models are still so simple that even one person can create a significant improvement, and a small team can make best-in-the-world practical results.
2. New, better models come every year.
So the first company to do something is unlikely to be the one who wins the cake.
E.g. company X makes the best porn generation model of 2022. In 2023 company Y makes a better model. And in 2024 yet another company makes even better model.
@vsiegel 2 ปีที่แล้ว
@@killers31337 Yes, that is similar to the beginning of facebook. Many people knew that something like it became possible, and over time, various factors change, the potential user base for example. It is not clear what is the right moment. It gets easier over time, but it is not known how that happens. And a company that tries it in the right moment succeeds.
@vsiegel 2 ปีที่แล้ว
@@killers31337 An interesting aspect is that a company that tries porn generation seriously could easily find investors, and could afford huge amounts of computing power, realistically spending tens of millions USD on it. Starting now would have good chances, I guess. Anybody here who thinks being rich makes happy? That's your chance! [nobody is interested] .
@vsiegel 2 ปีที่แล้ว
@@killers31337 On 1.: You could say it is a young science, with many open questions. But steps of improvement are often an idea of a single person, it is a normal case. I would say that there are still low hanging fruits available for progress.
For practical use, I think there are two types of AI problems in general. The ones that are just solved, anybody can use the algorithm and train a model.
And the ones that depend on large amounts of training data, compute or memory. The algorithm is known, but to use it, you need to come up with the training data you need, which may be difficult and expensive.
@sapiranimations 2 ปีที่แล้ว
@@killers31337 1. I don't think that is true any more. OpenAI spends millions of dollars worth of electricity and GPU equipment to train every single one of these multi billion parametrized models. No small team can afford to do that. And many of these "clever" models only show true potential when ran on this scale.
@qeter129 2 ปีที่แล้ว
There are probably over a million well annotated images on some of the large r34 websites. Why not train such a model on them? The economic value of a well functioning system would be unquestionable.
@thegistofcalculus 2 ปีที่แล้ว ⁺²
Hmm... could it take in a badly photoshoped image and realify it?
@YannicKilcher 2 ปีที่แล้ว ⁺⁵
yes, probably if you put some noise onto it and then let the model de-noise it
@vsiegel 2 ปีที่แล้ว ⁺²
You ask a science fiction question, and get a science answer less than an hour later. [mind blown]
@thegistofcalculus 2 ปีที่แล้ว ⁺¹
@@vsiegel It was not a question pulled out of a gaussian question distribution.
@thegistofcalculus 2 ปีที่แล้ว
@@YannicKilcher Thank you for a clarifying answer
@laurenpinschannels 2 ปีที่แล้ว
@@thegistofcalculus It wasn't? not even context-conditioned? ;)
@fredrik241 2 ปีที่แล้ว
Who's checking if not the majority of the 'generated' image are not mostly copied from a training input image?
@blengi 2 ปีที่แล้ว ⁺¹
I don't know, seems intuitively analogous in some limit and makes me more convinced my own myopic reality is just an incidental byproduct of an ex nihilo emergent, dreaming hyper dimensional Boltzmann like entity, reordering and compactifying lower entropy information configurations like its looking for a compelling reason to keep these things extant lol...
@brandonmckinzie2737 2 ปีที่แล้ว
wait am I first?
@brandonmckinzie2737 2 ปีที่แล้ว ⁺²
nvm...
@herp_derpingson 2 ปีที่แล้ว ⁺³
38:10 The treasure is hidden! Thats why you cant see it.
.
Its like teaching people that photoshop exists. ML research is expensive, there should be no shame in making back the money from it.
Damn I wish someone trained this on hen... I mean anime.
@YannicKilcher 2 ปีที่แล้ว ⁺³
Oh no I'm all for making money but don't tell me some BS about safety 😁
@twobob ปีที่แล้ว
This is the same reason given by voice cloning repositories. Safety. Whatever. Sure...
@MrBcool88 ปีที่แล้ว
"No one used GPT-2 to spread fake news" ("wait, why has no one?") Yannik goes on to build GPT-4chan
@xl0xl0xl0 2 ปีที่แล้ว
The small model is pretty helpless.
@justinwhite2725 2 ปีที่แล้ว
AI Dungeon came under fire because users were creating very detailed child porn with GPT-2. They had to overhaul their system to detect and block such activities.
@thebunfromouterspace 2 ปีที่แล้ว ⁺⁴
That was GPT3, which by the way, was entirely under OpenAIs control during that. Yet they still took >1 year to even see it as an issue.

ต่อไป

เล่นอัตโนมัติ

Parti - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Paper Explained)