OpenAI presents Sora, a text-to-video model for generating high-quality video from text prompts. In this video, we explain a high-level overview of how Sora works.
Honestly it'd be really nice to see the open source community catch up to this scale of operation one day! They are the backbone of ai progress but they rarely manage to innovate with actual public models to use
I completely agree. Most of these models are close-source so it’s mainly showcasing their R&D capability but as you said, public don’t benefit much from these. I also hope the open source community can catch up soon.
Thank you so much for posting this, that is really neat, I cant wait to try it out, I heard that the release is due on either March or April 2024. Now I have a few questions, do you have any idea of the capacity of Sora to generate high paced actions scenes? Lets say like a kung-Fu training fight? How fast are the movements gonna be? I mean i saw that Land Rover video and the Mountain biking one too but it did lack speed in my humble opinion. Now I know its in beta testing and its already a massive leap in the Generative Video world, but like i cant tell you the amount of frustration I have been getting by using other platforms like Runway for example so I am kind of expecting Sora to be a huge step forward to say the least. The time wasted to get what you need with Runway (if you ever do lol) is insane so I cant wait to try out Sora. Oh and yes one last question you may have the answer to, will Sora have the capability to reuse the same characters in different videos? That's an important one. I hope yes! Thank you again for posting and I hope you can bring some clarity to those questions I asked.
RE: release date - I don't know when this will be released. Hopefully soon! RE: fast movement - Yes, it would be very interesting. I believe that the model should have the capability of doing so. But we may need some prompt engineering to steer the model to generate what we have in mind (similar to ChatGPT based on GPT3.5). It probably won't be an issue in future iterations. RE: Consistent character - That would be awesome for many applications. I am pretty sure that it is feasible with existing techniques. Consistent character, customization, and more fine-grained controllability are all great directions that Sora can bring to the market. I believe that it's going to have a deep impact on many industries.
Thank you for taking the time to reply Professor. Generative AI is evolving so rapidly that we definitely can expect some major changes before the last quarter 2024. I truly hope that they improve the fluidity and speed of movement and re-use of characters in order to be able to create our own Generative AI short movies at least. Now another point is that it would be great to include a feature where we can make the character articulate sentences and include audio ambiant background sound . I truly have faith in Open Ai for this, maybe they can multiplex Eleven Labs that could be a way to go. Thank you again for your response dear Professor.
The cat photograph used in the video belongs to Tombili (chubby in Turkish), a street cat from Istanbul. Tombili was known for its stylish sitting pose
Hi Professor, thank you for your explanation. However, I think that at 1:03 in the video, the up-sampling mechanism for image is performed by the 'decoder', not by the diffusion model. The animation here seems to suggest that the diffusion model produces the high-resolution images. Thanks for your time.
Sorry for the confusion. I introduced two mechanisms for high-resolution generation: 1) cascade diffusion models and 2) latent diffusion models. In cascade-based approaches, the upsampling is done via a super-resolution diffusion model. The model in Sora is likely using only a video decoder that upsampling the denoised clean latent to high-resolution images/videos.
@@jbhuang0604 Thanks for your explanation. I checked the IMAGEN paper, they use the text2image diffusion model and the SR-resolution diffusion model to produce the high-resolution image, which is the output of decoder in the latent diffusion model. Because I used to think that the main difference between cascade and latent diffusion model was just one uses the low-resolution image and the other uses the latent representation, with both employing an encoder-diffusion-decoder pipeline. In IMAGEN, it seems that the diffusion model can also serve in the 'decoder' role. Am I right?
Do I understand it correctly that the latent z at train time for Latent Diffusion Models is lower resolution than the z^_0 used at test time (while the decoder is kept the same)?
The spatiotemporal latent z is a compressed version of the RGB video. The training and inference of diffusion models happen in the latent space. To generate video, we start with a pure Gaussian noise z_T and progressively denoise it (using a denoising network conditioned on the text prompt) to get a clean latent z_0. Then, we can use the same trained decoder to convert the clean spatiotemporal latent back to an RGB video.
No problem. You can check out the video on diffusion models here: th-cam.com/video/i2qSxMVeVLI/w-d-xo.htmlsi=jz27-Gf4BcjSSptV&t=21, where I talked about the training based on maximum likelihood and the training objective of VAE.
Yup, I think Sora can already do that. It has demonstrated the ability to synthesize videos of various styles (eg minecraft). Now it supports generating videos up to one minute. It’s very likely with the next iteration we will see videos of several minutes following a long script.
It doesnt use keyframes for long video gen. They most likely train by cutting chunks of the space-time latent out, and having it predict the missing frames.
What I meant is that they likely used cascade diffusion method. The “keyframe” of course, is not RGB images. Sora demonstrated great results on interpolating two frames/videos. So I guess that’s probably how they handle long video generation.
Each person's essence is etched in their dreams and daily pursuits, especially in their chosen profession. Yet, the looming shadow of AI replacing human roles casts doubt on our future. It's confounding why some fervently champion innovations that jeopardize our fundamental right to earn a living. Beware, the allure of novelty can exact a devastating toll. Remember. It's baffling why those in power don't halt technologies that threaten lives, and why courts or human rights groups don't step in. Ordinary folks uphold the world's wealth structure, yet their job security is at risk. It's the duty of every government to protect their people's livelihoods.
I don’t think this will not entirely replace the human creativity. In fact, it will allow more people with limited artistic skills to express themselves visually.
Honestly it'd be really nice to see the open source community catch up to this scale of operation one day! They are the backbone of ai progress but they rarely manage to innovate with actual public models to use
I completely agree. Most of these models are close-source so it’s mainly showcasing their R&D capability but as you said, public don’t benefit much from these. I also hope the open source community can catch up soon.
Thank you for such a concise explanation.
Thanks for watching!
Thank you so much for posting this, that is really neat, I cant wait to try it out, I heard that the release is due on either March or April 2024. Now I have a few questions, do you have any idea of the capacity of Sora to generate high paced actions scenes? Lets say like a kung-Fu training fight? How fast are the movements gonna be? I mean i saw that Land Rover video and the Mountain biking one too but it did lack speed in my humble opinion. Now I know its in beta testing and its already a massive leap in the Generative Video world, but like i cant tell you the amount of frustration I have been getting by using other platforms like Runway for example so I am kind of expecting Sora to be a huge step forward to say the least. The time wasted to get what you need with Runway (if you ever do lol) is insane so I cant wait to try out Sora. Oh and yes one last question you may have the answer to, will Sora have the capability to reuse the same characters in different videos? That's an important one. I hope yes! Thank you again for posting and I hope you can bring some clarity to those questions I asked.
RE: release date
- I don't know when this will be released. Hopefully soon!
RE: fast movement
- Yes, it would be very interesting. I believe that the model should have the capability of doing so. But we may need some prompt engineering to steer the model to generate what we have in mind (similar to ChatGPT based on GPT3.5). It probably won't be an issue in future iterations.
RE: Consistent character
- That would be awesome for many applications. I am pretty sure that it is feasible with existing techniques. Consistent character, customization, and more fine-grained controllability are all great directions that Sora can bring to the market. I believe that it's going to have a deep impact on many industries.
Thank you for taking the time to reply Professor. Generative AI is evolving so rapidly that we definitely can expect some major changes before the last quarter 2024. I truly hope that they improve the fluidity and speed of movement and re-use of characters in order to be able to create our own Generative AI short movies at least. Now another point is that it would be great to include a feature where we can make the character articulate sentences and include audio ambiant background sound . I truly have faith in Open Ai for this, maybe they can multiplex Eleven Labs that could be a way to go.
Thank you again for your response dear Professor.
@@julienjames7216 Yes, I am very excited about the rapid progress as well! Exciting times ahead!
The cat photograph used in the video belongs to Tombili (chubby in Turkish), a street cat from Istanbul. Tombili was known for its stylish sitting pose
I love that cat! I think they even made a sculpture of Tombili after the cat died!
@@jbhuang0604 Yeah true, the statue was stolen and found after a month 😄
Hi Professor, thank you for your explanation.
However, I think that at 1:03 in the video, the up-sampling mechanism for image is performed by the 'decoder', not by the diffusion model. The animation here seems to suggest that the diffusion model produces the high-resolution images.
Thanks for your time.
Sorry for the confusion. I introduced two mechanisms for high-resolution generation: 1) cascade diffusion models and 2) latent diffusion models. In cascade-based approaches, the upsampling is done via a super-resolution diffusion model. The model in Sora is likely using only a video decoder that upsampling the denoised clean latent to high-resolution images/videos.
@@jbhuang0604
Thanks for your explanation.
I checked the IMAGEN paper, they use the text2image diffusion model and the SR-resolution diffusion model to produce the high-resolution image, which is the output of decoder in the latent diffusion model.
Because I used to think that the main difference between cascade and latent diffusion model was just one uses the low-resolution image and the other uses the latent representation, with both employing an encoder-diffusion-decoder pipeline. In IMAGEN, it seems that the diffusion model can also serve in the 'decoder' role. Am I right?
Wonderful explanation.....thanks prof.
Thanks a lot!
Thank you for your contributions Prof~
You are very welcome! Happy that you like it!
Do I understand it correctly that the latent z at train time for Latent Diffusion Models is lower resolution than the z^_0 used at test time (while the decoder is kept the same)?
The spatiotemporal latent z is a compressed version of the RGB video.
The training and inference of diffusion models happen in the latent space. To generate video, we start with a pure Gaussian noise z_T and progressively denoise it (using a denoising network conditioned on the text prompt) to get a clean latent z_0. Then, we can use the same trained decoder to convert the clean spatiotemporal latent back to an RGB video.
Nice explanation!
Glad it was helpful!
1:07 Can you provide more info on this point please?
What exactly is the latent space?
And how does the VAE work?
No problem. You can check out the video on diffusion models here: th-cam.com/video/i2qSxMVeVLI/w-d-xo.htmlsi=jz27-Gf4BcjSSptV&t=21, where I talked about the training based on maximum likelihood and the training objective of VAE.
so if i give it a prompt to make a cartoon animation video, it can also do so?? Is there a text limit on the prompt, can it follow a script?
Yup, I think Sora can already do that. It has demonstrated the ability to synthesize videos of various styles (eg minecraft). Now it supports generating videos up to one minute. It’s very likely with the next iteration we will see videos of several minutes following a long script.
It doesnt use keyframes for long video gen.
They most likely train by cutting chunks of the space-time latent out, and having it predict the missing frames.
What I meant is that they likely used cascade diffusion method. The “keyframe” of course, is not RGB images. Sora demonstrated great results on interpolating two frames/videos. So I guess that’s probably how they handle long video generation.
The patches are space only (x, y) or space-time(x,y,t)?
I believe that the “patches” are spatiotemporal patches. I used 2D just for visualization.
Thank you for sharing.
You are welcome!
It's so over bruh
Hahahah
Could video game graphics be processed in this way?
Certainly, this is just the first step. I am sure that this will have great impact on game industry very soon.
Great Video!
Glad you enjoyed it
so when will sora be available to use?
We don't know. Hopefully soon!
深入浅出,言简意赅,赞!
感謝!
When we can use this?
Hopefully soon! But OpenAI definitely needs to put many guardrail to avoid misuse of such technology.
Sora = THE MATRIX pre alpha version 0.1
The progress is incredible!
imagine it in the next 5 years
Yup, the rate of progress is incredible. Can’t imagine what this will look like…
The world and my brain after this: 💀
Indeed!
Still have no clue. I don't think I ever will though.
You can do it bruh!
Matrix incoming
Indeed!!
感觉人物有点太刻板了 有点机器人的感觉
Yup, that's true! But it's just a start. I can't imagine what the results will look like one year from now.
So in other words you feed it copyrighted videos and it spits those copyrighted videos out again? So much for "AI"
I believe they would need to make sure that the training videos are either synthetically generated (e.g., via Unreal Engine) or licensed.
It won't generate the exact same video found in the training data.
Each person's essence is etched in their dreams and daily pursuits, especially in their chosen profession. Yet, the looming shadow of AI replacing human roles casts doubt on our future. It's confounding why some fervently champion innovations that jeopardize our fundamental right to earn a living. Beware, the allure of novelty can exact a devastating toll. Remember.
It's baffling why those in power don't halt technologies that threaten lives, and why courts or human rights groups don't step in. Ordinary folks uphold the world's wealth structure, yet their job security is at risk. It's the duty of every government to protect their people's livelihoods.
That’s deep bro
too much blah blah how can i make my video is the point
I hope we can have access soon! It’s super exciting!
Oh god this is horrible
Like having your dad do you homework. Grow up and get other creating yourself. So many fake artists are incoming with this crap
I don’t think this will not entirely replace the human creativity. In fact, it will allow more people with limited artistic skills to express themselves visually.