How does OpenAI's Sora work?

Jia-Bin Huang

มุมมอง 48 252

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 มิ.ย. 2024
OpenAI presents Sora, a text-to-video model for generating high-quality video from text prompts. In this video, we explain a high-level overview of how Sora works.
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 66

@crispinotechgaming 13 วันที่ผ่านมา ⁺¹
Honestly it'd be really nice to see the open source community catch up to this scale of operation one day! They are the backbone of ai progress but they rarely manage to innovate with actual public models to use
@jbhuang0604 11 วันที่ผ่านมา
I completely agree. Most of these models are close-source so it’s mainly showcasing their R&D capability but as you said, public don’t benefit much from these. I also hope the open source community can catch up soon.
@ranashahroz2794 3 หลายเดือนก่อน ⁺¹²
Thank you for such a concise explanation.
@jbhuang0604 3 หลายเดือนก่อน
Thanks for watching!
@julienjames7216 3 หลายเดือนก่อน ⁺¹
Thank you so much for posting this, that is really neat, I cant wait to try it out, I heard that the release is due on either March or April 2024. Now I have a few questions, do you have any idea of the capacity of Sora to generate high paced actions scenes? Lets say like a kung-Fu training fight? How fast are the movements gonna be? I mean i saw that Land Rover video and the Mountain biking one too but it did lack speed in my humble opinion. Now I know its in beta testing and its already a massive leap in the Generative Video world, but like i cant tell you the amount of frustration I have been getting by using other platforms like Runway for example so I am kind of expecting Sora to be a huge step forward to say the least. The time wasted to get what you need with Runway (if you ever do lol) is insane so I cant wait to try out Sora. Oh and yes one last question you may have the answer to, will Sora have the capability to reuse the same characters in different videos? That's an important one. I hope yes! Thank you again for posting and I hope you can bring some clarity to those questions I asked.
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
RE: release date
- I don't know when this will be released. Hopefully soon!
RE: fast movement
- Yes, it would be very interesting. I believe that the model should have the capability of doing so. But we may need some prompt engineering to steer the model to generate what we have in mind (similar to ChatGPT based on GPT3.5). It probably won't be an issue in future iterations.
RE: Consistent character
- That would be awesome for many applications. I am pretty sure that it is feasible with existing techniques. Consistent character, customization, and more fine-grained controllability are all great directions that Sora can bring to the market. I believe that it's going to have a deep impact on many industries.
@julienjames7216 3 หลายเดือนก่อน ⁺²
Thank you for taking the time to reply Professor. Generative AI is evolving so rapidly that we definitely can expect some major changes before the last quarter 2024. I truly hope that they improve the fluidity and speed of movement and re-use of characters in order to be able to create our own Generative AI short movies at least. Now another point is that it would be great to include a feature where we can make the character articulate sentences and include audio ambiant background sound . I truly have faith in Open Ai for this, maybe they can multiplex Eleven Labs that could be a way to go.
Thank you again for your response dear Professor.
@jbhuang0604 3 หลายเดือนก่อน
@@julienjames7216 Yes, I am very excited about the rapid progress as well! Exciting times ahead!
@ebrarbasyigit3234 3 หลายเดือนก่อน ⁺³
The cat photograph used in the video belongs to Tombili (chubby in Turkish), a street cat from Istanbul. Tombili was known for its stylish sitting pose
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
I love that cat! I think they even made a sculpture of Tombili after the cat died!
@ebrarbasyigit3234 3 หลายเดือนก่อน ⁺¹
@@jbhuang0604 Yeah true, the statue was stolen and found after a month 😄
@TayuYoung หลายเดือนก่อน ⁺¹
Hi Professor, thank you for your explanation.
However, I think that at 1:03 in the video, the up-sampling mechanism for image is performed by the 'decoder', not by the diffusion model. The animation here seems to suggest that the diffusion model produces the high-resolution images.
Thanks for your time.
@jbhuang0604 หลายเดือนก่อน
Sorry for the confusion. I introduced two mechanisms for high-resolution generation: 1) cascade diffusion models and 2) latent diffusion models. In cascade-based approaches, the upsampling is done via a super-resolution diffusion model. The model in Sora is likely using only a video decoder that upsampling the denoised clean latent to high-resolution images/videos.
@TayuYoung หลายเดือนก่อน
@@jbhuang0604
Thanks for your explanation.
I checked the IMAGEN paper, they use the text2image diffusion model and the SR-resolution diffusion model to produce the high-resolution image, which is the output of decoder in the latent diffusion model.
Because I used to think that the main difference between cascade and latent diffusion model was just one uses the low-resolution image and the other uses the latent representation, with both employing an encoder-diffusion-decoder pipeline. In IMAGEN, it seems that the diffusion model can also serve in the 'decoder' role. Am I right?
@md.shahidulislamshabuz9245 3 หลายเดือนก่อน ⁺¹
Wonderful explanation.....thanks prof.
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
Thanks a lot!
@YunqingZhao 3 หลายเดือนก่อน ⁺¹
Thank you for your contributions Prof~
@jbhuang0604 3 หลายเดือนก่อน
You are very welcome! Happy that you like it!
@mihaidusmanu5430 3 หลายเดือนก่อน ⁺¹
Do I understand it correctly that the latent z at train time for Latent Diffusion Models is lower resolution than the z^_0 used at test time (while the decoder is kept the same)?
@jbhuang0604 3 หลายเดือนก่อน
The spatiotemporal latent z is a compressed version of the RGB video.
The training and inference of diffusion models happen in the latent space. To generate video, we start with a pure Gaussian noise z_T and progressively denoise it (using a denoising network conditioned on the text prompt) to get a clean latent z_0. Then, we can use the same trained decoder to convert the clean spatiotemporal latent back to an RGB video.
@TheAnirudh4 3 หลายเดือนก่อน ⁺¹
Nice explanation!
@jbhuang0604 3 หลายเดือนก่อน
Glad it was helpful!
@abdoufma 3 หลายเดือนก่อน ⁺¹
1:07 Can you provide more info on this point please?
What exactly is the latent space?
And how does the VAE work?
@jbhuang0604 3 หลายเดือนก่อน
No problem. You can check out the video on diffusion models here: th-cam.com/video/i2qSxMVeVLI/w-d-xo.htmlsi=jz27-Gf4BcjSSptV&t=21, where I talked about the training based on maximum likelihood and the training objective of VAE.
@bleach5980 3 หลายเดือนก่อน ⁺¹
so if i give it a prompt to make a cartoon animation video, it can also do so?? Is there a text limit on the prompt, can it follow a script?
@jbhuang0604 3 หลายเดือนก่อน
Yup, I think Sora can already do that. It has demonstrated the ability to synthesize videos of various styles (eg minecraft). Now it supports generating videos up to one minute. It’s very likely with the next iteration we will see videos of several minutes following a long script.
@draken5379 3 หลายเดือนก่อน
It doesnt use keyframes for long video gen.
They most likely train by cutting chunks of the space-time latent out, and having it predict the missing frames.
@jbhuang0604 3 หลายเดือนก่อน
What I meant is that they likely used cascade diffusion method. The “keyframe” of course, is not RGB images. Sora demonstrated great results on interpolating two frames/videos. So I guess that’s probably how they handle long video generation.
@marverickbin 3 หลายเดือนก่อน
The patches are space only (x, y) or space-time(x,y,t)?
@jbhuang0604 3 หลายเดือนก่อน
I believe that the “patches” are spatiotemporal patches. I used 2D just for visualization.
@Janamejaya.Channegowda 3 หลายเดือนก่อน ⁺²
Thank you for sharing.
@jbhuang0604 3 หลายเดือนก่อน
You are welcome!
@nicoli3143 3 หลายเดือนก่อน ⁺¹⁴
It's so over bruh
@johnchevalier290 3 หลายเดือนก่อน ⁺²
Hahahah
@TheInternetIsDeadToMe 3 หลายเดือนก่อน ⁺¹
Could video game graphics be processed in this way?
@jbhuang0604 3 หลายเดือนก่อน
Certainly, this is just the first step. I am sure that this will have great impact on game industry very soon.
@provlic8134 3 หลายเดือนก่อน ⁺¹
Great Video!
@jbhuang0604 3 หลายเดือนก่อน
Glad you enjoyed it
@pandabuycentral 3 หลายเดือนก่อน
so when will sora be available to use?
@jbhuang0604 3 หลายเดือนก่อน
We don't know. Hopefully soon!
@LL-vx3nr 3 หลายเดือนก่อน ⁺¹
深入浅出，言简意赅，赞!
@jbhuang0604 3 หลายเดือนก่อน
感謝！
@Mr.MawYT1498 3 หลายเดือนก่อน ⁺¹
When we can use this?
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
Hopefully soon! But OpenAI definitely needs to put many guardrail to avoid misuse of such technology.
@soulkill3579 3 หลายเดือนก่อน ⁺¹
Sora = THE MATRIX pre alpha version 0.1
@jbhuang0604 2 หลายเดือนก่อน
The progress is incredible!
@arashnozari4042 13 วันที่ผ่านมา ⁺¹
imagine it in the next 5 years
@jbhuang0604 11 วันที่ผ่านมา
Yup, the rate of progress is incredible. Can’t imagine what this will look like…
@user-kq9cu8wy9z หลายเดือนก่อน ⁺¹
The world and my brain after this: 💀
@jbhuang0604 หลายเดือนก่อน
Indeed!
@BoredT-Rex 3 หลายเดือนก่อน ⁺¹
Still have no clue. I don't think I ever will though.
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
You can do it bruh!
@dailymind77 3 หลายเดือนก่อน ⁺¹
Matrix incoming
@jbhuang0604 2 หลายเดือนก่อน
Indeed!!
@user-hs9gs3uy1q 3 หลายเดือนก่อน ⁺¹
感觉人物有点太刻板了有点机器人的感觉
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
Yup, that's true! But it's just a start. I can't imagine what the results will look like one year from now.
@TheSonic1685 3 หลายเดือนก่อน
So in other words you feed it copyrighted videos and it spits those copyrighted videos out again? So much for "AI"
@jbhuang0604 3 หลายเดือนก่อน
I believe they would need to make sure that the training videos are either synthetically generated (e.g., via Unreal Engine) or licensed.
@radioreactivity3561 3 หลายเดือนก่อน
It won't generate the exact same video found in the training data.
@veenasuresh7054 3 หลายเดือนก่อน
Each person's essence is etched in their dreams and daily pursuits, especially in their chosen profession. Yet, the looming shadow of AI replacing human roles casts doubt on our future. It's confounding why some fervently champion innovations that jeopardize our fundamental right to earn a living. Beware, the allure of novelty can exact a devastating toll. Remember.
It's baffling why those in power don't halt technologies that threaten lives, and why courts or human rights groups don't step in. Ordinary folks uphold the world's wealth structure, yet their job security is at risk. It's the duty of every government to protect their people's livelihoods.
@jbhuang0604 3 หลายเดือนก่อน
That’s deep bro
@SIDOLOMI 3 หลายเดือนก่อน ⁺¹
too much blah blah how can i make my video is the point
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
I hope we can have access soon! It’s super exciting!
@thelozer6311 3 หลายเดือนก่อน ⁺¹
Oh god this is horrible
@coyi3884 3 หลายเดือนก่อน ⁺¹
Like having your dad do you homework. Grow up and get other creating yourself. So many fake artists are incoming with this crap
@jbhuang0604 3 หลายเดือนก่อน ⁺¹
I don’t think this will not entirely replace the human creativity. In fact, it will allow more people with limited artistic skills to express themselves visually.

ต่อไป

เล่นอัตโนมัติ

How Stable Diffusion Works (AI Image Generation)