omg I have the 70 minute video of my voice on my iPhone, I suppose I have no choice but to upload it! check back in 1 hour. I bet somebody will edit it all together
Only about 40% of words are able to be made out by the best lip readers. The rest of the words are assumed based on context. So this project has huge limitations to start with.
@Eric Lee ohhhh children school they give milk like teachers to student. it good because I can eat cereal with milk it free. So teacher give milk to children. Okeh?
Reverse the program to animate the mouth movements EDIT: If Cary still has the animation files for some of his videos I don't think it'd be too hard to rip the mouth data from them (as a one dimensional matrix representing different mouth positions) and then use that with the audio from those videos
You could also reverse the purpose of the AI: give it the original transcript and have it swap real words with similar-looking words. Limit it to only a few words per sentence, give it an oddly specific dictionary for substitutions, and you'd have truly automated the bad lip reading channel. Maybe that's what I'll do for my senior project.
• The takeaway from this video is to give deaf people lots of kudos. • Decimating twice isn't 20% off, it's 19% off: ((N×0.9)×0.9) Close but no zikal (I think I need more practice lip-reading). • Dubbing words onto politician's mouths has already been done. It's the audio counterpart of deep-fakes (and BadLipReading).
as a linguist, I feel for you, you took on a task way harder than you expected, good job regardless. unfortunately we can not see inside the mouth of someone speaking and that is where so much of speech happens. you can also consider the following: if you have the same vowel after 3 different consonants, your lips will always be in a different position, thus some sounds don't have unique lip positions at all. real life lip reading is mostly context and being able to tell where those highly distinguishable consonants are.
Please use this to translate Jojo Siwa so we know what she’s trying to say Also, don’t worry about the project’s accuracy. I have a Deaf sibling and when they talk to me it’s fine because I learned sign language growing up with them. But they hate lip reading because it’s so hard to read lips. Apparently opinions/studies sort of agree that lip reading is an awful way to communicate cause some sounds look the same. A pretty infamous one is “Olive juice” looking like “I love you”. They say only 30% of words can be read accurately. Pretty weird right?
It's pretty obvious if you actually stop to think about it. (To quote Wikipedia for briefness) "Organs used for speech include the lips, teeth, alveolar ridge, hard palate, velum (soft palate), uvula, glottis and various parts of the tongue." Out of all of that, the only thing "lip reading" gets you information about is the lips and very occasionally the tip of the tongue; all of the rest of that critical information is invisible from the outside. It's remarkable that anybody ever thought lip reading was effective, really. Did they never stop to consider what their own mouth and throat are doing?
Badly Drawn Turtle Exactly! Sounds like Fa and Va look exactly the same. As well as Ga and Ka. The whole point of lip reading is that it’s just the shape of the mouth. You don’t have context or the sounds. In ASL we mouth words on most signs, but that’s just cause. If you do the sign for twins and mouth “twins”, no one is going to think you said “wins” because there is that context. But lip reading by itself (when my sibling tries to understand someone who isn’t signing) they struggle so much.
@@caseygreyson4178 yeah, there are around 40 phonemes in most languages, but traditional 2D animators use only 10 mouth shapes. eg. M B and P all use the same shape, there is one neutral looking shape that is used for about a quarter of the other sounds.
I love the conway's game of life reference "bring out the big guns" lmao For anyone wondering, the picture he slams on the table is a glider gun, it produces infinite gliders.
A Traditional to simplified Chinese character converter would be amazing. If you guys want to try that project again I suggest trying to identify radicals and translate those instead of the characters themselves. Most differences between simplified and traditional are in the radicals
I, myself, am hard of hearing. As long as I have the tiniest bit of sound, I can read lips. And with dramatic wording, like yours, I read it just fine! So hah!
AI writes the music for the score for the Bee Movie, AI writes the script for the Bee Movie, AI animates the Bee Movie, AI makes a bad lip reading of the AI written Bee Movie, AI takes the bad lip reading of the AI written Bee Movie and writes a script to contextualize the random things, AI animates the contextualized script based on the bad lip reading of the AI written Bee Movie and animates it, and so _ad nauseam_
The Gosper Glider Gun (4:20) is one of the smallest guns in Conway’s Game of Life. Like I’m not saying you needed to show a HBK Gun or anything, but at least show a Cordership Gun or something
Honestly, and I'm not sure if this is how TH-cam does their captions, but I feel like a combination of lip reading and word recognition together would make very accurate captions, especially if it's tuned to be just right.
That causes an issue. It wont know if it sees lips or not. It could just see like, as an example, a fortnitw characters lips. Alot of gameplay channels dont have webcams. It may see the wrong thing as lips, issues like that may screw up subtitles.
There’s a Cosmo article about the video used titled “TH-camr had one night stand with a woman, she lied afterwards about being pregnant with twins” if anyone wants to know the context of the video
@@43Jodo How does that make him an asshole? Like it's something kinda personal and he probably just wanted it to be more as an update about why he wasn't gonna be a father to those who were following him at the time instead of a video for everyone to be able to see forever
I think angles matter too. If he had done 2 angles, it probably would be able to look at the movements more precise and see where it went wrong. Then having 3-5 people reading the same thing both overly moving and normally, it should figure it out pretty quick.
carykh: *"On March the 11th, 2018, at 11 PM, I did the unthinkable."* Me: oh no, please tell me he didn't read the entire bee movie scri- carykh: *"I read the entire Bee Movie script on camera"*
My technical mind: "This is pretty interesting." My linguistic mind, watching the section on the algorithm guessing syllables: "Please, for the love of everything, use the IPA! Ahhhhhhhh!" (To be clear, this is mostly a joke. At least he is using a standardized format for syllables. I just have this little part of my brain that's been spoiled by the IPA's unambiguous nature and figured there's probably someone else out there who'll get it.)
If you could increase or decrease the score of words based on context you could probably reduce the amount of errors that occur, also that can be trained on separate material in the form of text transcripts from other sources, making it easier to see if it hurts or helps.
When you said you would use the transcript of a movie i was getting very excited. When you were talking about doing the unthinkable, i knew it had to be it. When you said you read the entire bee movie script on camera, i literally started clapping before i could care about my family being in the same room. I respect you so much for this, you really gave a big sacrifice.
carykh You told me the next episode would be between Christmas and New Years Eve. I am more disappointed at that then the fact that nobody made one of their TWOW submissions “The the the the the the the the the the”
I am actually currently trying to do the opposite. Using google speech recognition API and gentle(which I found thx to ur vid so thx) I am creating a lip syncing programming that will take audio from the mic, convert it into phonemes, then animate a character. Now that itself isn’t to hard but I want to do it live(live audio) so I am kind of struggling.
@@npric2883 animoji takes your picture and maps your muscle movement to a 3D model on a screen. Their project is to get the audio without the camera part and map it to a character on a screen.
@@jeeeves If it is ran locally, the ping would be less than a millisecond. It is ran locally. Quite annoying when people who don't understand technical and networking terms completely try to make statements to sound smart.
@@prokaryotesys I know that, but I didn't want to spell it out for two reasons. First, I didn't want to make it obvious for people looking for that kind of info. Second, comedic effect.
ASL to English might be more useful, but I think the idea is great. The tricky part would be the dataset. ASL uses more than the hands, so you'd probably need different types of clothing to train it on, as well as different skin tones, etc.
@@johnsensebe3153 once the basic Data set is created you can create the sceletal system, apply that to the person being interpreted and it should be fine. and the various dataset you could get from news brodcasts, meanwhile the transcripts are usually in the CC/subtitles even if they'res a interpreter on screen.
A really interesting idea. I had a similar idea some months ago but I couldn't do it myself. I think maybe you should focus on the link between words in order to create a meaningful sentence, like the TH-cam subtitle algorithm which can correctly transcribe audio to text most of the time. Combining that kind of algorithm with your lip reading idea, it might be good lip reading instead.
Diordnas Darkunn he’s mentioned the possibility of doing that before to save time on animation, though I personally think a more hard coded approach would work better than a neural network
@@agentstache135 I thought for sure that would be what this video was about. I would love to see how well that works, either for generating actual video from audio or for using it to animate the character's lips.
@@jacobfeinland7878 My idea for how to do it copied from my comment from the video where Cary mentions it (the dance one) because it's an essay I'm not rewriting: Why would you need an AI for animating the lips? Why not just write (or use, I’m sure it already exists) an algorithm that takes a transcript (handwritten or using existing speech recognition (which I know is probably still technically an AI)) of what you’re saying as input and then move the mouth? I’m sure there are some parts that you’d have to manually do, eg screaming, but it’d be a lot more reliable and robust than an AI based on the audio. If I were to code it I’d mine a dictionary for the International Phonetic Alphabet (or some other pronunciation respelling) representation of each word. Then just figure out what mouth shape you make and how long you make it for each sound and put it all together into an animation. Obviously you’d probably still need to tweak it some more, depending on how time-accurate your transcript is, and that might be where an AI could help. But, I still don’t think an AI would be robust enough for the whole process, especially for a pretty discrete animation where if it picks the wrong mouth shape it’s pretty noticeable. Whereas if you were to just use it to help with temporal alignment, it being wrong would only show up as a small offset, less noticeable.
CMUdict strikes again! Looked to me like some successes here. Now you got me wondering if you'd go even further weighting words / word neighborhoods by commonness, or by taking morphosyntax into account. Oh, and so much yes to the sinking smiles at 3:54 - that slow letdown of throwing out a hopeful spike solution and watching it fail.
1:38 if you make the words "heaven high poop push" the opposite you get "hell low pee pull" which sounds like "hello people". wow im suprised i noticed that.
Not necessarily. He already incorporated the IPA (International Phonetic Alphabet) this is more accurate than any native language. It truly has a Sound matching only one sign. A language with less destincive sounds and less allophones would be ideal. And italian has 30 which is an OK low number, but 7 of them are vowls. And you want more vowls (i'd think) because vowls are created by obstructing the airflow in a different manner with the same tone. So an A and an O make the same with your vocal chords, but A you stretch your lips and O you round them the last part is the tongue which you don't see, but but you lips slightly move when you go through you vowels. (also yes I know italien has A E I O U so just 5 vowels, but phonetic vowls have a nother realizazion as transcripted vowels in written language. The vowls also include diphtongues (voewls that merge into each other) or vowls with a lightly different attribution. In english bad and bat have a different form of A just for example. So if he wants to make the system more accurate he would need a language with less allphones and less sounds that voiced and unvoiced differentiation. eg G K and T D and many more are basically the same sound, but one with your vocal choards vibrating the other without. You can find that out by looking in a mirror and placing a finger on your throat. than say ATA and ADA whole you say ATA you will feel nothing but saying ADA you will feel vibration. But both look exactly the same. And in phonetics both are basically considered the same. And there are many more examples of this in the english language. And they are bearing meaning. Like Tick and Dick... thats a massive one. or simple Dog and Dock. it basically guts a sentence. So this project basically is doomed to fail by just looking at the lips. The tongue is so very important. Lip reading is hard, and it works by guessing words. In a sentence some words do not make sense, so they are tossed out, but the AI cannot differentiate between a sensible utterance and a non-sensible, it can though make guess what word was said and maybe from that one could extrapolate a probable sentence that was uttered.
@@Womenooo In the italian alphabet a letter is told in the same way anytime even if it has a specific letter before or after in the english language for example the T it is read in a way while TH is read in another way and they have different sounds in Italian we don't have this problem the letter E is always say in the same way even if it has a G or a F before of after (sorry for my English but as you can tell I'm italian)
@@bananogamer6972 no you don't understand. It is not about how true the phonetics of a language are to its alphabet. It is about how simple the phonetics are. I just have a basic knowledge of Italian at best but an example for a problem would propably be g and j. Geco an Julia would both look the same on the onset of the word. I am not certain on the example though. It is really a problem of many European languages that have many phonemes that are realized in the mouth and not on the lips thus it is impossible to read them without contextualization.
Thanks for helping my project on a video that an AI makes. I need it to read a transcript and create accurate voice and face. It then creates a video off of seeing images of faces off of the internet
I think that to a large degree this whole thing was flawed simply due to the angle that you are recording your face from. People don't look at a person from below normally. This make issues with some standard information sets one might normally use I would think.
Decimating twice does not mean dropping by twenty percent. Decreasing by 10% two times leaves you with 90% x 90% = 81% of what you started with, meaning a decrease of 19%. Nerd.
Dear carykh, your AI was not wrong, the word 'of' should be pronounced as 'ov', while 'off' is pronounced as 'oph'. Your success rate was higher than your thaught.
BFDI references 3:44, 5:27 "yeah i know she was so surprised" is the first line spoken in bfdi (by match) 12:40 flower's announcer crusher brief 15:39 "take the plunge" is the bfdi 1a name (yes i did watch the whole video four times [twice with captions], so what?)
I think the AI works pretty well for the amount of information it has. I guess you could only improve it by choosing the correct words based on grammar and context and what words most likely are next to each other. Also an additional System to output back to audio using a network that is trained on combining lip movement and the detected phonemes into input for a network(easy trained autoencoder) that outputs your voice would make the Project complete. Would loooove to see that.
Dude. I just started watching your videos. I don’t know what job you have. But your a genius. Your literally improving computer programming extremely. I don’t know actually terminology. But your gonna be making huge money someday if not already. Your gonna be the reason robots become a reality
now we need the full bee movie uploaded, but with the actual audio replaced by your dramatic reading of the script...
omg I have the 70 minute video of my voice on my iPhone, I suppose I have no choice but to upload it!
check back in 1 hour. I bet somebody will edit it all together
@@carykh please
@@carykh I would watch this :D Great work on the project btw, love your videos.
Please I still want this
If I need to I will volunteer as tribute
I love these videos.
OOF IVE FOUND YOU
@@UmMeAmberE same
Wow
Hello!
So thats how i found your channel
Holy hecc this is useful for animation
YES
Only about 40% of words are able to be made out by the best lip readers. The rest of the words are assumed based on context. So this project has huge limitations to start with.
@Eric Lee you like cereals>:)?
@Eric Leeyou like mum buy cereal type >:) ?
@Eric Lee ohhhh children school they give milk like teachers to student. it good because I can eat cereal with milk it free. So teacher give milk to children. Okeh?
Okay not is it
@@GaJ42 you like cereals>:)?
"We just need to pick the right transcript"
Me: Its going to be a Bee movie isnt it?
"I read the entire Bee movie script on camera"
NAILED IT.
It just HAD to be the Bee Movie script, I cheered so hard when he said it.
th-cam.com/video/AJCfgXhA5fc/w-d-xo.html here is his bee movie script video
thought the exact same thing
NAILY
I only guessed it because i have watched the video before
Cary: Read the lips of this guy.
Computer: *S U M M O N S S A T A N*
WHO SUMMONED ME
Cary:ME
God: Let me introduce myself
666 likes I'm not gonna ruin that
Still 666 likes
"so how tough are you?"
"I read the entire bee movie script"
"yeah, so?"
"I read it in front of my camera"
"come right in, sorry for the wait"
You got a bottle of ketchup?
yeah
*Fails at opening ketchup cap
Could I run this in some hot water?
Kolio Pulio
Why doesnt anyone know the last line?
@@azadanzans5359 , no no
AND SUBMITTED IT FOR A COLLEGE CLASS
Funny thing is...
I actually correctly guessed “Have you got a moment?”
same
LIAR
I guessed it was a question, but that’s it
I guessed are you being helpful?
That was the only one I got
"Or rather, I should say *OUR* lip reading A.I"
*SOVIENT ANTHEM STARTS PLAYING*
Antonio Sustaita ah, the sovieNt union
SPOTILA NAVEKI VELIKAYA RUS
Yes
Soviet
@@vvg_lol *YES!!!*
*OUR* LIP READING AI
_Soviet anthem begins_
Good job comrade we need you in the soviet union
mobile.twitter.com/unusualvideos/status/1069136310600777729
Sounds like
*_COMMUNIST PROPAGANDA_*
But ok
@@benos1799 to bad
Soviet has been gone for almost 30 years
Daily reminder that communism doesn't work.
13:00 super easy
I memorized the bee movie script
Overfitting in real life :D
I actually read "have you got a moment" easily. The AI needs more training in phrases.
Your profile pic saids it all
12:51 Jokes on you! I memorized the whole bee movie script!!!
what did he say then
Vannesa pull yourself together
According
@@Fuley-la-joo to
@@Crystal_500 all
Reverse the program to animate the mouth movements
EDIT: If Cary still has the animation files for some of his videos I don't think it'd be too hard to rip the mouth data from them (as a one dimensional matrix representing different mouth positions) and then use that with the audio from those videos
that's what China did with the news anchoring AI
IIRC Adobe Animate recently released a feature that would assist in lip syncing, but I'm not sure if it's anything like the logic used here.
You could also reverse the purpose of the AI: give it the original transcript and have it swap real words with similar-looking words. Limit it to only a few words per sentence, give it an oddly specific dictionary for substitutions, and you'd have truly automated the bad lip reading channel.
Maybe that's what I'll do for my senior project.
That would be incredibly useful to the anime industry. And with decent enough cgi, to the entire film dubbing industry.
Lol he did that
14:14 interesting, so this is what being insane feels like.
I'm pretty sure it's more like 3:53
*Uses headphone*...ow
My right headphone is broken
Which makes me sane, i guess
"For example, after the word 'the' there should always be a noun"
adjectives
The cat = The bad cat
@@pinkman_ Gerunds (-ing) are nouns, so you're using a noun there.
Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.
~HAL 9000
I was looking for this.
A Space Oddisey fans unite!
I’m afraid I can’t do that, Dave.
You said what we were all thinking, thank you
Most disappointed that there was no 2001: A Space Odyssey reference to HAL9000's decision to murder the crew based on lip reading evidence.
3:53 **Their smiles slowly turning into giant frowns**
• The takeaway from this video is to give deaf people lots of kudos.
• Decimating twice isn't 20% off, it's 19% off: ((N×0.9)×0.9) Close but no zikal (I think I need more practice lip-reading).
• Dubbing words onto politician's mouths has already been done. It's the audio counterpart of deep-fakes (and BadLipReading).
I was going to comment that but I'm not even sure what the "correct" term is. Sure you could say "20%" but does "bi-decimate" work?
What an absolute madlad! He actually read the whole Bee Movie script!
I hope he likes jazz...
as a linguist, I feel for you, you took on a task way harder than you expected, good job regardless. unfortunately we can not see inside the mouth of someone speaking and that is where so much of speech happens. you can also consider the following: if you have the same vowel after 3 different consonants, your lips will always be in a different position, thus some sounds don't have unique lip positions at all. real life lip reading is mostly context and being able to tell where those highly distinguishable consonants are.
You had a video of you reading the bee movie script for 10 months? And you didn’t post it?
- respect.
Please use this to translate Jojo Siwa so we know what she’s trying to say
Also, don’t worry about the project’s accuracy. I have a Deaf sibling and when they talk to me it’s fine because I learned sign language growing up with them. But they hate lip reading because it’s so hard to read lips. Apparently opinions/studies sort of agree that lip reading is an awful way to communicate cause some sounds look the same. A pretty infamous one is “Olive juice” looking like “I love you”. They say only 30% of words can be read accurately. Pretty weird right?
It's pretty obvious if you actually stop to think about it. (To quote Wikipedia for briefness) "Organs used for speech include the lips, teeth, alveolar ridge, hard palate, velum (soft palate), uvula, glottis and various parts of the tongue." Out of all of that, the only thing "lip reading" gets you information about is the lips and very occasionally the tip of the tongue; all of the rest of that critical information is invisible from the outside. It's remarkable that anybody ever thought lip reading was effective, really. Did they never stop to consider what their own mouth and throat are doing?
Badly Drawn Turtle Exactly! Sounds like Fa and Va look exactly the same. As well as Ga and Ka. The whole point of lip reading is that it’s just the shape of the mouth. You don’t have context or the sounds. In ASL we mouth words on most signs, but that’s just cause. If you do the sign for twins and mouth “twins”, no one is going to think you said “wins” because there is that context. But lip reading by itself (when my sibling tries to understand someone who isn’t signing) they struggle so much.
@@caseygreyson4178 yeah, there are around 40 phonemes in most languages, but traditional 2D animators use only 10 mouth shapes. eg. M B and P all use the same shape, there is one neutral looking shape that is used for about a quarter of the other sounds.
and, “Alligator food” looks like, “I love you”
If the computer got 47% right. Then its pretty good.
I love the conway's game of life reference "bring out the big guns" lmao
For anyone wondering, the picture he slams on the table is a glider gun, it produces infinite gliders.
timestamp?
Roses are read
Violets are blue
AI can read
Can Cary too?
A Traditional to simplified Chinese character converter would be amazing. If you guys want to try that project again I suggest trying to identify radicals and translate those instead of the characters themselves. Most differences between simplified and traditional are in the radicals
I, myself, am hard of hearing. As long as I have the tiniest bit of sound, I can read lips. And with dramatic wording, like yours, I read it just fine! So hah!
We need the entire movie but with the AI instead of the actual audio
EDIT: woah that’s a lot of likes
AI writes the music for the score for the Bee Movie, AI writes the script for the Bee Movie, AI animates the Bee Movie, AI makes a bad lip reading of the AI written Bee Movie, AI takes the bad lip reading of the AI written Bee Movie and writes a script to contextualize the random things, AI animates the contextualized script based on the bad lip reading of the AI written Bee Movie and animates it, and so _ad nauseam_
The Gosper Glider Gun (4:20) is one of the smallest guns in Conway’s Game of Life. Like I’m not saying you needed to show a HBK Gun or anything, but at least show a Cordership Gun or something
not enough pixels in a TH-cam video! And hey at least it's bigger than a queen bee
lol 420
Thanks for the random knowledge, stranger!
*420 blaze it*
I seriously thought he said “I love bobbies” 13:35
Honestly, and I'm not sure if this is how TH-cam does their captions, but I feel like a combination of lip reading and word recognition together would make very accurate captions, especially if it's tuned to be just right.
That causes an issue. It wont know if it sees lips or not. It could just see like, as an example, a fortnitw characters lips. Alot of gameplay channels dont have webcams. It may see the wrong thing as lips, issues like that may screw up subtitles.
When you didn't understand anything but you still enjoyed the video.
*THIS IS AMAZING! SO COOL!*
Few mins later...
*WHAT DOES THAT MEAN? WATEVER!*
3:54 when I try to talk/listen to someone talking in a dream
What a nice way to start off the year! Finding _yet another_ awesome channel I'm gonna be enjoying for a pretty long time, I think!
Person: Read My Lips
Cary: Say No More
decimated twice would be 19% off
100-(100/10)=90
90-(90/10)=81
You should try to make a similar program that converts audio into little animated mouth movements for animators
06:08 i swear I expecting he was gonna read the Bee movie script… AND HE DID!
I'm like "YESS!"
"Automate their entire channel."
That's another hint for your next channel: lazykh
There’s a Cosmo article about the video used titled “TH-camr had one night stand with a woman, she lied afterwards about being pregnant with twins” if anyone wants to know the context of the video
Agent Stache what the fuck?
th-cam.com/video/_J7dEhYttbQ/w-d-xo.html
Plug this into the Wayback Machine to actually watch the video. Asshole decided to delete it.
@@43Jodo How does that make him an asshole? Like it's something kinda personal and he probably just wanted it to be more as an update about why he wasn't gonna be a father to those who were following him at the time instead of a video for everyone to be able to see forever
what is going on lol
Agent Stache
What are you talking about
3:53 me trying to have a normal conversation with someone
Edit: Woah, that's a lot of likes...
Same here
this sounds exactly like me when I haven't slept in 24 hours but still have a lot to say
WhEn LibEarLs sPeAk tO mE tHeY sOuNd LIke ThaT XD XD WOW they ThInk Their so Gr8 :0)
😂😂😂😂😂😂😂
13:39I ACTUALLY GOT IT RIGHT OMGGG!
So this is what ultra instinct feels like?
YEAH HAHA SAME
"Both of you did terrible"
Your neck moves too when you make certain syllables. Maybe you should incorporate that?
I think angles matter too. If he had done 2 angles, it probably would be able to look at the movements more precise and see where it went wrong. Then having 3-5 people reading the same thing both overly moving and normally, it should figure it out pretty quick.
Predated O exactly
carykh: *"On March the 11th, 2018, at 11 PM, I did the unthinkable."*
Me: oh no, please tell me he didn't read the entire bee movie scri-
carykh: *"I read the entire Bee Movie script on camera"*
Lmao
My technical mind: "This is pretty interesting."
My linguistic mind, watching the section on the algorithm guessing syllables: "Please, for the love of everything, use the IPA! Ahhhhhhhh!"
(To be clear, this is mostly a joke. At least he is using a standardized format for syllables. I just have this little part of my brain that's been spoiled by the IPA's unambiguous nature and figured there's probably someone else out there who'll get it.)
Just saying I can lip read and the reason I can’t tell what ur saying is because no one talks like that
caluppy he was over pronouncing words and that made the AI confused I think.
@Radium X I was able to get Vanessa
this video is so last year
*C o m e d y*
Yeah I like the cool stuff from 2019 like the sequel to the Logan Paul suicide forest video and a sequel to fortnight
IN AN HOUR BOI
Izzy Pin TIMEZONES BOI
Way to start new year with a dad joke
I should say *OUR* lip reading AI.
Staline aproves
If you could increase or decrease the score of words based on context you could probably reduce the amount of errors that occur, also that can be trained on separate material in the form of text transcripts from other sources, making it easier to see if it hurts or helps.
AI should learn to animate so Cary could be able to upload more often
When you said you would use the transcript of a movie i was getting very excited. When you were talking about doing the unthinkable, i knew it had to be it.
When you said you read the entire bee movie script on camera, i literally started clapping before i could care about my family being in the same room.
I respect you so much for this, you really gave a big sacrifice.
I live in Germany. It's Silvester. I am drunk. It's 6 am. I am watching Carykh. I hope I spell3d everything right. Happy new year!!!!
best ai
hsppy new yere
Frohes neues!
As soon as you said, "Which movie to pick," I instantly went, "It's Bee Movie, isn't it?"
I've never seen Bee Movie to be honest.
I got the “have you got a moment” right 😃
COMP: LAUREL
AI: YANNY
I HEARD "THE EARTH IS NOT FLAT"!!!
You shouldn't reveal that you are deaf!
I hear covfefe
I heard commit order 66
We went all of 2018 without a TWOW vid
*That wasn’t very cash money of you*
i have 39 minutes left tho
just kidding, twow 24a coming january i hope
carykh come on if you fly to a different time zone you could still get it out by 2018!
Wait twow is actually not just cancelled? No way!
carykh You told me the next episode would be between Christmas and New Years Eve. I am more disappointed at that then the fact that nobody made one of their TWOW submissions “The the the the the the the the the the”
@@carykh When do you think we'll have season 2? I missed the beginning of the first season and can't stop bingeing the series.
the blurry voice actually sounds great. i would turn that into music so fast
I am actually currently trying to do the opposite. Using google speech recognition API and gentle(which I found thx to ur vid so thx) I am creating a lip syncing programming that will take audio from the mic, convert it into phonemes, then animate a character. Now that itself isn’t to hard but I want to do it live(live audio) so I am kind of struggling.
is the project on github?
Isnt that animoji
@@blasttrash no not yet
@@npric2883 animoji takes your picture and maps your muscle movement to a 3D model on a screen. Their project is to get the audio without the camera part and map it to a character on a screen.
Like vrchat?
It's not the A.I fault, it's ping is too high.
Vsus what the hell
. . . I want to punch you so bad. The AI is ran Locally meaning it's sub-instant reading.
@@TheNerdBird_ no its ping
Look buddy, it's is short for "it is" but if you want to signify possession it's "its", not "it's", okay?
@@jeeeves If it is ran locally, the ping would be less than a millisecond. It is ran locally. Quite annoying when people who don't understand technical and networking terms completely try to make statements to sound smart.
O:08
Cary: Or I should say OUR lip reading AI
*Soviet anthem starts playing*
0:09 cause I'm communist
Edit: 2:32 he uses the URSS to convert it to spectrogram
two communist references in one video
URSS = ur SS
@@Kitulous mein leben
Welp this is what I’m watching for the first vid of 2019
Fancy Spider same
Same bruhh
I think you need to train netwot not only with lips, but with throat too. Because a lot of sounds became from vocal cords only
9:45 Uhh... What's that censor bar supposed to be covering? Because I don't think it did what it was supposed to do.
ToHellWithReality their emails, I think.
@@prokaryotesys I know that, but I didn't want to spell it out for two reasons. First, I didn't want to make it obvious for people looking for that kind of info. Second, comedic effect.
@@ToHellWithReality just r/woosh them
@@krucible4889 oof i got wooshed
thats one of my life goals tho
@krucible r/itswooooshwithfouros
How about sign language Recognition AI? and maybe translation, ASL to UK Sign language etc.?
ASL to English might be more useful, but I think the idea is great. The tricky part would be the dataset. ASL uses more than the hands, so you'd probably need different types of clothing to train it on, as well as different skin tones, etc.
@@johnsensebe3153 Just train it on a black and white dataset
@@elllieeeeeeeeeeeeeeeeeeeeeeeee You're still going to have a variety of shades, short sleeves, long sleeves, no sleeves, frilly cuffs, etc.
like the one in unfriended 2?
@@johnsensebe3153 once the basic Data set is created you can create the sceletal system, apply that to the person being interpreted and it should be fine. and the various dataset you could get from news brodcasts, meanwhile the transcripts are usually in the CC/subtitles even if they'res a interpreter on screen.
14:16
Carykh: Quiet I want to talk!
AI: LET ME TALK FIRST
Carykh: Let me talk first, please
*And then you loop this
A really interesting idea. I had a similar idea some months ago but I couldn't do it myself. I think maybe you should focus on the link between words in order to create a meaningful sentence, like the TH-cam subtitle algorithm which can correctly transcribe audio to text most of the time. Combining that kind of algorithm with your lip reading idea, it might be good lip reading instead.
hey what about an AI to play Super Mario afap? that may break the wr.
@@kamaljotsingh6675 I've already made one, that can complete SMB almost as fast as the WR. th-cam.com/video/pSiQgHZhKjk/w-d-xo.html
Holy Moly you are that super mario TAS man
@@nyroysa Hi, nice to meet you here.
HAPPY NEW YEAR!!!
TofuMaster83 Happy new year!!! (In 1 hour for me)
and happy birthday bfdi!
4:04 Don't blame poor computer, he is just trying to summon satan, nothing special.
I was so proud when I guessed "do you have a moment"
Do it the other way - generate lip shapes from the audio! Automated lip sync!
Diordnas Darkunn he’s mentioned the possibility of doing that before to save time on animation, though I personally think a more hard coded approach would work better than a neural network
@@agentstache135 I thought for sure that would be what this video was about. I would love to see how well that works, either for generating actual video from audio or for using it to animate the character's lips.
@@jacobfeinland7878 My idea for how to do it copied from my comment from the video where Cary mentions it (the dance one) because it's an essay I'm not rewriting:
Why would you need an AI for animating the lips? Why not just write (or use, I’m sure it already exists) an algorithm that takes a transcript (handwritten or using existing speech recognition (which I know is probably still technically an AI)) of what you’re saying as input and then move the mouth? I’m sure there are some parts that you’d have to manually do, eg screaming, but it’d be a lot more reliable and robust than an AI based on the audio. If I were to code it I’d mine a dictionary for the International Phonetic Alphabet (or some other pronunciation respelling) representation of each word. Then just figure out what mouth shape you make and how long you make it for each sound and put it all together into an animation. Obviously you’d probably still need to tweak it some more, depending on how time-accurate your transcript is, and that might be where an AI could help. But, I still don’t think an AI would be robust enough for the whole process, especially for a pretty discrete animation where if it picks the wrong mouth shape it’s pretty noticeable. Whereas if you were to just use it to help with temporal alignment, it being wrong would only show up as a small offset, less noticeable.
Using those mouths from bfdi.
Wait animators actually lipsync the characters? It feels so dumb I mean who's going to care
[IN TENNIS BALL VOICE]: James-
14:04 oh good, I have mono audio setting on.
What else does K and H mean hmm...
Cary
*K* omments
*H* ere
dangit i was close.
also this is my first comment from 2019, hehe
LIES!
BGGAMING Deluxe komrades
he's a time traveler huzzah
Cary Killed Him
Must be nice to be in 2019
Wow the video of you saying the bee movie script was recorded on my birthday. Best present ever!
0:03 It's The Captain from SpongeBob "are you ready kids, aye-aye captain! I can't hear you! AYE-AYE CAPTAIN! OHHHHHHHHHH!"
CMUdict strikes again! Looked to me like some successes here. Now you got me wondering if you'd go even further weighting words / word neighborhoods by commonness, or by taking morphosyntax into account. Oh, and so much yes to the sinking smiles at 3:54 - that slow letdown of throwing out a hopeful spike solution and watching it fail.
OH MY GOSH I GOT THE LIP READING RIGHT!!! BOTH OF THEM!!
I am *GOD*
wait why isn't your channel verified
1:38 if you make the words "heaven high poop push" the opposite you get "hell low pee pull" which sounds like "hello people". wow im suprised i noticed that.
Ok, Hannah, stop messing with Cary. Hannah, calm down.
Hannah?
Hannah!
You could have used the video of the longest word, the full chemical name of titin. Instant 3 hours with a full transcript available everywhere.
It would contain the same 5 samples of "words," though.
"Tower owe wheat and sought owe-induced height eight of lamb late"
In Italian that would be easier because every letter has a sound
Not necessarily. He already incorporated the IPA (International Phonetic Alphabet) this is more accurate than any native language. It truly has a Sound matching only one sign. A language with less destincive sounds and less allophones would be ideal. And italian has 30 which is an OK low number, but 7 of them are vowls. And you want more vowls (i'd think) because vowls are created by obstructing the airflow in a different manner with the same tone. So an A and an O make the same with your vocal chords, but A you stretch your lips and O you round them the last part is the tongue which you don't see, but but you lips slightly move when you go through you vowels. (also yes I know italien has A E I O U so just 5 vowels, but phonetic vowls have a nother realizazion as transcripted vowels in written language. The vowls also include diphtongues (voewls that merge into each other) or vowls with a lightly different attribution. In english bad and bat have a different form of A just for example.
So if he wants to make the system more accurate he would need a language with less allphones and less sounds that voiced and unvoiced differentiation. eg G K and T D and many more are basically the same sound, but one with your vocal choards vibrating the other without. You can find that out by looking in a mirror and placing a finger on your throat. than say ATA and ADA whole you say ATA you will feel nothing but saying ADA you will feel vibration. But both look exactly the same. And in phonetics both are basically considered the same. And there are many more examples of this in the english language. And they are bearing meaning. Like Tick and Dick... thats a massive one. or simple Dog and Dock. it basically guts a sentence. So this project basically is doomed to fail by just looking at the lips. The tongue is so very important. Lip reading is hard, and it works by guessing words. In a sentence some words do not make sense, so they are tossed out, but the AI cannot differentiate between a sensible utterance and a non-sensible, it can though make guess what word was said and maybe from that one could extrapolate a probable sentence that was uttered.
@@Womenooo In the italian alphabet a letter is told in the same way anytime even if it has a specific letter before or after in the english language for example the T it is read in a way while TH is read in another way and they have different sounds in Italian we don't have this problem the letter E is always say in the same way even if it has a G or a F before of after (sorry for my English but as you can tell I'm italian)
@@bananogamer6972 no you don't understand. It is not about how true the phonetics of a language are to its alphabet. It is about how simple the phonetics are. I just have a basic knowledge of Italian at best but an example for a problem would propably be g and j. Geco an Julia would both look the same on the onset of the word. I am not certain on the example though. It is really a problem of many European languages that have many phonemes that are realized in the mouth and not on the lips thus it is impossible to read them without contextualization.
@@Womenooo now I understand thanks
Emanuele Bonandrini or finnish
0:08 *COMMUNISM INTENSIFIES*
Stalin wants to know your location!
It's socialism not communism!
@@marlon.8051 Similar thing...
@@d0nnyr0n socialism tries convince the population that communism is great and communism dont
@@marlon.8051 That is not correct. See this *www.investopedia.com/video/play/difference-between-communism-and-socialism/* .
tumbnail: automate their entire channel.
NO THANK YOU.
I'm really curious how it would sound if the raw phoneme data was pushed into sound output instead of trying to match it up to specific words.
maybe it would play every phoneme simultaneously, and the more confident it was in a phoneme, the louder it would be.
I actually got the "have you got a moment" one right.
Thanks for helping my project on a video that an AI makes. I need it to read a transcript and create accurate voice and face. It then creates a video off of seeing images of faces off of the internet
I think that to a large degree this whole thing was flawed simply due to the angle that you are recording your face from. People don't look at a person from below normally. This make issues with some standard information sets one might normally use I would think.
Decimating twice does not mean dropping by twenty percent. Decreasing by 10% two times leaves you with 90% x 90% = 81% of what you started with, meaning a decrease of 19%.
Nerd.
No one is gonna talk about on how he used diffrent animation style in the beginning
The audio reminds me of the creepy audio captchas
At 6:06 I could tell it would be the bee movie script
Wow Cary, ORIGINAL
I HAVE NO FUCKING IDEA WHAT HALF OF THESE WORDS MEAN, BUT I LIKE IT
Dear carykh, your AI was not wrong, the word 'of' should be pronounced as 'ov', while 'off' is pronounced as 'oph'.
Your success rate was higher than your thaught.
/əv/ is of
/ôf,äf/ is off
HOLY SHIT I KNEW IT WAS BEE MOVIE, kinda obvious though ... love it!
BFDI references
3:44, 5:27 "yeah i know she was so surprised" is the first line spoken in bfdi (by match)
12:40 flower's announcer crusher
brief 15:39 "take the plunge" is the bfdi 1a name
(yes i did watch the whole video four times [twice with captions], so what?)
I think the AI works pretty well for the amount of information it has. I guess you could only improve it by choosing the correct words based on grammar and context and what words most likely are next to each other. Also an additional System to output back to audio using a network that is trained on combining lip movement and the detected phonemes into input for a network(easy trained autoencoder) that outputs your voice would make the Project complete. Would loooove to see that.
I strongly agree with the word choosing idea!
Decimated twice means -19% not -20% *ugh*
Dude. I just started watching your videos. I don’t know what job you have. But your a genius. Your literally improving computer programming extremely. I don’t know actually terminology. But your gonna be making huge money someday if not already. Your gonna be the reason robots become a reality
lip smacking is the equivalent of fingernails on a chalkboard.