Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!

Bisqwit

มุมมอง 98 672

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 7 ม.ค. 2025

ความคิดเห็น •

@Bisqwit 6 ปีที่แล้ว ⁺⁸⁶
This is LPC (Linear Predictive Coding): yₙ = eₙ − ∑(ₖ₌₁..ₚ) (bₖ yₙ₋ₖ) where
‣ y[] = output signal, e[] = excitation signal (buzz, also called predictor error signal), b[] = the coefficients for the given frame
‣ p = number of coefficients per frame, k = coefficient index, n = output index
Compare with FIR (Finite Impulse Response): yₙ = ∑(ₖ₌₁..ₚ) (bₖ xₙ₋ₖ) where
‣ x[] = input signal
The similarities between the two are striking. FIR is used in applications like low-pass filtering, high-pass filtering, band-pass filtering, band-stop filtering, etc. It is an almost magical type of mathematics that is used to generate these filters. For LPC, there are several different algorithms, many of which are implemented in Praat, the software that I used in this video to create my LPC files.
@huyvole9724 6 ปีที่แล้ว ⁺²
I met that formula when I learn Signal & System module (my school call it Digital Signal Processing)
@BichaelStevens 6 ปีที่แล้ว ⁺¹
16-17 minutes in:
Please next time lower the audio or give a warning. The popping killed my hearing
@Rennu_the_linux_guy 6 ปีที่แล้ว
uhhh
@a1k0n 6 ปีที่แล้ว ⁺²
In fact it's identical to an IIR filter, which has coefficients for both x and y, and your x coefficient is 1 and all your y coefficients are negated.
@RazorM97 6 ปีที่แล้ว ⁺¹
How to stop prison radicalization
@KatzRool 6 ปีที่แล้ว ⁺⁹⁴
I was going to make a joke about how you already sound like speech synthesis when speaking English, but your English gets better every single video. Keep it up man!
@framegrace1 6 ปีที่แล้ว ⁺¹⁵³
I think the clicks are because the program is cutting/pasting at random waveform values. This produces non-continuous gaps in the waveform that generates those clicks. I think the simple way to solve it, is to just wait until the value of the sample crosses the 0 line to perform the cut of the audio, and wait again a 0 crossing to introduce the next one.
@KuraIthys 6 ปีที่แล้ว ⁺¹⁰
Interesting theory.
That actually matches advice mentioned in the SNES manual in relation to audio samples. What it is trying to say exactly is ambiguous, but it warns against discontinuities in the waveform, which would result in clicking sounds.
Of course, given the ADPCM coding, discontinuities on block boundaries would easily result if you're not careful. (since the samples within a block are all expanded using the same parameters, but across block boundaries the parameters change.)
@idk-bv3iw 6 ปีที่แล้ว ⁺¹⁴
What about a simple fade-out/fade-in between the samples?
@TheBcoolGuy 6 ปีที่แล้ว ⁺⁴
@@idk-bv3iw That's the method used in video editing.
@crimsun7186 6 ปีที่แล้ว ⁺²
You also have to determine a rithmic pattern dependant on the langauge and overall delivery, as words are not spoken at a constant pace.
@a1k0n 6 ปีที่แล้ว ⁺⁵
I don't think that will work, because of all the excitation signal history in the bp[] array. Instantaneously changing the filter coefficients can lead to instability. One thing that might help, or might make it worse (I'm not sure) is to try implementing the transposed version where the bp[] array isn't just past output samples, but partially computed future samples. See the notes here: docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html
@greasyfingers9250 6 ปีที่แล้ว ⁺¹¹⁰
"Yes, I use PHP. Because a programming language, that you know is much more efficient than one that you don't know."
This is the truest statement I have ever heard.
@greasyfingers9250 6 ปีที่แล้ว ⁺¹
@Michael Smith You can debug it line by line with xdebug, but c# or java are usually better for that kind of work.
@Kitulous 6 ปีที่แล้ว ⁺⁵
in order to debug PHP you have to var_dump every single variable because the stack trace in PHP is a real mess.
@HermanWillems 5 ปีที่แล้ว
Short term yes, long term no.
@jlewwis1995 3 ปีที่แล้ว ⁺⁴
Finally a video that actually shows how to ACTUALLY MAKE a TTS voice from scratch, almost everything online about "how to make a text to speech synthesizer from scratch" is just "use this function to call the os TTS library lul"
@educate9946 6 ปีที่แล้ว ⁺¹¹⁵
Now I can have Robot Bisqwit wake me up every morning.
@thefoolishgmodcube2644 6 ปีที่แล้ว ⁺¹²
Imaging having “SHALOM! SHALOM!” as a wake-up alarm
@kkeanie 5 ปีที่แล้ว
@David Plays Stuff I really need that. it would stop my depression
@PantsYT 5 ปีที่แล้ว ⁺²
"Hyvää huomenta"
@sindavmi 4 ปีที่แล้ว
robot bisqwit is a pleonasm
@x0j 6 ปีที่แล้ว ⁺²¹⁶
This doesn't fool me, I know you have a much more advanced synthesizer that you use for your videos. A nice coverup attempt though
@BichaelStevens 6 ปีที่แล้ว ⁺¹¹²
We have reached peak AI revolution - machines making machines
A voice synth making a voice synth
@akj7 6 ปีที่แล้ว ⁺³
Haha
@huyvole9724 6 ปีที่แล้ว ⁺⁴
-6.4°C
@Bisqwit 6 ปีที่แล้ว ⁺¹⁴
Actually the truth was like -22. I just happened to do the recording a month earlier...
@imlxh7126 ปีที่แล้ว
Uberduck has a neural-network-based simulation of Microsoft Sam. Talk about overengineered lmao
@tomh6339 4 ปีที่แล้ว ⁺¹⁰
Dude. I haven't used Praat since University, was hit by waves of nostalgia in the most unexpected place. Your videos are the best, you're quite the renaissance man.
@magicstix0r 5 ปีที่แล้ว ⁺¹⁵
The input signal can't be a pure sine wave because:
1.) The vocal chords don't emit pure sine waves; they emit something more like a buzz.
2.) A pure sinewave would almost be unaffected by the LPC filters because it's a single frequency.
A buzz is extremely rich in harmonics, and the human ear keys off the presence or absence of those harmonics in determining what was said. That's why if you look at voice data in a spectrogram, you tend to see lots of streaks that move together or widen/shrink based on what's being said.
In a sort of philosophical explanation, the input signal is "sampling" your LPC filters. A single single sine wave would result in sampling just a single data point. You need a lot of sine waves to get enough of a picture of the LPC filter to see what it looks like, which is what your brain is keying on to make sense of your words.
Think of it kind of like an image. The sine waves are the pixels that you're building a picture of the LPC filter with. A single sine wave is like a single pixel; it doesn't tell you much. A buzz is loaded with lots of sine waves, so analogously it's loaded with a lot of pixels, so it can give you a better picture of the LPC filter, and thus a better picture of the formant it represents.
@Bisqwit 5 ปีที่แล้ว ⁺⁵
Great explanation! Not an ELI5 though :-) But I would have settled for that.
@metadaat5791 6 ปีที่แล้ว ⁺¹⁰
I always liked the implication of GSM using LPC, that technically you're not hearing someone's actual voice, but a reconstruction made of a buzzer with a filter and hisses and pops from filtered noise. So, you're actually listening to a speech synthesizer's reconstruction of the other person's voice! :-) :-)
@chooha 5 ปีที่แล้ว ⁺³¹
Hi bisqwit I don't know if you realize this but you are an inspiration for many of the viewers here, like a hero. So could you make a video about how you reached this insane level of skill, what your journey was like, and maybe some tips on how one can be as good as you ? Thanks for all the amazing content ^_^
@magicstix0r 5 ปีที่แล้ว ⁺⁹
The constant clicks and pops are due to discontinuities at the frame boundaries. With an algorithm like this, they usually fix it using overlap-add. The gist of OLA is that your frames overlap and are weighted by a windowing function, then you sum them together where they overlap.
@MissNorington 6 ปีที่แล้ว ⁺¹⁴
Really outstanding video! Great work Bisqwit!
@oresteszoupanos 6 ปีที่แล้ว ⁺¹⁰
Joel, regarding your question at 8:05, we cannot use a sine wave because it only has audio energy in 1 frequency, whereas to synthesise human speech, we need energies in "all" frequencies, so we can have base pitches and formants happening at the same time. Buzzers have a better spread of frequencies, compared to the more "pure" sine wave. Hope I made sense ^_^
@Bisqwit 6 ปีที่แล้ว ⁺³
Good explanation, but not really an ELI5 :-)
I understand the situation as indicated elsewhere in the video, but I was having trouble explaining in layman terms without referring to things like frequency spectrum; I wrote that request for the benefit of audience.
@oresteszoupanos 6 ปีที่แล้ว ⁺⁷
@@Bisqwit Aha, I'd never heard the term ELI5 (Explain Like I'm 5) before! Here is my second attempt :-)
Voice sounds are slightly complicated. Sine wave sounds are simple. Buzzers are super-complicated. We cannot use 1 simple sine wave, filter it, and get a complex voice sound. We have to start with a super-complex buzzer, then filter out some things, to be left with a less-complex voice sound.
@frisosmit8920 6 ปีที่แล้ว ⁺¹
That's actually a very good explainaition. Your first explaination made me understand it. But then again, I'm not 5 years old.
@noneofyourbeeswax3460 5 ปีที่แล้ว
But you could superimpose sine waves to get all the frequencies?
@Bisqwit 5 ปีที่แล้ว
Yes, and in fact all waveforms can be represented as a sum of sinewaves. That is what e.g. the Fourier transform is about, or the discrete cosine transform.
@wallaguest1 5 ปีที่แล้ว ⁺⁴
i cant understand how you have so much knowledge, its crazy
@OverSeasMedia 5 ปีที่แล้ว ⁺⁵
bisqwit was the inspiration to write my own tools whenever i need one, Great video.
@DudeWatIsThis 4 ปีที่แล้ว
Bisqwit you fucking legend man. This is the way to handle the banter. Throw it straight back at them!
Genius stuff. You win again, good sir!
@pixelflow 5 ปีที่แล้ว ⁺¹⁰
Finally! A Bisqwit Vocaloid :3
@prizmarvalschi1319 4 ปีที่แล้ว ⁺⁴
This is kinda like how utau users create voicebanks
Except we sing in 5 syllable strings for Japanese,sometimes more for others languages. And sometimes recorded in three or more pitches.
@shivisuper 6 ปีที่แล้ว ⁺¹
These videos make me respect you even more. You're very knowledgeable!
@kapiltyagi4639 6 ปีที่แล้ว ⁺⁴
The solution for the clicking in the sound is to simply fade out some of the frequency from the very end of the sample. Because LPC just converting the audio samples into the simple and low resolution waveform just bunch of float values and a gain.
@davidcuny7002 4 ปีที่แล้ว ⁺¹
The red lines in Praat indicate formants, not the overtones. The vocal chords produces pulses, which have a fundamental frequency (pitch) as well as overtones (multiples of the pitch). The tongue forms a series of "tubes" in the mouth, which causes the pulses to resonate at frequencies proportional to the length of those various chambers. The resonating frequencies of these "tubes" are formants, and different mouth shapes create different sets of resonating frequencies.
@noname-rr7hk 2 ปีที่แล้ว
I was searching for this video for half a year. Thankyou...
@mattg5461 5 ปีที่แล้ว
Brilliant. I find this video a week after handing in my dissertation on vocal synthesis... This would have changed everything
@Bisqwit 5 ปีที่แล้ว
How so?
@mattg5461 5 ปีที่แล้ว
There's just a lot of things you've covered in here that I wasn't able to find much concrete information about - things like accents and dialects especially. Lots of things like that which I knew from common sense but couldn't find actual written documentation to back up.
@BeeBaux 5 ปีที่แล้ว ⁺¹
Great! job bro. Thanks for making complex thing easier.
@DrSid42 6 ปีที่แล้ว ⁺¹
Just had an idea I will make my own speech synth. I wondered if there is some nice example low-level enough. And guess what. This guys had the same idea just in time to have it done now. Great job !
@DynamicFortitude 6 ปีที่แล้ว ⁺²
8:00 Buzzer cannot be pure sine, because then the filtering of the frequencies would make no sense - there would be only one frequency in buzzer to start with. Buzzer needs to have rich frequency spectrum, but at the same time it needs to be harmonic (i.e. all frequencies are natural multiples of some base frequency = there is a defined pitch).
You could use any function in form A*sin(x) + B*sin(2x) + C*sin(3x) +..., but of course the easiest way to produce signal like that is to use 1) square wave, 2) sawtooth wave (as you did), 3) triangle wave, 4) exp(sin(x)), etc.
@Bisqwit 6 ปีที่แล้ว
Good explanation, but not an ELI5. I had trouble explaining it in layman terms without invoking mathematics and frequency spectrums... That's why I wrote the annotation.
@DynamicFortitude 6 ปีที่แล้ว ⁺¹
@@Bisqwit Vocal cords are the buzzers. Air go through a buzzer, then through a tube (vocal tract) which amplifies some frequencies (formants), and dampen other. If buzzer sound would be just one sine wave, then the tube just makes it louder or more silent, nothing more. Tube cannot create new frequencies, acts as a filter only. So the aim for the buzzer is to generate many frequencies, so the tube (vocal tract) have something to choose from. White noise (during whispering) have all frequencies - so it is OK. Pitched sound is also OK, since it have many sine waves in it, as long as its base frequency is not too high (easier to understand bass singing than soprano singing!). High pitch have fewer sine waves in formants frequency range (~300-3000Hz). Try changing VoicePitch to ~1046 Hz (soprano's high C), and you won't be able to distinguish vowels o from u from a, or e from i.
@thetastefultoastie6077 6 ปีที่แล้ว ⁺²⁵
I've never seen `++i %= max` before. That's pretty cool.
Edit: it seems this only works in C++ but not in C, Java or Javascript
@Bisqwit 6 ปีที่แล้ว ⁺²⁰
In C++, operator++() returns a reference to the object being modified. This is not the case in C. This has nothing to do with C++17 or about sequence points. If the expression was `i++ %= max`, it would be a different story. `++i %= max` is completely unambiguous in its meaning. The reason it does not work in C is because `++i` returns a non-lvalue copy of the variable in C, not a reference to it. (C does not have references.)
@thetastefultoastie6077 6 ปีที่แล้ว
@@Bisqwit Thanks for the explanation!
I used an online compiler to quickly try all versions of C++ and indeed it worked in all of them.
@Smaxx 6 ปีที่แล้ว
@@shaurz I'd just write a tiny inline function with a speaking name instead. ;) Like `incmod(v, m)`
@DrSid42 6 ปีที่แล้ว ⁺⁵
@@shaurz It seems weird to you because of different background. Finish folk did it like this for centuries.
@noneofyourbeeswax3460 5 ปีที่แล้ว
@@DrSid42I don't think computers have been around for centuries
@RamLaska 5 ปีที่แล้ว ⁺²
I did something like this in the early nineties. I recorded my voice on my Mac SE, and wrote a hypercard stack to play the correct sounds together. It didn't translate English into phonemes, you had to write out your own phonemes, but that wasn't quite so unusual at that time.
I also only made one recording per phoneme, because ain't nobody got time to record every possible phoneme pair 😂
@tomaszx7760 4 ปีที่แล้ว ⁺²
I remember play with " Say " speech synthesizer from Workbench 1.3 OS (at Amiga 500 computer)
@d3ibit 6 ปีที่แล้ว
Joel, always a pleasure to watch a C++ (related in some way) video. Keep the good work!
@stennisrl 6 ปีที่แล้ว ⁺¹
Wow, what a cool video to wake up to. Excellent work!
@moth.monster 5 ปีที่แล้ว ⁺³
Now we need to record the speech synth speaking and use that to make another synth
@miszczklasykuw3025 6 ปีที่แล้ว ⁺¹
music in background adds nice atmosphere to video as always x)
@skilz8098 6 ปีที่แล้ว ⁺¹
Once again; another great video!
@JokerCat-x2t 7 หลายเดือนก่อน
I know this is 5 years old, but it's still cool to listen to.
@MrGoatflakes 5 ปีที่แล้ว ⁺⁵
6:34 if you say this five times into a mirror at night you will summon a Bisqwit :P
@yukimoe 6 ปีที่แล้ว ⁺¹⁹
So you're basically teaching us how to make Vocaloid-like software? Nice.
@ceablue8037 6 ปีที่แล้ว
@jj zun Yesssssssssssssssssssssssss
@AT-zr9tv 3 ปีที่แล้ว
Your videos are fantastic.
This one particularly.
5 ปีที่แล้ว ⁺¹
Super interesting article. Thanks!
@adam7868 6 ปีที่แล้ว
I think I remember asking about this at one point, glad to see a video done on it
@alexhauptmann298 5 ปีที่แล้ว ⁺¹
ELI5 explanation for why you can't use a sine wave: the human voice is essentially a subtractive synthesizer. Most commercial music synthesizers can do some form of this. It's the same sort of "buzzer in a tube" model, except the tube is generally way simpler (unless you're Plogue, but that's another story).
The reason a sine wave can't be used is because subtractive synthesis works by taking away frequencies from a harmonically-rich (i.e. complex waveform) sound. Any given wave can be recreated by an arbitrary number of sine waves, but a sine wave can't be broken down into something simpler. So essentially, a sine wave can't be used because it's not enough data. It mathematically cannot be subtracted from any further.
This is...more complex than I was intending but oh well lmao
@Bisqwit 5 ปีที่แล้ว ⁺²
Good explanation, but definitely not something that works for five-year-olds :)
@alexhauptmann298 5 ปีที่แล้ว ⁺¹
@@Bisqwit Haha, I figured. Is that a QRIO in the thumbnail btw? I wanted one SO BAD as a little kid and was thoroughly impressed with how realistic the synthesized speech sounded. Of course, now I know (from experience, even) that Japanese is a MUCH easier language to synthesize than English.
Also while watching your video on Finnish phonetics, I found it interesting how it's sort of similar to Japanese (vowels with singular pronunciation, lengthened vowels and consonants). I wonder if that would make it technically easier to synthesize than English (at least, native-speaker English)...at the very least, it would make the plaintext dictionary rules much easier :P
@Bisqwit 5 ปีที่แล้ว ⁺²
It’s a Nao, not Qrio. And yes, as a Finnish person who knows the basics of Japanese, I find Japanese much easier and familiar in many aspects compared to English.
@farteryhr 5 ปีที่แล้ว
virtual singer Bisqwitoid confirmed (slap
have you played with UTAU (singing synthesis software) in which it's very easy to make your own voicebank (and get quality high)? looking forward to that soooo much~
it's just wonderful to find another common interest of you and me.. phonology and speech/singing synthesizing!
(but yes to get high quality it needs deeper understanding of singing in timing, rhythm, grammar, and much time to fine-tune pitch, volume, breathiness envelopes for songs)
@GibusWearingMann 5 ปีที่แล้ว ⁺¹³
I'm starting to become curious how to stop prison radicalization.
@gandolfphoenix1363 5 ปีที่แล้ว ⁺²
You used the speech synthesizer that you made to give the Tutorial!
@Bisqwit 5 ปีที่แล้ว ⁺¹
Yes, I used it in the first few seconds of this video.
@gero9307 3 ปีที่แล้ว ⁺¹
I created a voicebank CVVC and VCV type for utau, and while watching this video I experienced deja vu)
@codeninja1832 6 ปีที่แล้ว ⁺¹
This is interesting as a programmer, as someone who's trying to learn another language (old english, dead language sure, but fun), and as someone who asked you how to trill about a month ago haha.
Still can't trill, but I'm on my way.
@Bisqwit 6 ปีที่แล้ว
Thanks for posting!
@edo9k 5 ปีที่แล้ว ⁺⁴
I wish I had seen this video when I was researching for the master's degree.
@Bisqwit 5 ปีที่แล้ว ⁺²
What did you write about?
@Thebasicmaker 4 ปีที่แล้ว
I also made a speech syinthethizer using the same procedure but my language was BASIC! And the voice was mine too pronuncing a word and then cutting the part that I needed and the program just had to load the sounds and play it one after the other to speech reading a phrase I give to an input intruction
@XTpF4vaQEp 5 ปีที่แล้ว ⁺⁶
13:15 accidentally used the whisper effect
@uxxlabrute 6 ปีที่แล้ว ⁺³
Earthbound music in the background FeelsgoodMan
@themcc1879 6 ปีที่แล้ว ⁺⁶
Sample voice frame to C code... the Lisp lover in me says you should have used Lisp, code as data and data as code. Either way this was beyond interesting. I like your accent but to be honest everyone who speaks English has an accent. The voice speaking with an accent was diffently something I wasn't expecting this　月曜日。
@pedropereirapt 5 ปีที่แล้ว
So inspiring! Thanks for this video, you got a new sub!
@Catbangin 6 ปีที่แล้ว ⁺¹
Cheer bisqwit! Almost near to guitar effects tutorial!
@dgmsstuff 6 ปีที่แล้ว ⁺³
I'm speechless. No pun intended.
@zeppy13131 5 ปีที่แล้ว ⁺¹
I can't speak for anyone else, but I was glad when this was Finnished.
@robertboran6234 6 ปีที่แล้ว
Great Project. Thanks for sharing.
@krank3869 5 ปีที่แล้ว
I always thought these videos were sped up but then i looked at the clock
@ruadeil_zabelin 6 ปีที่แล้ว
Note that std::wstring_convert is deprecated in C++17, so if you want to be standard conforming, you should replace it with something else.
@Bisqwit 6 ปีที่แล้ว ⁺¹
Noted. I used it for 1) its brevity and 2) because I couldn’t figure out a concise replacement that is not deprecated.
@ruadeil_zabelin 6 ปีที่แล้ว
@@Bisqwit Unfortunately there isn't a standard way anymore. The standards commity has said that they're working on a replacement, but will only readd it if it's fully compliant with the unicode standards (apparently this one didn't work in all cases). The only way seems to be fully implement it yourself (utf8 decoding isn't very hard luckily), or use a library like iconv or libicu.
@akj7 6 ปีที่แล้ว ⁺¹
Question: At 9:17, why do you have: constexpr unsigned maxOrder? What is the purpose of the constexpr here? Won't the compiler evaluate the what maxOrder is without the constexpr? Why haven't you use const?
@Bisqwit 6 ปีที่แล้ว ⁺³
For integers, there is not much difference between const and constexpr. I just like to document the intention. The primary target audience of source code is people, after all. When I write “constexpr”, I mean “this should be a compile-time constant, and something probably depends on the fact”. Here, MaxOrder _needs_ to be a compile-time constant, because it is used as an array dimension. When I write “const”, I mean “this is immutable; it should be only read, not written to”. For example, the constant “rate” is not intended to be changed, but I don’t necessarily need it to be a compile-time constant, even though it happens to be.
@JoLiKMC 6 ปีที่แล้ว ⁺²
I, for one, welcome our new, Finnish robot overlords. _Hail Roboisqwit!_
Seriously, though, this is neat-as-hell. It's also kind of… heartbreaking, in a way. I never considered how speech synthesis works, and now that I know? The magic… is gone. :(
@firemaniac10010 5 ปีที่แล้ว ⁺⁴
I'm guessing the "buzz" can't be a pure sine wave because a pure sine wave has no harmonics; it's a pure tone. In other words, there's nothing to filter out except for one single frequency.
@minecrafttheobjectno541 5 ปีที่แล้ว ⁺¹
Did I hear a turret say "Weeee" when he said "thumbs up the video"?
@GabrielCrowe 6 ปีที่แล้ว ⁺¹
Awesome stuff.
@Darksoulmaster 6 ปีที่แล้ว ⁺⁵
Wow, i dont know what are you even talking about, but its cool.
@Bisqwit 6 ปีที่แล้ว ⁺¹⁰
Speech synthesis
@aprilliac 6 ปีที่แล้ว
Rolling index, why didn't I think of that... Thanks for the excellent video. :)
@Bisqwit 6 ปีที่แล้ว
Yeah, a rolling index is a bit neater solution than doing a copy-backwards-by-1 loop after each iteration. On the other hand, the rolling index makes SIMD optimizations impossible, so it’s a tradeoff.
@Sturmtreiben 5 ปีที่แล้ว ⁺¹
Which graphics software do you use for creating pictures like the one in 3:00? They somehow look really good.
@Bisqwit 5 ปีที่แล้ว ⁺¹
Thanks. I use LibreOffice Impress. I also do some postprocessing in kdenlive; basically all _animations_ are done in the video editor.
@Sturmtreiben 5 ปีที่แล้ว
Thanks, Joel!
@j5679 ปีที่แล้ว
Very interesting video. I may have missed it but it seems like you are not incorporating stress accent into your synthesis, right?
Algorithmically figuring out where the stress lies may be a bit of a challenge depending on the language (or be downright impossible), but the English Wiktionary actually provides this data and they also offer regular HTML dumps that contain IPA transcriptions. Finnish actually happens to be one of the best covered languages on the English Wiktionary, so if you ever decide to do a v2 of this project, incorporating Wiktionary's IPA data might be an idea.
I'm not sure how much you know about phonetics but please be aware that IPA does not fully capture how words are pronounced. Phonemic transcriptions don't capture it by a long shot but even a narrow phonetic transcription can be slightly inaccurate (vowel qualities are a continuum, the different durations are on a continuum etc.). This all is to say that even if you use IPA data, the rest of the pipeline still needs to be tailored to a specific language and can't produce accurate output language-agnostically.
@Bisqwit ปีที่แล้ว ⁺¹
From Wikipedia: ”Since stress can be realised through a wide range of phonetic properties, such as loudness, vowel length, and pitch (which are also used for other linguistic functions), it is difficult to define stress solely phonetically.”
In Finnish language (this synth aims for speaking like Finnish speakears do) emphasis (stress) is always on the first syllable. In my speech synthesizer, it is realized by using slightly higher pitch for stressed phonemes.
@Bleenderhead 6 ปีที่แล้ว ⁺²
I want to hear it sing Space Oddity.
@ivanbogdasaebersold4690 5 ปีที่แล้ว ⁺¹
This will be my COVAS in Elite Dangerous...
@fisu51 5 ปีที่แล้ว ⁺²
Kyllä
@Embedonix 6 ปีที่แล้ว ⁺¹
+1 for using 'goto' in your code :)
@Smaxx 6 ปีที่แล้ว ⁺²
Your failing lowercase conversion for umlauts is a pretty nasty trap I fell into in the past as well. It looks like you're doing everything correct, it should work, yet it somehow doesn't. Unfortunately, it's not as easy as imbuing/passing the correct locale. It might be, but that's not guaranteed.
Even when using the UTF-8 locale, you might just walk char by char and for whatever reason ignore UTF-8 sequences… So far for me it always worked when using the wide character version instead (i.e. `wchar_t` over `char` or possibly `uint32_t`), although I've heard even that fails for some. Guess it's not totally unexpected I've heard stuff about dropping the `codecvt` header from the standard…
So in your case I'd just to the `std::u32string` conversion first, then unify character casing after that.
@Eodese หลายเดือนก่อน
1:49 this is exactly the process to make a UTAU voicebank
@Bisqwit หลายเดือนก่อน
Interesting. Is there a video about that?
@gazehound 6 ปีที่แล้ว ⁺¹
I'm early this time. Awesome video!
@Bisqwit 6 ปีที่แล้ว ⁺¹
Thank you!
@cmyk8964 3 ปีที่แล้ว
ELI5 of why a sine wave won’t work:
Formants are a result of our mouth shape causing resonance. Resonance causes certain frequencies of the original sound wave to be emphasized, which forms the thick stripes in the spectrogram.
Pure sine waves are too pure to resonate with the varied frequencies that human vowels require, because it’s just made of 1 frequency, while a human voicebox generates a mix of a bunch of frequencies.
@clearz3600 6 ปีที่แล้ว
Interesting as always.
@primodernious 4 ปีที่แล้ว
this is not how i would have done it. my method is much simpler. i would just speak in each letter one by one and the run a wave form analysis on each letter spoken and then cut and past each single wave form shape of each letter and store them in a binary array with spaces between that indicate a blank space. then i only need to find out how to generate repetitions of each letter waveform to mimic the spelling of each letter and increase or decrease amplutide of each wave form repetitions until it sound like what i originally recorded and just program some algoritms to repeat these arangments of repetitions and amplitudes automatically. then i would just use another array to pile up from the text interprater these vocal wave form repetitions in such a way that it completes a sentence in binary. then i would send the binary string of numbers to the ports and let the computer speak. the program would assign these spesific wave form of each letter to the actually letters that makes up the sentence so the program read the letter and then use its copy in wave form format and then guess the repetitions with other letters before the string of vocal letters is piled before it get send to the speakers. the holy grail is in the wave form of each letter and can be separated. the way speach work is like a computer program. the structure is programemd to begin with. the rest is just some randomness and imperfection. frequency of speach os going up and down as well. voice have to be seen as shapes in sequence where geometry makes up the sound.
@ShotgunLlama 2 ปีที่แล้ว
In this video in Praat, he uses LPC (Burg). Would this work with LPC using covariance or autocorrelation?
@HerrRussoTragik 6 ปีที่แล้ว
Ohhh in the past I've made a pseudo "TTS" using the winmm from windows.h and PlaySound function...
@clementpoon120 4 ปีที่แล้ว
Pipe a chatbot to it and add a GLaDOS voice to it, and you've got yourself a GLaDOS.
@jfkd2812 5 ปีที่แล้ว
11:01 Hey, it's imgui! Very nice to use
@smallgoodwoodoodaddy 6 ปีที่แล้ว ⁺²
I always liked your accent. So I liked it 👍 :D
@yohvh 5 ปีที่แล้ว ⁺¹
When you find a problem after you played the audio do you just in real time think of a solution and code it right there at that speed?
@videogamemusicandfunstuff4873 6 ปีที่แล้ว
11:01 This program looks really nice. What GUI library did you use?
@Kellykellamster 6 ปีที่แล้ว
Looks like imgui to me.
@Bisqwit 6 ปีที่แล้ว
Yep, correct. Imgui it is.
@oo8dev 6 ปีที่แล้ว ⁺¹
Amazing!!
@DanieleMarchei 6 ปีที่แล้ว ⁺¹¹
Yes but now we want to listent to the synthesizer's voice
@distrologic2925 6 ปีที่แล้ว ⁺¹
you *are*
@juniorsilvabroadcast 5 ปีที่แล้ว
Bisqwit can you help me with something? I'm looking for a advanced audio clipper created in VST architecture. Some type of clipper that doesn't let out any type of small peaks on the output. 4x oversampling would help a lot but even that is badly implemented with traditional audio clipper avaible at internet in VST architecture. I have a FM Audio Processor made in VST technology using some VST Plug-Ins avaible on internet. And the big issue is the clipper. It let's out small peaks that makes the processing difficult because i need to implement a ISP Protected limiter on the end. That makes the sound go down when High Frequency material is played.
@Bisqwit 5 ปีที่แล้ว
I am not sure what exactly it is you want. It kind of sounds like you want a soft limiter, though. I don’t particularly have experience about VST plugins, aside from trying to install them for use in Audacity, at some point, getting it working, and then at some later point, noticing that the plugins are no longer there and being too indifferent to study further why.
@skilz8098 6 ปีที่แล้ว
I'm wondering if the technology that is used to transfer data from Vinyl Record Albums into mp3 files would be of any assistance... Then just filter out the background music until you have pure voice. Then you can have a singing speech synthesizer.
@Bisqwit 6 ปีที่แล้ว
As for the first sentence, I fail to see the relevance. As for the second sentence, what kind of solutions do you have for “filtering out background music”? Even on TH-cam* it depends on correctly identifying the original recording (with or without lyrics) leaving only the added commentary and sound effects, and even then the resulting audio sounds quite hollow.
*) TH-cam has a tool that allows video creators remove a song that infringes copyright, when TH-cam has first identified the infringement using ContentID. Often it results in simply muting that region of the video, but sometimes it successfully removes the song leaving only commentary.
@victorprokop2240 4 ปีที่แล้ว
3:16 Mongolian throat singing!!! lmao
@Bisqwit 4 ปีที่แล้ว
I’m not sure if you are mocking, but the principle is actually similar. The purpose is to enunciate different subtones while keeping the primary tone unchanged.
@jamescumbria4499 5 ปีที่แล้ว ⁺¹
Are you going to make this speech synthesizer a TTS voice for Windows?
@Bisqwit 5 ปีที่แล้ว
I don’t deal with Windows.
@MESYETI 5 ปีที่แล้ว
Wow!
I might try to make one, it seems hard though
@arcnorj 6 ปีที่แล้ว
Can you explain just a bit what you did to generate the LPC sample from David Woods? I guess manually editing the pitch curve with Praat?
@Bisqwit 6 ปีที่แล้ว ⁺¹
I dumped the soundtrack of the video into a wav file using MPlayer. Then I opened the soundtrack in Audacity, and cropped it into just those three seconds or so, saved it into a new wav file. (Or maybe I dumped only three seconds from the soundtrack in the first place, using -ss and -endpos options. I don’t remember.) Then I opened the wav file in Praat, and did nothing else but synthesized the LPC from it (Analyze spectrum → To LPC (burg) → Save).
@1st_ProCactus 5 ปีที่แล้ว ⁺¹
Awesome !!!
@Armadurapersonal 6 ปีที่แล้ว ⁺⁴
Perfect for spurdo memes
@pencrows 5 ปีที่แล้ว ⁺¹
was all the audio in this video speech synthesized
edit:its not
@alejandroduarte5245 6 ปีที่แล้ว
Great video :)
@siddharthkalantri5076 5 ปีที่แล้ว
Thank you I use sync dec talk alyes with my screen reader and alyes search for Hindi language . perhaps one day I will get my wish looking at your finish voice sync . as its possible .
@vegardertilbake1 6 ปีที่แล้ว
Ha! This was so much fun!

ต่อไป

เล่นอัตโนมัติ