Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ม.ค. 2019
  • In this tool-assisted education video, we create a speech synthesizer in modern C++ - with a Finnish accent. The video deconstructs speech and phonemes and explores the Linear Predictive Coding, LPC. The open source programs Praat and Audacity are featured.
    All downloadable materials, including source code: github.com/bisqwit/speech_syn...
    Become a member: th-cam.com/users/Bisqwitjoin
    My links:
    Twitter: / realbisqwit
    Liberapay: liberapay.com/Bisqwit
    Steady: steadyhq.com/en/bisqwit
    Patreon: / bisqwit (Other options at bisqwit.iki.fi/donate.html)
    Twitch: / realbisqwit
    Homepage: iki.fi/bisqwit/
    Credits, in order of appearance:
    - Music: original composition :: untitled :: Joel Yliluoma
    - Music: Duke Nukem 3D :: Shop n Bag :: Lee Jackson
    - Music: Earthbound :: Twoson :: Akio Ōmuri and others (converted into MIDI and played through OPL3 emulation through homebrew software)
    - Meme: The Fairly OddParents :: If I Had One :: Butch Hartman
    - Music: Chrono Trigger :: Ending :: Yasunori Mitsuda (SPC-OPL3 conversion)
    - Presentation: Haifa University :: Linear Predictive Coding :: Nimrod Peleg 2009
    - Music: original composition :: Space :: Joel Yliluoma
    - Music: Final Fantasy V :: Airship :: Nobuo Uematsu (SPC-OPL3 conversion)
    - Music: Tales of Phantasia :: Freeze :: Motoi Sakuraba (SPC-OPL3 conversion)
    - Music: Aryol :: Warmup :: Kyohei Sada (SPC-OPL3 conversion)
    How to stop prison radicalization: • Video
    You can contribute subtitles: th-cam.com/users/timedtext_vid... or to any of my videos: th-cam.com/users/timedtext_cs_...
    #speechsynthesis #bisqwit #programming

ความคิดเห็น • 297

  • @Bisqwit
    @Bisqwit  5 ปีที่แล้ว +87

    This is LPC (Linear Predictive Coding): yₙ = eₙ − ∑(ₖ₌₁..ₚ) (bₖ yₙ₋ₖ) where
    ‣ y[] = output signal, e[] = excitation signal (buzz, also called predictor error signal), b[] = the coefficients for the given frame
    ‣ p = number of coefficients per frame, k = coefficient index, n = output index
    Compare with FIR (Finite Impulse Response): yₙ = ∑(ₖ₌₁..ₚ) (bₖ xₙ₋ₖ) where
    ‣ x[] = input signal
    The similarities between the two are striking. FIR is used in applications like low-pass filtering, high-pass filtering, band-pass filtering, band-stop filtering, etc. It is an almost magical type of mathematics that is used to generate these filters. For LPC, there are several different algorithms, many of which are implemented in Praat, the software that I used in this video to create my LPC files.

    • @huyvole9724
      @huyvole9724 5 ปีที่แล้ว +2

      I met that formula when I learn Signal & System module (my school call it Digital Signal Processing)

    • @BichaelStevens
      @BichaelStevens 5 ปีที่แล้ว +1

      16-17 minutes in:
      Please next time lower the audio or give a warning. The popping killed my hearing

    • @Rennu_the_linux_guy
      @Rennu_the_linux_guy 5 ปีที่แล้ว

      uhhh

    • @a1k0n
      @a1k0n 5 ปีที่แล้ว +3

      In fact it's identical to an IIR filter, which has coefficients for both x and y, and your x coefficient is 1 and all your y coefficients are negated.

    • @RazorM97
      @RazorM97 5 ปีที่แล้ว +1

      How to stop prison radicalization

  • @KatzRool
    @KatzRool 5 ปีที่แล้ว +93

    I was going to make a joke about how you already sound like speech synthesis when speaking English, but your English gets better every single video. Keep it up man!

  • @educate9946
    @educate9946 5 ปีที่แล้ว +117

    Now I can have Robot Bisqwit wake me up every morning.

    • @thefoolishgmodcube2644
      @thefoolishgmodcube2644 5 ปีที่แล้ว +13

      Imaging having “SHALOM! SHALOM!” as a wake-up alarm

    • @kkeanie
      @kkeanie 5 ปีที่แล้ว

      @David Plays Stuff I really need that. it would stop my depression

    • @PantsYT
      @PantsYT 4 ปีที่แล้ว +2

      "Hyvää huomenta"

    • @ktaleentkekma5777
      @ktaleentkekma5777 3 ปีที่แล้ว

      robot bisqwit is a pleonasm

  • @greasyfingers9250
    @greasyfingers9250 5 ปีที่แล้ว +111

    "Yes, I use PHP. Because a programming language, that you know is much more efficient than one that you don't know."
    This is the truest statement I have ever heard.

    • @greasyfingers9250
      @greasyfingers9250 5 ปีที่แล้ว +1

      @Michael Smith You can debug it line by line with xdebug, but c# or java are usually better for that kind of work.

    • @Kitulous
      @Kitulous 5 ปีที่แล้ว +5

      in order to debug PHP you have to var_dump every single variable because the stack trace in PHP is a real mess.

    • @HermanWillems
      @HermanWillems 4 ปีที่แล้ว

      Short term yes, long term no.

  • @BichaelStevens
    @BichaelStevens 5 ปีที่แล้ว +113

    We have reached peak AI revolution - machines making machines
    A voice synth making a voice synth

    • @akj7
      @akj7 5 ปีที่แล้ว +3

      Haha

    • @huyvole9724
      @huyvole9724 5 ปีที่แล้ว +4

      -6.4°C

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +13

      Actually the truth was like -22. I just happened to do the recording a month earlier...

    • @imlxh7126
      @imlxh7126 ปีที่แล้ว

      Uberduck has a neural-network-based simulation of Microsoft Sam. Talk about overengineered lmao

  • @x0j
    @x0j 5 ปีที่แล้ว +214

    This doesn't fool me, I know you have a much more advanced synthesizer that you use for your videos. A nice coverup attempt though

  • @framegrace1
    @framegrace1 5 ปีที่แล้ว +151

    I think the clicks are because the program is cutting/pasting at random waveform values. This produces non-continuous gaps in the waveform that generates those clicks. I think the simple way to solve it, is to just wait until the value of the sample crosses the 0 line to perform the cut of the audio, and wait again a 0 crossing to introduce the next one.

    • @KuraIthys
      @KuraIthys 5 ปีที่แล้ว +10

      Interesting theory.
      That actually matches advice mentioned in the SNES manual in relation to audio samples. What it is trying to say exactly is ambiguous, but it warns against discontinuities in the waveform, which would result in clicking sounds.
      Of course, given the ADPCM coding, discontinuities on block boundaries would easily result if you're not careful. (since the samples within a block are all expanded using the same parameters, but across block boundaries the parameters change.)

    • @idk-bv3iw
      @idk-bv3iw 5 ปีที่แล้ว +14

      What about a simple fade-out/fade-in between the samples?

    • @TheBcoolGuy
      @TheBcoolGuy 5 ปีที่แล้ว +4

      @@idk-bv3iw That's the method used in video editing.

    • @crimsun7186
      @crimsun7186 5 ปีที่แล้ว +2

      You also have to determine a rithmic pattern dependant on the langauge and overall delivery, as words are not spoken at a constant pace.

    • @a1k0n
      @a1k0n 5 ปีที่แล้ว +5

      I don't think that will work, because of all the excitation signal history in the bp[] array. Instantaneously changing the filter coefficients can lead to instability. One thing that might help, or might make it worse (I'm not sure) is to try implementing the transposed version where the bp[] array isn't just past output samples, but partially computed future samples. See the notes here: docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html

  • @magicstix0r
    @magicstix0r 5 ปีที่แล้ว +14

    The input signal can't be a pure sine wave because:
    1.) The vocal chords don't emit pure sine waves; they emit something more like a buzz.
    2.) A pure sinewave would almost be unaffected by the LPC filters because it's a single frequency.
    A buzz is extremely rich in harmonics, and the human ear keys off the presence or absence of those harmonics in determining what was said. That's why if you look at voice data in a spectrogram, you tend to see lots of streaks that move together or widen/shrink based on what's being said.
    In a sort of philosophical explanation, the input signal is "sampling" your LPC filters. A single single sine wave would result in sampling just a single data point. You need a lot of sine waves to get enough of a picture of the LPC filter to see what it looks like, which is what your brain is keying on to make sense of your words.
    Think of it kind of like an image. The sine waves are the pixels that you're building a picture of the LPC filter with. A single sine wave is like a single pixel; it doesn't tell you much. A buzz is loaded with lots of sine waves, so analogously it's loaded with a lot of pixels, so it can give you a better picture of the LPC filter, and thus a better picture of the formant it represents.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +5

      Great explanation! Not an ELI5 though :-) But I would have settled for that.

  • @tomh6339
    @tomh6339 4 ปีที่แล้ว +8

    Dude. I haven't used Praat since University, was hit by waves of nostalgia in the most unexpected place. Your videos are the best, you're quite the renaissance man.

  • @metadaat5791
    @metadaat5791 5 ปีที่แล้ว +10

    I always liked the implication of GSM using LPC, that technically you're not hearing someone's actual voice, but a reconstruction made of a buzzer with a filter and hisses and pops from filtered noise. So, you're actually listening to a speech synthesizer's reconstruction of the other person's voice! :-) :-)

  • @chooha
    @chooha 5 ปีที่แล้ว +31

    Hi bisqwit I don't know if you realize this but you are an inspiration for many of the viewers here, like a hero. So could you make a video about how you reached this insane level of skill, what your journey was like, and maybe some tips on how one can be as good as you ? Thanks for all the amazing content ^_^

  • @magicstix0r
    @magicstix0r 5 ปีที่แล้ว +8

    The constant clicks and pops are due to discontinuities at the frame boundaries. With an algorithm like this, they usually fix it using overlap-add. The gist of OLA is that your frames overlap and are weighted by a windowing function, then you sum them together where they overlap.

  • @jlewwis1995
    @jlewwis1995 2 ปีที่แล้ว +1

    Finally a video that actually shows how to ACTUALLY MAKE a TTS voice from scratch, almost everything online about "how to make a text to speech synthesizer from scratch" is just "use this function to call the os TTS library lul"

  • @pixelflow
    @pixelflow 5 ปีที่แล้ว +10

    Finally! A Bisqwit Vocaloid :3

  • @MissNorington
    @MissNorington 5 ปีที่แล้ว +14

    Really outstanding video! Great work Bisqwit!

  • @prizmarvalschi1319
    @prizmarvalschi1319 3 ปีที่แล้ว +4

    This is kinda like how utau users create voicebanks
    Except we sing in 5 syllable strings for Japanese,sometimes more for others languages. And sometimes recorded in three or more pitches.

  • @d3ibit
    @d3ibit 5 ปีที่แล้ว

    Joel, always a pleasure to watch a C++ (related in some way) video. Keep the good work!

  • @shivisuper
    @shivisuper 5 ปีที่แล้ว +1

    These videos make me respect you even more. You're very knowledgeable!

  • @wallaguest1
    @wallaguest1 5 ปีที่แล้ว +4

    i cant understand how you have so much knowledge, its crazy

  • @OverSeasMedia
    @OverSeasMedia 5 ปีที่แล้ว +5

    bisqwit was the inspiration to write my own tools whenever i need one, Great video.

  • @DudeWatIsThis
    @DudeWatIsThis 3 ปีที่แล้ว

    Bisqwit you fucking legend man. This is the way to handle the banter. Throw it straight back at them!
    Genius stuff. You win again, good sir!

  • @stennisrl
    @stennisrl 5 ปีที่แล้ว +1

    Wow, what a cool video to wake up to. Excellent work!

  • @adam7868
    @adam7868 5 ปีที่แล้ว

    I think I remember asking about this at one point, glad to see a video done on it

  • @skilz8098
    @skilz8098 5 ปีที่แล้ว +1

    Once again; another great video!

  • @miszczklasykuw3025
    @miszczklasykuw3025 5 ปีที่แล้ว +1

    music in background adds nice atmosphere to video as always x)

  • @esteveslisboeta
    @esteveslisboeta 4 ปีที่แล้ว

    So inspiring! Thanks for this video, you got a new sub!

  • @BeeBaux
    @BeeBaux 4 ปีที่แล้ว +1

    Great! job bro. Thanks for making complex thing easier.

  •  5 ปีที่แล้ว +1

    Super interesting article. Thanks!

  • @noname-rr7hk
    @noname-rr7hk 2 ปีที่แล้ว

    I was searching for this video for half a year. Thankyou...

  • @robertboran6234
    @robertboran6234 5 ปีที่แล้ว

    Great Project. Thanks for sharing.

  • @AT-zr9tv
    @AT-zr9tv 3 ปีที่แล้ว

    Your videos are fantastic.
    This one particularly.

  • @DrSid42
    @DrSid42 5 ปีที่แล้ว +1

    Just had an idea I will make my own speech synth. I wondered if there is some nice example low-level enough. And guess what. This guys had the same idea just in time to have it done now. Great job !

  • @user-ql1hd2my3y
    @user-ql1hd2my3y หลายเดือนก่อน

    I know this is 5 years old, but it's still cool to listen to.

  • @Catbangin
    @Catbangin 5 ปีที่แล้ว +1

    Cheer bisqwit! Almost near to guitar effects tutorial!

  • @oo8dev
    @oo8dev 5 ปีที่แล้ว +1

    Amazing!!

  • @GabrielCrowe
    @GabrielCrowe 5 ปีที่แล้ว +1

    Awesome stuff.

  • @kapiltyagi4639
    @kapiltyagi4639 5 ปีที่แล้ว +4

    The solution for the clicking in the sound is to simply fade out some of the frequency from the very end of the sample. Because LPC just converting the audio samples into the simple and low resolution waveform just bunch of float values and a gain.

  • @RamLaska
    @RamLaska 5 ปีที่แล้ว +2

    I did something like this in the early nineties. I recorded my voice on my Mac SE, and wrote a hypercard stack to play the correct sounds together. It didn't translate English into phonemes, you had to write out your own phonemes, but that wasn't quite so unusual at that time.
    I also only made one recording per phoneme, because ain't nobody got time to record every possible phoneme pair 😂

  • @clearz3600
    @clearz3600 5 ปีที่แล้ว

    Interesting as always.

  • @moth.monster
    @moth.monster 5 ปีที่แล้ว +3

    Now we need to record the speech synth speaking and use that to make another synth

  • @yukimoe
    @yukimoe 5 ปีที่แล้ว +19

    So you're basically teaching us how to make Vocaloid-like software? Nice.

    • @ceablue8037
      @ceablue8037 5 ปีที่แล้ว

      @jj zun Yesssssssssssssssssssssssss

  • @oresteszoupanos
    @oresteszoupanos 5 ปีที่แล้ว +10

    Joel, regarding your question at 8:05, we cannot use a sine wave because it only has audio energy in 1 frequency, whereas to synthesise human speech, we need energies in "all" frequencies, so we can have base pitches and formants happening at the same time. Buzzers have a better spread of frequencies, compared to the more "pure" sine wave. Hope I made sense ^_^

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +3

      Good explanation, but not really an ELI5 :-)
      I understand the situation as indicated elsewhere in the video, but I was having trouble explaining in layman terms without referring to things like frequency spectrum; I wrote that request for the benefit of audience.

    • @oresteszoupanos
      @oresteszoupanos 5 ปีที่แล้ว +7

      @@Bisqwit Aha, I'd never heard the term ELI5 (Explain Like I'm 5) before! Here is my second attempt :-)
      Voice sounds are slightly complicated. Sine wave sounds are simple. Buzzers are super-complicated. We cannot use 1 simple sine wave, filter it, and get a complex voice sound. We have to start with a super-complex buzzer, then filter out some things, to be left with a less-complex voice sound.

    • @frisosmit8920
      @frisosmit8920 5 ปีที่แล้ว +1

      That's actually a very good explainaition. Your first explaination made me understand it. But then again, I'm not 5 years old.

    • @noneofyourbeeswax3460
      @noneofyourbeeswax3460 4 ปีที่แล้ว

      But you could superimpose sine waves to get all the frequencies?

    • @Bisqwit
      @Bisqwit  4 ปีที่แล้ว

      Yes, and in fact all waveforms can be represented as a sum of sinewaves. That is what e.g. the Fourier transform is about, or the discrete cosine transform.

  • @procactus9109
    @procactus9109 5 ปีที่แล้ว +1

    Awesome !!!

  • @edo9k
    @edo9k 5 ปีที่แล้ว +4

    I wish I had seen this video when I was researching for the master's degree.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +2

      What did you write about?

  • @GibusWearingMann
    @GibusWearingMann 5 ปีที่แล้ว +14

    I'm starting to become curious how to stop prison radicalization.

  • @gazehound
    @gazehound 5 ปีที่แล้ว +1

    I'm early this time. Awesome video!

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +1

      Thank you!

  • @tomaszx7760
    @tomaszx7760 4 ปีที่แล้ว +2

    I remember play with " Say " speech synthesizer from Workbench 1.3 OS (at Amiga 500 computer)

  • @mattg5461
    @mattg5461 5 ปีที่แล้ว

    Brilliant. I find this video a week after handing in my dissertation on vocal synthesis... This would have changed everything

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      How so?

    • @mattg5461
      @mattg5461 5 ปีที่แล้ว

      There's just a lot of things you've covered in here that I wasn't able to find much concrete information about - things like accents and dialects especially. Lots of things like that which I knew from common sense but couldn't find actual written documentation to back up.

  • @alejandroduarte5245
    @alejandroduarte5245 5 ปีที่แล้ว

    Great video :)

  • @thetastefultoastie6077
    @thetastefultoastie6077 5 ปีที่แล้ว +25

    I've never seen `++i %= max` before. That's pretty cool.
    Edit: it seems this only works in C++ but not in C, Java or Javascript

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +20

      In C++, operator++() returns a reference to the object being modified. This is not the case in C. This has nothing to do with C++17 or about sequence points. If the expression was `i++ %= max`, it would be a different story. `++i %= max` is completely unambiguous in its meaning. The reason it does not work in C is because `++i` returns a non-lvalue copy of the variable in C, not a reference to it. (C does not have references.)

    • @thetastefultoastie6077
      @thetastefultoastie6077 5 ปีที่แล้ว

      @@Bisqwit Thanks for the explanation!
      I used an online compiler to quickly try all versions of C++ and indeed it worked in all of them.

    • @Smaxx
      @Smaxx 5 ปีที่แล้ว

      @@shaurz I'd just write a tiny inline function with a speaking name instead. ;) Like `incmod(v, m)`

    • @DrSid42
      @DrSid42 5 ปีที่แล้ว +5

      @@shaurz It seems weird to you because of different background. Finish folk did it like this for centuries.

    • @noneofyourbeeswax3460
      @noneofyourbeeswax3460 4 ปีที่แล้ว

      @@DrSid42I don't think computers have been around for centuries

  • @uxxlabrute
    @uxxlabrute 5 ปีที่แล้ว +3

    Earthbound music in the background FeelsgoodMan

  • @vegardertilbake1
    @vegardertilbake1 5 ปีที่แล้ว

    Ha! This was so much fun!

  • @davidcuny7002
    @davidcuny7002 4 ปีที่แล้ว +1

    The red lines in Praat indicate formants, not the overtones. The vocal chords produces pulses, which have a fundamental frequency (pitch) as well as overtones (multiples of the pitch). The tongue forms a series of "tubes" in the mouth, which causes the pulses to resonate at frequencies proportional to the length of those various chambers. The resonating frequencies of these "tubes" are formants, and different mouth shapes create different sets of resonating frequencies.

  • @farteryhr
    @farteryhr 4 ปีที่แล้ว

    virtual singer Bisqwitoid confirmed (slap
    have you played with UTAU (singing synthesis software) in which it's very easy to make your own voicebank (and get quality high)? looking forward to that soooo much~
    it's just wonderful to find another common interest of you and me.. phonology and speech/singing synthesizing!
    (but yes to get high quality it needs deeper understanding of singing in timing, rhythm, grammar, and much time to fine-tune pitch, volume, breathiness envelopes for songs)

  • @gandolfphoenix1363
    @gandolfphoenix1363 5 ปีที่แล้ว +2

    You used the speech synthesizer that you made to give the Tutorial!

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +1

      Yes, I used it in the first few seconds of this video.

  • @alexhauptmann298
    @alexhauptmann298 5 ปีที่แล้ว +1

    ELI5 explanation for why you can't use a sine wave: the human voice is essentially a subtractive synthesizer. Most commercial music synthesizers can do some form of this. It's the same sort of "buzzer in a tube" model, except the tube is generally way simpler (unless you're Plogue, but that's another story).
    The reason a sine wave can't be used is because subtractive synthesis works by taking away frequencies from a harmonically-rich (i.e. complex waveform) sound. Any given wave can be recreated by an arbitrary number of sine waves, but a sine wave can't be broken down into something simpler. So essentially, a sine wave can't be used because it's not enough data. It mathematically cannot be subtracted from any further.
    This is...more complex than I was intending but oh well lmao

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +2

      Good explanation, but definitely not something that works for five-year-olds :)

    • @alexhauptmann298
      @alexhauptmann298 5 ปีที่แล้ว +1

      @@Bisqwit Haha, I figured. Is that a QRIO in the thumbnail btw? I wanted one SO BAD as a little kid and was thoroughly impressed with how realistic the synthesized speech sounded. Of course, now I know (from experience, even) that Japanese is a MUCH easier language to synthesize than English.
      Also while watching your video on Finnish phonetics, I found it interesting how it's sort of similar to Japanese (vowels with singular pronunciation, lengthened vowels and consonants). I wonder if that would make it technically easier to synthesize than English (at least, native-speaker English)...at the very least, it would make the plaintext dictionary rules much easier :P

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +2

      It’s a Nao, not Qrio. And yes, as a Finnish person who knows the basics of Japanese, I find Japanese much easier and familiar in many aspects compared to English.

  • @Thebasicmaker
    @Thebasicmaker 3 ปีที่แล้ว

    I also made a speech syinthethizer using the same procedure but my language was BASIC! And the voice was mine too pronuncing a word and then cutting the part that I needed and the program just had to load the sounds and play it one after the other to speech reading a phrase I give to an input intruction

  • @MrGoatflakes
    @MrGoatflakes 5 ปีที่แล้ว +6

    6:34 if you say this five times into a mirror at night you will summon a Bisqwit :P

  • @XTpF4vaQEp
    @XTpF4vaQEp 5 ปีที่แล้ว +7

    13:15 accidentally used the whisper effect

  • @ddream296
    @ddream296 5 ปีที่แล้ว +1

    whoah nice!

  • @yohvh
    @yohvh 5 ปีที่แล้ว +1

    When you find a problem after you played the audio do you just in real time think of a solution and code it right there at that speed?

  • @themcc1879
    @themcc1879 5 ปีที่แล้ว +6

    Sample voice frame to C code... the Lisp lover in me says you should have used Lisp, code as data and data as code. Either way this was beyond interesting. I like your accent but to be honest everyone who speaks English has an accent. The voice speaking with an accent was diffently something I wasn't expecting this 月曜日。

  • @smallgoodwoodoodaddy
    @smallgoodwoodoodaddy 5 ปีที่แล้ว +2

    I always liked your accent. So I liked it 👍 :D

  • @krank3869
    @krank3869 4 ปีที่แล้ว

    I always thought these videos were sped up but then i looked at the clock

  • @counterculturecocks
    @counterculturecocks 5 ปีที่แล้ว

    You are amazing.

  • @codeninja1832
    @codeninja1832 5 ปีที่แล้ว +1

    This is interesting as a programmer, as someone who's trying to learn another language (old english, dead language sure, but fun), and as someone who asked you how to trill about a month ago haha.
    Still can't trill, but I'm on my way.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      Thanks for posting!

  • @aprilliac
    @aprilliac 5 ปีที่แล้ว

    Rolling index, why didn't I think of that... Thanks for the excellent video. :)

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      Yeah, a rolling index is a bit neater solution than doing a copy-backwards-by-1 loop after each iteration. On the other hand, the rolling index makes SIMD optimizations impossible, so it’s a tradeoff.

  • @gero9307
    @gero9307 2 ปีที่แล้ว +1

    I created a voicebank CVVC and VCV type for utau, and while watching this video I experienced deja vu)

  • @Embedonix
    @Embedonix 5 ปีที่แล้ว +1

    +1 for using 'goto' in your code :)

  • @akj7
    @akj7 5 ปีที่แล้ว +1

    Question: At 9:17, why do you have: constexpr unsigned maxOrder? What is the purpose of the constexpr here? Won't the compiler evaluate the what maxOrder is without the constexpr? Why haven't you use const?

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +3

      For integers, there is not much difference between const and constexpr. I just like to document the intention. The primary target audience of source code is people, after all. When I write “constexpr”, I mean “this should be a compile-time constant, and something probably depends on the fact”. Here, MaxOrder _needs_ to be a compile-time constant, because it is used as an array dimension. When I write “const”, I mean “this is immutable; it should be only read, not written to”. For example, the constant “rate” is not intended to be changed, but I don’t necessarily need it to be a compile-time constant, even though it happens to be.

  • @MESYETI
    @MESYETI 4 ปีที่แล้ว

    Wow!
    I might try to make one, it seems hard though

  • @jfkd2812
    @jfkd2812 5 ปีที่แล้ว

    11:01 Hey, it's imgui! Very nice to use

  • @ShotgunLlama
    @ShotgunLlama 2 ปีที่แล้ว

    In this video in Praat, he uses LPC (Burg). Would this work with LPC using covariance or autocorrelation?

  • @dgmsstuff
    @dgmsstuff 5 ปีที่แล้ว +3

    I'm speechless. No pun intended.

  • @smkyone
    @smkyone 5 ปีที่แล้ว +1

    kiitos

  • @zeppy13131
    @zeppy13131 5 ปีที่แล้ว +1

    I can't speak for anyone else, but I was glad when this was Finnished.

  • @juniorsilvabroadcast
    @juniorsilvabroadcast 4 ปีที่แล้ว

    Bisqwit can you help me with something? I'm looking for a advanced audio clipper created in VST architecture. Some type of clipper that doesn't let out any type of small peaks on the output. 4x oversampling would help a lot but even that is badly implemented with traditional audio clipper avaible at internet in VST architecture. I have a FM Audio Processor made in VST technology using some VST Plug-Ins avaible on internet. And the big issue is the clipper. It let's out small peaks that makes the processing difficult because i need to implement a ISP Protected limiter on the end. That makes the sound go down when High Frequency material is played.

    • @Bisqwit
      @Bisqwit  4 ปีที่แล้ว

      I am not sure what exactly it is you want. It kind of sounds like you want a soft limiter, though. I don’t particularly have experience about VST plugins, aside from trying to install them for use in Audacity, at some point, getting it working, and then at some later point, noticing that the plugins are no longer there and being too indifferent to study further why.

  • @arcnorj
    @arcnorj 5 ปีที่แล้ว

    Can you explain just a bit what you did to generate the LPC sample from David Woods? I guess manually editing the pitch curve with Praat?

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +1

      I dumped the soundtrack of the video into a wav file using MPlayer. Then I opened the soundtrack in Audacity, and cropped it into just those three seconds or so, saved it into a new wav file. (Or maybe I dumped only three seconds from the soundtrack in the first place, using -ss and -endpos options. I don’t remember.) Then I opened the wav file in Praat, and did nothing else but synthesized the LPC from it (Analyze spectrum → To LPC (burg) → Save).

  • @fisu51
    @fisu51 5 ปีที่แล้ว +2

    Kyllä

  • @Darksoulmaster
    @Darksoulmaster 5 ปีที่แล้ว +5

    Wow, i dont know what are you even talking about, but its cool.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +10

      Speech synthesis

  • @HerrRussoTragik
    @HerrRussoTragik 5 ปีที่แล้ว

    Ohhh in the past I've made a pseudo "TTS" using the winmm from windows.h and PlaySound function...

  • @JoLiKMC
    @JoLiKMC 5 ปีที่แล้ว +2

    I, for one, welcome our new, Finnish robot overlords. _Hail Roboisqwit!_
    Seriously, though, this is neat-as-hell. It's also kind of… heartbreaking, in a way. I never considered how speech synthesis works, and now that I know? The magic… is gone. :(

  • @Bleenderhead
    @Bleenderhead 5 ปีที่แล้ว +2

    I want to hear it sing Space Oddity.

  • @daneru
    @daneru 5 ปีที่แล้ว

    Music: th-cam.com/video/Url3QHHNKSA/w-d-xo.html

  • @sebudrsappu6098
    @sebudrsappu6098 3 ปีที่แล้ว

    Earthbound music in Background:)

  • @firemaniac10010
    @firemaniac10010 5 ปีที่แล้ว +4

    I'm guessing the "buzz" can't be a pure sine wave because a pure sine wave has no harmonics; it's a pure tone. In other words, there's nothing to filter out except for one single frequency.

  • @generic_programmer
    @generic_programmer 5 ปีที่แล้ว

    I like this

  • @ivanbogdasaebersold4690
    @ivanbogdasaebersold4690 5 ปีที่แล้ว +1

    This will be my COVAS in Elite Dangerous...

  • @skilz8098
    @skilz8098 5 ปีที่แล้ว

    I'm wondering if the technology that is used to transfer data from Vinyl Record Albums into mp3 files would be of any assistance... Then just filter out the background music until you have pure voice. Then you can have a singing speech synthesizer.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      As for the first sentence, I fail to see the relevance. As for the second sentence, what kind of solutions do you have for “filtering out background music”? Even on TH-cam* it depends on correctly identifying the original recording (with or without lyrics) leaving only the added commentary and sound effects, and even then the resulting audio sounds quite hollow.
      *) TH-cam has a tool that allows video creators remove a song that infringes copyright, when TH-cam has first identified the infringement using ContentID. Often it results in simply muting that region of the video, but sometimes it successfully removes the song leaving only commentary.

  • @JakubSkowron
    @JakubSkowron 5 ปีที่แล้ว +2

    8:00 Buzzer cannot be pure sine, because then the filtering of the frequencies would make no sense - there would be only one frequency in buzzer to start with. Buzzer needs to have rich frequency spectrum, but at the same time it needs to be harmonic (i.e. all frequencies are natural multiples of some base frequency = there is a defined pitch).
    You could use any function in form A*sin(x) + B*sin(2x) + C*sin(3x) +..., but of course the easiest way to produce signal like that is to use 1) square wave, 2) sawtooth wave (as you did), 3) triangle wave, 4) exp(sin(x)), etc.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      Good explanation, but not an ELI5. I had trouble explaining it in layman terms without invoking mathematics and frequency spectrums... That's why I wrote the annotation.

    • @JakubSkowron
      @JakubSkowron 5 ปีที่แล้ว +1

      @@Bisqwit Vocal cords are the buzzers. Air go through a buzzer, then through a tube (vocal tract) which amplifies some frequencies (formants), and dampen other. If buzzer sound would be just one sine wave, then the tube just makes it louder or more silent, nothing more. Tube cannot create new frequencies, acts as a filter only. So the aim for the buzzer is to generate many frequencies, so the tube (vocal tract) have something to choose from. White noise (during whispering) have all frequencies - so it is OK. Pitched sound is also OK, since it have many sine waves in it, as long as its base frequency is not too high (easier to understand bass singing than soprano singing!). High pitch have fewer sine waves in formants frequency range (~300-3000Hz). Try changing VoicePitch to ~1046 Hz (soprano's high C), and you won't be able to distinguish vowels o from u from a, or e from i.

  • @adraxcz
    @adraxcz 5 ปีที่แล้ว +1

    Hey Bisqwit!
    May I ask what editing software do you use ? Thanks!

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +1

      For which type of content?

  • @AlexVasiluta
    @AlexVasiluta 5 ปีที่แล้ว +1

    Nice

  • @Sturmtreiben
    @Sturmtreiben 5 ปีที่แล้ว +1

    Which graphics software do you use for creating pictures like the one in 3:00? They somehow look really good.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +1

      Thanks. I use LibreOffice Impress. I also do some postprocessing in kdenlive; basically all _animations_ are done in the video editor.

    • @Sturmtreiben
      @Sturmtreiben 4 ปีที่แล้ว

      Thanks, Joel!

  • @mrswassduck7544
    @mrswassduck7544 5 ปีที่แล้ว

    Dope

  • @Smaxx
    @Smaxx 5 ปีที่แล้ว +2

    Your failing lowercase conversion for umlauts is a pretty nasty trap I fell into in the past as well. It looks like you're doing everything correct, it should work, yet it somehow doesn't. Unfortunately, it's not as easy as imbuing/passing the correct locale. It might be, but that's not guaranteed.
    Even when using the UTF-8 locale, you might just walk char by char and for whatever reason ignore UTF-8 sequences… So far for me it always worked when using the wide character version instead (i.e. `wchar_t` over `char` or possibly `uint32_t`), although I've heard even that fails for some. Guess it's not totally unexpected I've heard stuff about dropping the `codecvt` header from the standard…
    So in your case I'd just to the `std::u32string` conversion first, then unify character casing after that.

  • @coolbrotherf127
    @coolbrotherf127 5 ปีที่แล้ว

    How did you know this stuff before starting the project or learn as you went? I've wanted to start stuff like this but get overwhelmed by all the stuff I have to learn to finish the project.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      I studied it while making this project. Reading example code, reading articles that describe how LPC works, exploring outputs, trial and error until I got the first LPC-to-WAV converter working. Some basic principles I had already learned years ago don’t-remember-where. And of course, the principles of phoneme-based speech synthesis were already familiar to me since the 1990s when I studied how Dr. Sbaitso works.

  • @siddharthkalantri5076
    @siddharthkalantri5076 5 ปีที่แล้ว

    Thank you I use sync dec talk alyes with my screen reader and alyes search for Hindi language . perhaps one day I will get my wish looking at your finish voice sync . as its possible .

  • @ruadeil_zabelin
    @ruadeil_zabelin 5 ปีที่แล้ว

    Note that std::wstring_convert is deprecated in C++17, so if you want to be standard conforming, you should replace it with something else.

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว +1

      Noted. I used it for 1) its brevity and 2) because I couldn’t figure out a concise replacement that is not deprecated.

    • @ruadeil_zabelin
      @ruadeil_zabelin 5 ปีที่แล้ว

      @@Bisqwit Unfortunately there isn't a standard way anymore. The standards commity has said that they're working on a replacement, but will only readd it if it's fully compliant with the unicode standards (apparently this one didn't work in all cases). The only way seems to be fully implement it yourself (utf8 decoding isn't very hard luckily), or use a library like iconv or libicu.

  • @victorprokop2240
    @victorprokop2240 3 ปีที่แล้ว

    3:16 Mongolian throat singing!!! lmao

    • @Bisqwit
      @Bisqwit  3 ปีที่แล้ว

      I’m not sure if you are mocking, but the principle is actually similar. The purpose is to enunciate different subtones while keeping the primary tone unchanged.

  • @jamescumbria4499
    @jamescumbria4499 5 ปีที่แล้ว +1

    Are you going to make this speech synthesizer a TTS voice for Windows?

    • @Bisqwit
      @Bisqwit  5 ปีที่แล้ว

      I don’t deal with Windows.