Speech features intro 3: Mel-scale spectrogram

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 พ.ย. 2024

ความคิดเห็น • 23

  • @na50r24
    @na50r24 วันที่ผ่านมา

    17:35
    What confuses me about this is, can we do the comparison to figure out if the same word is in the signal or if the both signals came from the same speaker? (IIrc the algo used for this is called DTW which is very similar to the Edit Distance algo)

  • @doranoon10
    @doranoon10 ปีที่แล้ว +1

    hey Herman!! just wanted to say thanks for your videos! it's helped me a bunch in my dissertation section on timbre similarity analysis, and it's clear enough that I, a musician, can understand it!

  • @OussemaGuerriche
    @OussemaGuerriche 4 หลายเดือนก่อน

    your way of explanation is very good

  • @santiagoguisasola1834
    @santiagoguisasola1834 11 หลายเดือนก่อน

    Really amazing set of videos, thank you Herman! You have a great presentation style.
    Could another way to think about the Mel scale involve harmonics? Since each frequency when doubled (or halved) is the same underlying note (e.g. A is 440Hz --- it is also 220Hz and 880Hz), the space between the same note in an octave gets bigger and bigger as frequency increases. For example, a low A is 27.5 Hz. If we double it, we get the next A at 55Hz. The difference is only 27.5Hz. Going higher, we have an A at 3520Hz. The next A is all the way up at 7040Hz. If we add only 27.5Hz to 3520Hz, we go up to only 3547.5Hz, which isn't even A#!!!! (which is at 3729.310Hz). So the Mel scale adjusts for the growing space between the same notes as frequency goes up.
    If so, I wonder why the Mel scale isn't rooted in harmonics and equal temperament (instead of experimental data).

  • @alfredoalarconyanez4896
    @alfredoalarconyanez4896 3 ปีที่แล้ว +1

    Thank you very much for this awesome video, very well explained !

    • @kamperh
      @kamperh  3 ปีที่แล้ว

      Very happy you enjoyed it! :)

  • @Kotpaz
    @Kotpaz ปีที่แล้ว

    You are awesome! thank you so much you were extremely helpful in my project

  • @nedzadhadziosmanovic3785
    @nedzadhadziosmanovic3785 3 ปีที่แล้ว +4

    I simply cannot believe that you have so little views and likes. To be hones, your video on this topic is the best there is on the internet. I hope you make videos as a side thing, and make a lot of money in the meanwhile, because man you know your stuff. All the best and cheers :D

    • @kamperh
      @kamperh  3 ปีที่แล้ว +1

      Very happy you found this so helpful! It's very encouraging! :)

  • @entertain8768
    @entertain8768 2 ปีที่แล้ว

    @15:28 shape log_mel_spec is (40,161) but in the plot of the same doesn’t seem to have same dimensions why ?

  • @waisyousofi9139
    @waisyousofi9139 2 ปีที่แล้ว

    Thanks Herman!
    Can you share the github link of this playlist's code

  • @eastchun2635
    @eastchun2635 2 ปีที่แล้ว

    Where can I download your example audios (siren.wav, dress_start.wav and where_were_you.wav)?

  • @mohamadhamoudy8232
    @mohamadhamoudy8232 3 ปีที่แล้ว

    Dear Professor Herman , please could you post some videos on Wavelets , Scalogram in speech signal processing , thanks

    • @kamperh
      @kamperh  3 ปีที่แล้ว +1

      I wish I had more time, Mohamad!!

  • @emrekulkul4784
    @emrekulkul4784 ปีที่แล้ว

    hey man, hope youre doin good :) I have one part that still eludes me: when we obtain the vectors of each stft frame, what are exactly the values inside the vectors? I dont understand what people mean with “features”. What type of features do these values represent? Also, why are the filters shaped as a triangle? What is the reasoning of that? Thanks a lot in advance, luv ur channel :)

    • @kamperh
      @kamperh  ปีที่แล้ว +1

      Thanks for good questions!
      The features inside each STFT window is typically a modification of values coming from a discrete Fourier transform. Without getting into all the details here, you can think of the first value in this vector as telling you something about the lowest frequency content in that little snippet at audio; the last value in the vector at the highest dimension tells you something about the highest frequency content.
      Your question about the triangular window is also good, especially since I don't actually know the answer! There are some good reasons that you want some tapering off of that window on the sides, but I don't know why we specifically use a triangular window. It might be that this window is one of the easiest tapered windows to actually implement. Hope that helps!

  • @entertain8768
    @entertain8768 2 ปีที่แล้ว

    Great explanation. Please share the notebook

  • @SH-ee2hs
    @SH-ee2hs 3 ปีที่แล้ว

    hi sir can u also add linear predictive coding ,GMM,EM topics of speech

    • @kamperh
      @kamperh  3 ปีที่แล้ว +1

      I wish I had 60 hours every day to just make videos... :( But hopefully in the future!!

    • @SH-ee2hs
      @SH-ee2hs 3 ปีที่แล้ว

      @@kamperh can i have your email contact?

    • @kamperh
      @kamperh  3 ปีที่แล้ว

      @@SH-ee2hs I don't want to post it here on TH-cam, but you should be able to find it from my home page. Hope that helps!

  • @kumar707ful
    @kumar707ful 2 ปีที่แล้ว

    Where I can find the python code ?

  • @mrstanton81
    @mrstanton81 10 หลายเดือนก่อน

    Note.