I can't tell you how helpful this series has been in my current project applying CNN\LSTM to audio data. You've filled in all the gaps that are hard to find information about, as so many applications of CNN\LSTM are to image classification, not audio. Thank you!
I am so HAPPY I've found your video. No book could have explained this better, probably not even many professors or tutors. I'm gonna check your other videos!
Thank you so much for this series! I love how you hyperlinked the series by making references back to older videos to refresh material. Extremely useful. just a small note: I was reimplementing the Hann window in numpy for STFT and noticed that k should start at 0 and go to K-1
thanks! great explanation of windowing and overlapping frames. I had to rewatch it a couple of times to understand how overlapping frames solved the windowing problem but so much easier and more fun than reading a book!
Great series! You give a good overview of the process, the problems which we face and solutions while keeping the video duration short and optimum so one can keep watching the series. If one is curious enough, they can also read up more on the concepts as well. Great work, thank you! :)
I just love these videos, it clearly shows the theoritical part and simplifies it into a language that non specialist people (laymen) understands, I am by no means an audio engineer, though my field overlaps with the topics of sound/audio engineering, I work with machinery vibrations which uses basically every tool that you mentioned in this series, I am currently working on my thesis and have used some of these functions already along with wavelet packet transform (which makes my work so easy to do - and it took me quite time to get a hang of it). Great work, may your work help more people like myself for eternity (or till the climate change takes us all out)! Thank you again, and God bless!
3 ปีที่แล้ว
Perfect explanation which I could have not found for months on blogposts etc. Thanks!
Thank you very much for these videos. It is very generous of you to make them. I think there is another aspect of spectral leakage and that is if the signal harmonics do not fall exactly " on top of" the sampling frequencies, then the energy of the harmonics gets spread out to the ALL sampling frequencies, with the most of the energy leaking to frequencies around the closest sampling frequency.
i like your work . i have worked with DL and image for quite some time. your series showed me all i need to know abt sound . thanks and keep doing this
I can feel this is highly inspired by Mullers book about Digital Signal Processing :DD I used it for my thesis and I recgonized a lot of stuff like... Ive heard (read) this somewhere before :DD But its described very well there and you explained/summarized it perfectly ;)
Very nice!! Thanks a lot!! Just an observation: in the middle of the video, i got surprised by a slack ring hahaha ... i couldnt find any slack notification in my phone and take a while to discover that the sound comes from de video. That's all . Minute 12:42
if you have overlapping frames, given that you apply the hann window function, wouldn't you create some form of amplitude modulation that is not present in the original signal?
In Frequency domain,If overlapping frames are introduced to compensate the loss of signal during windowing why do we have overlapping of frames in time domain ? Whether discontinuity(Spectral leak) happens only at the end or start of the frame ? Is it only relevant for frequency domain ?
You teach so well. I have a question. I’m going to detect phonological feathers of African American English from some interviews, and I want to make the detection process automated. What do you recommend?
how to choose overlap length? if it was more than required then wouldn't a part of signal repeat? if i understood correctly, then it doesn't matter as we are extracting features from stft of each frame. is it this way? and also ur videos r really good
Hi sir, we apply the overlapping because of the spectral leakage when we apply the FFT for the frequency domain. But in time domain, this is not the case. Then why do we still apply the overlapping in the time domain? Thank you, sir
According to the sampling theorem, a signal must be sampled at least twice as fast as the highest frequency component in the signal, but in practice what are the recommended sampling frequencies?
Great content! After Windowing(overlapping) you can minimize the "lost period", but it does not seem to solve losing signal problem. To me, by applying Haan function, edges in original signal are lost anyway. So I am not fully sure what we have truly achieved after overlapping, except minimizing useless/lost signals? Frequency Domain will still be missing bars for those overlapped portions of frequencies, isn't it?
I'm a bit confused about the denominator in the Hann windowing function. Shouldn't it just be K, i.e. the frame size, or number of samples in a frame, instead of K-1?
So if I'm understanding correctly, the main purpose of windowing is to deal with spectral leakage. But it sounds like spectral leakage only occurs at the beginning and end of the entire signal. So why bother windowing every frame? It seems like we could save a lot of time and effort by just clipping the erroneous bits off the ends of the signal. I'm guessing there's a reason, but what is it?
As per my understanding, your suggestion will work only for static waveform. But in order to calculate Spectrogram (STFT) we need to calculate FFT at every frame and collectively we will get a nice spectrogram to work with.
Around 19:39 you explain what hop-length and the frame size is. But very confusing the librosa package has different feature names and also another new feature n_fft Could you clear up what the n_fft means and how they are different.
Great video mate, I've seen Librosa use 2048 for frame size and 512 for hop size as default for many of their functions (ej. melspectrogram). Any recommendation on what to use as a general reference? I'm not sure how Librosa does their windowing though so idk.
A usual ratio between frame size and hop length is 2:1. Beyond this it's difficult to provide a general rule. For some problems you want higher temporal resolution, hence you should use a hope size of 512. For others, 768 or 1024 is totally fine. You'll have to treat the hop length as a hyper parameter that needs to be optimised empirically during training.
Hello, my friend, I didn't get the point and I wanna ask you. Let's say I have extracted 30 windows overlapping from the audio signal. while I am extracting features should I act these 30 as a different 30 windows but the same target or 30 windows = 1 feature x 30 times = 30 features => my audio. which one do you think is true?
hello sir, can you help me with some links to do windowing to pick speech or audio from microphone for preprocessing , and can you make a tutorial for this soon please thanks
@12:00: What exactly does it mean that "Processed signal isn't a integer number of periods"? I get that if you loop the sound frame that is showed on the slide, it does not result in a continous wave but it goes from around -3 amplitude directly to around +4 which results in a high frequency, but i cant really get what is meant by "Processed signal isn't a integer number of periods". Can someone help please? :)
@Andrei Thank Andrei. I realised what confused me is that I didn't know that a period was an actual feature of the sound wave. Thought it was more abstract than that.
At 3:51, could you explain how you go this number again? I tried to do the calculation, but I'm not getting the same thing. Is this because of A/D conversion? Is that single sample number based off of a sample/hold technique or is it just a normal length for one sample in a "one second" length?
Is there any videos where you show how to extract and measure silent pauses with Librosa or another Python library? I would appreciate your help. Thanks.
Hello Valerio. Thank you for another great video. I would like to ask for your advice for my project and already emailed you about it. Could you take some time and give your thoughts on it please? Would really appreciate it. Thank you
Usually, you'll extract them at different times. In audio processing libraries like Librosa, there are different functions for different audio features, that you can apply sequentially. Or, perhaps, you may just want to extract one!
Valerio, I'm using your code on my github python notebook to process some machine sounds and I want to give credit where credit is due so I'm going to mention you, the book you're working on, and the YT series and a link to the YT series. Is that good enough for you? Would you like me to add anything else? I'll make a boilerplate that I will put at the top of the code.
Can you tell me the whole feature extraction pipeline for audio data. Like I wank make a class of feature extraction. Which take audio signa. Then what I have to do?
Hello my Name Is Lucas. I'm Brazilian and I'm trying to make an algorithm that differentiates one noise from another. For example: Depending on the sound the rain makes as it hits the ground if I can determine how much water is falling. Things like that, always involving noise. Anyone familiar with this to help me with some directions? Thanks
Hello, Sir SaiGeeta here from India. Sir, please suggest me good laptop specification for text to speech synthesis using Deep learning algorithm. please please
Some people are born to teach and your are one of them. It feels like finishing a semester course on signal processing in couple of mins.
Thank you!
I can't tell you how helpful this series has been in my current project applying CNN\LSTM to audio data. You've filled in all the gaps that are hard to find information about, as so many applications of CNN\LSTM are to image classification, not audio. Thank you!
You made me clear the concept of frames, overlapping needs and multiple doubts.
I am so HAPPY I've found your video. No book could have explained this better, probably not even many professors or tutors. I'm gonna check your other videos!
I am doing my research in Signal processing along with ML. Thank you for clearly explaining the concepts
Thank you so much for this series! I love how you hyperlinked the series by making references back to older videos to refresh material. Extremely useful.
just a small note: I was reimplementing the Hann window in numpy for STFT and noticed that k should start at 0 and go to K-1
thanks! great explanation of windowing and overlapping frames. I had to rewatch it a couple of times to understand how overlapping frames solved the windowing problem but so much easier and more fun than reading a book!
Great series! You give a good overview of the process, the problems which we face and solutions while keeping the video duration short and optimum so one can keep watching the series. If one is curious enough, they can also read up more on the concepts as well. Great work, thank you! :)
Thank you!
Fantastic description. Loved the clarity in concept
I just started yesterday and I am just loving it as it is helping me a great deal with understanding the audio concepts to do r&d on my ongoing work.
I just love these videos, it clearly shows the theoritical part and simplifies it into a language that non specialist people (laymen) understands, I am by no means an audio engineer, though my field overlaps with the topics of sound/audio engineering, I work with machinery vibrations which uses basically every tool that you mentioned in this series, I am currently working on my thesis and have used some of these functions already along with wavelet packet transform (which makes my work so easy to do - and it took me quite time to get a hang of it). Great work, may your work help more people like myself for eternity (or till the climate change takes us all out)!
Thank you again, and God bless!
Perfect explanation which I could have not found for months on blogposts etc. Thanks!
Thank you!
Such a great series!
You are an awesome teacher Velardo
I appreciate you sooo much :)
You're an amazing teacher, thanks for the videos! They are really helpful to me!
Thank you very much for these videos.
It is very generous of you to make them.
I think there is another aspect of spectral leakage and that is if the signal harmonics do not fall exactly " on top of" the sampling frequencies, then the energy of the harmonics gets spread out to the ALL sampling frequencies, with the most of the energy leaking to frequencies around the closest sampling frequency.
That's exactly what I was looking for, you made my day thanks☺
Amazing intuition over why using windowing mitigate spectral leakage, great series of videos!
Excellent Explanation Solved all my problems Thank You Sir!
Well explained for overlapping and signal loosing done by Hann windows. Thanks
Thanks!
i like your work . i have worked with DL and image for quite some time. your series showed me all i need to know abt sound . thanks and keep doing this
Glad you like this!
This is awesome explanation. now i understand why you can't use a digital filter thats just rectangular
I can feel this is highly inspired by Mullers book about Digital Signal Processing :DD I used it for my thesis and I recgonized a lot of stuff like... Ive heard (read) this somewhere before :DD But its described very well there and you explained/summarized it perfectly ;)
That is a great book, that I love to refer people to for more info/details.
Best teacher ever 🙏🙏
Very nice!! Thanks a lot!! Just an observation: in the middle of the video, i got surprised by a slack ring hahaha ... i couldnt find any slack notification in my phone and take a while to discover that the sound comes from de video. That's all . Minute 12:42
Thank you for the awesome explanation. Very clear and understandable! :)
Thanks!
OMG. You and your channel are unique. Amazing. Thanks
Thank you very much, Valerio!
if you have overlapping frames, given that you apply the hann window function, wouldn't you create some form of amplitude modulation that is not present in the original signal?
Enjoyed as usual.... and waiting for the next
I really enjoyed your explanations
Incredibly helpful! It would be immensely helpful if you could post a video of the code for framing and windowing for an audio with the explanation.
In Frequency domain,If overlapping frames are introduced to compensate the loss of signal during windowing why do we have overlapping of frames in time domain ?
Whether discontinuity(Spectral leak) happens only at the end or start of the frame ? Is it only relevant for frequency domain ?
You teach so well.
I have a question. I’m going to detect phonological feathers of African American English from some interviews, and I want to make the detection process automated. What do you recommend?
Big thumbs up, solid video brother!
how to choose overlap length? if it was more than required then wouldn't a part of signal repeat?
if i understood correctly, then it doesn't matter as we are extracting features from stft of each frame. is it this way?
and also ur videos r really good
Super helpful, thank you for the great work.
Thanks!
11:56 i'm kinda lost. what do u mean by "not an integer number of periods"?
Hi sir, we apply the overlapping because of the spectral leakage when we apply the FFT for the frequency domain. But in time domain, this is not the case.
Then why do we still apply the overlapping in the time domain? Thank you, sir
Highly comprehensive and informative series!
I had a question : How can we directly input Low Level Descriptors to Keras models?
You can check out my "DL for Audio with Python" series for that. In this series, I don't use Keras, and focus only on audio features.
According to the sampling theorem, a signal must be sampled at least twice as fast as the highest frequency component in the signal, but in practice what are the recommended sampling frequencies?
Wow! This is so good. Wish I had seen this before. Merci! :)
Thanks!
Thank you for this awesome series. Please, what is the difference between frame and window size? These two terms confuses me alot
Thank you! I hope I will get a better rank with this information :)
This is hands down the best tut series!
Where did you learn so much?
Could you recommend]nd some books or websites?
Thank you!
Thank you through I don't understand. I read other artical extraly to learn it. However, you help me much. I never reliaze leakega in FFT before. LOL
Very good, thanks for this video!
Great content! After Windowing(overlapping) you can minimize the "lost period", but it does not seem to solve losing signal problem. To me, by applying Haan function, edges in original signal are lost anyway.
So I am not fully sure what we have truly achieved after overlapping, except minimizing useless/lost signals? Frequency Domain will still be missing bars for those overlapped portions of frequencies, isn't it?
I assume spectral leakage simply means a "click/pop" from clipping when abrupting the sample?
I'm a bit confused about the denominator in the Hann windowing function. Shouldn't it just be K, i.e. the frame size, or number of samples in a frame, instead of K-1?
Thank you for your explanation! Sorry, but may I have the book you are referencing to for this video?
So if I'm understanding correctly, the main purpose of windowing is to deal with spectral leakage. But it sounds like spectral leakage only occurs at the beginning and end of the entire signal. So why bother windowing every frame? It seems like we could save a lot of time and effort by just clipping the erroneous bits off the ends of the signal. I'm guessing there's a reason, but what is it?
As per my understanding, your suggestion will work only for static waveform. But in order to calculate Spectrogram (STFT) we need to calculate FFT at every frame and collectively we will get a nice spectrogram to work with.
Around 19:39 you explain what hop-length and the frame size is.
But very confusing the librosa package has different feature names and also another new feature n_fft
Could you clear up what the n_fft means and how they are different.
Great video mate, I've seen Librosa use 2048 for frame size and 512 for hop size as default for many of their functions (ej. melspectrogram). Any recommendation on what to use as a general reference? I'm not sure how Librosa does their windowing though so idk.
A usual ratio between frame size and hop length is 2:1. Beyond this it's difficult to provide a general rule. For some problems you want higher temporal resolution, hence you should use a hope size of 512. For others, 768 or 1024 is totally fine. You'll have to treat the hop length as a hyper parameter that needs to be optimised empirically during training.
Valerio Velardo - The Sound of AI
thanks! I’ve noticed that it varies indeed so I’ll keep playing around with it.
Very good stuff. Thank you.
Glad you like it!
Hello, my friend, I didn't get the point and I wanna ask you.
Let's say I have extracted 30 windows overlapping from the audio signal. while I am extracting features should I act these 30 as a different 30 windows but the same target or 30 windows = 1 feature x 30 times = 30 features => my audio.
which one do you think is true?
as usual, amazing stuff.
hello sir, can you help me with some links to do windowing to pick speech or audio from microphone
for preprocessing , and can you make a tutorial for this soon please
thanks
That's something I'd like to cover in the future. Stay tuned :)
@@ValerioVelardoTheSoundofAI.. Ok , i am your student thanks
Great content delivered! Keep going !!!!!
Thanks Misha!
@@ValerioVelardoTheSoundofAI Looking forward to more videos.
thx for your help i can better understand audio analysis
@12:00: What exactly does it mean that "Processed signal isn't a integer number of periods"? I get that if you loop the sound frame that is showed on the slide, it does not result in a continous wave but it goes from around -3 amplitude directly to around +4 which results in a high frequency, but i cant really get what is meant by "Processed signal isn't a integer number of periods". Can someone help please? :)
@Andrei Thank Andrei. I realised what confused me is that I didn't know that a period was an actual feature of the sound wave. Thought it was more abstract than that.
At 3:51, could you explain how you go this number again? I tried to do the calculation, but I'm not getting the same thing. Is this because of A/D conversion? Is that single sample number based off of a sample/hold technique or is it just a normal length for one sample in a "one second" length?
1/44100Hz = 0.0000227s = 0.0227ms
Hi please what is the difference between short term and mid term feature extraction. Is mid feature extraction like averaging the short term frames ?
17:56 was the ahHaa moment for me , loved it thnx for making the video
Is there any videos where you show how to extract and measure silent pauses with Librosa or another Python library? I would appreciate your help. Thanks.
I don't have such a video. But I would use a feature like RMSE for that.
Hello Valerio. Thank you for another great video. I would like to ask for your advice for my project and already emailed you about it. Could you take some time and give your thoughts on it please? Would really appreciate it. Thank you
Hie. great video. are time-frequency domain audio features extracted at the same time or one then the other ?
Usually, you'll extract them at different times. In audio processing libraries like Librosa, there are different functions for different audio features, that you can apply sequentially. Or, perhaps, you may just want to extract one!
Valerio Velardo - The Sound of AI Thank you.
Valerio, I'm using your code on my github python notebook to process some machine sounds and I want to give credit where credit is due so I'm going to mention you, the book you're working on, and the YT series and a link to the YT series. Is that good enough for you? Would you like me to add anything else? I'll make a boilerplate that I will put at the top of the code.
That is very kind of you :) It's perfect!
Can u please explain spectral leakge in layman terms here?
Can you tell me the whole feature extraction pipeline for audio data. Like I wank make a class of feature extraction. Which take audio signa. Then what I have to do?
I covered audio preprocessing pipelines in this video: th-cam.com/video/O04v3cgHNeM/w-d-xo.html
thankyou make this video basic information...
Hello my Name Is Lucas. I'm Brazilian and I'm trying to make an algorithm that differentiates one noise from another.
For example: Depending on the sound the rain makes as it hits the ground if I can determine how much water is falling. Things like that, always involving noise.
Anyone familiar with this to help me with some directions?
Thanks
But I noticed an error in slide title, First you titled them as Time domaine 2:49 example then at the end you titled them Frequency domain 20:41.
Thank you for pointing that out!
@@ValerioVelardoTheSoundofAI I re-watched the video this is nit a mistake.
Super!
thaks!
ps: i think you need nerdy music in the intro. mozart i think )
Can you please tell how to extract features from an audio signal by using MATLAB?
I don't use MatLab, I prefer to work in Python.
this is all very cute but how to we actually implement these things? seems like everybody talks about them, but nobody shows it
perfect!
please make a video list
This video is already part of a playlist.
Spectral leakage. Spectral contamination is more like it.
Hello, Sir SaiGeeta here from India. Sir, please suggest me good laptop specification for text to speech synthesis using Deep learning algorithm. please please
Most of us are indian and I am too ..