Thank you for making these amazing videos and putting so much effort. This series has cleared my doubts better than any signal processing and audio processing videos. I have one question: In this plot at 12:48, you are computing the spectrogram for the square of the amplitude, if I plot not the square of the amplitude, I am able to visualize the frequencies. In the previous videos shows in the series, we are computing the fourier transform given the amplitude (while computing full fourier transform).
The amplitude squared gives some sense of the Energy in the wave, which is a better representation, since we perceive waves through energy. It also helps us take a logarithm of that energy to get a decibel sense, which is how we perceive loudness too.
Thank you so much for your fascinating course. At about 7:33, when you explain how to get #frames, which is 342 here, I cannot calculate it myself based on the formula in the last video: #frames = ((#samples(of scale array)-FRAME_SIZE)/HOP_SIZE)+1 = ((174943-2048)/512)+1 = 338.68 and not 342. Can you please clarify me? Thank you Zahra
Is there any possibility to recreate audio after we get the log-amplitude spectrogram? Ofc, first we would convert dB back to power for what we have a function in Librosa, but what then? How to invert the "np.abs(S_scale) ** 2" part back to audio?
once you've taken the magnitude, you unfortunately lose the phase information and therefore lose information in general. you would have to have stored the phase somewhere
Hello! The series is extremely interesting. Thank you for creating a channel and sharing the knowledge with theory and hands-on. Between, I want to report a tiny error. In this track around the time 3:15, the audio track been played via Ipython was not audible. The same problem happend when you played the same audio in one your previous video. Thank you.
Hi Valerio, in the above spectrograms there is always a strong and constant low frequency component. What does it depend on? Is it relevant or is it just an artefact? Thank you
Hi Valerio, Thanks for this video. Very useful. At 7:30 of this video, displaying the shape of stft output matrix, I feel that the #frames is calculated as (samples/hopsize)+1 in my example code. I understand the equation in your previous video but the librosa output is slightly different. In my example, samples=220100, Frame_size = 512, Hop_size = 160 Output STFT matrix has second dimension: 1379 Can you please clarify.
From the librosa source code says that pad_mode is 'reflect' and place center If unspecified. if center: y = np.pad(y, int(n_fft // 2), mode=pad_mode) so both n_frames calculation Valerio said and librosa have same result. Your samples + padding = 220100 + (256*2) = 220612, n_frames = (samples - frame_size)/hop_size + 1 = (220612-512)/160 + 1 = 1376, and I think your **second dimension: 1379** maybe type wrong. My computation is (257, 1376).
I notice in the librosa docs they don't square the magnitude but then use amplitude to db, does anyone know if this make any difference to the final results? Guess I'll have to try to understand everything properly later! Great video!
Why do I have this error when I try to run "Ipd.Audio(scale_file)" : Value Error: rate must be specified when data is a numpy array or list of audio samples
Quick questions, What is the purpose of doing Y_scale = S_scale ** 2 ? Why 2 but not a different number ? What effect does this power parameter has on the generated spectrogram ?
Hey Valerio, i tried to make it on an audiofile that i got but i have an error; it sais to me that this "figsize" is not defined... can you please help :)
Hi Valerio, I want to extract features for set of video that is stored in folder and number of audio file is 200.How can I load audio files and apply the feature extracting in all files at the same time.What is the format for the code that is contains the all audio files???
Hi, great video again. Could anyone explain me why we square after the np.abs(Y)? not using it doesn't change much the result since we'll use logarithmic scales, but is it correct to actually use it?
Dear Valerio, how can I convert from magnitude spectrum (therefore, with ^2 coefficient) back to audio? The istft requires complex numbers, but we lose track of them with the magnitude.
are we only able to work in .wav with librosa? I've been using only .wav so far. If it's needed, i'll convert mp3s to .wav using pydub. I'll try that next week
Hi, i got confused a bit. In the DL series we compute spectrogram with spectrogram = np.abs(stft) and log_spectrogram = librosa.amplitude_to_db(spectrogram). Is there any difference with the way we compute it in this video (spectrogram=np.abs(stft)**2 and Y_log_scale = librosa.power_to_db(Y_scale) )?
@@ValerioVelardoTheSoundofAI Thank you! But still i can't understand when we use the first case and when the second one, since you call both as spectrograms
@@Kyrios_X it depends on the task. Sometimes, amplitude spectrograms work better than power spectrograms. Sometimes, it's the reverse. Unfortunately, there isn't a "rule". You'll have to try both of them and see which representation works best for your problem.
@@ValerioVelardoTheSoundofAI Ok, got it ! Since i am new to speech recognition and still practicing on datasets, would it be better to use scipy.signal.spectrogram() instead ?
hii, valerio i have a question if we have 100 genres with 100 training examples each then how to take the spectrograms, and store them? (or is there any way to generate img data and input at the run time)and each spectrogram will have varying dimensions then how to get uniform input for the network to train? will use of rectangular windows of the spectrogram be better for training , can you suggest some links for reading more about audio augmentation. Y_scale.shape is a 2d array is enough for training or we need the rgb version of it, which is more efficient please have look at this kaggle competition this might interest you , for bird audio recognition what should the input be spectrogram or melspectrogram www.kaggle.com/c/birdsong-recognition
Achai you've asked a lot of good questions! I have a couple of videos in "DL for Audio with Python" that has a similar use case to yours, i.e., 100 samples in 10 musical genres. You can refer to those videos. It's important that you always have the same input shape. For that, you can segment the songs as to have the same number of samples. Usually for music genre classification 15-30 sec worth of audio should be good. If by "rectangular windows of the spectrogram" you mean applying a Mel filter bank, you're on the right path. Mel Spectrograms, or even better CQT, are valuable approaches when dealing with music data. For training, you'll need the equivalent of a grey scale image, in case you decide to go with CNN architectures -- which I suggest you to do. In other words, you'll have to add a 3rd dimension. Once again, you can refer back to my videos in DL for Audio to see how to do that.
@@ValerioVelardoTheSoundofAI Hii, I have been watching the series(DL for Audio with Python) and referring to it a lot, thank you for this channel, so for CNN we have to set the third dimension as 1 right is that what you meant , like (120,120,1) adding the other dimension, by windowing i meant if there is a 1 minute call of bird then taking 5 second input next 5 second input so on , is there any way to do that in the spectrogram ?
Thank you so much for the great content! I followed every video up to now and learned so much At this point is the first time I have a problem that I can't solve: In Sublime Text when I use the plot_spectogram function there is no spectogram window popping up as usual. If I put a print() before that (I don't know if that would be the right way to do it) the output shows "None". Apart from that no errors are occuring. Does anyone know how to visualize the spectogram in Sublime text? Hope that someone knows the solution to that. Thanks in advance :)
All the videos in this series are very helpful, well made and well explained, thank you so much!
Such a comprehensive yet easy to understand series, hats off
The best explanation in all the internet! Thanks man.
Thank you for making these amazing videos and putting so much effort. This series has cleared my doubts better than any signal processing and audio processing videos. I have one question: In this plot at 12:48, you are computing the spectrogram for the square of the amplitude, if I plot not the square of the amplitude, I am able to visualize the frequencies. In the previous videos shows in the series, we are computing the fourier transform given the amplitude (while computing full fourier transform).
The amplitude squared gives some sense of the Energy in the wave, which is a better representation, since we perceive waves through energy. It also helps us take a logarithm of that energy to get a decibel sense, which is how we perceive loudness too.
You are going to put so many lecturers - top universities inclusive, out of job!
Thank you so much for your fascinating course.
At about 7:33, when you explain how to get #frames, which is 342 here, I cannot calculate it myself based on the formula in the last video:
#frames = ((#samples(of scale array)-FRAME_SIZE)/HOP_SIZE)+1 = ((174943-2048)/512)+1 = 338.68 and not 342.
Can you please clarify me?
Thank you
Zahra
Is there any possibility to recreate audio after we get the log-amplitude spectrogram? Ofc, first we would convert dB back to power for what we have a function in Librosa, but what then? How to invert the "np.abs(S_scale) ** 2" part back to audio?
once you've taken the magnitude, you unfortunately lose the phase information and therefore lose information in general. you would have to have stored the phase somewhere
MAKASIIH BG, SUDAH MEMBANTU TUGAS SAYA
Extremely clear explanation!
Thanks a lot!
Is possible to get a music transcription of a piano recording using this?
Can you please explain similarity matrix if its possible with python?
Hello! The series is extremely interesting. Thank you for creating a channel and sharing the knowledge with theory and hands-on. Between, I want to report a tiny error. In this track around the time 3:15, the audio track been played via Ipython was not audible. The same problem happend when you played the same audio in one your previous video. Thank you.
Thank you. I think the issue with audio has to do with YT's copyright.
Hi Valerio, in the above spectrograms there is always a strong and constant low frequency component. What does it depend on? Is it relevant or is it just an artefact? Thank you
This helps so much for my final project idea thank you!
Great video!
What is the advantage of power magnitude vs. magnitude, e.g. why doing np.abs(S_scale) ** 2, not np.abs(S_scale)?
Why the time of the spectrum is only lasting a few seconds while the raw audio is a few hours long?
I believe that's a bug when displaying audio in Jupyter notebooks.
Hi Valerio,
Thanks for this video. Very useful.
At 7:30 of this video, displaying the shape of stft output matrix, I feel that the #frames is calculated as (samples/hopsize)+1 in my example code. I understand the equation in your previous video but the librosa output is slightly different.
In my example, samples=220100, Frame_size = 512, Hop_size = 160
Output STFT matrix has second dimension: 1379
Can you please clarify.
From the librosa source code says that pad_mode is 'reflect' and place center If unspecified.
if center:
y = np.pad(y, int(n_fft // 2), mode=pad_mode)
so both n_frames calculation Valerio said and librosa have same result.
Your samples + padding = 220100 + (256*2) = 220612, n_frames = (samples - frame_size)/hop_size + 1 = (220612-512)/160 + 1 = 1376, and I think your **second dimension: 1379** maybe type wrong. My computation is (257, 1376).
Sir is it also possible to save such a spectogram to an image file?
I notice in the librosa docs they don't square the magnitude but then use amplitude to db, does anyone know if this make any difference to the final results? Guess I'll have to try to understand everything properly later! Great video!
Why do I have this error when I try to run "Ipd.Audio(scale_file)" : Value Error: rate must be specified when data is a numpy array or list of audio samples
Quick questions, What is the purpose of doing Y_scale = S_scale ** 2 ? Why 2 but not a different number ? What effect does this power parameter has on the generated spectrogram ?
Why does the debussy sound file does'nt look like a copy from the center?
Hey Valerio, i tried to make it on an audiofile that i got but i have an error; it sais to me that this "figsize" is not defined... can you please help :)
Hi Valerio, I want to extract features for set of video that is stored in folder and number of audio file is 200.How can I load audio files and apply the feature extracting in all files at the same time.What is the format for the code that is contains the all audio files???
I am also looking for the same, to extract spectrogram from the large number of audio files. did you get any solution that I can follow
Was there a reason why you changed from using a continuous colour scale in the first (non-log) plot to a diverging scale for the log plot?
I suspect it demonstrates harmonics better than with the contiuous scale.
thank you for this video, your a real hero.
Hi how do we do this if we have 150 audio files?
Can you make videos on voice cloning?
Thank you for the suggestion! I haven't planned to cover this topic soon, but I'll put it in my backlog as it's quite interesting.
Hi, great video again. Could anyone explain me why we square after the np.abs(Y)? not using it doesn't change much the result since we'll use logarithmic scales, but is it correct to actually use it?
Dear Valerio, how can I convert from magnitude spectrum (therefore, with ^2 coefficient) back to audio? The istft requires complex numbers, but we lose track of them with the magnitude.
just save the variable before doing magnitude
are we only able to work in .wav with librosa? I've been using only .wav so far.
If it's needed, i'll convert mp3s to .wav using pydub. I'll try that next week
You can load mp3 files with librosa if you have ffmpeg installed.
@@ValerioVelardoTheSoundofAI ah, ok. Thanks!
Hi, i got confused a bit. In the DL series we compute spectrogram with spectrogram = np.abs(stft) and log_spectrogram = librosa.amplitude_to_db(spectrogram). Is there any difference with the way we compute it in this video (spectrogram=np.abs(stft)**2 and Y_log_scale = librosa.power_to_db(Y_scale) )?
In the first case, we have an amplitude spectrogram. In the second, a power spectrogram.
@@ValerioVelardoTheSoundofAI Thank you! But still i can't understand when we use the first case and when the second one, since you call both as spectrograms
@@Kyrios_X it depends on the task. Sometimes, amplitude spectrograms work better than power spectrograms. Sometimes, it's the reverse. Unfortunately, there isn't a "rule". You'll have to try both of them and see which representation works best for your problem.
@@ValerioVelardoTheSoundofAI Ok, got it ! Since i am new to speech recognition and still practicing on datasets, would it be better to use scipy.signal.spectrogram() instead ?
Thank you so so much for your dedication.
Which piece was it from debussy?
Great content
Thank you Sandipan :)
How can I get your code?
From the link to github in the description box.
Incredible!!!
thanks for your video, very helpfull and well explained !!!!!!
You're welcome!
how to save this after processing in csv file
You could use Pandas for that. It has a super convenient to_csv function.
@@ValerioVelardoTheSoundofAI can make video for all there is no resources as you know in internet it will be refrence thanka alot
from scratch build a dataset from audio and save it as a CSV file that will be great I appreciate that effort
@@iioiggtrt9085 I'll put this in my backlog! Thank you for the suggestion :)
very helpful. Thank you very much!
Do on mfcc also . Upto now no good resource on it please do on mfcc also🙏
I've planned to cover MFCCs (theory + code) over the next few weeks. Stay tuned!
@@ValerioVelardoTheSoundofAI thank you and waiting for video
Excellent, thanks for this video.
Glad you liked it!
Hi, Valerio, do you speak italian ? ;-)
Yes, I'm Italian
@@ValerioVelardoTheSoundofAI Perfetto! Vorrei chiederti 1 milione di cose ! :-D Intanto inizio a guardare i tuoi video, che mi interessano moltissimo!
hii, valerio i have a question if we have 100 genres with 100 training examples each then how to take the spectrograms, and store them? (or is there any way to generate img data and input at the run time)and each spectrogram will have varying dimensions then how to get uniform input for the network to train? will use of rectangular windows of the spectrogram be better for training , can you suggest some links for reading more about audio augmentation.
Y_scale.shape is a 2d array is enough for training or we need the rgb version of it, which is more efficient
please have look at this kaggle competition this might interest you , for bird audio recognition what should the input be spectrogram or melspectrogram
www.kaggle.com/c/birdsong-recognition
Achai you've asked a lot of good questions! I have a couple of videos in "DL for Audio with Python" that has a similar use case to yours, i.e., 100 samples in 10 musical genres. You can refer to those videos.
It's important that you always have the same input shape. For that, you can segment the songs as to have the same number of samples. Usually for music genre classification 15-30 sec worth of audio should be good.
If by "rectangular windows of the spectrogram" you mean applying a Mel filter bank, you're on the right path. Mel Spectrograms, or even better CQT, are valuable approaches when dealing with music data.
For training, you'll need the equivalent of a grey scale image, in case you decide to go with CNN architectures -- which I suggest you to do. In other words, you'll have to add a 3rd dimension. Once again, you can refer back to my videos in DL for Audio to see how to do that.
@@ValerioVelardoTheSoundofAI
Hii, I have been watching the series(DL for Audio with Python) and referring to it a lot, thank you for this channel,
so for CNN we have to set the third dimension as 1 right is that what you meant , like (120,120,1) adding the other dimension, by windowing i meant if there is a 1 minute call of bird then taking 5 second input next 5 second input so on , is there any way to do that in the spectrogram ?
@@achalcharantimath5603 (120, 120, 1) works. You can use the whole 1 minute worth of audio, or segment it in, say, 15 seconds chunks.
Thank you so much for the great content!
I followed every video up to now and learned so much
At this point is the first time I have a problem that I can't solve:
In Sublime Text when I use the plot_spectogram function there is no spectogram window popping up as usual. If I put a print() before that (I don't know if that would be the right way to do it) the output shows "None". Apart from that no errors are occuring. Does anyone know how to visualize the spectogram in Sublime text?
Hope that someone knows the solution to that. Thanks in advance :)
getting the same issue, did you end up solving it? thanks
ah, found it. add plt.show() at the end of the function!
Thank you sir🥳🙌🙌
Thank you :)
nice hair style :)
Thank you thank you thank you!!!!!!!!
Very very usfull thank you
You're welcome!
0:16 debussy
you were too young with new hair :))
LOL