11- Preprocessing audio data for Deep Learning

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 ธ.ค. 2024

ความคิดเห็น • 226

  • @ValerioVelardoTheSoundofAI
    @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +29

    I now have a full series called "Audio Signal Processing for Machine Learning", which develops the concepts introduced here in greater detail. You can check it out at
    th-cam.com/video/iCwMQJnKk2c/w-d-xo.html

    • @subramanyabhattm4626
      @subramanyabhattm4626 3 ปีที่แล้ว +1

      Sir how to understand the mfccs like which mfcc to consider and which one to leave.

  • @alexey7249
    @alexey7249 2 ปีที่แล้ว +36

    For who is learning course in 2022 - a name of function "waveplot" was changed to "waveshow".

  • @kaushilkundalia2197
    @kaushilkundalia2197 4 ปีที่แล้ว +28

    I'm so glad I found this series. Great quality content (Y)

  • @casafurix
    @casafurix 3 ปีที่แล้ว +3

    this is so well-explained, helps me entirely for the project i'm working on! i can never thank you enough for making all these videos, you deserve the best!

  • @eriklee1131
    @eriklee1131 4 ปีที่แล้ว +3

    Great video! I like how you stepped through everything and the code in the video works

  • @SubtreX
    @SubtreX 4 ปีที่แล้ว +8

    Finally found exactly what I was looking for. Great explanations! ❤

  • @iliasp4275
    @iliasp4275 2 ปีที่แล้ว +4

    1:47 that song hits HARD

    • @yusufcan1304
      @yusufcan1304 7 หลายเดือนก่อน

      hahaha :D

    • @sanzhik_sanziu
      @sanzhik_sanziu 6 หลายเดือนก่อน

      Там есть звук или это просто файл wav без звука?

  • @SabriCanOkyay
    @SabriCanOkyay ปีที่แล้ว

    The series is awesome. And 😭at 1:50 . Love you bro!

  • @WarshaKiringoda
    @WarshaKiringoda 4 ปีที่แล้ว +4

    This channel is a Gem!! Thank you for putting out these tutorials. Keep going!

  • @abhishekdileep5950
    @abhishekdileep5950 2 ปีที่แล้ว +1

    awesome series, deserves more recognition !!!

  • @javadmahdavi1151
    @javadmahdavi1151 3 ปีที่แล้ว

    The best instructional video I've ever seen, even better than college ❤❤❤❤

  • @jaychen1116
    @jaychen1116 4 ปีที่แล้ว +3

    Thank you for the wonderful work. If you can make a series of audio signal processing, that would be great. Have a nice day!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      Thank you for the feedback!

    • @maddonotcare
      @maddonotcare 4 ปีที่แล้ว +2

      i second this!!

    • @meetgandhi8782
      @meetgandhi8782 4 ปีที่แล้ว

      Yeah, I would be very helpful if you made a video series explaining these different DSP methods.

  • @VishwaAbeywardana
    @VishwaAbeywardana 2 ปีที่แล้ว

    hello what is the python version you use in this tutorial

  • @mohammadareebsiddiqui5739
    @mohammadareebsiddiqui5739 4 ปีที่แล้ว +2

    The series so far was very well explained and paced but personally I would've wanted a little more detailed explanation of MFCC as it would be the most important thing we are going to use in the nn right? If there are some resources you can recommend it'd be really appreciated!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +3

      Thank you for the feedback! I get your point. But I made the choice not to get into the algorithmic/mathematical details of MFCCs because it's a quite complicated topic that would probably derail too much from the focus on deep learning. As I mentioned in the videos, if I see enough interest I may create a whole series on audio DSP. There, I'll definitely go into the nitty gritty of MFCCs and the Fourier transform. On this point, would you be interested in a series on audio digital signal processing?
      As for this course, I don't think using MFCCs as a black box is going to be detrimental for DL applications.
      As for extra resources on MFCCs, I suggest you take a look at this article: practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/ It's a friendly intro into the concept. Hope this helps :)

    • @mohammadareebsiddiqui5739
      @mohammadareebsiddiqui5739 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI I would definitely be interested in a series of audio DSP; although, as an DL enthusiast I would love that if the topics covered in that series would circle around their significance in DL somehow?
      Also thank you so much for the article!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      @@mohammadareebsiddiqui5739 that could be interesting...

  • @aavash_bhattarai
    @aavash_bhattarai 11 หลายเดือนก่อน +3

    If anybody has an issue on the librosa.feature.mfcc() line (mfcc takes 0 positional arguments but 1 positional argument...)
    make sure you add "y=" before signal that is
    MFCCs = librosa.feature.mfcc(y = signal, n_fft= n_fft, hop_length = hop_length, n_mfcc= 13)
    Hope this helps

    • @Tvisha-kt3oy
      @Tvisha-kt3oy 4 หลายเดือนก่อน

      Thanks for the help! Was a little confused when code wasn't working!

    • @monkedelufi6106
      @monkedelufi6106 4 หลายเดือนก่อน

      thanks a lot bud

    • @cromie_
      @cromie_ 13 วันที่ผ่านมา

      thanks dude!

  • @frekehkhouri
    @frekehkhouri 3 ปีที่แล้ว

    You are amazing! A real music and deep learning wizard!

  • @aendnouseforalastname8318
    @aendnouseforalastname8318 4 ปีที่แล้ว +1

    So glad I found this! You do an amazing job!

  • @rainerzufall1868
    @rainerzufall1868 4 ปีที่แล้ว +2

    Incredible channel! Please keep going!!!

  • @joeyg4008
    @joeyg4008 3 ปีที่แล้ว

    20:16 With this graph, how would you display the number of seconds on the x-axis and the range of frequencies on the y-axis?

  • @ashokdhingra4
    @ashokdhingra4 3 ปีที่แล้ว

    Your wonderful videos are helping me in my PhD on Indian Vocal Music. But alas, no videos on Indian Classical Vocals.

  • @manpreetkaur8587
    @manpreetkaur8587 3 ปีที่แล้ว

    This is helping me in my capstone masters project. Thank you so much.

    • @sahiljain3083
      @sahiljain3083 3 ปีที่แล้ว

      The same reason i was searching the concepts then this channel showed up.

  • @evicluk
    @evicluk 4 ปีที่แล้ว +1

    I think I am finishing my master degree with you😂 , thank you for your amazing job!

    • @rekreator9481
      @rekreator9481 4 ปีที่แล้ว

      Bachelor degree here but same.. I guess Im low on time as I see the complexity tho :DD When is your due date and how far are you? :DDD

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +2

      I'm happy the videos can help :)

    • @evicluk
      @evicluk 4 ปีที่แล้ว

      It's not in a hurry. I tried to do build a music recommender system, and the music classification is the first step. I tried several models and tried to optimize it.

    • @rekreator9481
      @rekreator9481 4 ปีที่แล้ว

      @@evicluk Do you use tensorflow or pytorch?

    • @evicluk
      @evicluk 4 ปีที่แล้ว

      @@rekreator9481 tensorflow

  • @javadmahdavi1151
    @javadmahdavi1151 3 ปีที่แล้ว

    this is so good , what do you suggest to implement this things?
    i very excited to read that book and learn sounds in deep learning
    thank you so much

  • @annasultubo
    @annasultubo 3 ปีที่แล้ว

    Sto scrivendo la tesi grazie a te, ti devo una cena! Amazing Job

  • @ahmedkhateeb8178
    @ahmedkhateeb8178 3 ปีที่แล้ว

    Thankyou for spreading the knowledge. I've a question tho, If I want to make a source separation kind of application. should I use mel scale spectograms or should I opt better for other time series representations like Gramian matrices and Markov transtions?

  • @captcannoli7293
    @captcannoli7293 4 ปีที่แล้ว

    great series! I have learned much more from this that other courses that have cost me alot of money. I have one question if you could help. What is the number of features you are extracting to use in the NNs in the series? it wasnt very clear in the videos.

  • @birukabera465
    @birukabera465 4 ปีที่แล้ว

    Thank you for this wonderful video lecture, I am working on lung sound analysis. would you show us also how to implement wavelet analysis particularly discret wavelet transform as like FFT, STFT, MFCCS?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      Glad you liked it Biruk! I'm planning to start a whole series on audio/music processing over the next few weeks. Stay tuned :)

  • @EliRifle
    @EliRifle 4 ปีที่แล้ว +1

    Extremely helpful! You are the best

  • @apoorvwatsky
    @apoorvwatsky 4 ปีที่แล้ว

    Amazing series.
    A question, frequency and magnitude are numpy arrays of size > 661000 respectively.
    But while plotting, the x-axis (denoting frequency) scales itself to the sample rate which is 22050. Why so? I'm talking about the spectrum plot here.

  • @lindascoon4652
    @lindascoon4652 4 ปีที่แล้ว

    Thank you so much for ur videos. I have a question regarding processing of audio. If I want to classify a bell that rings for less than a sec then stops for some time, do I have to collect the audio of the individual rings and cut out the silences or can I use a longer audio of the bell ringing and stopping ?

  • @tolouamirifar1913
    @tolouamirifar1913 4 ปีที่แล้ว +1

    Hi Valerio, thank you so much for your amazing videos. I am doing an emergency vehicle siren detection by deep learning, I divided my data into emergency and non-emergency and used band-pass filter to remove the noise. Now I have doubt that should I implement this filter on just emergency audio files or on all the data (emergency and non-emergency). I would be grateful if you could guide me on this.

  • @consultingprestig2096
    @consultingprestig2096 ปีที่แล้ว

    Thanks for this tutorials ! I want to aks question : What you not use notebook ? i'am using notebook with vscode ".ipynb " file extension its just pratical no ? Good luck

  • @alua6916
    @alua6916 3 ปีที่แล้ว

    Hello, thank you very much for this tutorial. What if i have problems with numpy? My ide tells there is a mistake

  • @markosklonis
    @markosklonis 3 หลายเดือนก่อน

    having an issue with librosa missing _soundfile_data module when i try load the song

  • @yusufcan1304
    @yusufcan1304 7 หลายเดือนก่อน

    we couldn't heard the song but it is super cool video.

  • @jawadmansoor6064
    @jawadmansoor6064 2 ปีที่แล้ว

    8:30 length of 'signal', 'magnitude' and 'frequency' is the same.
    Why is frequency increasing from beginning towards the end, it should be increasing and decreasing at different times and it should not only increase over time. What am I missing?
    9:29 by lower frequency I understand the left most area in the graph. But we see the same height of energy/magnitude towards the end (right side) of the graph. But you say that "the higher we go with frequency the less contribution they will give us". I don't understand the graph this seams.

  • @LucasSilva30
    @LucasSilva30 3 ปีที่แล้ว

    Valerio, first of all, congratulations for your excelent job! I am learning so much from you!
    Secondly, can you explain how to load mp3 files with librosa? From what I read on the documentation, installing ffmpeg should solve it, but it did not.
    Thank you!

  • @jaehwlee
    @jaehwlee 4 ปีที่แล้ว

    Thank you for posting this wonderful video.
    I'm working on a Toy project where I search for music with humming, is it right to use Mel Spectrogram? I don't know if CQT is more appropriate. I would appreciate your reply.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +2

      You can definitely give a try to Mel Spectrograms. Try to focus on intervals instead of absolute pitches as people with no absolute pitch (i.e., the overwhelming majority) can hum the intervals which make up a melody, but not necessarily in the right key. Focus only on monophonic music (i.e., a vocal melody). The generalisation is a way harder problem. Hope this helps!

  • @kacemichakdi3048
    @kacemichakdi3048 2 ปีที่แล้ว

    Hi sir
    thanks for this video
    i just want to know how can we play the audio in python and listen to it from this form (signal , sr = l.load(file))

  • @JamesSmith-vu7io
    @JamesSmith-vu7io 3 ปีที่แล้ว

    Hello Valerio, have you ever extracted ivector from audio clips? I am trying to find documentation on it but am struggling. Your advice would be greatly appreciated

  • @hafizhashshiddiqi2988
    @hafizhashshiddiqi2988 2 ปีที่แล้ว

    WOW! this lesson totally good, are we done for preprocessing audio thing with that? coz i wanna build a system for speaker recognition and need to learn how to build a. model, one of the step to do preprocessing for the audio, this video very helpful if we are done for preprocessing audio thing, and what do we do after this? or maybe if we want to representation the result using numeric, not visualizing, thank you

  • @saranyaasuresh5710
    @saranyaasuresh5710 2 ปีที่แล้ว

    Hello Valerio
    Your videos are very helpful to learn about audio signals processing in AI. Am learning about AI and the theory you have explained are easier to grab. Thank you for such great lessons.
    I have a doubt as input you have been using .wav file, which is uncompressed, thus the file size will be large. Can you tell me what method can be used to process the audio file with best quality and without losing information.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  2 ปีที่แล้ว

      Thank you! There isn't an ideal solution to compress audio files and not lose information. WAV (loseless) is the best. Many AI music applications won't be affected negatively if you use MP3s instead.

  • @jecellepaculba8673
    @jecellepaculba8673 2 ปีที่แล้ว

    hello, do you know how to classify extracted features of Audio from MFCC to SVM?

  • @latchireddiekalavya4683
    @latchireddiekalavya4683 3 ปีที่แล้ว

    i have downloaded that audio file . but still it is showing error as
    FileNotFoundError: [Errno 2] No such file or directory: 'blues.00000.wav'
    sir solution please?

  • @michalkmicikiewicz4391
    @michalkmicikiewicz4391 3 ปีที่แล้ว

    Shouldn't we multiply the magnitude by 2 when narrowing the power spectrum plot to the Nyquist frequency?

  • @yannickpezeu3419
    @yannickpezeu3419 3 ปีที่แล้ว

    Amazing clear content. Thanks a lot !

  • @bogdandaniel5426
    @bogdandaniel5426 3 ปีที่แล้ว

    How can i fix this error "UserWarning: PySoundFile failed. Trying audioread instead.
    warnings.warn("PySoundFile failed. Trying audioread instead.")"?

  • @kiran23500
    @kiran23500 4 ปีที่แล้ว

    great series man,thank you.can we differentiate human voices by using the mel spectrums?if yes can u please tell me how?ur reply would be helpful.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      Yes, you can use MFCCs for speaker identification. The process is similar to that I've used for genre recognition in the following videos. Check those out!

  • @sidvlognlifestyle
    @sidvlognlifestyle 2 ปีที่แล้ว

    You help me like God send you to help me .... After 3 days i have submission till preprocessing 😅 Thank you so much .... Please make a video on How to build a accurate model for audio signal ❤️

  • @greg73049
    @greg73049 4 ปีที่แล้ว

    Hi, thanks alot for these videos they are very useful.
    I was just wondering if it would be beneficial to represent the frequency scale logarithmically as humans interpret sound in this way (since musical intervals/harmonics are represented by multiples of a frequency rather than an absolute difference). Are deep learning algorithms not trained with this scale since it mimics human hearing more?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +2

      Great intuition! You can take the logarithm of the spectrogram, or, apply Mel filterbanks, and arrive at the so called Mel Spectrogram. I have another series called "Audio Signal Processing for ML" that dives deep into all of these topics, if you're interested.

  • @midhunsatheesan5717
    @midhunsatheesan5717 4 ปีที่แล้ว

    This was a brilliant video. I have a query which I would like to shoot. Don't know if it's answered in the next set of videos.
    Does it matter if the time span of each clips are different in the dataset?
    Do the same principles applied here apply to any audio eg : animal audios, scream detection?
    How to deal with noise?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      1- If you're using a CNN architecture you need to have all data samples with the same duration. To obtain this, you should segment clips with different duration (e.g., 10 secs clips).
      2- Yes, you can transfer the same approach used here to other audio domains.
      3- If you're using a DL approach, the network should be able to learn to deal with noise automatically.
      If you'd like to learn more about these topics, I suggest you to check out my series "Audio Signal Processing for ML".

    • @midhunsatheesan5717
      @midhunsatheesan5717 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Thanks Valerio. I'm watching the signal processing series too..! Another query I have.. There is another library called kapre. That one seems like it's built upon Keras. How do you think it compares with librosa? Kapre seemed very easy with just additions of layers to the model. I'm not sure if it can do everything that librosa can.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@midhunsatheesan5717 Kapre is great if you want to extract spectrograms computing FT on GPU. However, it can't do many things that librosa can. So if you plan to use basic audio features used in DL go with Kapre. Otherwise, go with librosa :)

  • @lakshaysharma1888
    @lakshaysharma1888 2 ปีที่แล้ว

    how to use that blues.0000.wav file for running the code many error are coming

  • @nmdhawale
    @nmdhawale ปีที่แล้ว

    While plotting the power spectrum we take only half of the data (left) after doing fft? Then how come we dont do the same while plotting data based on sfft? 😮

  • @stipan11
    @stipan11 4 ปีที่แล้ว

    Hi there, i'm interested to know how can i clean my audio dataset(google speech commands) if it contains faulty audio. For example i should hear word "three" but there is too much noise or the word is cut in the middle of pronunciation so it just says "thh.."
    Any idea how to get rid of those audio files and clean my dataset without doing it manually?

  • @3arabs4
    @3arabs4 4 ปีที่แล้ว

    Just amazing content, you are a live saver.

  • @abhipanchal5681
    @abhipanchal5681 4 ปีที่แล้ว

    Hi there,
    I want to do music semantic segmentation(intro, chorus, verse etc.). Could you please suggest me how should I label my audio data? and what features I should use for that?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +2

      The task you're referring to is called "music segmentation" or "music structure analysis". I'm assuming you want to work with audio (e.g., WAV) not symbolic data (e.g., MIDI). There's a lot of literature on this topic. The techniques that work best are based on music processing algorithms which don't involve machine learning. The high-level idea is to extract a chromogram, manipulate it, and use a self-similarity matrix to identify similar parts of a song. The book "Fundamentals of Music Processing" has a chapter that discusses music segmentation in detail. Here's a slide presentation that summarises that book chapter: s3-us-west-2.amazonaws.com/musicinformationretrieval.com/slides/mueller_music_structure.pdf Hope this helps :)

  • @emirkuralkocer1280
    @emirkuralkocer1280 3 ปีที่แล้ว

    Hello Valerio, I have 3 folders(go, yes, no) that contains 30 .wav files. Each folder has 10 wav files. How can I run this code with different 30 wav files?

  • @Engineer_Keith
    @Engineer_Keith 4 ปีที่แล้ว

    If I understand correctly, the hop_length being smaller than n_fft means that there's an overlap of the graphed data, right?
    Or does having n_fft = 2,048 mean that all the graphed data was only (661,500 / 2,048) = 322.998 or 1/323rd of the 30 second clip?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      That's correct. You can check out my video on extracting audio feature pipelines in the "Audio processing for ML" series to get a detailed explanation of the process. I hope I mentioned the right video :D

    • @Engineer_Keith
      @Engineer_Keith 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Thanks so much! You're a fantastic teacher.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@Engineer_Keith thanks :)

  • @gs1619
    @gs1619 4 ปีที่แล้ว

    Nice work broo! I have a question tho. Is it alright to have negative MFCCs? Btw I am using RAVDESS dataset.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      It's totally fine to get negative MFCCs. Stay tuned for my coming videos in the "Audio Processing for ML" series on MFCCs to learn more ;)

  • @fatimahalqadheeb
    @fatimahalqadheeb 2 ปีที่แล้ว

    I have an audio dataset, each audio file consists of letters that are spoken all at one time. How can I prepare these audio files for machine learning? I would like to have each letter in an audio file. if anyone has an idea please help.

  • @arbigobiaalu
    @arbigobiaalu ปีที่แล้ว

    Can you update the code for present versions of matplotlib and librosa.

  • @rishirajput6691
    @rishirajput6691 3 ปีที่แล้ว

    valerio could you please give some code for removing silence from whole audio data, please guide

  • @AshwaniKumar04
    @AshwaniKumar04 4 ปีที่แล้ว

    Thanks for the video.
    One question: Should we always use SR = 22050?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      It depends on the problem. Most of the time sr = 16K is OK for sound/music classification problems.

    • @AshwaniKumar04
      @AshwaniKumar04 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Thanks for the reply. Does having a higher value increases the model accuracy?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@AshwaniKumar04 not necessarily. If most of the patterns for classification are in the lower frequencies having a high sr can actually be counterproductive.

  • @jessicabustos1262
    @jessicabustos1262 4 ปีที่แล้ว

    Hola Valerio. Un saludo desde Colombia. He estado viendo algunos videos tuyos referentes a MFCC pero como comprenderás, mi inglés es un humilde casi B1 y he estado poniendo subtítulos a tus videos; sin embargo, este no fue el caso :C porque no aparece la opción. Me encantaría que este video tuviera la opción de de los subtítulos, te quedaré muy agradecida. Quisiera saber qué hay después de la obtención de los MFCC, qué se debe implementar en Phyton para que finalmente tome la decisión de clasificar un sonido como Xo Y ?. Quedo muy agradecida con tu colaboración.

    • @marianofares9694
      @marianofares9694 7 หลายเดือนก่อน

      Hola me interesa esto que decis Jessica!

  • @achmadarifmunaji3320
    @achmadarifmunaji3320 4 ปีที่แล้ว

    Can we use a full 10 minute wav file as an example or do we need to cut the file into pieces in preprocessing?

  • @massimomontanaro
    @massimomontanaro 4 ปีที่แล้ว

    Hi Valerio, I've been looking for resources on how to deal with deep learning and audio for some time without too many results, so I'm really grateful to you for sharing these videos! I would like to ask you if it is possible and how to recover the original signal from the spectrogram. I tried to use the inverse functions like librosa.db_to_amplitude and librosa.core.istft, but the output signal seems very bad. I think this happens because we truncate complex numbers for the construction of the spectrogram. Can you suggest me the right way?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      You're absolutely right! The issue with istft is that we ignore the phase. The audio result is somewhat problematic. Reconstructing audio from a power spectrogram is a major problem still actively researched. There isn't a simple solution I'm afraid :(

    • @massimomontanaro
      @massimomontanaro 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Yeah, i found the same answer in a research paper i was reading just now. Do you think a well trained LSTM autoencoder could approximate a better result? I mean, if we use these corrupted istft's outputs as input and the original waveforms as output, could we obtain a neural net that can reconstruct a better waveform? Or do you think it's only a waste of time? Thanks in advance for your attention!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@massimomontanaro mmh... this is a highly dimensional problem. You'll need a MASSIVE dataset to try to get something decent. It may be worth an experiment, but I wouldn't be super confident.

  • @matrixlukan
    @matrixlukan 3 ปีที่แล้ว

    Now I know why fourier transforms is added to my degree syllabus

  • @Gileadean
    @Gileadean 4 ปีที่แล้ว +1

    Could you make a video on the inverse functions? signal->stft->istft->signal works fine, but signal->stft->amplitude_to_db->db_to_amplitude->istft->signal results in a distorted signal. Same with inverse.mfcc_to_audio.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      This is a somewhat more advanced topic in DSP. I'm thinking of creating a series on audio DSP / music processing. I'll definitely cover the inverse functions in that series. Before engaging in the implementation, I'd like to dig deeper in the math behind FT/MFCC. You're totally right re the reconstruction of the signal from MFCCs. It's a long shot, and the result isn't that great.

    • @Gileadean
      @Gileadean 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI I kinda get why we are losing information if we convert our spectrogram to a mel-spectrogram, but why are we already losing information when using amplitude_to_db on the stft? Isn't it "just" a log-function?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      ​@@Gileadean excellent question! I'm glad you've been playing around with these interesting DSP concepts :) Now, on to the answer. The STFT outputs a matrix with complex numbers. To arrive at the spectrogram, we calculate the absolute value of each complex number. This process removes the imaginary part of the complex values, which carries information about the phase of the signal. At this point you've already lost information! When you try to reconstruct the signal, the inverse STFT can't rely on phase information anymore. Hence, the somewhat distorted sound. As you correctly hinted to in your question, the conversion back and forth from amplitude to db doesn't loose any additional vital info. I hope this helps!

    • @Gileadean
      @Gileadean 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Thanks for your quick replies! I somehow missed the np.abs(stft) and the warning message that occurs when calling amplitude_to_db on a complex input (phases will be discarded)

  • @carlraywairata5291
    @carlraywairata5291 4 ปีที่แล้ว

    hi valerio, I can't show the wave image can you help me?

  • @achmadarifmunaji3320
    @achmadarifmunaji3320 4 ปีที่แล้ว

    I'm trying to use a wav file containing someone's conversation as a sample file. when I want to display a waveform containing the magnitude and frequency it produces an asymmetrical waveform. is it still necessary to divide the frequency and magnitude?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      What do you bean by "I want to display a waveform containing the magnitude and frequency"? A waveform doesn't display information about neither. It's a time-domain representation with air pressure as a function of time.

    • @achmadarifmunaji3320
      @achmadarifmunaji3320 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI sorry, i mean spectrum

  • @barshalamichhane6761
    @barshalamichhane6761 4 ปีที่แล้ว

    Hello sir from where do I get the audio file that you have used here? Would you please provide me the link?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      If I remember correctly, it comes from the Marsyas genre dataset (marsyas.info/downloads/datasets.html). I may have mentioned this in a previous video.

    • @barshalamichhane6761
      @barshalamichhane6761 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI thank you :)

  • @achalcharantimath5603
    @achalcharantimath5603 4 ปีที่แล้ว

    hii Valerio,if we have training data in mp3 format, is it important to convert mp3 files to wav files for training ? will it improve performance

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      Don't worry about mp3 files. With Librosa you can directly load them, without the need to convert them to wav files first.

  • @rutiksansare3644
    @rutiksansare3644 3 ปีที่แล้ว

    sir im making a project on attendance system using voice. which python modules should i use ?? which algorithms should i use???

  • @swapnilbhabal5289
    @swapnilbhabal5289 3 ปีที่แล้ว

    Can anyone send the link for the music dataset of popular / hit songs please

  • @ronerortega3804
    @ronerortega3804 4 ปีที่แล้ว

    This is what i've been looking for.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      Thanks Roner!

    • @ronerortega3804
      @ronerortega3804 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI does reslly hard manage music data to make animations or reactions with it? Im really really new on this thing about music spectre and ML

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@ronerortega3804 not really. As long as you have audio parameters (e.g., loudness, chroma, beat), you can map them to different elements of an animation.

  • @phamvandan760
    @phamvandan760 4 ปีที่แล้ว

    I think that the magnitude of the frequency in FFT is the module which calculate by np.absolute() not np.abs()

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      np.absolute() and np.abs() are completely ideantical. You can use either one.

    • @phamvandan760
      @phamvandan760 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Yeah, I see. Thanks.

  • @oblivion962
    @oblivion962 2 ปีที่แล้ว

    Please make a tutorial on Anomaly detection on raw sound

  •  4 ปีที่แล้ว

    Great videos man! Is there a way to make a database from audio files from metadata? Like labeling each file with BPM, Key, etc. But automatic, doing a database from scracth is going to take more than the coding itself lol

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      Thanks! There are algorithms for extracting Key, BPM automatically. You'll then need to implement a DB and populate it with the metadata. The algorithms aren't perfect. They are also genre-dependent.

    •  4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI i want to automatically sort my samples, and obviously experiment with NN and python, what approach would you recommend

  • @achmadarifmunaji3320
    @achmadarifmunaji3320 4 ปีที่แล้ว

    What is the number of hop_length for voice recognition?

  • @alexandergeorgiev2631
    @alexandergeorgiev2631 ปีที่แล้ว +2

    Did anyone get plt.colorbar to work? I get an AttributeError that says "AttributeError: module 'matplotlib' has no attribute 'axes'. Did you mean: 'axis'?"

    • @tobir693
      @tobir693 ปีที่แล้ว

      same here. Tried everything. Read somewhere that downgrading MatPlotLib to 3.6 or 3.6.3 makes it work again. Haven't tried it though

  • @JustinMitchel
    @JustinMitchel 4 ปีที่แล้ว +1

    Nice work! Keep it up.

  • @blaze-pn6fk
    @blaze-pn6fk 4 ปีที่แล้ว

    really great series !!

  • @thebinayak
    @thebinayak 3 ปีที่แล้ว

    Hi Valerio, Thank you for your detailed explanation. I am sure like me thousands of others are benefitting from your videos. I understood everything in your video however I have one query, can we use log_spectrogram for deep learning instead of MFCCs? Or in other words, why do we only use MFCCs in deep learning? One more concern, I have audio data that is recorded in 44100Hz, can I use a sample rate of 44100 instead of 22050 (which you are using in this tutorial)? Thank you in advance.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 ปีที่แล้ว +1

      Mel-Spectrograms is the feature of choice in DL. Of course, you can use a RD of 44.1K.

  • @Fil0125
    @Fil0125 3 ปีที่แล้ว

    Ok bro but, what I need to pass on my input to teach my neural network?

  • @Matusravas
    @Matusravas 4 ปีที่แล้ว

    Many thanks for the great tutorial. I have one question. When calculating fft, with abs function you calculate magnitude, but you just take one half of the frequency spectrum, hence each magnitude value in left side of the sprctrum should be multiplied by 2 and afterwards each of them should be divided by number of samples, that is how I understood theory of fft. If I am wrong please let me know. Many thanks for any response. :)

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      I take the first half, because the second half is symmetrical and doesn't bring any additional info. I have a few in-depth videos re FT in the "Audio signal processing for ML" if you'd like to dig more.

    • @Matusravas
      @Matusravas 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI thank you very much for your prompt response. But I saw the theory how to calculate magnitudes of the frequencies and there was said: If you slice it on two halves and take only the left one, then you need to multiply each of the magnitude on the left side by two. Additionaly as I said also dividing each magnitude from the left half by number of samples. I tried it on simple sinus wave with frequency 10Hz and amplitude of 5. At first trying your approach and did not get the righ magnitude ploted (frequencies are correct but magnitudes are not). If I did multiply magnitudes by two after taking just first half and the dividing each by number of samples I got the correct plot. I am not arguing, but I want to understand it. Anyway you do a brilliant job, videos are amazing and very helpful.

  • @sumanths3856
    @sumanths3856 3 ปีที่แล้ว

    I am trying for .mp3 files it is showing
    audioread.exceptions.NoBackendError in the line ==> signal, sample_rate = librosa.load(file_path, sr=SAMPLE_RATE)
    i have installed ffmpeg.
    Please help me out

  • @arnabmukherjee9939
    @arnabmukherjee9939 4 ปีที่แล้ว

    NEED HELP !!
    What is the best way to extract features from variable length audios( 15 sec - 4 min). I am doing bird song classification. Since mfcc will give different dimension for different duration audio , how can i feed that to a neural network? Should I extract features by any different technique? Please help

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      I would suggest keeping a fixed input length. If you do so, then you should pre-process the audio files of variable length slicing them as to have several samples of the same length. For bird song classification, it's preferable to use CNN-based architectures, which accept a fixed-size input. You can try RNN nets, if for some reason, you'd prefer using audio files of different lengths.

    • @arnabmukherjee9939
      @arnabmukherjee9939 4 ปีที่แล้ว +1

      @@ValerioVelardoTheSoundofAI Thanks for replying. Will definitely try that.

    • @arnabmukherjee9939
      @arnabmukherjee9939 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI HEY !
      Do you think there might be some misjudgement because a feature might become divided into several windows since it can be replicated at any time instead of local segments?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +2

      @@arnabmukherjee9939 If you identify a reasonable length, this shouldn't be an issue, as the sample should have most features that determine a species from another. A rule of thumb for deciding on the length of a sample you can use is: how much time does an expert would need to recognise a species from its bird song?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@arnabmukherjee9939 No I don't think so!

  • @rafiyandi4654
    @rafiyandi4654 2 ปีที่แล้ว

    what version of python are you using ?

  • @alonalon8794
    @alonalon8794 4 ปีที่แล้ว

    great content. btw, how can i download the wav file?

    • @alonalon8794
      @alonalon8794 4 ปีที่แล้ว

      it seems as if it can't be downloaded from the github link u published. is there another place to download it from?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      Thanks! I think I used a piece classified as blues from the GTZAN dataset. You can search for the file with the same name in the dataset. I provided the link to download GTZAN in a previous video in the series.

  • @oroneki
    @oroneki 4 ปีที่แล้ว

    Again! This is just awesome!

  • @missmudassarnizam5158
    @missmudassarnizam5158 4 ปีที่แล้ว

    thankyou so much ... i m so happy you make this video .you make my work easier
    ..

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      Glad I've been useful! Have you seen my new series Audio Sianal Processing for ML? It runs very deep into these and more topics in audio processing.

  • @krishnachauhan2850
    @krishnachauhan2850 4 ปีที่แล้ว

    But what if the data is in either pickle or raw like wav files then how to progress?
    Plz someone help

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว +1

      You can "unpickle" pickle files. You can load wav files directly with librosa (using librosa.load). I have numerous videos which explain how to do that.

  • @giostechnologygiovannyv.ri489
    @giostechnologygiovannyv.ri489 ปีที่แล้ว

    1:47 Mmmm no audio played ^^'' Valerio issues with YT and song copyright? hehe It's always annoying XD haha but ok

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  ปีที่แล้ว +1

      Yep, that's why...

    • @giostechnologygiovannyv.ri489
      @giostechnologygiovannyv.ri489 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI :(... sometimes they make me think I did something to my laptop... a trick some YTers use is to reduce a lot the volume of the song, does not always work, but sometimes ;)... Btw enjoyed your videos! Are super explanatory, keep going like that! :D

  • @MattsThe1991
    @MattsThe1991 4 ปีที่แล้ว

    Thanks man! Wonderful videos

  • @chrisd4504
    @chrisd4504 4 ปีที่แล้ว

    This is a fantastic series but I can’t download any of the files from the repository. Is there a reason for that, or is it just a problem on my end (it probably is).

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      Hi Chris, I'm glad you like the series! The repo should work fine. Have you tried downloading it in zip format instead of using Git?

    • @chrisd4504
      @chrisd4504 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI Embarrassingly enough I hadn't tried that. I tried just cloning it from terminal and using the github desktop app, but not downloading it in zip format. I just successfully downloaded it in zip format. Thanks for getting back to me!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      @@chrisd4504 I'm happy you found a workaround :)

  • @tejakayyala5795
    @tejakayyala5795 3 ปีที่แล้ว

    Love you so much sir....No words...

  • @dannyrodin1151
    @dannyrodin1151 3 ปีที่แล้ว +1

    There's no sound when you play that blues file.

  • @achmadarifmunaji3320
    @achmadarifmunaji3320 4 ปีที่แล้ว

    How long should the wav file be used in preprocessing?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 ปีที่แล้ว

      That really depends on the problem you're working on and your dataset. Let me give you a couple of examples. In music processing, we usually use 15'' or 30'' of a song to analyse it. In keyword spotting systems, you would often have 1-second long clips.

    • @achmadarifmunaji3320
      @achmadarifmunaji3320 4 ปีที่แล้ว

      @@ValerioVelardoTheSoundofAI I'm working on ML for voice recognition using a data set that contains conversations in the form of a wav file. is there any suggestion you can give for a good duration .wav file for my problem?
      Thank you for the advice

  • @frederiksidenius
    @frederiksidenius ปีที่แล้ว

    Thanks a lot for the great videos!
    They are well-made and very informative, however, I can’t help noticing that your frequency axis is wrong. It’s a minor inaccuracy but the way you define frequency with ‘np.linspace(0, sr, len(magnitude))’ is wrong. The frequency resolution should be ‘sr/len(magnitude)’ which you can achieve for example either with ‘np.linspace(0, sr, len(magnitude) + 1)[:-1]’ or ‘np.linspace(0, sr - sr/len(magnitude), len(magnitude))’. In short, the DFT returns N frequency bins from 0 to N-1 and therefore no bin is equal to the sample rate.
    As I said it is a minor error, however, this would be the accurate way to do it.

  • @antunmodrusan828
    @antunmodrusan828 3 ปีที่แล้ว

    Thanks for the great content! :)