Hi Jon, I am doing a final year undergraduate project on bioacoustics, I am new to signal processing as well as your channel! I was just wondering - do you have a paper covering some of the stuff you've talked about, which I could reference?
Hi! Yes, this work is mostly in my master thesis. If you search Google Scholar for "Environmental sound classification on microcontrollers using Convolutional Neural Networks" you should find it. I would give you a link, but TH-cam tends to shadowblock messages with links...
Hey Mr Hope you doing good ! Please Can you help me ? How Can we use speech recognition to detect falling in elderly people ? Just another question how to combine audio with image to implement fall detection ?? Thank you
Thank you. A very good presentation. Is Keras model code you showed (i.e. "block_1", "block_2", etc.) on a couple of your slides available in one of your GitHub repositories?
Thank you Michael. Yes, all the Keras I tested in my thesis are in the following repo/folder. The one in question is probably in "strided.py" or "sbcnn.py" github.com/jonnor/ESC-CNN-microcontroller/tree/0d3a1231831d3ee61c22a4f8b461a7511fae3de7/microesc/models
Hi... great work! Thank you for uploading this video. If you had the exact frequency vs time data for a particular sample in text or csv format, How to use it to improve accuracy of a cnn? Can image data be correlated to corresponding frequency data to get more accurate predictions?
Hi Jay. The spectrograms contain basically all the time versus frequency data. But if you have some additional information available, there are way to incorporate that. If the data is always available (both training time and prediction time), then you can use it as an additional input to the neural network.
Data augmentation is basically always automated. Either as a pre-processing batch job. Or done on-the-fly while training the neural network. This posts shows the code for common audio augmentations, medium.com/@makcedward/data-augmentation-for-audio-76912b01fdf6
Thank you very much for your very informative presentation. However, I have a question regarding one of your slides, Specifically on Aggregation analysis windows: Could you please explain further (possibly with an example). For instance windows = 6 is number of segment that you have extracted from you audio signals or it is length of windows (6*sampling_rate)? or bands=32? Moreover, regarding base model, is the model that you presented in slide before (3 layers CNN?) so the logic is that we kind get the audio signals convert them into the sequence of windows and pass them through SB-CNN and propagate it over time and compute the average pooling and will use the output of average pooling to the softmax to conduct the prediction. is this logic is correct? In advance thank you for you considerations.
Thank you so much for sharing the presentation with us! I m new in machine learning and I have some questions. From where could I download or use datasets of audio for my project? Thank you in advance !
Interesting talk! In the example you showed, lots of the sounds are quite different from each-other, e.g. the children playing, a siren, and a jackhammer. Does it also work for sounds that are very similar? For example different crow calls or different type of chimpanzee sounds?
Hi Ramon. Yes the same basic approach can be used in such a case. Whether good results can be achieved depends on how hard the task is annd how good the data is.
Great stuff. How's the job market for this type of knowledge and skills? I am an old EE just starting a DS masters and I've turned my attention to audio classification.
Hi Chac. For audio, image, video etc type of processing - the kind of companies that before would hire for Digital Signal Processing skills are today hiring for Machine Learning. If you have an EE background with skills around embedded systems, that is a very good compliment for many such companies. At the moment the demand for ML engineers is high - and many are trying to build new ML-based products and functionality - and there is a lack of skilled people. So pretty good I would say - but you need to go for the places that match your skill profile. A masters degree will set you apart from the large number of self-learners, in terms of demonstrated qualifications
John, hate to bug you again, but I am actually kinda serious about this. My DS program is actually not geared or focused for 'TinyML' so I need to supplement it with other learning. What online program or set of courses would you recommend to get into 'TinyML'?
@@chacmool2581 There is a TinyML book. Have not ready, but probably a good start. The TinyML youtube channel has many good talks, but they are on bleeding edge research - not a pedagogical resource. But apart from the usual embedded/DSP topics, the main part of TinyML is computationally efficient and small models. So focus on understanding how to choose and optimize for such models. For CNNs my master thesis has some pointers on that
I'm new to machine learning and I feel like I watched so many audio machine learning videos and the tips & tricks section to the end on this is the most practical and unique stuff I've seen. Thanks! Does the simple audio recognition by tensor flow tutorial still exist? I can't seem to find it? Also, in the audio augmentation slide you talk about adding noise to your data for benefit of the model but in the Q&A you talk about how de-noising is helpful. Could you clarify the different cases where you use both?
Hi Peter. The Tensorflow simple audio tutorial still exists, but they keeping moving it around and renaming it. Currently it is called "Simple audio recognition: Recognizing keywords" at www.tensorflow.org/tutorials/audio/simple_audio
Training with noise via data augmentation is almost always beneficial (possible exception, if one of your classes is very noise like). And given sufficient data, this will work well, and is the simplest solution. However, if one 1) has a small amount of data and 2) there are well known denoising methods that work well for the case - it may be worth a try. Examples of usecases where I have seen denoising step work well is bird audio spotting in remote monitoring cases (forests etc) - here it is often very quiet and the noise floor can be significant. It may be the noise is that of the microphones and electronics themselves, which is near constant, and relatively simple to denoise
Hello Jon , you did a great presentation. Thanks for sharing. I am working on my master's thesis, specifically in Lung Sounds classification using CNN. I am using mfcc's features. I am getting about 88% of accuracy. Do you think that melspectogram can give a high accuracy than 88% ?
Hi Idrisse! Thank you. Yes, I think that mel-spectrogram instead of MFCC might give you a slight increase in performance for your usecase, at least it is worth trying out!
@@Jononor thanks sir, I would like to ask something, please bear me. Step1 : original dataset 177 samples ( 3 classes , each class has 59 audios files). Because of the small size of the data, I did data augmentation. Step 2: After data augmentation, I extracted mfcc's features of the Audio files with its respective labels in order to create a useful dataset. Step 3 : I splitted the new dataset into training, validation and testing sets. Step 4: Feed the CNN with the training and validation sets for the training process. Step 5: evaluated the CNN with the testing set, we are able to reach an accuracy around 90-93%. Is correct ( logic) to test the model with the testing data that l got in step 3? Or I should split the data to training and testing sets before doing the data augmentation.? Doing so l got an accuracy around 40-43. Thanks a lot for replying to me.
@@idrisseahamadiabdallah7669 the testing set should be kept unmodified. Data Augmentation should only be applied to training. It sounds like your data augmentation may have introduced bigger changes than planned. Check the statistics of the data, it should still be very similar between augmented train and original train/test, otherwise you will get trouble
I was quite surprised that for classification you didn't feed the feature embeddings of the windows to an rnn and instead just used a post processing trick. Wouldn't an rnn work better, what about a transformer? Also, I know that mel spectrograms work better than just feeding raw audio, but how better? is it like +5% accuracy or is it game changing? nvm 😅 both of these questions were answered at the end. another question that came to mind though is: what about speech recognition models or something similar, are spectrogram-based models still dominating or is it a different story?
Temporal aggregation using mean or majority voting is simple and works pretty well. It can be done with an RNN, or AutoPool, or an attention function - and it can increase performance a bit
Whether mel-spectrogram or raw audio works best depends on the task and dataset. It is much more challenging, and more data intensive, to make a system that learns from raw audio - but it sometimes performs better once it works. Though combining both tends to work the best. Not always worth the complexity though
@@Jononor jesus, that was quick XD thank you so much for the reply! I really appreciate it. and that was great presentation btw. It was very easy to follow. I hope you have a nice day ma, cheers :D.
Hi Jon. Great presentation. I am absolutely new to machine learning and found your talk really clear and useful. Thanks for sharing.
Hi Jon, I am doing a final year undergraduate project on bioacoustics, I am new to signal processing as well as your channel! I was just wondering - do you have a paper covering some of the stuff you've talked about, which I could reference?
Hi! Yes, this work is mostly in my master thesis. If you search Google Scholar for "Environmental sound classification on microcontrollers using Convolutional Neural Networks" you should find it. I would give you a link, but TH-cam tends to shadowblock messages with links...
perfect bro. can you exchange an idea how to prepare dataset ?
Hey Mr Hope you doing good !
Please Can you help me ? How Can we use speech recognition to detect falling in elderly people ?
Just another question how to combine audio with image to implement fall detection ??
Thank you
Respected Sir...
My project is to cancel the noise from audio... For this how can i train ML model? And how can i proceed for that plz help me....
Thank you. A very good presentation. Is Keras model code you showed (i.e. "block_1", "block_2", etc.) on a couple of your slides available in one of your GitHub repositories?
Thank you Michael. Yes, all the Keras I tested in my thesis are in the following repo/folder. The one in question is probably in "strided.py" or "sbcnn.py"
github.com/jonnor/ESC-CNN-microcontroller/tree/0d3a1231831d3ee61c22a4f8b461a7511fae3de7/microesc/models
Interesting Presentation !
Hi... great work! Thank you for uploading this video. If you had the exact frequency vs time data for a particular sample in text or csv format, How to use it to improve accuracy of a cnn? Can image data be correlated to corresponding frequency data to get more accurate predictions?
Also.. is data augmentation (time shift, pitch shift etc,) manual or is there any automated process for achieving this?
Hi Jay. The spectrograms contain basically all the time versus frequency data. But if you have some additional information available, there are way to incorporate that. If the data is always available (both training time and prediction time), then you can use it as an additional input to the neural network.
Data augmentation is basically always automated. Either as a pre-processing batch job. Or done on-the-fly while training the neural network. This posts shows the code for common audio augmentations, medium.com/@makcedward/data-augmentation-for-audio-76912b01fdf6
Hi can you please explain how can we convert mp3 audio file into. Wav file
For a single file use Audacity. For multiple files can use ffmpeg and shell to script it. To do it from Python, use librosa.load and soundfile.write
Thank you very much for your very informative presentation. However, I have a question regarding one of your slides, Specifically on Aggregation analysis windows: Could you please explain further (possibly with an example). For instance windows = 6 is number of segment that you have extracted from you audio signals or it is length of windows (6*sampling_rate)? or bands=32?
Moreover, regarding base model, is the model that you presented in slide before (3 layers CNN?) so the logic is that we kind get the audio signals convert them into the sequence of windows and pass them through SB-CNN and propagate it over time and compute the average pooling and will use the output of average pooling to the softmax to conduct the prediction. is this logic is correct?
In advance thank you for you considerations.
thanks you, for great presentation. i have question :
how to make comparisons between one person's voice and another.
Search for "speaker recognition". I recommend looking into pretrained models based on X-vectors or I-vectors
@@Jononor ok thanks
Thank you so much for sharing the presentation with us! I m new in machine learning and I have some questions. From where could I download or use datasets of audio for my project? Thank you in advance !
A good overview of environmental audio datasets can be found at www.cs.tut.fi/~heittolt/datasets
Interesting talk! In the example you showed, lots of the sounds are quite different from each-other, e.g. the children playing, a siren, and a jackhammer. Does it also work for sounds that are very similar? For example different crow calls or different type of chimpanzee sounds?
Hi Ramon. Yes the same basic approach can be used in such a case. Whether good results can be achieved depends on how hard the task is annd how good the data is.
Great stuff.
How's the job market for this type of knowledge and skills? I am an old EE just starting a DS masters and I've turned my attention to audio classification.
Hi Chac. For audio, image, video etc type of processing - the kind of companies that before would hire for Digital Signal Processing skills are today hiring for Machine Learning. If you have an EE background with skills around embedded systems, that is a very good compliment for many such companies. At the moment the demand for ML engineers is high - and many are trying to build new ML-based products and functionality - and there is a lack of skilled people. So pretty good I would say - but you need to go for the places that match your skill profile. A masters degree will set you apart from the large number of self-learners, in terms of demonstrated qualifications
@@Jononor Thank you very much for that. Much appreciated.
John, hate to bug you again, but I am actually kinda serious about this. My DS program is actually not geared or focused for 'TinyML' so I need to supplement it with other learning. What online program or set of courses would you recommend to get into 'TinyML'?
@@chacmool2581 There is a TinyML book. Have not ready, but probably a good start. The TinyML youtube channel has many good talks, but they are on bleeding edge research - not a pedagogical resource. But apart from the usual embedded/DSP topics, the main part of TinyML is computationally efficient and small models. So focus on understanding how to choose and optimize for such models. For CNNs my master thesis has some pointers on that
@@chacmool2581 Also, do a few practical projects. Get an ESP32 board and build something fun (does not have to be useful)
I'm new to machine learning and I feel like I watched so many audio machine learning videos and the tips & tricks section to the end on this is the most practical and unique stuff I've seen. Thanks! Does the simple audio recognition by tensor flow tutorial still exist? I can't seem to find it? Also, in the audio augmentation slide you talk about adding noise to your data for benefit of the model but in the Q&A you talk about how de-noising is helpful. Could you clarify the different cases where you use both?
Hi Peter. The Tensorflow simple audio tutorial still exists, but they keeping moving it around and renaming it. Currently it is called "Simple audio recognition: Recognizing keywords" at www.tensorflow.org/tutorials/audio/simple_audio
Training with noise via data augmentation is almost always beneficial (possible exception, if one of your classes is very noise like). And given sufficient data, this will work well, and is the simplest solution. However, if one 1) has a small amount of data and 2) there are well known denoising methods that work well for the case - it may be worth a try. Examples of usecases where I have seen denoising step work well is bird audio spotting in remote monitoring cases (forests etc) - here it is often very quiet and the noise floor can be significant. It may be the noise is that of the microphones and electronics themselves, which is near constant, and relatively simple to denoise
Hello Jon , you did a great presentation. Thanks for sharing.
I am working on my master's thesis, specifically in Lung Sounds classification using CNN.
I am using mfcc's features. I am getting about 88% of accuracy.
Do you think that melspectogram can give a high accuracy than 88% ?
Hi Idrisse! Thank you. Yes, I think that mel-spectrogram instead of MFCC might give you a slight increase in performance for your usecase, at least it is worth trying out!
@@Jononor thank you
@@Jononor thanks sir,
I would like to ask something, please bear me.
Step1 : original dataset 177 samples ( 3 classes , each class has 59 audios files).
Because of the small size of the data, I did data augmentation.
Step 2: After data augmentation, I extracted mfcc's features of the Audio files with its respective labels in order to create a useful dataset.
Step 3 : I splitted the new dataset into training, validation and testing sets.
Step 4: Feed the CNN with the training and validation sets for the training process.
Step 5: evaluated the CNN with the testing set, we are able to reach an accuracy around 90-93%.
Is correct ( logic) to test the model with the testing data that l got in step 3? Or I should split the data to training and testing sets before doing the data augmentation.? Doing so l got an accuracy around 40-43.
Thanks a lot for replying to me.
@@idrisseahamadiabdallah7669 the testing set should be kept unmodified. Data Augmentation should only be applied to training. It sounds like your data augmentation may have introduced bigger changes than planned. Check the statistics of the data, it should still be very similar between augmented train and original train/test, otherwise you will get trouble
@@Jononor okay I understood, thanks a lot.
One other question. Do you think that the 177 wav files , maybe enough to train a CNN model efficiently?
I was quite surprised that for classification you didn't feed the feature embeddings of the windows to an rnn and instead just used a post processing trick. Wouldn't an rnn work better, what about a transformer? Also, I know that mel spectrograms work better than just feeding raw audio, but how better? is it like +5% accuracy or is it game changing?
nvm 😅 both of these questions were answered at the end. another question that came to mind though is: what about speech recognition models or something similar, are spectrogram-based models still dominating or is it a different story?
Temporal aggregation using mean or majority voting is simple and works pretty well. It can be done with an RNN, or AutoPool, or an attention function - and it can increase performance a bit
Whether mel-spectrogram or raw audio works best depends on the task and dataset. It is much more challenging, and more data intensive, to make a system that learns from raw audio - but it sometimes performs better once it works. Though combining both tends to work the best. Not always worth the complexity though
@@Jononor jesus, that was quick XD
thank you so much for the reply! I really appreciate it. and that was great presentation btw. It was very easy to follow.
I hope you have a nice day ma, cheers :D.
@@xXDarQXx Thank you :) Happy learning, have a nice day!
I really like your presentation. Thank you very much. Since I'm trying to classify sound for my project now, could I ask you some more questions?
Just ask here, or create Stack Overflow questions and link them here. Then I can respond :)
Could you help me explain more detail about mel spectrogram, more mathematical
@@tranthanh3060 here is a good intro, haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
@@Jononor Thank you so much for your prompt response, this is exactly what I need. Hope you have a nice day!
i am here again, one question. Why don't you upload audio processing videos weekly ? Thanks !!!!!!
Several reasons. But the main one is that I do not have the time right now. It takes around 10 hours to make a 10 minute lecture with solid content.
@@Jononor you are right! Its hard and sometimes a headache haha, anyway loved the old content!
Sir can you share the code of your model?
Hi Saleem. You can find the code here, github.com/jonnor/eSC-CNN-microcontroller
@@Jononor thank you so much sir
Fantastic!!!! **O** GrEAT insight! Thank you!