Thanks for the video; it was excellent and I learned a great deal. I'd suggest, though, that you split out the test data _before_ you apply the under/over sampling algorithm (to the train data only). That would give a much better comparison of the algorithms, showing how they perform on the unmodified test data.
Thanks. The train/test split is the first step (see cell number 2 in the notebook) and none of the under/over sampling methods are applied to the test set. The performance comparison is indeed on the unmodified test data.
I just noticed that as I was going through your notebook on github (thanks for uploading!) and was going to edit my comment. . Yes, that makes perfect sense. What initially confused me was that the graphs are showing the decision boundary on the train data (and I was thinking it was the test data).
I do like like the graphs showing the decision boundary on the train data, since it shows how the under/over sampling algs modify the data. I forked the notebook and am going to add the plots of the decision boundary on the test data as well.
Thanks! Here is a link to the notebook github.com/irreducible/PyData-Resampling/blob/master/PyData-Resampling-nb.ipynb and the slides www.slideshare.net/AjinkyaMore3/python-resampling
Thanks! Here is a link to the notebook github.com/irreducible/PyData-Resampling/blob/master/PyData-Resampling-nb.ipynb and the slides www.slideshare.net/AjinkyaMore3/python-resampling
Optimizing an arbitrary metric is rather useless for business. In particular, what is the business meaning of optimizing for precision of normal cases? Something like alarms per month may well be meaningful, but that would be Recall(pos)/Prec(pos)..
Great talk. Thanks!
Thanks for the video; it was excellent and I learned a great deal. I'd suggest, though, that you split out the test data _before_ you apply the under/over sampling algorithm (to the train data only). That would give a much better comparison of the algorithms, showing how they perform on the unmodified test data.
Thanks. The train/test split is the first step (see cell number 2 in the notebook) and none of the under/over sampling methods are applied to the test set. The performance comparison is indeed on the unmodified test data.
I just noticed that as I was going through your notebook on github (thanks for uploading!) and was going to edit my comment. . Yes, that makes perfect sense. What initially confused me was that the graphs are showing the decision boundary on the train data (and I was thinking it was the test data).
I do like like the graphs showing the decision boundary on the train data, since it shows how the under/over sampling algs modify the data. I forked the notebook and am going to add the plots of the decision boundary on the test data as well.
Yes, the idea was to show the changes in the data distribution affect the decision boundary.
Thank you for your presentation! Could you please upload the code in notebook file for example?
Thanks! Here is a link to the notebook github.com/irreducible/PyData-Resampling/blob/master/PyData-Resampling-nb.ipynb and the slides www.slideshare.net/AjinkyaMore3/python-resampling
I found it with a bit of digging:
github.com/irreducible/PyData-Resampling/blob/master/PyData-Resampling-nb.ipynb
LOL . . . I should have refreshed the comments before posting my comment. :-)
Great thanks to you Mr Ajinkya More.
Great presentation!
Great video, Thanks.
Where can I get the slides you used?
(I found your paper on arXiv, but it doesn't have the code)
Thanks! Here is a link to the notebook github.com/irreducible/PyData-Resampling/blob/master/PyData-Resampling-nb.ipynb and the slides www.slideshare.net/AjinkyaMore3/python-resampling
Optimizing an arbitrary metric is rather useless for business. In particular, what is the business meaning of optimizing for precision of normal cases? Something like alarms per month may well be meaningful, but that would be Recall(pos)/Prec(pos)..