If you use scikit-learn version higher than 0.21 the name of library sklearn.preprocessing import Imputer changed to from sklearn.impute import SimpleImputer and parameter of this function missing_values allows np.nan instead of 'Nan' . These are the problems that I faced so far.
Question: 16:52 I think using either inter quartile range or standard deviation to detect outliers are under the assumption of normality. They are interchangeable through the z table. Why IQR does not assume normality, and SD does?
shouldnt we have imputed missing values BEFORE dummying up the data? im assuming once it is dummied, the imputer take the median/mean of the 0s and 1s, but will not impute the "true" mean. I am not sure. Can someone please elaborate?
It cannot be larger, yes. These are hyperparameters passed and as far as I know, it is a trial-error based method. If you come across any other way, please post it as a reply :) Thanks
Each component explains some variability of the data. If a dataset with 10 features has a lot of correlation between features such that only 2 features can account for most of the variation (say 90%) in data then you would be happy to set PCA(n_components=2) and lose the 10 % variablity in exchange for reducing the dimensions of the dataset by 8 features.
What will na_values = ['#NAME?'] do in pd.read_csv? In adult dataset, missing values are represented by " ?" (note the space before ?). So does ['#NAME?'] indicated some kind of a regular expression matching particular values related to ?
Very Nice Tutorial...But I have a question, Is there a way where it automatically checks the unique categories of a feature if its "overly-imbalanced", Like in your example, the "native_country" had a "imbalanced" categories where "United states" outnumbered the other categories. But then in your code, we already assumed that the "native_country" has that problem. I want a program where it checked all the features if it has "imbalanced" in it's number of categories and at the same time take care of it ( ex. changing low frequency categories to "Others"). Thanks in advanced...
@April Chen Do you have any suggestions on how to handle multiclass classification problems? Lets say we have 14 different items to be predicted from 200 unique combination of items in another column in the same dataset. Appreciate your suggestion
In this data set, why not? I mean, I understand that it's looking at a distribution of points per category.. It just feels like something to help the viewer understand the trends
Good features, well chosen example, not user friendly program (need to write lines versus choosing functions), very poor teaching techniek. She is knowledgeable but does not transfer knowledge well to non-experts. Conference can assume non expert participants.
She doesn't do a good job of explaining why she is dummying everything up at beginning, nor is she explaining well what she's doing along the way.. I just started my ML course having finished my intro to Analytics.. I will keep watching but I keep having to backtrack every 2 mins due to her not explaining steps well..
What an amazing video! I'll have to re-watch it to understand all the concepts and read the code many times but I still like your explanation.
Wow this lady sure explains this well!
Wow, thanks so much for uploading both the video and code. It's so helpful I really wish my professors would explain this well!
where is the code? can you give me code?
@@AkashVerma-em1gk this is the code: github.com/aprilypchen/depy2016
Where is the code? I couldn't find it either in the description or the comments.
Thanks for the Video !!!!Please publish the steps in a Data bricks notebook, so that it would be more useful for practice.
Such a gem 💎😍
Thank you so much!
If you use scikit-learn version higher than 0.21
the name of library sklearn.preprocessing import Imputer changed to from sklearn.impute import SimpleImputer
and parameter of this function missing_values allows np.nan instead of 'Nan' .
These are the problems that I faced so far.
this worked for me:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='median')
Very nicely presented. Nice job April.
I would also like to take a look at the dateset used, if possible to do some tweaking as well! Thanks though for the presentation.
datahub.io/machine-learning/adult#resource-adult
Question: 16:52 I think using either inter quartile range or standard deviation to detect outliers are under the assumption of normality. They are interchangeable through the z table. Why IQR does not assume normality, and SD does?
Thanks for all these information, very useful, please how can I access the first part of your presentation and thanks again
April: pre-modelling doesn't get enough ATTENTION
Deep learners: hmm, interesting
Thank you for sharing the notebook, it really helps a lot for a beginner like me.
Hi,Do you know where is the link for this notebook?Thanks
Thankyou, a brilliant, well explained presentation, it helped me massively.
Great presentation! Thank you!
Wow that helped me sooooooooooooo much, thank you!!
For Multivariate Outliers Detection you should use Mahalanobis Distance for mixed variables instead of using Boxplot....
Notebook for this tutorial : github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb
Thank you :)
shouldnt we have imputed missing values BEFORE dummying up the data? im assuming once it is dummied, the imputer take the median/mean of the 0s and 1s, but will not impute the "true" mean. I am not sure. Can someone please elaborate?
i have this same question
why do u multiplate iqr to 1.5? 17:00
is the jupyter notebook available somewhere?
Thanks for this talk,
Is there any access to the notebook ?
found: github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb
@@SiphesihleY thanks a lot
@@SiphesihleY Thank you!
Thank u so much for the notebook
Thanks a lot for such an explanation, it really helped me. Can you please share the link for the dataset?
you had me at df[income] = [0 if x
nice vid thanks a lot
the explanation and the code are simply great
At 27:34, n=10 is chosen arbitrarily. But what are the limits on n? For example, It cannot be larger than the number of features right?
You wouldn't ideally want to exceed the number of features if your goal is dimensionality reduction.
It cannot be larger, yes. These are hyperparameters passed and as far as I know, it is a trial-error based method. If you come across any other way, please post it as a reply :) Thanks
Each component explains some variability of the data. If a dataset with 10 features has a lot of correlation between features such that only 2 features can account for most of the variation (say 90%) in data then you would be happy to set PCA(n_components=2) and lose the 10 % variablity in exchange for reducing the dimensions of the dataset by 8 features.
What will na_values = ['#NAME?'] do in pd.read_csv?
In adult dataset, missing values are represented by " ?" (note the space before ?). So does ['#NAME?'] indicated some kind of a regular expression matching particular values related to ?
Great overview!
Thanks a lot!!!! this is very useful!!!
Amazing video , thanks
Thank you! Going through a Data Science bootcamp now, let's just say their explanation of this was less than adequate, your's is not!
Very Nice Tutorial...But I have a question, Is there a way where it automatically checks the unique categories of a feature if its "overly-imbalanced", Like in your example, the "native_country" had a "imbalanced" categories where "United states" outnumbered the other categories. But then in your code, we already assumed that the "native_country" has that problem. I want a program where it checked all the features if it has "imbalanced" in it's number of categories and at the same time take care of it ( ex. changing low frequency categories to "Others"). Thanks in advanced...
Link in the description is not working and can you please provide github or Jupyter notebooks link as well
Thansk April, very explanatory and very easy to understand even for not native. Plus you are so pretty.
@April Chen Do you have any suggestions on how to handle multiclass classification problems?
Lets say we have 14 different items to be predicted from 200 unique combination of items in another column in the same dataset. Appreciate your suggestion
Sorry to say there no solution in ml, I think u can do that with cnn
very useful demo ... thanks
Is the dataset publically available?
Thank you for this talk. This is very helpful.
@April Chen thank you very much! hope you have a great day!
datahub.io/machine-learning/adult#resource-adult
great clarity
first video is amazing , it has cleared lot of confusions . Can anyone share the link of second video also.
Did you find the link for second video ? If yes, please share it
this is amazing
Good explanation, thanks )
Excellent one
Lots of help! Thanks! Anything else you can teach me/us?
Does any1 have this project git?
the notebook cannot be accessed, can you please provide a valid one? thanks
github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb
Thanks a lot!
Could anyone please share the Github for this notebook. Thank you.
link is in the description...
Hi I get the following error, when I visit the link in the description:
Internal Error
Ticket issued
please help.
Link: github.com/aprilypchen
Excellent
how tro get dataset
datahub.io/machine-learning/adult#resource-adult
normalize leaving github and Linkdin account of speaker.
@12:50 she said along column for axis=0, is that correct??
Yes, it is. She will be imputing along the columns.
notebook can be found here: github.com/aprilypchen/depy2016
Nice code
so helpful thank uu
source code link?
github.com/aprilypchen/depy2016
Where's the data?
code link:
github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb
22:04 Rule 1: NEVER join the measured points with lines.
In this data set, why not? I mean, I understand that it's looking at a distribution of points per category.. It just feels like something to help the viewer understand the trends
@@avatar098 oh it is for visualization aid, thought the relationship was linear between the points
Why is nobody covering REUSABILITY??? How am i going to re-use this dummy encoding in a new dataset? Why is everyone overlooking this shit?
That UpTalk is so annoying
Good features, well chosen example, not user friendly program (need to write lines versus choosing functions), very poor teaching techniek. She is knowledgeable but does not transfer knowledge well to non-experts. Conference can assume non expert participants.
She doesn't do a good job of explaining why she is dummying everything up at beginning, nor is she explaining well what she's doing along the way.. I just started my ML course having finished my intro to Analytics.. I will keep watching but I keep having to backtrack every 2 mins due to her not explaining steps well..
woman dude has some over anxiety problems that comin from her voice