Pre-Modeling: Data Preprocessing and Feature Exploration in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ธ.ค. 2024

ความคิดเห็น • 95

  • @JOKBO1
    @JOKBO1 7 ปีที่แล้ว +6

    What an amazing video! I'll have to re-watch it to understand all the concepts and read the code many times but I still like your explanation.

  • @prafulmaka7710
    @prafulmaka7710 5 ปีที่แล้ว +3

    Wow this lady sure explains this well!

  • @ausseriridische976
    @ausseriridische976 7 ปีที่แล้ว +5

    Wow, thanks so much for uploading both the video and code. It's so helpful I really wish my professors would explain this well!

    • @AkashVerma-em1gk
      @AkashVerma-em1gk 6 ปีที่แล้ว +1

      where is the code? can you give me code?

    • @issahsamori3398
      @issahsamori3398 5 ปีที่แล้ว +1

      @@AkashVerma-em1gk this is the code: github.com/aprilypchen/depy2016

    • @Aliraza-fb5jx
      @Aliraza-fb5jx ปีที่แล้ว

      Where is the code? I couldn't find it either in the description or the comments.

  • @sanketdash7292
    @sanketdash7292 5 ปีที่แล้ว +2

    Thanks for the Video !!!!Please publish the steps in a Data bricks notebook, so that it would be more useful for practice.

  • @kennethOdoh1
    @kennethOdoh1 9 หลายเดือนก่อน

    Such a gem 💎😍
    Thank you so much!

  • @mustafaismaileyi
    @mustafaismaileyi 4 ปีที่แล้ว +7

    If you use scikit-learn version higher than 0.21
    the name of library sklearn.preprocessing import Imputer changed to from sklearn.impute import SimpleImputer
    and parameter of this function missing_values allows np.nan instead of 'Nan' .
    These are the problems that I faced so far.

    • @RyanJamesMcCall
      @RyanJamesMcCall 3 ปีที่แล้ว +3

      this worked for me:
      from sklearn.impute import SimpleImputer
      imp = SimpleImputer(missing_values=np.nan, strategy='median')

  • @koumospecial
    @koumospecial 6 ปีที่แล้ว +1

    Very nicely presented. Nice job April.

  • @muyanjassenyonga3398
    @muyanjassenyonga3398 6 ปีที่แล้ว +1

    I would also like to take a look at the dateset used, if possible to do some tweaking as well! Thanks though for the presentation.

    • @Nehmaiz
      @Nehmaiz 5 ปีที่แล้ว

      datahub.io/machine-learning/adult#resource-adult

  • @joehsiao6224
    @joehsiao6224 4 ปีที่แล้ว

    Question: 16:52 I think using either inter quartile range or standard deviation to detect outliers are under the assumption of normality. They are interchangeable through the z table. Why IQR does not assume normality, and SD does?

  • @0AlHidaya0
    @0AlHidaya0 4 ปีที่แล้ว +2

    Thanks for all these information, very useful, please how can I access the first part of your presentation and thanks again

  • @hamman_samuel
    @hamman_samuel 3 ปีที่แล้ว

    April: pre-modelling doesn't get enough ATTENTION
    Deep learners: hmm, interesting

  • @sanacme
    @sanacme 7 ปีที่แล้ว

    Thank you for sharing the notebook, it really helps a lot for a beginner like me.

    • @rainbowdu509
      @rainbowdu509 6 ปีที่แล้ว +1

      Hi,Do you know where is the link for this notebook?Thanks

  • @sineadgillespie-mccracken140
    @sineadgillespie-mccracken140 4 ปีที่แล้ว

    Thankyou, a brilliant, well explained presentation, it helped me massively.

  • @wsgsantos
    @wsgsantos 6 ปีที่แล้ว +3

    Great presentation! Thank you!

  • @joseolivio2348
    @joseolivio2348 6 ปีที่แล้ว +4

    Wow that helped me sooooooooooooo much, thank you!!

  • @mdazizulislam9653
    @mdazizulislam9653 4 ปีที่แล้ว +1

    For Multivariate Outliers Detection you should use Mahalanobis Distance for mixed variables instead of using Boxplot....

  • @modakad
    @modakad 4 ปีที่แล้ว +5

    Notebook for this tutorial : github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb

    • @KoLMiW
      @KoLMiW 3 ปีที่แล้ว +1

      Thank you :)

  • @fatimak6440
    @fatimak6440 2 ปีที่แล้ว +1

    shouldnt we have imputed missing values BEFORE dummying up the data? im assuming once it is dummied, the imputer take the median/mean of the 0s and 1s, but will not impute the "true" mean. I am not sure. Can someone please elaborate?

    • @leassis91
      @leassis91 2 ปีที่แล้ว +1

      i have this same question

  • @tanteriaaa
    @tanteriaaa 2 ปีที่แล้ว

    why do u multiplate iqr to 1.5? 17:00

  • @tonyhathuc
    @tonyhathuc 3 ปีที่แล้ว +1

    is the jupyter notebook available somewhere?

  • @MahmoudOuf
    @MahmoudOuf 5 ปีที่แล้ว +2

    Thanks for this talk,
    Is there any access to the notebook ?

    • @SiphesihleY
      @SiphesihleY 5 ปีที่แล้ว +9

      found: github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb

    • @jyotirmayavasaniwal5
      @jyotirmayavasaniwal5 5 ปีที่แล้ว

      @@SiphesihleY thanks a lot

    • @DHAiRYA2801
      @DHAiRYA2801 4 ปีที่แล้ว

      @@SiphesihleY Thank you!

    • @Aliraza-fb5jx
      @Aliraza-fb5jx ปีที่แล้ว

      Thank u so much for the notebook

  • @akankshamishra27
    @akankshamishra27 5 ปีที่แล้ว +2

    Thanks a lot for such an explanation, it really helped me. Can you please share the link for the dataset?

  • @danielprytz
    @danielprytz 2 ปีที่แล้ว +1

    you had me at df[income] = [0 if x

  • @Muhammadalbasrawe
    @Muhammadalbasrawe 7 ปีที่แล้ว

    nice vid thanks a lot
    the explanation and the code are simply great

  • @sdoken
    @sdoken 7 ปีที่แล้ว

    At 27:34, n=10 is chosen arbitrarily. But what are the limits on n? For example, It cannot be larger than the number of features right?

    • @prince.harshan
      @prince.harshan 7 ปีที่แล้ว +1

      You wouldn't ideally want to exceed the number of features if your goal is dimensionality reduction.

    • @TheDeltatoGrowth
      @TheDeltatoGrowth 7 ปีที่แล้ว +2

      It cannot be larger, yes. These are hyperparameters passed and as far as I know, it is a trial-error based method. If you come across any other way, please post it as a reply :) Thanks

    • @tulasijamun3234
      @tulasijamun3234 6 ปีที่แล้ว +1

      Each component explains some variability of the data. If a dataset with 10 features has a lot of correlation between features such that only 2 features can account for most of the variation (say 90%) in data then you would be happy to set PCA(n_components=2) and lose the 10 % variablity in exchange for reducing the dimensions of the dataset by 8 features.

  • @statchaitya6139
    @statchaitya6139 8 ปีที่แล้ว

    What will na_values = ['#NAME?'] do in pd.read_csv?
    In adult dataset, missing values are represented by " ?" (note the space before ?). So does ['#NAME?'] indicated some kind of a regular expression matching particular values related to ?

  • @chaeralove
    @chaeralove 5 ปีที่แล้ว +1

    Great overview!

  • @madelcamp
    @madelcamp 7 ปีที่แล้ว +1

    Thanks a lot!!!! this is very useful!!!

  • @raedtabani7669
    @raedtabani7669 4 ปีที่แล้ว

    Amazing video , thanks

  • @martinrussell1404
    @martinrussell1404 6 ปีที่แล้ว +5

    Thank you! Going through a Data Science bootcamp now, let's just say their explanation of this was less than adequate, your's is not!

  • @edgarpanganiban9339
    @edgarpanganiban9339 6 ปีที่แล้ว

    Very Nice Tutorial...But I have a question, Is there a way where it automatically checks the unique categories of a feature if its "overly-imbalanced", Like in your example, the "native_country" had a "imbalanced" categories where "United states" outnumbered the other categories. But then in your code, we already assumed that the "native_country" has that problem. I want a program where it checked all the features if it has "imbalanced" in it's number of categories and at the same time take care of it ( ex. changing low frequency categories to "Others"). Thanks in advanced...

  • @akhwandabdulkareem2120
    @akhwandabdulkareem2120 4 ปีที่แล้ว

    Link in the description is not working and can you please provide github or Jupyter notebooks link as well

  • @brunoreggio3896
    @brunoreggio3896 6 ปีที่แล้ว

    Thansk April, very explanatory and very easy to understand even for not native. Plus you are so pretty.

  • @TheFreezwater
    @TheFreezwater 6 ปีที่แล้ว

    @April Chen Do you have any suggestions on how to handle multiclass classification problems?
    Lets say we have 14 different items to be predicted from 200 unique combination of items in another column in the same dataset. Appreciate your suggestion

    • @mitrabhanuroutkali
      @mitrabhanuroutkali 5 ปีที่แล้ว

      Sorry to say there no solution in ml, I think u can do that with cnn

  • @vilkoos
    @vilkoos 7 ปีที่แล้ว

    very useful demo ... thanks

  • @michaelmolter6180
    @michaelmolter6180 8 ปีที่แล้ว +5

    Is the dataset publically available?

    • @jerwinsamuel
      @jerwinsamuel 6 ปีที่แล้ว

      Thank you for this talk. This is very helpful.

    • @ArgaDaneshwara
      @ArgaDaneshwara 6 ปีที่แล้ว

      @April Chen thank you very much! hope you have a great day!

    • @Nehmaiz
      @Nehmaiz 5 ปีที่แล้ว

      datahub.io/machine-learning/adult#resource-adult

  • @hridayborah9750
    @hridayborah9750 4 ปีที่แล้ว

    great clarity

  • @sannigupta4042
    @sannigupta4042 6 ปีที่แล้ว

    first video is amazing , it has cleared lot of confusions . Can anyone share the link of second video also.

    • @suhasnayak4704
      @suhasnayak4704 5 ปีที่แล้ว +1

      Did you find the link for second video ? If yes, please share it

  • @djlivestreem4039
    @djlivestreem4039 3 ปีที่แล้ว

    this is amazing

  • @vasylcf
    @vasylcf 6 ปีที่แล้ว

    Good explanation, thanks )

  • @vatsaldesai6965
    @vatsaldesai6965 6 ปีที่แล้ว

    Excellent one

  • @refocusedadhd
    @refocusedadhd 5 ปีที่แล้ว

    Lots of help! Thanks! Anything else you can teach me/us?

  • @pedroagrodrigues-k4y
    @pedroagrodrigues-k4y 4 ปีที่แล้ว +1

    Does any1 have this project git?

  • @Aristotin
    @Aristotin 6 ปีที่แล้ว

    the notebook cannot be accessed, can you please provide a valid one? thanks

    • @NikhilKumar-pz3uz
      @NikhilKumar-pz3uz 6 ปีที่แล้ว

      github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb

  • @susmitvengurlekar
    @susmitvengurlekar 5 ปีที่แล้ว

    Thanks a lot!

  • @rajkiranveldur4570
    @rajkiranveldur4570 7 ปีที่แล้ว

    Could anyone please share the Github for this notebook. Thank you.

    • @tulasijamun3234
      @tulasijamun3234 6 ปีที่แล้ว

      link is in the description...

    • @rajkiranveldur4570
      @rajkiranveldur4570 6 ปีที่แล้ว

      Hi I get the following error, when I visit the link in the description:
      Internal Error
      Ticket issued
      please help.

    • @tulasijamun3234
      @tulasijamun3234 6 ปีที่แล้ว

      Link: github.com/aprilypchen

  • @nomso8370
    @nomso8370 7 ปีที่แล้ว

    Excellent

  • @vikkysingh4857
    @vikkysingh4857 5 ปีที่แล้ว

    how tro get dataset

    • @Nehmaiz
      @Nehmaiz 5 ปีที่แล้ว

      datahub.io/machine-learning/adult#resource-adult

  • @yashasaveekesarwani2551
    @yashasaveekesarwani2551 3 ปีที่แล้ว +1

    normalize leaving github and Linkdin account of speaker.

  • @NikhilKumar-pz3uz
    @NikhilKumar-pz3uz 6 ปีที่แล้ว

    @12:50 she said along column for axis=0, is that correct??

    • @MahatiSuvvari
      @MahatiSuvvari 6 ปีที่แล้ว

      Yes, it is. She will be imputing along the columns.

  • @mkanalysis
    @mkanalysis 6 ปีที่แล้ว

    notebook can be found here: github.com/aprilypchen/depy2016

  • @vishalnanote7942
    @vishalnanote7942 2 ปีที่แล้ว

    Nice code

  • @jottedpro7408
    @jottedpro7408 7 ปีที่แล้ว

    so helpful thank uu

  • @alimrahardian109
    @alimrahardian109 5 ปีที่แล้ว +2

    source code link?

    • @issahsamori3398
      @issahsamori3398 5 ปีที่แล้ว +2

      github.com/aprilypchen/depy2016

  • @quantstyle6448
    @quantstyle6448 5 ปีที่แล้ว

    Where's the data?

  • @alaahesham250
    @alaahesham250 5 ปีที่แล้ว

    code link:
    github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb

  • @Gruemoth
    @Gruemoth 5 ปีที่แล้ว

    22:04 Rule 1: NEVER join the measured points with lines.

    • @avatar098
      @avatar098 5 ปีที่แล้ว

      In this data set, why not? I mean, I understand that it's looking at a distribution of points per category.. It just feels like something to help the viewer understand the trends

    • @Gruemoth
      @Gruemoth 5 ปีที่แล้ว

      @@avatar098 oh it is for visualization aid, thought the relationship was linear between the points

  • @yasmineu-d4384
    @yasmineu-d4384 2 ปีที่แล้ว

  • @emilfilipov169
    @emilfilipov169 7 ปีที่แล้ว

    Why is nobody covering REUSABILITY??? How am i going to re-use this dummy encoding in a new dataset? Why is everyone overlooking this shit?

  • @bhargav7476
    @bhargav7476 4 ปีที่แล้ว

    That UpTalk is so annoying

  • @zeljkorakic950
    @zeljkorakic950 4 ปีที่แล้ว

    Good features, well chosen example, not user friendly program (need to write lines versus choosing functions), very poor teaching techniek. She is knowledgeable but does not transfer knowledge well to non-experts. Conference can assume non expert participants.

  • @ca177
    @ca177 4 ปีที่แล้ว

    She doesn't do a good job of explaining why she is dummying everything up at beginning, nor is she explaining well what she's doing along the way.. I just started my ML course having finished my intro to Analytics.. I will keep watching but I keep having to backtrack every 2 mins due to her not explaining steps well..

  • @MrCaglar1993
    @MrCaglar1993 4 ปีที่แล้ว

    woman dude has some over anxiety problems that comin from her voice