Kaggle's 30 Days Of ML (Day-12 Part-2): Handling Categorical Variables

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ต.ค. 2024

ความคิดเห็น • 40

  • @abhishekkrthakur
    @abhishekkrthakur  3 ปีที่แล้ว +16

    If you like the videos, please do consider subscribing. It helps me keep motivated to make awesome videos like this one. :)

    • @sameeralamynwa
      @sameeralamynwa 3 ปีที่แล้ว +1

      Inside OHE, can you please explain why we are using transform on X_valid for the values that we fit on the X_train? Should it not be done like separate fit_transform on X_train and X_valid because both of these are completely different data?

  • @sarthaksingh2175
    @sarthaksingh2175 3 ปีที่แล้ว +3

    It's like learning to swim from Phelps. Thanks Abhishek

  • @fatimak6440
    @fatimak6440 2 ปีที่แล้ว +1

    haha - it's embarrassing to admit but it took me ONE WEEK to get through a 55 minute video. good stuff abhi! i have a few concerns in some places, but i need to now go through the code WITHOUT your video, for clearer understanding, and may revert back with questions/concerns.

    • @rohan-o5w
      @rohan-o5w 11 หลายเดือนก่อน +1

      thanks for the encouragement that i am not the only one finding this difficulty 😄

  • @tahsinulislam3147
    @tahsinulislam3147 3 ปีที่แล้ว +2

    Thank you for the videos @Abhishek Thakur! I think instead of looping to select obj columns, you can use
    X_train.select_dtypes(exclude=['object']) returns dataframe without obj datatypes columns, bit easier to filter I suppose.

  • @harish00784
    @harish00784 3 ปีที่แล้ว +2

    You are my savior 🙏🏼. I was able to perform other exercise, but this categorical variable part I was actually stuck. This helped me ✌🏼

    • @prathamnikam5451
      @prathamnikam5451 ปีที่แล้ว

      Yes, I am also stuck pn 83% what to do in categorical exercise?

  • @pranjal86able
    @pranjal86able 3 ปีที่แล้ว +1

    The benefit of ordinal over label encoding is only that Ordinal can be used on 2D data: (n_samples, n_features) where as with Label encoding, you will need to loop through the n_features. Small difference.

    • @jiggeling
      @jiggeling 3 ปีที่แล้ว

      Please note that this is not quite true. Both methods are only equal for models which do not care about the order of the categories, which is why it is discouraged to use labelencoder on input data.

  • @1993anhnguyen
    @1993anhnguyen 3 ปีที่แล้ว +1

    Thank you for a great video. I have learned a lot.

  • @nishanttailor4786
    @nishanttailor4786 2 ปีที่แล้ว

    To the point explanation!!

  • @amanpatkar7009
    @amanpatkar7009 3 ปีที่แล้ว +7

    I recommend everyone to buy "Approaching almost any machine learning problem " book

  • @code4u941
    @code4u941 3 ปีที่แล้ว +2

    Hi, you said In Random forest no need to do one-hot-encoding, just do label encoding it is fine. So is this true for only Random forest or any tree based models like XG-boost, GBM, Ada-Boost etc. ?
    thanks.

  • @fatimak6440
    @fatimak6440 2 ปีที่แล้ว

    why is the original notebook splitting the X so early on? then you have to perform all functions TWICE, unnecessarily. Apply all functions/operations, then split, wouldnt this be better? Seriously asking if there is a valid reason for the split so early.

  • @swayamsingh4650
    @swayamsingh4650 3 ปีที่แล้ว

    Do we have to learn like which model accepts which type of encoding? I mean as you said we can use label encoding with Decision trees, RF and GMB but with logistic reg and linear reg you prefer binarization.
    Also One hot encoding is also a type of binarization but still we used it with Random Forest.

  • @mithilesh03
    @mithilesh03 3 ปีที่แล้ว

    Hello Abhishek sir, I hope all is well with you. Just a quick question. In the Categorical Variables Exercise, The missing columns in the X_test are deleted based on the missing columns in the X dataset. However, I noticed that X_test contains some additional missing columns. Can you please advise how to handle this because my onehotencoding is fit and transform on X_train, and I cannot use this to transform the X_test as this will lead to error. Should I handle the missingness in X_test using axis=0 on selected columns? Thanks

  • @pradumn3850
    @pradumn3850 3 ปีที่แล้ว

    While working on an AirQuality data set I get this error when I run ypred = model. predict(X_train)
    'numpy.float64' object has no attribute 'predict'
    what is my mistake?

  • @NYARRAMSETTIPAVANSAIKONDALARAO
    @NYARRAMSETTIPAVANSAIKONDALARAO 3 ปีที่แล้ว +1

    if categorical features has missing values in test set what will onehotencoder do?

  • @divyanshugera3187
    @divyanshugera3187 ปีที่แล้ว

    Why are we having a new label encoder in the loop for every column
    cannot we just define it once outside the loop?

  • @paddynsubuga9857
    @paddynsubuga9857 3 ปีที่แล้ว +1

    Hello, I want to make a final model and then predict using the X_test data set. Please find time and make for some of us the video for this. It will be of great help too. Thanks for the content though.

    • @abhishekkrthakur
      @abhishekkrthakur  3 ปีที่แล้ว +1

      many videos in this series have this part. if you go through them, you will find what you are looking for.

    • @paddynsubuga9857
      @paddynsubuga9857 3 ปีที่แล้ว +1

      @@abhishekkrthakur oooh.. okay. Thanks

  • @sravya4189
    @sravya4189 3 ปีที่แล้ว +1

    I am trying to run the notebook but it says "Draft session starting" and getting stucked in there. I updated my chrome browser, restarted it, flipped accelerators while running, but nothing helped. Can you please suggest ways to overcome this issue

    • @abhishekkrthakur
      @abhishekkrthakur  3 ปีที่แล้ว +1

      you need to run a cell. if it still doesnt work after a few mins, contact kaggle support :)

    • @sravya4189
      @sravya4189 3 ปีที่แล้ว +1

      @@abhishekkrthakur Thanks for response. I tried running as suggested, but still not working. Raised help ticket in Kaggle. Thanks.

  • @啸剑王
    @啸剑王 3 ปีที่แล้ว

    But what about the order of ordinal encoder? It seems our code never defines whether "never" should be 1 or 2 or 100

  • @appunram6881
    @appunram6881 3 ปีที่แล้ว +1

    What is the use of putting axis = 1 ?

    • @abhishekkrthakur
      @abhishekkrthakur  3 ปีที่แล้ว +4

      axis=0 => rows, axis=1 => columns

    • @appunram6881
      @appunram6881 3 ปีที่แล้ว +1

      @@abhishekkrthakur Thank you sir

  • @iwrestling4020
    @iwrestling4020 3 ปีที่แล้ว

    sorry dumb questions, why is axis=1?

    • @grow-with-abi
      @grow-with-abi 3 ปีที่แล้ว

      Axis=1 refers to column and axis=0 refers to rows

  • @pranjal86able
    @pranjal86able 3 ปีที่แล้ว

    What is the free book link at the print comment? (13:50)

    • @abhishekkrthakur
      @abhishekkrthakur  3 ปีที่แล้ว +1

      bit.ly/approachingml

    • @pranjal86able
      @pranjal86able 3 ปีที่แล้ว

      ​@@abhishekkrthakur thank you, I see it now in the description. Coming from core python background, the masking step ```s[s]``` looked very foreign. I had to spend a lot of time on my own to understand it. (20:30)

  • @AdityaJha1
    @AdityaJha1 3 ปีที่แล้ว +1

    My man Abhishek to the rescue 🔥🔥🔥🔥

  • @NYARRAMSETTIPAVANSAIKONDALARAO
    @NYARRAMSETTIPAVANSAIKONDALARAO 3 ปีที่แล้ว

    If the test set contains Nan values in the columns that were not dropped, it will give an error. how to deal with this? should we impute these values in test set ?