Encoding Categorical Values in Pandas for Keras (2.2)

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ก.พ. 2025

ความคิดเห็น • 19

  • @japedr
    @japedr 3 ปีที่แล้ว +1

    12:29 Shouldn't "df" be passed as an argument of the function (or replace "df" by "df1")? Otherwise, the function depends on the definition of "df" out of its scope, which I think was not intended.
    Thanks for the awesome work.

  • @obsidiansiriusblackheart
    @obsidiansiriusblackheart 5 ปีที่แล้ว +2

    6:44 What if you had 10 categories? Or 100? Would you still have to create these dummies, or is there some more efficient way to convert the categories to numerical values?

    • @HeatonResearch
      @HeatonResearch  5 ปีที่แล้ว +1

      If you can find a way to order them, then you can reduce it to a single index number. You can also see if some of the dummy variables are not important (by creating them and running a feature importance report) and drop the unimportant ones.

    • @obsidiansiriusblackheart
      @obsidiansiriusblackheart 5 ปีที่แล้ว

      @@HeatonResearch thank you

  • @leassis91
    @leassis91 2 ปีที่แล้ว

    hi, in calc_smooth_mean function, how do you define weight? did you choose randomly? Thanks in advance

  • @Kanakapallianurag
    @Kanakapallianurag 4 ปีที่แล้ว

    for the dogs and cats at 10:16, we got that value {'cat': 0.2, 'dog': 0.8} is it because {'cat': ((no of cats with y = 1 ) * 2)/ (len(index)+1), 'dog': ((no of dogs with y = 1 ) * 2)/ (len(index)+1)}
    ((no of dogs with y = 1 ) * 2)/ (len(index)+1), please do reply

  • @bobdowling6932
    @bobdowling6932 4 ปีที่แล้ว

    A question about Z-scores: Is there any advantage in using the mean and standard deviation over, say, the median and inter-quartile range? Would your networks be vulnerable if one or more columns of initial data came from a distribution with ill-defined mean and std.var.? (Classic maths example is p.d.f.(x) = (1/π)/(1+x²).)

  • @hanserj169
    @hanserj169 5 ปีที่แล้ว

    What if I have just a label with 2 categories? should I use dummy variables or set 0 and 1 directly in the dataset in 1 simple column (binary)? Sorry if the question is too basic, but I'm a beginner. Could I use LabelEncoder() or pd.factorize?

  • @hellaxxable
    @hellaxxable 5 ปีที่แล้ว +1

    Great video Jeff, thank you! One question about target encoding, is it possible to use target encoding when your target value is a multiclass feature? Let's say it's not only 1's and 0's but there are some 2's as well. If yes, would this change the way how encoding is applied?

    • @HeatonResearch
      @HeatonResearch  5 ปีที่แล้ว

      There are several ways to do that, but the potential to target leak is so much higher. I've tried a couple of different techniques on my own but always got fairly bad overfitting. If I ever have a case where I push it deeper I may post something on target encoding for a categorical target.

  • @nobodyeverybody8437
    @nobodyeverybody8437 4 ปีที่แล้ว

    Dear Jeff, shouldn't we set the ddof in the zscore function to 1? bcz the default value is 0 and I think it affects the results a little bit, but your suggestion?

  • @AlokKumar-jh8wp
    @AlokKumar-jh8wp 5 ปีที่แล้ว

    Jeff could you suggest for foundation in statistics and how to apply stats in machine learning

    • @HeatonResearch
      @HeatonResearch  5 ปีที่แล้ว

      For a book, I really like this one: openstax.org/details/books/introductory-statistics

    • @HeatonResearch
      @HeatonResearch  5 ปีที่แล้ว

      Not free, but pretty much a bridge between stats and ML: www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576

  • @liweigao4755
    @liweigao4755 5 ปีที่แล้ว

    Great video, just one question: how to handle the case when there are too many dummy variables get created? For example, if the column is for the phone models, there might be thousands of them, which consume lots of memory. Thanks!

    • @paulchristian1244
      @paulchristian1244 4 ปีที่แล้ว

      Look at sklearn hashing and binary encoding

  • @forvm2051
    @forvm2051 5 ปีที่แล้ว +1

    Get lost since "Target Encoding for Categoricals" at 8:38, don't know what the code is doing

    • @HeatonResearch
      @HeatonResearch  5 ปีที่แล้ว

      That code is basically just building up a test data set. I could have also just generated the test dataset as a CSV and had the Jupyter notebook load that, but this keeps it all in one compact notebook.

  • @8eck
    @8eck 3 ปีที่แล้ว

    I lost the point somewhere between cars and dogs with cats... Very confusing explanations, going over important steps too quickly. Dropping so much columns, what the point of that data then?