Advanced missing values imputation technique to supercharge your training data.

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 ธ.ค. 2024

ความคิดเห็น • 25

  • @soccerdadsg
    @soccerdadsg ปีที่แล้ว

    Absolutely love this library!

  • @akmalmir8531
    @akmalmir8531 ปีที่แล้ว

    Danil thank you for sharing, interesting library, one idea would be best if next time we could compare like :
    1) mean imputation
    2) dropping
    3) ML
    and then fit and predict any model to data at the end we can compare in which imputation RMSE is in minimum

    • @lifecrunch
      @lifecrunch  ปีที่แล้ว

      Did such comparison many times. Although it is very much dependent on the data, but on average the ML missing values imputation yields better results.

    • @akmalmir8531
      @akmalmir8531 ปีที่แล้ว

      @@lifecrunch Yes agree, that's why i am writing to show to you viewers that you idea works better than simple imputation, like you are giving gold to them, it would ne better if you give comparison at the end

    • @lifecrunch
      @lifecrunch  ปีที่แล้ว

      Agree, this would be a great illustration of the concept.

  • @akshu7832
    @akshu7832 5 วันที่ผ่านมา

    Informative

  • @likhithp9934
    @likhithp9934 7 หลายเดือนก่อน

    Nice Work man

    • @lifecrunch
      @lifecrunch  7 หลายเดือนก่อน

      Thanks 🔥

  • @nawaz_haider
    @nawaz_haider ปีที่แล้ว +1

    I'm learning Data Science, and most tutorials just use the mean value. This didn't make any sense to me. I was wondering how on earth their model works in the real world with all these wrong values that have been used during training. Now I see what pros do.

    • @lifecrunch
      @lifecrunch  ปีที่แล้ว +1

      Yeah, the naive (mean) approach just works technically. It’s used to fill in the blanks so the models which can’t handle NaN could train. But the volume of incorrectly filled missing values will directly reflect the model’s generalization.

  • @mkaya4677
    @mkaya4677 2 หลายเดือนก่อน

    Hi,
    First of all, your video provides very useful information, and I want to thank you for that. I have a question I would like to ask you.
    I am analyzing air pollution in a city in my country. For this purpose, I have created a dataset using air pollution data and meteorological data. I then organized these data into hourly intervals. However, I encountered a problem. My dataset contains null values. These null values appear consecutively in some parts of the dataset. For example, in the first 3000 rows, there are approximately 2500 null values for the NO2, NOX, and NO air pollutants, but in the remaining part of the dataset, there are very few null values. In addition, there are rows where data for all air pollutants are missing, but these rows cover a short period consecutively. I believe this might be due to workers turning off the devices after working hours on certain days. I have previously trained a few models to fill in these missing values, but I did not achieve good results. I would like to ask for your guidance. In these two cases, should I fill in the missing data or exclude them from the dataset? What would be the most accurate method to complete these missing values?

    • @lifecrunch
      @lifecrunch  2 หลายเดือนก่อน

      In the first place (a lot of consecutive missing values at the top) I would just drop them.
      As for those NaNs in the middle, since your data is a time series, I would use something like a rolling window or nearest neighbors values to fill in the blank spots.

  • @prestonryan3734
    @prestonryan3734 หลายเดือนก่อน

    Absolute mad lad

  • @tnwu4350
    @tnwu4350 7 หลายเดือนก่อน

    Hi there this is an awesome approch for imputation. How would you go about validating this though? It would be helpful to demonstrate that its more accurate than methods like simple or iterative imputer

    • @lifecrunch
      @lifecrunch  7 หลายเดือนก่อน

      I have benchmarked this approach to iterative imputer along with all statistical methods. Every time verstack.NaNImputer gave better results, especially comparing to statistical methods. And there's really no magic - a sophisticated model like lightgbm is a golden standard when it comes to tabular data.

  • @anmolchhetri3033
    @anmolchhetri3033 หลายเดือนก่อน +1

    very helpful thanks, But is it require to do hyperparameter tuning of lightgbm models?

    • @lifecrunch
      @lifecrunch  หลายเดือนก่อน

      For the purpose of missing values imputation - not necessary. Tuning can give a subtle accuracy improvement and it’s justified for an actual prediction model, but I wouldn’t do it for a data processing step.

  • @yolomc2
    @yolomc2 8 หลายเดือนก่อน

    is possible to get copy of the code to study sir ? thanks in advnance 👌👍

    • @lifecrunch
      @lifecrunch  8 หลายเดือนก่อน +1

      Unfortunately didn't save the code from this video... You can code along, the script is not very complicated.

    • @yolomc2
      @yolomc2 8 หลายเดือนก่อน

      @@lifecrunch 👍

  • @AlexErdem-lo5rz
    @AlexErdem-lo5rz 6 หลายเดือนก่อน

    Thank you!

    • @lifecrunch
      @lifecrunch  6 หลายเดือนก่อน

      Welcome!

  • @kalyanchatterjee8624
    @kalyanchatterjee8624 5 หลายเดือนก่อน

    Great, but I am not the right audience. Too fast.

    • @lifecrunch
      @lifecrunch  5 หลายเดือนก่อน +1

      You’ll get there…