Advanced missing values imputation technique to supercharge your training data.
ฝัง
- เผยแพร่เมื่อ 3 ก.ย. 2023
- Get the most out of your data for machine learning by adopting this advanced data preprocessing trick.
verstack package documentation - verstack.readthedocs.io/en/la...
Absolutely love this library!
Thank you!
Welcome!
Nice Work man
Thanks 🔥
Hi there this is an awesome approch for imputation. How would you go about validating this though? It would be helpful to demonstrate that its more accurate than methods like simple or iterative imputer
I have benchmarked this approach to iterative imputer along with all statistical methods. Every time verstack.NaNImputer gave better results, especially comparing to statistical methods. And there's really no magic - a sophisticated model like lightgbm is a golden standard when it comes to tabular data.
Danil thank you for sharing, interesting library, one idea would be best if next time we could compare like :
1) mean imputation
2) dropping
3) ML
and then fit and predict any model to data at the end we can compare in which imputation RMSE is in minimum
Did such comparison many times. Although it is very much dependent on the data, but on average the ML missing values imputation yields better results.
@@lifecrunch Yes agree, that's why i am writing to show to you viewers that you idea works better than simple imputation, like you are giving gold to them, it would ne better if you give comparison at the end
Agree, this would be a great illustration of the concept.
is possible to get copy of the code to study sir ? thanks in advnance 👌👍
Unfortunately didn't save the code from this video... You can code along, the script is not very complicated.
@@lifecrunch 👍
I'm learning Data Science, and most tutorials just use the mean value. This didn't make any sense to me. I was wondering how on earth their model works in the real world with all these wrong values that have been used during training. Now I see what pros do.
Yeah, the naive (mean) approach just works technically. It’s used to fill in the blanks so the models which can’t handle NaN could train. But the volume of incorrectly filled missing values will directly reflect the model’s generalization.
Great, but I am not the right audience. Too fast.
You’ll get there…