Hands-on Handling missing value using Mean Median mode with Python | Data Cleaning Tutorial 8

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.ย. 2024
  • During the Machine Learning Data Cleaning process, you will often need to figure out whether you have missing values in the data set, and if so, how to deal with it. In this video, I have demonstrated to handling the missing value using statistical way mean, median and mode. In this video I only cover the hands-on explanation using python :-
    1. We impute the missing data for a quantitative attribute by the mean or median and for qualitative attribute by mode.
    2. Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median.
    3. Similar case Imputation: In this case, we calculate mean individually of non missing values then replace the missing value based on other variable.
    Python Notebook : github.com/atu...

ความคิดเห็น • 13

  • @chiomaobiajulu4363
    @chiomaobiajulu4363 8 หลายเดือนก่อน

    this is a very helpful video, I must admit. Nice work. I'd love to ask though, what do we do with the NaN gotten after using the groupby function? I mean, how can we replace it with a reasonable value afterwards?

  • @sakinaaali2696
    @sakinaaali2696 ปีที่แล้ว

    it really helped me. thank you.

  • @phuocnguyenngoc2197
    @phuocnguyenngoc2197 2 ปีที่แล้ว

    Thanks alot

  • @sreeramsaravanan8132
    @sreeramsaravanan8132 3 ปีที่แล้ว +1

    How to use group by imputation if we dont have domain knowledge on that particular dataset?

    • @AtulPatelds
      @AtulPatelds  3 ปีที่แล้ว +1

      Hi, in reality Data-science is not to easy as we understand and how the hype is being created by many bloggers and training institutes. In real world data is very very messy to work on data science problem and I would say that without Domain knowledge you can make model but that model will work in production or not that probability is very very less. Because Its up to how well you understand your data and how much good information you can extract from that data to make a good model. So If you don't have domain knowledge than you can only do hit and trial strategy and I hope you also know that by using hit and trail methods you will also not be satisfied that you are going in right direction also that will take lots of time. I have seen many production deployment get failed due to lack of domain knowledge in Data science projects. As you also know that we spend our 70% time in data and feature engineering part because model creation is not to hard even that can be created by any fresher but main problem is that what quality of data we are feeding in model. As we know if we feed scrap data then we will get scarp model so I hope you will understand the importance of Domain knowledge.
      It would always be suggested that we should have one domain expert who can guide us during featuring engineering part if we are not to good in that respected domain.So if you are not good in domain knowledge so you can take help from domain expert of any senior member in your team.

  • @terryterry3733
    @terryterry3733 2 ปีที่แล้ว

    in similar case imputation you took 10 + 15 / 2 =12.5 .. where this 2 is coming from . this is because u have only 2 values 10 and 15 ?

  • @harshavardhan6368
    @harshavardhan6368 ปีที่แล้ว

    why are u doing before test train split

  • @terryterry3733
    @terryterry3733 3 ปีที่แล้ว

    HI : in section 7 why did u use 0 after mode ? mode()[0]

    • @AtulPatelds
      @AtulPatelds  3 ปีที่แล้ว +1

      Here we want mode of Data frame column and if we calculate mode for a column having several rows mode(x) can be an array as there can be multiple values with high frequency. That's why I used the mode[0] to select the first one and we always use this by default mode[0] at the end to select the highest frequency.

  • @shashankathawale7002
    @shashankathawale7002 3 ปีที่แล้ว

    Inline no 7 why did you write 0 before mode. could you please tell us about it?

    • @AtulPatelds
      @AtulPatelds  3 ปีที่แล้ว +1

      Here we want mode of Data frame column and if we calculate the for a column having several rows mode(x) can be an array as there can be multiple values with high frequency. That's why I used the mode[0] to select the first one and we always use this by default mode[0] at the end to select the highest frequency.