Missing Data? No Problem!

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 เม.ย. 2023
  • 5 Ways Data Scientists deal with Missing Values.
    Check out my other videos:
    Data Pipelines: Polars vs PySpark vs Pandas: • The BEST library for b...
    Polars for Data Science: • Polars: The Next Big P...
    Speed up Pandas Dataframes: • This INCREDIBLE trick ...
    Avoid These Pandas Mistakes: • 25 Nooby Pandas Coding...
    Links to my stuff:
    * TH-cam: youtube.com/@robmulla?sub_con...
    * Discord: / discord
    * Twitch: / medallionstallion_
    * Twitter: / rob_mulla
    * Kaggle: www.kaggle.com/robikscube
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 144

  • @mapo499
    @mapo499 4 หลายเดือนก่อน +228

    This is data fabrication which (at least in physics) is academic malpractice and can lead to your data seeming more precise than it is

    • @101McAvoy101
      @101McAvoy101 4 หลายเดือนก่อน +53

      this is usually used to make the most out of a given data set that is then used to make a predictive model which in and of itself is not 100% accurate anyways. it is not an academically perfect solution but a workable one, that is needed in data science regularly.

    • @mapo499
      @mapo499 4 หลายเดือนก่อน +23

      @@101McAvoy101 ah this makes sense I guess so long as the graph isn't being passed off as actual measurements thanks for clarifying

    • @user-vn4jw3ch8w
      @user-vn4jw3ch8w 4 หลายเดือนก่อน +7

      ​@mapo499 your comment just show you don't know shxt about data science. Quite shocked coming from someone with a stem physics background

    • @user-cx1bc5er7y
      @user-cx1bc5er7y 4 หลายเดือนก่อน +46

      @@user-vn4jw3ch8w why so hostile? he just didnt know and he isnt wrong about what he said about his domain of experience either😮

    • @KCM25NJL
      @KCM25NJL 4 หลายเดือนก่อน +1

      @mapo499 your statement is broadly accurate, but there are very few specific instances where what you state will ever be true. The reason being, is that to be sued for malpractice, the evidence of your malpractice would need to show a clear and obvious deviation that would require outlier data points. There are also very few niche topics that would be subject of malpractice, and most of them relate to health and safety, legal and finance professions. Its use in professional practice must be significantly flawed or misapplied to rise to the level of malpractice.

  • @michael_bryant
    @michael_bryant 10 หลายเดือนก่อน +127

    Didn’t know about the built in interpolate, back fill, and forward fill functions. Thanks!

  • @HPOfficeJetProAll-In-One
    @HPOfficeJetProAll-In-One 5 หลายเดือนก่อน +63

    “We make it up!”

  • @stevenkovacs3113
    @stevenkovacs3113 5 หลายเดือนก่อน +9

    If you want to use that to train a neural net or a supported vector machine, you may want to add a new column indicating that the original value was missing ...

  • @theaerogr
    @theaerogr 8 หลายเดือนก่อน +40

    Please never drop samples with missing values. This will create a bias in your analysis and will break your pipeline on new data. Mean/Mode Imputation is a great and fast method to handle the problem. If you have non-linear predictive models MM is a great imputation method that doesn't affect the performance.

    • @larkohiya
      @larkohiya 4 หลายเดือนก่อน +3

      Thank you. This guy gets it.

    • @tomaszubiri4588
      @tomaszubiri4588 4 หลายเดือนก่อน +2

      Actually, it's fine. Just reduces your sample size. Why would it break anything?

    • @Blacky95Geh
      @Blacky95Geh 3 หลายเดือนก่อน +1

      What do you mean with "predictive Model mm?

    • @tomaszubiri4588
      @tomaszubiri4588 3 หลายเดือนก่อน

      @@Blacky95Geh predictive models, Mean/Mode

    • @Blacky95Geh
      @Blacky95Geh 3 หลายเดือนก่อน +1

      @tomaszubiri4588 mmhm thanks - I just can't wrap my head around the purpose of using mode here. "Mean / rolling mean" I get. Personally I like splining better, as it uses a cubic function which in turn will work better when generating values for missing temperature peaks.

  • @johnmo1111
    @johnmo1111 ปีที่แล้ว +87

    Excellent. Can't remember the last time I learned so many things in such a short time. 👍

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you found it helpful!

  • @MrNummularius
    @MrNummularius ปีที่แล้ว +26

    Awesome! More shorts please!

    • @robmulla
      @robmulla  ปีที่แล้ว +2

      Glad you like it!

  • @pbaby6813
    @pbaby6813 4 หลายเดือนก่อน +3

    Love me a lil casual data manipulation

  • @sumangorkhali5748
    @sumangorkhali5748 11 หลายเดือนก่อน +6

    so informative, clear, presice and easy to understand

  • @BrunetteViking
    @BrunetteViking 4 หลายเดือนก่อน

    These shorts are awesome; thank you for sharing 😀.

  • @ezhankhan1035
    @ezhankhan1035 6 หลายเดือนก่อน

    This is awesome! I had no idea you could interpolate like that. Thanks Rob!

  • @wilsonsantosmarrola1251
    @wilsonsantosmarrola1251 ปีที่แล้ว +1

    Tks Rob! Great content!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching!

  • @kirillsvc
    @kirillsvc ปีที่แล้ว +1

    Great content. Keep it up man!

  • @stevecti
    @stevecti 5 หลายเดือนก่อน

    Great video! Wish i knew this when i was working with time series data at a past job... those NaNs gave me a couple of headaches for sure 😂

  • @molmock
    @molmock ปีที่แล้ว +11

    I definitely should read the f..ing pandas manual 😅

    • @robmulla
      @robmulla  ปีที่แล้ว +5

      Or just watch my shorts! 😊

  • @Rose-ec6he
    @Rose-ec6he 4 หลายเดือนก่อน +10

    Please don't use NAN and null interchangeably it will probably confuse people. They are different concepts that have different implications. Null generally means there was an unreported error or there's a serious bug in the program.
    NAN generally means either there was nothing collected, or the value calculated/measured is not expressible as a floating point number. This can suggest a flaw in the math performed on floating point numbers.
    Null only is possible with numbers that are stored as integers of some kind and NAN is only possible on specifically floating point numbers. The closest thing to null that python has is `None` but it's not identical

    • @Howtheheckarehandleswit
      @Howtheheckarehandleswit 4 หลายเดือนก่อน +3

      While I certainly agree with your main point that null and NaN are completely different concepts that shouldn't be conflated, a lot of the details you gave are incorrect or misleading.
      Null and integers don't really have anything to do with each other at all. Sometimes, null can be indicated with a "sentinel value" for integers, but this is generally considered bad practice because it is very error prone.
      Instead, in a robust system, null is indicated outside the actual value of something. You can sort of think of this as wrapping your type SomeType in another type NullableSomwType, where NullableSomeType stores a boolean variable indicating whether it is null, and also an instance of SomeType which is only valid to read from if the boolean has a specific value. Rust's Option type is a good example of how this works.
      Null does not necessarily mean there is a bug. It can be a completely sensible return value for some functions. For example, if you have a set of survey responses, where some questions could be skipped if the respondent elected not to answer, the only sensible thing to return for what that respondent entered would be null by some name (you might call it None or NotAnswered, but it's serving the same purpose in this context).
      NaN is part of IEEE 754, the standard that defines how floating point numbers work. The actual value of a float can be NaN, without needing to wrap the fundamental value type in something like Option. Typically, NaN is returned for floating point operations that have no sensible value, such as inf - inf, 0÷0, sqrt(-1), etc. NaN actually does usually indicate that something has gone wrong, because it means that there *was* a value, but at some point, something was done to it that was invalid.

  • @kingki1953
    @kingki1953 6 หลายเดือนก่อน

    Thank you sir for help doing time series analysis

  • @thebreath6159
    @thebreath6159 11 หลายเดือนก่อน +3

    Actually the best data science channel. This tips encourage you how to choose a solution for missing info

  • @oludelehalleluyah6723
    @oludelehalleluyah6723 ปีที่แล้ว +1

    I love the interpolate method

  • @koen._.
    @koen._. 5 หลายเดือนก่อน +2

    why is having gaps in time data a bad thing if we drop all NaN values, i dont see why this would cause problems. can anyone pls explain?

    • @Chrisallengallery
      @Chrisallengallery 3 หลายเดือนก่อน

      With a temperature graph, for example, deleting time slices will affect the overall average temp. Interpolating the missing temps is fine to do in time-series datasets. Not perfect but neither are the prediction models.

  • @hieu8276
    @hieu8276 5 หลายเดือนก่อน +5

    Side note: blindly fill is NOT acceptable. It is best to estimate the missing values based on existing information. Interpolation is a good one.

  • @hownottoplay393
    @hownottoplay393 4 หลายเดือนก่อน

    Absolutely perfect.

  • @bradleym494
    @bradleym494 3 หลายเดือนก่อน

    There is a sixth way, tell the data engineer to build a better pipeline

  • @repeatbot
    @repeatbot 2 หลายเดือนก่อน

    The true approach:
    1. Determine why the data was lost: defective measuring equipment, data corruption, etc.
    No point in data if the measurement is bad.
    2. Think thoroughly about the topic/domain of ressearch.
    For instance, the temperature of what - air in a specific city? Then you could probably find the missing values online by the forecast.
    An experiment? Probably, you'd have to repeat it anyways.
    3. If you can't find other sources, only then you should choose these methods. And still think about what is measured and why the data went missing.

  • @rizkamilandgamilenio9806
    @rizkamilandgamilenio9806 10 หลายเดือนก่อน

    Great video sir! Which one do you think the best solution for time seriea data?

  • @user-vv2yz2ht4l
    @user-vv2yz2ht4l ปีที่แล้ว

    the last one is very useful!!

  • @jaideepsingh564
    @jaideepsingh564 ปีที่แล้ว +1

    Doing awesome job.... Thanx a lot....

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching 🙏

  • @blender_wiki
    @blender_wiki 5 หลายเดือนก่อน

    Becareful in many situations you must write your interpolation algorithm if you have to fill more than one interval in the data set. And remember in 2023 you can easily generate meaningful synthetic data to fill gap in a dataset.

  • @killthem7414
    @killthem7414 4 หลายเดือนก่อน

    I'll use this some day

  • @bradleyfrueh2761
    @bradleyfrueh2761 11 หลายเดือนก่อน

    Great video! Is there a way of doing multiple imputation methods in python? Is that what interpolate is doing?

  • @parthibank280
    @parthibank280 ปีที่แล้ว +2

    Can you mention the scenarios in which interpolate is used in real-life projects?

    • @robmulla
      @robmulla  ปีที่แล้ว +3

      In developing a predictive model on sensor data that has missing data.

  • @niklasln
    @niklasln ปีที่แล้ว +1

    Useful for the pog champs sleep dataset 😅

    • @robmulla
      @robmulla  ปีที่แล้ว

      Could be! But trust cross validation.

  • @and_rotate69
    @and_rotate69 3 หลายเดือนก่อน

    in my study field (thermal engineering), we dont really have any data scientists or even python gurus, just some script kiddies who can barely copy/understand code given by ai, in a master's thesis there was a subject using ai to determine the best config for the required humidity of a room, after collecting data, the student and the supervisor didnt really preprocess data and did the unforgivable: 0ing all the NaNs, the problem is that when they injected the data into an ANN, it gave the worst results ever, and both of them didnt know why so they ended up dropping the idea of using ai coz 'humans cant be replaced yet'

  • @izzythetechguy7365
    @izzythetechguy7365 5 หลายเดือนก่อน

    Awesome content

  • @jaigoyal7970
    @jaigoyal7970 ปีที่แล้ว

    which method is best... or which method should be used in which case... it always confuses me.. pls answer..
    nice content btw... keep hustling!

  • @ifeanyinwobodo8530
    @ifeanyinwobodo8530 11 หลายเดือนก่อน

    Amazing content as usual.
    How do we use interpolation to fill missing values in panel data?

    • @ifeanyinwobodo8530
      @ifeanyinwobodo8530 11 หลายเดือนก่อน

      Please, can you reply my comments?🙏🙏🙏

  • @siqueirapaty
    @siqueirapaty ปีที่แล้ว

    So useful 🎉

  • @souravbarua3991
    @souravbarua3991 ปีที่แล้ว +1

    Very useful information. Thank 🙏 u

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you found it helpful.

  • @kinghezzy
    @kinghezzy 11 หลายเดือนก่อน +1

    Your channel is totally different. You bring to light things that people don't even know exist in many libraries. Big ups to you.

    • @robmulla
      @robmulla  11 หลายเดือนก่อน +1

      Thanks for telling me that. That’s exactly what I’m trying to do. Thanks!

  • @TildAlice
    @TildAlice 9 หลายเดือนก่อน

    thank you veryuch😊

  • @Lawh
    @Lawh 2 หลายเดือนก่อน

    Is there a difference with making your own system to handle a task or using a function that is inherent in the language?

  • @soccer9199
    @soccer9199 10 หลายเดือนก่อน

    In which situation should we use interpolate?
    Lets say if i have 15% missing values in a column? Also which one is better interpolate or KNNImputer ?

  • @sanfera5644
    @sanfera5644 10 หลายเดือนก่อน +2

    I would just use interpolation to be honest. I work with sensor data and they rarely jump. With looking through generic data streams and daily record in general, it usually becomes easy to simply interpolate the data. It is easy, versatile and works for waaaay more thing than just missing data in a graph.

  • @Shack263
    @Shack263 4 หลายเดือนก่อน

    How do we account for Greenland? Serious question.

  • @aviliocarcamo6454
    @aviliocarcamo6454 ปีที่แล้ว +1

    Hey! What if we're working with hourly data and there's a full month missing? (PS: thanks for your content)

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Ouch. That’s a lot of missing data. It depends on how many years you have and if you feel comfortable imputing values from previous years. Also, what are you using the data for- model building or descriptive analysis? Lots of context needed to answer your question.

  • @arnabmukherjee3129
    @arnabmukherjee3129 8 หลายเดือนก่อน +1

    Awesome

  • @joaovmlsilva3509
    @joaovmlsilva3509 2 หลายเดือนก่อน

    Skip > . You just fill up data when there's a recognized pattern

  • @K-mk6pc
    @K-mk6pc ปีที่แล้ว +1

    Awesome interpolate.

  • @user-ug6kk5ux5q
    @user-ug6kk5ux5q หลายเดือนก่อน

    when He says interpolate, does He reffer to regression?

  • @MarcioSouza1
    @MarcioSouza1 หลายเดือนก่อน

    Missing data? Make it up! :D

  • @nanolog522
    @nanolog522 3 หลายเดือนก่อน

    How are missing time values worse than fabricating new data points?

  • @poisonza
    @poisonza 5 หลายเดือนก่อน

    interpolate ... use loess for temperature kinda data

  • @Oldmoney.777
    @Oldmoney.777 6 หลายเดือนก่อน

    How to do the interpolation in R?

  • @JackTheAwesomeKnot
    @JackTheAwesomeKnot 3 หลายเดือนก่อน

    The data is taken every hour, so interpolation will be fine. But personally, i would just drop the NaNs.

  • @ardalankhalil
    @ardalankhalil ปีที่แล้ว

    why not using interpolate all the time, wouldn’t it solve every case you mentioned?

  • @weregoat529
    @weregoat529 3 หลายเดือนก่อน

    6. They just go home.

  • @ElinLiu0823
    @ElinLiu0823 ปีที่แล้ว +1

    Does ffill and bfill only working with linear variables?

    • @robmulla
      @robmulla  ปีที่แล้ว

      What do you mean by linear. It works for all data types I believe.

    • @ElinLiu0823
      @ElinLiu0823 ปีที่แล้ว

      @@robmulla Such the situation like a linear regression data or some task like stock prediction need ARMA model and timestamp,sir.

  • @emilwandel
    @emilwandel 4 หลายเดือนก่อน +1

    why is there a need to fill missing data?

  • @harisjaved1379
    @harisjaved1379 4 หลายเดือนก่อน

    Man I just do fillna(“it’s too hot”)

  • @JonathanBiemond
    @JonathanBiemond ปีที่แล้ว

    You mean this whole time I've been taking the mean of ffill and bfill, but I could have just used interpolate!!??

  • @brayanyamiddelvallemazo6614
    @brayanyamiddelvallemazo6614 5 หลายเดือนก่อน

    What about knn imputation?

  • @djstacktrace
    @djstacktrace 3 หลายเดือนก่อน

    This gets the job done but obviously it’s inaccurate as the missing value could’ve possibly been a zero, or could’ve possibly been orders of magnitude, higher and valid. We will never know because the data is missing.

  • @jahuuuuuuu4608
    @jahuuuuuuu4608 4 หลายเดือนก่อน +1

    thats the difference between a „data scientist“, a computer scientist, and a real data scientist, a statistician. you cant just fill missing data with sourrounding data, no matter how well you interpolate it. If its missing randomly just delete it and if not your fucked. Because if its missing because its a certain value and you still analyse the data there wont be a valid result

  • @idiot528
    @idiot528 4 หลายเดือนก่อน

    Pythond devs be like. So we import pandas to do all pur work for us.

    • @Chrisallengallery
      @Chrisallengallery 3 หลายเดือนก่อน

      Why do you think it's that simple? Because it is not.

  • @kr-sd3ni
    @kr-sd3ni 4 หลายเดือนก่อน +1

    or just u know... remove it from the data and plot the rest?

  • @jaredthomas9246
    @jaredthomas9246 2 หลายเดือนก่อน

    This helps a lot in a real business situation

  • @iettptt
    @iettptt 10 หลายเดือนก่อน

    I have so much to learn yet just can’t :/

  • @NapsterCEO007
    @NapsterCEO007 2 หลายเดือนก่อน

    🎉

  • @addictedyounoob3164
    @addictedyounoob3164 4 หลายเดือนก่อน +1

    Or you could leave the data as is and dont make it seem better than it is. why are values missing in the first place?

  • @larkohiya
    @larkohiya 4 หลายเดือนก่อน +1

    Don't fabricate data. If no data was available. Leave the empty spots that's relevant and says something different than if you fabricate a trend

    • @robmulla
      @robmulla  4 หลายเดือนก่อน

      In what context. In machine learning imputation is common and sometimes required depending on the algorithm.

    • @MrGeometres
      @MrGeometres 4 หลายเดือนก่อน

      @@robmulla If you have missing data, don't use an algorithm that can't deal with missing data, simple as that.

  • @Ramog1000
    @Ramog1000 4 หลายเดือนก่อน

    isn't NaN not a number and not null?

  • @method341
    @method341 ปีที่แล้ว +1

    Whats the interpolate code to get the average of the last and forward data? Or is the default method basically it?

    • @robmulla
      @robmulla  ปีที่แล้ว

      You can select from different interpolation methods. The default is linear.

  • @shapelessed
    @shapelessed 4 หลายเดือนก่อน

    Well, technically, if you were to use JS, then... You know... NaN would be a number, not null... And null, ugh... Would be an object...

  • @awsomegadgetguy7191
    @awsomegadgetguy7191 หลายเดือนก่อน

    Alright this confuses me, why are we screwing with the data again?

  • @dipankarnandi7708
    @dipankarnandi7708 ปีที่แล้ว +1

    Rob, quick question. If continues rows have missing values. Should we use Ffill or Interpolate.
    I think interpolate would be better. What are your thoughts

    • @robmulla
      @robmulla  ปีที่แล้ว

      It depends on the data and what you are trying to do. If the values are continuous then interpolation can help

  • @NeArMe.
    @NeArMe. ปีที่แล้ว +1

    So when we use the best?

    • @robmulla
      @robmulla  ปีที่แล้ว

      It depends on the data and what you are trying to do.

    • @NeArMe.
      @NeArMe. ปีที่แล้ว

      @@robmulla Can you explain each one?

  • @dominicduncan9895
    @dominicduncan9895 4 หลายเดือนก่อน

    is this in pandas?

  • @yellowrose0910
    @yellowrose0910 4 หลายเดือนก่อน +1

    "Five Ways to Cook Your Data". Highly NOT recommended.
    Oh and never believe math from someone who uses 'x' both as a variable and a multiplication sign.

  • @krypton_17
    @krypton_17 ปีที่แล้ว +1

    Is it only me who can't see gaps in the plot?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Yea. They are kind of hard to see I agree.

  • @TheScriptPunk
    @TheScriptPunk 5 หลายเดือนก่อน

    Dont have gaps. Ez

  • @zachmanifold
    @zachmanifold 2 หลายเดือนก่อน

    You need to understand the NULL values before you do anything with them. Are they TRULY outliers or is there a specific reason they're there? In some cases there's a real reason they're present and can drastically change your analysis or how you end up modelling something. I'm experiencing this right now with a project at work, coincidentally.

  • @codywohlers2059
    @codywohlers2059 2 หลายเดือนก่อน

    Let me just insert some missing values.... Look! climate change!

  • @AryanPatel-wb5tp
    @AryanPatel-wb5tp 5 หลายเดือนก่อน

    imputation methods

  • @nguyenphithoaihung
    @nguyenphithoaihung 5 หลายเดือนก่อน

    Always interpolate. Got it!

  • @arthurs5099
    @arthurs5099 3 หลายเดือนก่อน

    Just learn 17th century maths ->interpolation. Otherwise crazy skills to call this function

  • @theslaygamer1784
    @theslaygamer1784 4 หลายเดือนก่อน

    Awsome, but NaN is not NULL

  • @redandgreenchristmasblues23
    @redandgreenchristmasblues23 4 หลายเดือนก่อน

    This is malpractice

  • @minymaker
    @minymaker 4 หลายเดือนก่อน

    I hate this video and represents everything wrong with people dumbing down data science. If you aren’t able to figure out this stuff by simply looking at the documentation for fillna, then you are not a data scientist. Any data scientist knows what interpolation means, you don’t need to flash the equation on the screen to make it seem complicated

  • @nicolaslamour8712
    @nicolaslamour8712 11 หลายเดือนก่อน

    newdf = dropna() problem solved 😂😂

    • @robmulla
      @robmulla  11 หลายเดือนก่อน +1

      Sometimes, yes.

    • @nicolaslamour8712
      @nicolaslamour8712 11 หลายเดือนก่อน

      @@robmulla thanks for your short bite size lecture that stick insanely well :)

  • @user-co6ww2cm9k
    @user-co6ww2cm9k 4 หลายเดือนก่อน

    this doesn't feel very scientific!

  • @timothyhoytbsme
    @timothyhoytbsme 5 หลายเดือนก่อน +1

    Ah yes, how to bastardise science.

    • @Chrisallengallery
      @Chrisallengallery 3 หลายเดือนก่อน

      How to not understand data science.