Real World Data Cleaning in Python Pandas (Step By Step)

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 มิ.ย. 2023
  • In this video, I show you how to clean up data within Python Pandas within Jupyter notebook. This Python tutorial is great for those trying to get into Data Analytics or Data Science.
    Cricket Data: www.espncricinfo.com/records/...
    Everything is coded within MSSQL and inside SQL Server Management Studio.
    Interested in discussing a Data or AI project? Feel free to reach out via email or simply complete the contact form on my website.
    📧 Email: ryannolandata@gmail.com
    🌐 Website & Blog: ryannolandata.com/
    🍿 WATCH NEXT
    Python for Data Analyst and Scientists Playlist: • Python Tutorials
    Python Groupby: • The Complete Guide to ...
    Python Pandas Interview Questions: • 23 Python Pandas Codin...
    Python Lambda Functions: • Python Pandas Lambda F...
    MY OTHER SOCIALS:
    👨‍💻 LinkedIn: / ryan-p-nolan
    🐦 Twitter: / ryannolan_
    ⚙️ GitHub: github.com/RyanNolanData
    🖥️ Discord: / discord
    📚 *Practice SQL & Python Interview Questions: stratascratch.com/?via=ryan
    WHO AM I?
    As a full-time data analyst/scientist at a fintech company specializing in combating fraud within underwriting and risk, I've transitioned from my background in Electrical Engineering to pursue my true passion: data. In this dynamic field, I've discovered a profound interest in leveraging data analytics to address complex challenges in the financial sector.
    This TH-cam channel serves as both a platform for sharing knowledge and a personal journey of continuous learning. With a commitment to growth, I aim to expand my skill set by publishing 2 to 3 new videos each week, delving into various aspects of data analytics/science and Artificial Intelligence. Join me on this exciting journey as we explore the endless possibilities of data together.
    *This is an affiliate program. I may receive a small portion of the final sale at no extra cost to you.
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 79

  • @ArmanKHAN-bj9iv
    @ArmanKHAN-bj9iv ปีที่แล้ว +3

    Fantastic tutorial! Your step-by-step guide on data cleaning in Python Pandas was excellent. Clear explanations and practical examples made it easy to follow along. Looking forward to more of your uploads. Keep up the great work!

    • @RyanNolanData
      @RyanNolanData  ปีที่แล้ว

      Thank you! I’ll have another Python video up this week as well as more coming soon!

  • @nickdaboss03
    @nickdaboss03 ปีที่แล้ว +7

    you work super hard and put out really good content. Keep it up man, I'm looking forward to watching you grow!

    • @RyanNolanData
      @RyanNolanData  ปีที่แล้ว

      Thank you! Have another video ready to go later this week as well as 90% done with another Python interview question video.

  • @AJAY7509
    @AJAY7509 2 หลายเดือนก่อน

    this video really helped me man, i was trying to leard about panda now it poped up on my notification, thanks for the video.

    • @RyanNolanData
      @RyanNolanData  2 หลายเดือนก่อน

      No problem check out my other pandas vids I have a full playlist

  • @koo5867
    @koo5867 3 หลายเดือนก่อน +2

    Now that’s some cool content. This is exact what I wanted. Thanks bro🙏🏼keep helping the poor students like us! 😌

  • @Al-Ahdal
    @Al-Ahdal หลายเดือนก่อน

    @Ryan Nolan: Excellent Video. Very clearly explained. I'm looking forward to watching you grow!

  • @tapspasi2319
    @tapspasi2319 2 หลายเดือนก่อน

    Amazing! Very good presentation

  • @Tewhi69
    @Tewhi69 16 วันที่ผ่านมา

    Mistakes always help me learn because it forces me to recall new/old knowledge. Depending on how common the mistake was (>3) I end up retaining it and auto check, rarely do I see that mistake again.

  • @tianbowen721
    @tianbowen721 หลายเดือนก่อน

    Pretty Amazing :) and I'd say it's some dense content to fit in 40 mins ~~I learned a lot

  • @yankoshuan6225
    @yankoshuan6225 8 หลายเดือนก่อน +1

    i guess watching your videos while preparing my own portofolio , i am halfway there. Thanks a lot

    • @RyanNolanData
      @RyanNolanData  8 หลายเดือนก่อน

      No problem. My first batch of classification vids are done working on regression now

  • @far3582
    @far3582 5 หลายเดือนก่อน

    I am trying to move away from R, and this is a great video. Thanks Ryan!

    • @RyanNolanData
      @RyanNolanData  5 หลายเดือนก่อน

      No problem best of luck

  • @pradeeppadeliya
    @pradeeppadeliya ปีที่แล้ว +1

    This is a best tutorial .... 👍👍👍👍👍👍👍👍👍👍👍👍👍👍

  • @nagamanickam6604
    @nagamanickam6604 8 หลายเดือนก่อน +1

    Thank you Ryan nolan

  • @prathmesh_jadhav8930
    @prathmesh_jadhav8930 2 หลายเดือนก่อน +1

    Brother you doing awesome…. Upload more videos related to data analysis

    • @RyanNolanData
      @RyanNolanData  2 หลายเดือนก่อน +1

      I have a full playlist of 70ish vids! Working on more though

  • @lewismurigi3623
    @lewismurigi3623 4 หลายเดือนก่อน

    This was so much helpfull, Thanks Man

  • @satishharijan7280
    @satishharijan7280 7 หลายเดือนก่อน +1

    nice lecture bro thanks for this it is use full video for me

  • @user-iu5nz2gy6l
    @user-iu5nz2gy6l 3 หลายเดือนก่อน

    Thanks . Appreciate for this tutorial. Just have a question on Q5. Why is it already in a data frame? while we have to use to_frame for Q4 ? Thanks

  • @loydteds3944
    @loydteds3944 หลายเดือนก่อน

    You're video is very helpful! One question though, how do you remove duplicates in high dimensional data, lets say with 500 duplicates? Thanks

  • @benayawilly6536
    @benayawilly6536 11 หลายเดือนก่อน

    good work. keep it up

    • @RyanNolanData
      @RyanNolanData  10 หลายเดือนก่อน

      Thank you! I just uploaded a new video

  • @Al-Ahdal
    @Al-Ahdal หลายเดือนก่อน

    @Ryan Nolan: Your videos are great indeed. It is requested to have a comprehensive series on "Data Analytics & Visualization". Thanks

    • @RyanNolanData
      @RyanNolanData  หลายเดือนก่อน

      I have a full data Analyst playlist check it out

    • @Al-Ahdal
      @Al-Ahdal หลายเดือนก่อน

      @@RyanNolanData , could you please tag or locate. Thanks

    • @RyanNolanData
      @RyanNolanData  หลายเดือนก่อน

      @@Al-Ahdal th-cam.com/play/PLcQVY5V2UY4JrrKi2bW7DdOD08shTs4QQ.html

  • @yvonnemukhono3566
    @yvonnemukhono3566 หลายเดือนก่อน

    Very helpful.

  • @SuccessGossips
    @SuccessGossips หลายเดือนก่อน +1

    star means not out with highest score, you don't need to remove it

  • @CaptionThisChallenge_
    @CaptionThisChallenge_ 14 วันที่ผ่านมา

    The * in Highest_Inns_Score means the player was not out in that inning.

  • @ArhamZaiem
    @ArhamZaiem 27 วันที่ผ่านมา

    In the highest inns score, why didn't you used rstrip to remove * instead of split??

  • @mehrantavakoli6816
    @mehrantavakoli6816 18 วันที่ผ่านมา

    👏👏👏❤❤

  • @user-bf9lq6bb5s
    @user-bf9lq6bb5s 3 หลายเดือนก่อน

    Totally it was a great effort and much appreciated for your hard work. I would like to know how to remove or drop null values from the columns.
    Thanks in advance

    • @RyanNolanData
      @RyanNolanData  3 หลายเดือนก่อน +1

      Look up drop na

    • @user-bf9lq6bb5s
      @user-bf9lq6bb5s 3 หลายเดือนก่อน

      Cheers man... any advice how to remove year from a columns. for instances, if a column has numeric and year values and want to remove year (2004 in format)only.@@RyanNolanData

  • @salfrat55
    @salfrat55 ปีที่แล้ว

    "FS Jackson played for Cambridge University, Yorkshire and England. He spotted the talent of Ranjitsinhji when the latter, owing to his unorthodox batting and his race, was struggling to find a place for himself in the university side, and as captain was responsible for Ranji's inclusion in the Cambridge First XI and the awarding of his Blue. According to Alan Gibson this was "a much more controversial thing to do than would seem possible to us now". He was named a Wisden Cricketer of the Year in 1894.
    He captained England in five Test matches in 1905, winning two and drawing three to retain The Ashes. Captaining England for the first time, he won all five tosses and topped the batting and bowling averages for both sides, with 492 runs at 70.28 and 13 wickets at 15.46. These were the last of his 20 Test matches, all played at home as he could not spare the time to tour."

    • @RyanNolanData
      @RyanNolanData  ปีที่แล้ว +1

      Didn’t know this is a really cool story. Like Branch Rickey in baseball

  • @khan07700
    @khan07700 หลายเดือนก่อน

    Sir when we import data from site to table I'm not getting the option of table 0 what's the solution for that at 1:54.

  • @davideschreiber2821
    @davideschreiber2821 8 หลายเดือนก่อน

    Lots of good stuff here, but I finally gave up at 31:24. If you're confused about what's happening, imagine how confused we learners are as you bounce around from cell to cell copying-pasting-deleting-trying again, trying to figure things out.

    • @RyanNolanData
      @RyanNolanData  8 หลายเดือนก่อน

      Bugs are part of programming and no one is perfect. I show how it’s solved and why it happens

  • @pavankalyan_297
    @pavankalyan_297 8 หลายเดือนก่อน

    The star in the Highest score column means they were not out till the end of the match. Great tutorial Ryan. will it be possible for you to attach the notebook file here

    • @RyanNolanData
      @RyanNolanData  8 หลายเดือนก่อน +1

      Thank you and I can look at adding the code to Github this weekend

  • @nessim.liamani
    @nessim.liamani 9 วันที่ผ่านมา

    Hi Ryan,
    I'd like to understand how you would have treated a file with millions or tens of millions of lines to spot those "*" and "-" and "+"?
    You spoted them here manually by eye.
    Anyone can help me figureout that?
    Thanks

  • @MrFravallec
    @MrFravallec 4 หลายเดือนก่อน

    Great tutorial, got this issue on the data types: AttributeError Traceback (most recent call last)
    Cell In[11], line 1
    ----> 1 df['Inns']= df["Inns"].str.split(pat = '*').str[0]
    File ~\anaconda3\Lib\site-packages\pandas\core\generic.py:5902, in NDFrame.__getattr__(self, name)
    5895 if (
    5896 name not in self._internal_names_set
    5897 and name not in self._metadata
    5898 and name not in self._accessors
    5899 and self._info_axis._can_hold_identifiers_and_holds_name(name)
    5900 ):
    5901 return self[name]
    -> 5902 return object.__getattribute__(self, name)
    File ~\anaconda3\Lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180 # we're accessing the attribute of the class, i.e., Dataset.geo
    181 return self._accessor
    --> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)
    File ~\anaconda3\Lib\site-packages\pandas\core\strings\accessor.py:181, in StringMethods.__init__(self, data)
    178 def __init__(self, data) -> None:
    179 from pandas.core.arrays.string_ import StringDtype
    --> 181 self._inferred_dtype = self._validate(data)
    182 self._is_categorical = is_categorical_dtype(data.dtype)
    183 self._is_string = isinstance(data.dtype, StringDtype)
    File ~\anaconda3\Lib\site-packages\pandas\core\strings\accessor.py:235, in StringMethods._validate(data)
    232 inferred_dtype = lib.infer_dtype(values, skipna=True)
    234 if inferred_dtype not in allowed_types:
    --> 235 raise AttributeError("Can only use .str accessor with string values!")
    236 return inferred_dtype
    AttributeError: Can only use .str accessor with string values!

    • @Muhammad.Kashif31
      @Muhammad.Kashif31 3 หลายเดือนก่อน +1

      your data may be containing integer data, thats why you are getting the error

  • @kadircalloglu2848
    @kadircalloglu2848 9 วันที่ผ่านมา

    why we didnt use sql after typeies are changed

  • @user-vl3hm9hv3x
    @user-vl3hm9hv3x หลายเดือนก่อน

    bro...u should have used replace method with regex for cleaning *,+ etc chars from the columns

    • @RyanNolanData
      @RyanNolanData  หลายเดือนก่อน

      I used regex in my latest project and have a video coming out on it soon funny enough

  • @hemantsharma-xf3ub
    @hemantsharma-xf3ub 3 หลายเดือนก่อน

    where i can get the notes

  • @VladislavShishkin11
    @VladislavShishkin11 8 หลายเดือนก่อน +2

    I completed the project but I reopped it today and all the code was still there, but when I typed df it was the old table uncleaned? how do I make sure this doesn't happen again?

    • @RyanNolanData
      @RyanNolanData  8 หลายเดือนก่อน

      Ill add my code to github this weekend

    • @marcus.the.younger
      @marcus.the.younger หลายเดือนก่อน

      save the cleaned data

  • @tasmisa6778
    @tasmisa6778 22 วันที่ผ่านมา

    How am I supposed to know all the alphabets are named as those you just did???

  • @taha5754
    @taha5754 23 วันที่ผ่านมา

    Can you share the notebook used in this tutorial? @RyanNolanData

    • @RyanNolanData
      @RyanNolanData  23 วันที่ผ่านมา

      I need to make a website article on this. It’ll have the code in there

  • @rajareddyraju6773
    @rajareddyraju6773 3 หลายเดือนก่อน

    19:09

  • @sachinnambiar
    @sachinnambiar 4 หลายเดือนก่อน +1

    Its a dictionary right? Not a list.
    #rename multiple columns in a dictionary

  • @dogzrgood
    @dogzrgood 8 หลายเดือนก่อน

    Star * means the batsman was not out 😊

    • @RyanNolanData
      @RyanNolanData  8 หลายเดือนก่อน

      I appreciate it. Didn’t know

    • @Al-Ahdal
      @Al-Ahdal หลายเดือนก่อน

      @@RyanNolanData , Yes * mean batsman not out, but it won't affect any calculations. Great work indeed.

  • @salfrat55
    @salfrat55 ปีที่แล้ว

    Headley @4 min mark 😂😁

    • @RyanNolanData
      @RyanNolanData  ปีที่แล้ว +1

      Haha one day I’ll buy your dup

  • @yutomidorya459
    @yutomidorya459 19 วันที่ผ่านมา

    lol on excel is better lol