How do I find and remove duplicate rows in pandas?

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ย. 2024

ความคิดเห็น • 233

  • @fredcalo
    @fredcalo 7 ปีที่แล้ว +2

    I spent hours trying to figure this stuff out through reading chapters and chapters in Python books. Then I come here, and everything I was trying to figure out was explained in 9 minutes. This was IMMENSELY helpful, thanks!

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Awesome!! That's so great to hear!

  • @mea97905
    @mea97905 8 ปีที่แล้ว +26

    I like your concise and precise videos. I really appreciate your efforts.

    • @dataschool
      @dataschool  8 ปีที่แล้ว +3

      Thanks, I appreciate your comment!

  • @jordyleffers9244
    @jordyleffers9244 4 ปีที่แล้ว +5

    lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!

  • @reubenwyoung
    @reubenwyoung 5 ปีที่แล้ว +3

    Thanks so much for this! You helped me combine 629 files and remove 250k duplicate rows!
    You're the man! *Subscribed*

    • @dataschool
      @dataschool  5 ปีที่แล้ว +1

      Great to hear! 😄

  • @shashwatpaul3330
    @shashwatpaul3330 4 ปีที่แล้ว +1

    I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python.
    I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like:
    1st category would be for the reviews less than 100,000,
    2nd category would be for the reviews between 100,001 and 1,000,000,
    3rd category would be for the reviews between 1,000,001 and 5,000,000 and
    4th category would be for the reviews anything more than 5,000,000.
    Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command:
    gps['Reviews Group'] = [1 if x

  • @hongyeegan733
    @hongyeegan733 4 ปีที่แล้ว

    wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Thanks very much for your kind words!

  • @Beny123
    @Beny123 6 ปีที่แล้ว +3

    Thank you! here is a way to extract the non-duplicate rows df=df.loc[~df.A.duplicated(keep='first')].reset_index(drop=True)

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Thanks for sharing!

  • @emanueleco7363
    @emanueleco7363 4 ปีที่แล้ว

    You are the greatest teacher in the world

  • @MrTheAnthonyBielecki
    @MrTheAnthonyBielecki 6 ปีที่แล้ว +1

    Exactly what I needed! Why not set up a Patreon so we can show some love?

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Thanks for the suggestion! I am planning to set one up soon, and will let you know when it's live :)

    • @dataschool
      @dataschool  6 ปีที่แล้ว +1

      I just launched my Patreon campaign! I'd love to have your support: www.patreon.com/dataschool/overview

  • @cyl1040
    @cyl1040 4 ปีที่แล้ว +1

    I can solve the duplicate data from my CSV file~~~ Thank you.
    However, I suggest you can do more in this video. I think you can show after the delete result list. Such as:
    >> new_data=df.drop_duplicates(keep='first')
    >> new_data.head(24898)
    If you have to add it, I think this video will be more perfect~~~

  • @tushargoyaliit
    @tushargoyaliit 5 ปีที่แล้ว +1

    Myself from Punjab .M studying at IIT even then i got satisfaction of pandas from ur videos only . Thanks
    please give all u done in text format or like tutorial ,

    • @dataschool
      @dataschool  5 ปีที่แล้ว +1

      Is this what you are looking for?
      nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

  • @randyle2511
    @randyle2511 7 ปีที่แล้ว

    I like it the way you explain things...it's very clearly and precisely. My problem is little more complex where I want to remove the entire row where it met the following conditions.
    If any rows in Latitude column that has the same value as previous row (-1) AND the same row in the Longitude column that has the same values as previous row THEN remove the whole entire row that duplicated. Basically we have to compare two consecutive ROWS and COLUMNS and IF both conditions are met then remove the entire row. Let's say if there are 15 rows have the same values(i.e, If Lat[1,1] == Lat[0,1] & Lon[1,2] ==Lon [0,2] then remove, else skip, # Lat = Col1, Long = Col2) in both Latitude and Longitude columns then remove them all except keep one.
    Hope you got my points... :-). Looking forward to see your code.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Glad you like the videos! It's not immediately obvious to me how I would approach this problem, but I think that the 'shift' function from pandas might be useful. Good luck! Sorry that I can't provide any code.

  • @minaha9213
    @minaha9213 2 ปีที่แล้ว +1

    just find your channel , just watched this as my first watch for your videos , and pressed subscribe !!! , cause your explanation for the idea as whole is very remarkable 😃 thanks a lot .

  • @cradleofrelaxation6473
    @cradleofrelaxation6473 ปีที่แล้ว

    This is so helpful!
    Pandas has the best duplicates handling. Better than spreadsheets and SQL.

  • @cablemaster8874
    @cablemaster8874 3 ปีที่แล้ว +1

    Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      You're very welcome!

  • @rashayahya
    @rashayahya 4 ปีที่แล้ว +2

    I always find what I need in your channel.. and more... Thank you

    • @dataschool
      @dataschool  4 ปีที่แล้ว +1

      Great to hear!

  • @supa.scoopa
    @supa.scoopa 5 หลายเดือนก่อน +1

    THANK YOU for the keep tip, that's exactly what I was looking for!

    • @dataschool
      @dataschool  5 หลายเดือนก่อน

      Great to hear!

  • @deki90to
    @deki90to 3 ปีที่แล้ว +1

    HOW DO YOU KNOW WHAT I NEED? YOU ARE MY FAV TEACHER FROM NOW

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      Ha! Thank you! 😊

  • @balajibhaskarraokondhekar1823
    @balajibhaskarraokondhekar1823 3 ปีที่แล้ว

    You have done very Good jobs about under standing of DataFrame and make very easy to understanding DataFrame it so easy with the people which are working in excel
    Best wishes from me

  • @mariusnorheim
    @mariusnorheim 6 ปีที่แล้ว

    How can I remove duplicate rows based on 2 column values?
    I want to drop a row if two column values are the same. E.g. I have one column with Country = [USA, USA, Canada, USA] and an income column with values = [1000, 900, 900, 900]. I only want to drop the duplicate where both the country AND the income is 900. While if one row has country = Canada and income = 900 and second row has USA with income 900 I want to keep them both. Answers appreciated!
    Your videos are really helpful for learning pandas. Keep up the good work!

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Sorry, I'm not quite clear on what the rules are for when a row should be kept and when it should be dropped.
      Perhaps you could think of this task in terms of filtering the DataFrame, rather than using the drop duplicates functionality?

    • @mariusnorheim
      @mariusnorheim 6 ปีที่แล้ว

      Thanks for the reply! I managed to improve my code to avoid the duplicates in the first place. Keep up your great work with the videos, really helpful for improving my skills!

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Great to hear! :)

  • @Kristina_Tsoy
    @Kristina_Tsoy ปีที่แล้ว +1

    Kevin your videos are super helpful! thank you!!!

    • @dataschool
      @dataschool  ปีที่แล้ว

      You're very welcome!

  • @dhananjaykansal8097
    @dhananjaykansal8097 5 ปีที่แล้ว

    I didn't find much in Duplicates. Thanks so much sir. I can't thank u enough.

  • @ranveersharma1666
    @ranveersharma1666 4 ปีที่แล้ว

    love u brother . u r changing so many lives, thanku ....the best teacher award goes to Data school.

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Thanks very much for your kind words!

  • @anthonygonsalvis121
    @anthonygonsalvis121 3 ปีที่แล้ว

    Very methodical explanation

  • @chandrapatibhanuprakashap1862
    @chandrapatibhanuprakashap1862 2 ปีที่แล้ว

    It helps me a lot. Can you explain how do we get the count of each duplicated value.

  • @oeb5542
    @oeb5542 4 ปีที่แล้ว +2

    A very much appreciated efforts. Thanks a million for sharing with us your python knowledge. It has been a wonderful journey with your precise explanation. keep the hard work! Warm regards.

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Thanks very much! 😄

  • @jessicafletcher0610
    @jessicafletcher0610 ปีที่แล้ว

    OMG I WANT TO THAT YOU SOOOO MUCH 😊I been on the problem for days and the way you explain it make so easy then how I learned in class. I was so happy not to see that error message 😂 Thank you

    • @dataschool
      @dataschool  ปีที่แล้ว

      You're so very welcome! Glad I could help!

  • @goldensleeves
    @goldensleeves 4 ปีที่แล้ว

    At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks

  • @cafdo
    @cafdo 3 ปีที่แล้ว

    Great video. This helped me tremendously.
    How would you go about finding duplicates "case insensitive" with a certain field?

  • @imad_uddin
    @imad_uddin 3 ปีที่แล้ว +1

    Thanks a lot. It was a great help. Much appreciated!

  • @oasisgod1421
    @oasisgod1421 3 ปีที่แล้ว

    Great video. But I'd like just to find a duplicate column and then go to another column and find the duplicate and go to another column and find the duplicate and remain only one row with certain information.

  • @alishbakhan1084
    @alishbakhan1084 ปีที่แล้ว

    Thank you so much💕 your videos are really amazing...can you tell how to read any csv(without header on first line) and set first row with non null values as header...

  • @ravinduabeygunasekara833
    @ravinduabeygunasekara833 5 ปีที่แล้ว +1

    Great video! Btw, how do you know all these stuff? Do you take classes or read books?

    • @dataschool
      @dataschool  5 ปีที่แล้ว +6

      Work experience, reading documentation, trying things out, teaching, reading tutorials, etc.

  • @lindafl2528
    @lindafl2528 3 ปีที่แล้ว

    hello, thank you for the video, I'm wondering if you can make some tutorials about the API requests

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      Thanks for your suggestion!

  • @deltatv9335
    @deltatv9335 5 ปีที่แล้ว

    Hey Buddy, You are amazing and you remind me of Sheldon Cooper (BBT) because of the way you talk and also both of you are super smart. :-)
    One request- Please cover outliers sometime. Thanks.

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Ha! Many people have commented something similar :) And, thanks for your topic suggestion!

  • @rationalindian5452
    @rationalindian5452 3 ปีที่แล้ว +1

    Brilliant video .

  • @narbigogul5723
    @narbigogul5723 6 ปีที่แล้ว

    That's exactly what I was looking for, great explanation, thanks for sharing!

  • @omgthisana10
    @omgthisana10 4 หลายเดือนก่อน

    very well explained ty !

    • @dataschool
      @dataschool  4 หลายเดือนก่อน

      You're very welcome!

  • @ItsWithinYou
    @ItsWithinYou 2 ปีที่แล้ว

    If I have a datataframe with a million rows and 15 columns, how do I figure out if any columns in my dataframe has mixed data type?

  • @brianwaweru9764
    @brianwaweru9764 3 ปีที่แล้ว

    wait Kevin, keep=first means what is duplicated are the rows towards the bottom, meaning they have a much higher index. Keep= last means ?? Oh men am getting mixed up. Could someone please explain to me. Kevin,Please?

  • @JoshKelson
    @JoshKelson 5 ปีที่แล้ว

    Trying to figure out how to replace values above/below a threshold with the mean or median. If I find values that are skewing the data from a column, but don't want to exclude the whole row and drop the row, I just want to replace the value in one of the columns with a mean/median value. Can't figure out how to do this! IE: I want to replace all values in column 'age' that are above 130 (erroneous data), with the mean age of all the other values in 'age' column.

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      I'm sorry, I don't know the code for this off-hand. However, this would be a great question to ask during one of my monthly live webcasts with Data School Insiders: www.patreon.com/dataschool (join at the "Classroom Crew" level to participate)

  • @arpitmittal7865
    @arpitmittal7865 4 ปีที่แล้ว

    very useful videos.. can you please tell me how to find duplicate of just one specific row?

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Sorry, I don't fully understand. Good luck!

  • @jeffhale739
    @jeffhale739 6 ปีที่แล้ว +1

    Great video, Kevin! Super useful!

  • @rajoptional
    @rajoptional 3 ปีที่แล้ว

    Amazing and thanks bro , the right place for data queries

  • @KaiZergTV
    @KaiZergTV ปีที่แล้ว

    Thank you so much, you made my day. Finally i found the row of code, that i really needed to finish my task:)(Code Line 17)

    • @dataschool
      @dataschool  ปีที่แล้ว

      Glad I could help!

  • @halildurmaz7827
    @halildurmaz7827 3 ปีที่แล้ว

    Clean and informative !

  • @mansoormujawar1279
    @mansoormujawar1279 7 ปีที่แล้ว

    Because of your quality panda series I started following you. @duplicate - in my use case instead of drop duplicate I would like to keep 1st instance and just remove other duplicate values from specific column, so shape will remain same after removing duplicate values from column. Really appreciate if you got some time to answer this, thanks.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Glad you like the series! I'm not sure I understand your question - perhaps the documentation for drop_duplicates will help? pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

  • @asadghnaim2332
    @asadghnaim2332 3 ปีที่แล้ว

    When I use the parameter keep=False I get a number of rows less than the first and last combined what is the reason of that??

  • @mahdibouaziz5353
    @mahdibouaziz5353 4 ปีที่แล้ว

    you're amazing we need more videos in your channel

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      I do my best! I've got 20+ hours of additional videos available to Data School Insiders at various levels: www.patreon.com/dataschool

  • @MrMukulpandey
    @MrMukulpandey ปีที่แล้ว

    love to have more videos like this

    • @dataschool
      @dataschool  ปีที่แล้ว +1

      Thanks for your support!

  • @VNTHOTA
    @VNTHOTA 5 ปีที่แล้ว

    You should have used sort_values option with users.loc[users.duplicated(keep=False)].sort_values(by='age')

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Thanks for your suggestion!

  • @antonyjoy5494
    @antonyjoy5494 3 ปีที่แล้ว

    This is case of complete duplicates. So what should we do when we have to deal with incomplete duplicates..Ex age,gender and occupation same but zip is different..
    could you also make a video on that please..

  • @robind999
    @robind999 5 ปีที่แล้ว

    simple and useful. thanks Kevin.

  • @ayatbadayatbad7688
    @ayatbadayatbad7688 5 ปีที่แล้ว

    Thank you for this useful tutorial. Quick question, how do you check whether a value in column A is present in column B or not; not necessarily on the same row. It is like the samething that VLOOKUP function looks for in Excel. Many thanks for your feed-back!

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      I'm not sure I understand your question, I'm sorry!

  • @chandramohanbettadpura4993
    @chandramohanbettadpura4993 5 ปีที่แล้ว

    I have some missing dates in my dataset and want to add the missing dates to the dataset. I used isnull() to track these dates but I don't know how to add those dates into my dataset..Can you please help.Thanks

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      You might be able to use fillna and specify a method: pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

  • @prakmyl
    @prakmyl 4 ปีที่แล้ว +1

    Awesome videos Kevin. Thanks a to for the knowledge share.

  • @asifsohail5900
    @asifsohail5900 3 ปีที่แล้ว

    How can we efficiently find near duplicates from a dataset?

  • @reazahmed7004
    @reazahmed7004 3 ปีที่แล้ว

    How do I access iPython Jupyter Notebook link? it is not available in the github repository.

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      Is this what you were looking for? nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

  • @prakmyl
    @prakmyl 4 ปีที่แล้ว

    i get a error when i run users.drop_duplicates(subset=['age','zip_code']).shape . error "'bool' object is not callable" even i get the same error if i run users.duplicated().sum()

    • @dataschool
      @dataschool  4 ปีที่แล้ว +1

      Remove the .shape, and see what the results look like. Also, compare your code against mine in this notebook: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

  • @syyamnoor9792
    @syyamnoor9792 5 ปีที่แล้ว +1

    you are a hero...

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      That's very kind of you! :)

  • @mmarva3597
    @mmarva3597 3 ปีที่แล้ว

    Thank you for this content! I have a question : how can we handle quasi redundant values in different columns ? (Imagine two different columns each containing similar values ​​at 80%). Thanks a lot

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      When you say "handle", what is your goal? If you want to identify close matches, you can do what is called "fuzzy matching". Here's an example: pbpython.com/record-linking.html Hope that helps!

    • @mmarva3597
      @mmarva3597 3 ปีที่แล้ว

      ​@@dataschool Merci beaucoup for the reply. Let me explain my question : I have two variables/features named categories (milk, snack,pasta,oil,etc) and categories_en(en:milk , en:snack, en: pasta). My goal is to keep only one feature since both features share the same information. It was suggested that running a chi square test would help me decide which feature to keep but it seems silly to me :( ( I have almost 2millions records)

    • @dataschool
      @dataschool  3 ปีที่แล้ว +1

      It probably doesn't matter which feature you keep, if they contain roughly the same information.

  • @harshitagrwal9975
    @harshitagrwal9975 10 หลายเดือนก่อน

    user id are not same then how it can be duplicated?

  • @anantgosai8884
    @anantgosai8884 2 ปีที่แล้ว +1

    That was so accurate, thanks a lot genius!

    • @dataschool
      @dataschool  2 ปีที่แล้ว +1

      You're very welcome!

  • @somantalha4888
    @somantalha4888 2 ปีที่แล้ว +1

    beneficial videos. ❤

  • @jamesdoone3516
    @jamesdoone3516 7 ปีที่แล้ว

    Really great gob. Thank you very much!!

  • @artistz1831
    @artistz1831 6 ปีที่แล้ว

    Hey Kevin, I am confused for the drop duplicates here: the number of duplicated age and zipcode is 14; but after your drop the duplicates, the shape is 927. The total shape is 943, so the correct shape should be 943 - 14 = 929? Thanks a lot for your help!!!

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      I disagree with your statement "the number of duplicated age and zipcode is 14"... could you explain how you came to that conclusion? Thanks!

  • @DimasAnggaFM
    @DimasAnggaFM 4 ปีที่แล้ว +1

    great video!!

  • @emilyyyjw
    @emilyyyjw 4 ปีที่แล้ว +1

    Hi, I am wondering whether you could identify an issue that I am having whilst cleaning a dataset with the help of your tutorials. I will post the commands that I have used below:
    df["is_duplicate"]= df.duplicated() # make a new column with a mark of if row is a duplicate or not
    df.is_duplicate.value_counts()
    -> False 25804
    True 1591
    df.drop_duplicates(keep='first', inplace=True) #attempt to drop all duplicates, other than the first instance
    df.is_duplicate.value_counts() #
    -> False 25804
    True 728
    I am struggling to identify why there are still some duplicates that are marked 'True'?
    Kind regards,

    • @dataschool
      @dataschool  4 ปีที่แล้ว +1

      That's an excellent question! The problem is that by adding a new column called "is_duplicate", you actually reduce the number of rows which are duplicates of one another! Instead of adding that column, you should first check the number of duplicates with df.duplicated().sum(), then drop the duplicates, then check the number of duplicates again. Hope that helps!

  • @harneetlamba9512
    @harneetlamba9512 5 ปีที่แล้ว

    Hi, In the above video, at 1:12 minutes - the pandas DataFrame is displayed in Tabular form, with all the variables separated by vertical line. But in latest jupyter notebook, we get a single line below variable name. Can we get the same display as earlier, with new Jupyter version ?

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      There's probably a way, but it's probably not easy. I'm sorry!

  • @killaboody7889
    @killaboody7889 5 ปีที่แล้ว +3

    you are amazing.
    thank you ever much

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      You're very welcome!

  • @subuktageenshaikh2041
    @subuktageenshaikh2041 7 ปีที่แล้ว

    Hi, I have a doubt how do i remove duplicates from rows which are text or sentences like in RCV1 data set.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      The same process showed in the video will work for text data, as long as the duplicates are exact matches. Does that answer your question?

  • @moremirinplease
    @moremirinplease 3 ปีที่แล้ว +2

    i love you, sir.

  • @benogidan
    @benogidan 7 ปีที่แล้ว

    cheers for this :) will definitely consider purchasing the package

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      You're very welcome! The pandas library is open source, so it's free!

    • @benogidan
      @benogidan 7 ปีที่แล้ว

      sorry i meant on your website, the course ;)

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Awesome! Let me know if you have any questions about the course. More information is here: www.dataschool.io/learn/

  • @harshindublin
    @harshindublin 3 ปีที่แล้ว

    Thanks for the video

  • @peekayji
    @peekayji 6 ปีที่แล้ว

    Great! Very well explained.

  • @sagarbhadani1932
    @sagarbhadani1932 5 ปีที่แล้ว

    Hi, need help. Suppose if we have table such as transaction contains atleast 1 common item in the item column. How to code which are the transactions having coffee atleast?
    Transaction Item
    1 Tea
    2 Cookies
    2 Coffee
    3 cookies
    4 Bread
    4 Cookies
    4 Coffee

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      I'm not sure off-hand, good luck!

  • @srincrivel1
    @srincrivel1 5 ปีที่แล้ว +1

    you're doing god's work son!

  • @zma314125
    @zma314125 3 ปีที่แล้ว

    Thank you!

  • @engineeringlife2775
    @engineeringlife2775 ปีที่แล้ว

    Bonus Question 7:55

  • @KimmoHintikka
    @KimmoHintikka 7 ปีที่แล้ว

    I had weird error with this one. Setting index col with index_col='user_id' does not work for me it raises KeyError: 'user_id' error. Instead I had to run users = pd.read_table('bit.ly/movieusers', sep='|', header=None, names=user_cols) first and then users.set_index('user_id') for this tutorial to work

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Interesting! I'm not sure why that would be. But thanks for mentioning the workaround!

  • @muralikrishnapolipallivenk2572
    @muralikrishnapolipallivenk2572 6 ปีที่แล้ว

    Hi I am big fan of you work, and I have learned a lot from the videos, can you please help me on how can I use
    v-lookups of excel in pands

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      This might help: medium.com/importexcel/common-excel-task-in-python-vlookup-with-pandas-merge-c99d4e108988
      Good luck!

  • @abdulazizalsuayri4908
    @abdulazizalsuayri4908 6 ปีที่แล้ว

    full of useful info. Thanx man

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      You're very welcome! :)

  • @jatinshetty
    @jatinshetty 4 ปีที่แล้ว

    Yo! You are a superb teacher!

  • @da_ta
    @da_ta 5 ปีที่แล้ว

    thanks for tips and bonus ideas

  • @Ishkatan
    @Ishkatan 2 ปีที่แล้ว

    Good lesson, but the datatype has to match. I found I had to process my pandas tables with .astype(str) before this worked.

  • @zhaoqilong1994
    @zhaoqilong1994 8 ปีที่แล้ว

    is that any simple regular expression on python tutorial available?

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      For learning regular expressions, I like these two resources:
      developers.google.com/edu/python/regular-expressions
      www.pythonlearn.com/html-270/book012.html

  • @Animesh19007
    @Animesh19007 4 ปีที่แล้ว

    How to keep rows that contains null values in any column and remove completed rows?

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Does this help? th-cam.com/video/fCMrO_VzeL8/w-d-xo.html

  • @dandixon9466
    @dandixon9466 7 ปีที่แล้ว

    Great work man!

  • @krzysztofszeremeta1125
    @krzysztofszeremeta1125 6 ปีที่แล้ว

    how is the best way to compare data from tow file (in the same schema)

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      I don't know if there's one right way to do this... it depends on the details. Sorry I can't give you a better answer!

  • @sherlocksu1131
    @sherlocksu1131 7 ปีที่แล้ว

    HI, when you mention the "inplace" in the video, I am happy that PD have this parameter for experiment, but a problem comes, should I rember all the method that have the inplace parameter ;and rember the method that affect the origial dataframe in case that I use the DF already change when doing the calculation.
    That is a hugh job to remove all the method that have 'inplace' parameter or doesnot have ,isn't it..... TOT

    • @sherlocksu1131
      @sherlocksu1131 7 ปีที่แล้ว

      That is a huge

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      The 'inplace' parameter is just for convenience. I do recommend trying to memorize when that parameter is available. But if you forget, that's fine, because you can always write code like this:
      ufo = ufo.drop('Colors Reported', axis=1)
      ...instead of this:
      ufo.drop('Colors Reported', axis=1, inplace=True)

    • @sherlocksu1131
      @sherlocksu1131 7 ปีที่แล้ว

      Is all inplace argument in method way default by "False"?
      My problem is that: I worry that somethimes the method change original dataframe by method that have "inplace parameter"; somethimes the method does not change original dataframe.
      so i confuse when it affect the original DataFrame , since the wrong judgemet might be lead to bad conclusion.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      I think that 'inplace' is always False (by default) for all pandas functions.

  • @Anastasia-wy1uj
    @Anastasia-wy1uj 3 ปีที่แล้ว

    Jeez you just saved me so much work for a seemingly unsolvable project 🙏☕

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      That's awesome to hear!

    • @Anastasia-wy1uj
      @Anastasia-wy1uj 3 ปีที่แล้ว

      @@dataschool hey Kevin, I wonder if there's a way of grouping the results in groups that contain the found duplicate rows 🤔 I'm just thinking of a use case where some products (rows/index) with the same values (numerical and categorical) in features (columns) could be put into a product group so that a customer doesn't need to look through thousands of similar products but through much fewer product groups. The idea is of course implying product group feature selection beforehand and adding product variants afterwards (i.e. further product features that could differ among the products of one product group). I'd really appreciate your thoughts or advice on this 🙏 thanks 💙

  • @maheshaknur
    @maheshaknur 7 ปีที่แล้ว

    Thanks for this video :)
    How can we remove
    duplicates,delete columns,delete rows and insert new columns using python script ?

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Glad you liked the video! This video shows how to remove rows or columns: th-cam.com/video/gnUKkS964WQ/w-d-xo.html
      Does that help to answer your question?

  • @bharatin1331
    @bharatin1331 4 ปีที่แล้ว

    How to Remove Leading and Trailing space in data frame

  • @SahibzadaIrfanUllahNaqshbandi
    @SahibzadaIrfanUllahNaqshbandi 7 ปีที่แล้ว

    Thanks for good channel. I like it very much.
    I have a query.
    I am working on tweets, I have to remove duplicate tweets as well as tweets which are different in at most one word.
    I can do first part, Will you please guide me how can I do the second part?? Thanks

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      That's probably beyond the scope of what you can do with pandas. Perhaps you can take advantage of a fuzzy string matching library.

    • @SahibzadaIrfanUllahNaqshbandi
      @SahibzadaIrfanUllahNaqshbandi 7 ปีที่แล้ว

      Thanks...I will look into it.

  • @johnsonburgundypants
    @johnsonburgundypants 6 ปีที่แล้ว

    very clear, very concise!! :)

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Thanks! Glad you liked it!

  • @duckthatgivesafuk8471
    @duckthatgivesafuk8471 5 ปีที่แล้ว

    I really need help guys.
    I have a table that has a column : Column name - " Neighbourhood"
    This Column has A LOT of names repeated MANY times.
    To be specific, the column "Neighbourhood" has 10 Names that are repeated ALOT of times.
    My question is :
    I NEED HELP IN CREATING A SEPARATE COLUMN SPECIFYING HOW MANY TIMES EACH ELEMENT IN "NEIGHBORHOOD" HAS BEEN COUNTED.
    If anyone help me please.

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      I'm not positive this would work, but I might start by creating a dictionary out of value_counts, and then use that as a mapping for the new column. Anyway, I hope you were able to figure out a solution!

  • @Drivebyeasy
    @Drivebyeasy 7 ปีที่แล้ว

    Hello I want to know the concept of ReSampling please help

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      I'm sorry, I don't have any resources to offer you. Good luck!

  • @ashishacharya8427
    @ashishacharya8427 7 ปีที่แล้ว

    replace similar duplicate values with one of the values how to solve it??

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      I think the process would depend a lot on the particular details of the problem you are trying to solve.

  • @ajithtolroy5441
    @ajithtolroy5441 6 ปีที่แล้ว

    This is what I want, thanks for sharing :)

  • @hiericzhu
    @hiericzhu 6 ปีที่แล้ว

    Hi, I have question here. I want to mark the continue duplicate value like this [1,1,1,0,2,3,2,4,2], my expected result is [True,True, True,False,False,False,False,...].
    But the pandas.duplicated(keep=False) returns
    [True,True,True,False,True,False,True,False,True], The function treat the '2' in 2,x,2,y,2,z,2 sequence as duplicated. but it is not I want. How to remove it? I just want to mark the 1,1,1 as true. thanks.

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      How about just using code like this:
      df.columnname == 1
      Does that help?