How to work with big data files (5gb+) in Python Pandas!

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 ส.ค. 2024
  • In this video, we quickly go over how to work with large CSV/Excel files in Python Pandas. Instead of trying to load the full file at once, you should load the data in chunks. This is especially useful for files that are a gigabyte or larger. Let me know if you have any questions :).
    Source code on Github:
    github.com/KeithGalli/Data-Sc...
    Raw data used (from Kaggle):
    www.kaggle.com/datasets/mkech...
    I want to start uploading data science tips & exercises to this channel more frequently. What should I make videos on??
    -------------------------
    Follow me on social media!
    Instagram | / keithgalli
    Twitter | / keithgalli
    TikTok | / keithgalli
    -------------------------
    If you are curious to learn how I make my tutorials, check out this video: • How to Make a High Qua...
    Practice your Python Pandas data science skills with problems on StrataScratch!
    stratascratch.com/?via=keith
    Join the Python Army to get access to perks!
    TH-cam - th-cam.com/channels/q6X.html...
    Patreon - / keithgalli
    *I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.
    -------------------------
    Video timeline!
    0:00 - Overview
    1:25 - What not to do.
    2:16 - Python code to load in large CSV file (read_csv & chunksize)
    8:00 - Finalizing our data

ความคิดเห็น • 49

  • @mjacfardk
    @mjacfardk 2 ปีที่แล้ว +4

    During my 3 years in the field of data science, this course would be the best I've ever watched.
    thank you brother, go ahead.

  • @Hossein118
    @Hossein118 2 ปีที่แล้ว +6

    The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.

  • @CaribouDataScience
    @CaribouDataScience ปีที่แล้ว +10

    Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...

  • @michaelhaag3367
    @michaelhaag3367 2 ปีที่แล้ว +6

    glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)

  • @JADanso
    @JADanso 2 ปีที่แล้ว +1

    Very timely info, thanks Keith!!

  • @fruitfcker5351
    @fruitfcker5351 ปีที่แล้ว +5

    If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.

  • @ahmetsenol6104
    @ahmetsenol6104 ปีที่แล้ว

    It was quick and straight to the point. Very good one thanks.

  • @jacktrainer4387
    @jacktrainer4387 2 ปีที่แล้ว +2

    No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 ปีที่แล้ว +2

      Definitely, "big" very much means different things to different people and circumstances.

    • @Nevir202
      @Nevir202 ปีที่แล้ว

      Ya, I've been trying to process a book in Sheets, for that processing 100k words, so a few MB, in the way I'm trying to is already too much lol.

  • @dhssb999
    @dhssb999 2 ปีที่แล้ว +10

    Never used chunk in read_csv before, it helps a lot! Great tip, thanks

  • @firasinuraya7065
    @firasinuraya7065 ปีที่แล้ว +1

    OMG..this is gold..thank you for sharing

  • @lesibasepuru8521
    @lesibasepuru8521 ปีที่แล้ว +1

    You are a star my man... thank you

  • @abhaytiwari5991
    @abhaytiwari5991 2 ปีที่แล้ว +2

    Well-done Keith 👍👍👍

  • @elu1
    @elu1 2 ปีที่แล้ว +1

    great short video! nice job and thanks!

  • @rishigupta2342
    @rishigupta2342 ปีที่แล้ว +1

    Thanks Keith. Please do more videos on EDA python.

  • @andydataguy
    @andydataguy ปีที่แล้ว +1

    Great video! Hope you start making more soon

  • @spicytuna08
    @spicytuna08 ปีที่แล้ว +2

    thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?

  • @AshishSingh-753
    @AshishSingh-753 2 ปีที่แล้ว +2

    Pandas have capabilities I don't know it - secret Keith knows everything

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 ปีที่แล้ว +1

      Lol I love the nickname "secret keith". Glad this video was helpful!

  • @agnesmunee9406
    @agnesmunee9406 ปีที่แล้ว +2

    How would a go about it if it was a jsonlines(jsonl) data file?

  • @oscararmandocisnerosruvalc8503
    @oscararmandocisnerosruvalc8503 ปีที่แล้ว +1

    Cool videos bro .
    Can you address load and dump for Json please :)?

  • @DataAnalystVictoria
    @DataAnalystVictoria 8 หลายเดือนก่อน

    Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.

  • @rokaskarabevicius
    @rokaskarabevicius หลายเดือนก่อน

    This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.

  • @CS_n00b
    @CS_n00b 10 หลายเดือนก่อน

    why not groupby.size() instead of groupby.sum() the column of 1's?

  • @manyes7577
    @manyes7577 2 ปีที่แล้ว +2

    i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks
    for chunk in df:
    details = chunk[['brand', 'category_code','event_type']]
    display(details.head())
    break

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 ปีที่แล้ว +1

      How did you define "df"? I think that's where your issue lies.

  • @lukaschumchal6676
    @lukaschumchal6676 2 ปีที่แล้ว +1

    Thank you for video, it was really helpfull. But i am still little confused. Do I have to run every big file with chunks, because its necessary or it is just quicker way of working with large files?

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  2 ปีที่แล้ว +1

      The answer really depends on the amount of RAM that you have on your machine.
      For example, I have 16gb of ram on my laptop. No matter what, I would never be able to load in a file 16gb+ all at once because I don't have enough RAM (memory) to do that. Realistically, my machine is probably using about half the RAM for miscellaneous tasks at all times so I wouldn't even be able to open up a 8gb file all at once. If you are on Windows, you can open up your task manager --> performance to see details on how much memory is available. You could technically open up a file as long as you have enough memory available for it, but performance will decrease as you get closer to your total memory limit. As a result my general recommendation would be to load in files in chunks basically any time the file is greater than 1-2gb in size.

    • @lukaschumchal6676
      @lukaschumchal6676 2 ปีที่แล้ว

      @@TechTrekbyKeithGalli Thank you very much. I cannot even describe you how this is helpfull to me :).

  • @vickkyphogat
    @vickkyphogat ปีที่แล้ว

    what about .SAV files ?

  • @machinelearning1822
    @machinelearning1822 ปีที่แล้ว +1

    I have tried and followed each step however it gives this error:
    OverflowError: signed integer is greater than maximum

  • @konstantinpluzhnikov4862
    @konstantinpluzhnikov4862 2 ปีที่แล้ว +1

    Nice video! Working with big files If a hardware is not at it best means there is much time to make a cup of coffee, discuss the latest news...

  • @dicloniusN35
    @dicloniusN35 5 หลายเดือนก่อน

    but new file have only 100000 , not all info, you ignore other data?

  • @oscararmandocisnerosruvalc8503
    @oscararmandocisnerosruvalc8503 ปีที่แล้ว

    Why did you use the count there

    • @TechTrekbyKeithGalli
      @TechTrekbyKeithGalli  ปีที่แล้ว

      If you want to aggregate data (make it smaller), counting the number of occurrences of events is a common method to do that.
      If you are wondering why I added an additional 'count' column and summing, instead of just doing something like value_counts(), that's just my personal preferred method of doing it. Both can work correctly.

    • @oscararmandocisnerosruvalc8503
      @oscararmandocisnerosruvalc8503 ปีที่แล้ว

      @@TechTrekbyKeithGalli Thanks a lot for your videos, bro !!!!