You might never need Pandas again...

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 มิ.ย. 2024
  • In a world where Python is getting Rusty, even the industry giants are not immune from oxidation. Pandas has a Rust-based challenger, and I've put them head to head. Let's just say it might be time for a new king of the bears...
    Check out the code:
    github.com/isaacharrisholt/yo...
    Resources:
    Polars: pola.rs
    NYC Taxi dataset: www.nyc.gov/site/tlc/about/tl...
    IMDb dataset: www.kaggle.com/datasets/laksh...
    Polars benchmarks: pola.rs/posts/benchmarks/
    __________________________________________
    Check out my other socials!
    🎮 Discord ▶ discordapp.com/invite/bWrctJ7
    🐦 Twitter ▶ / isaacharrisholt
    🖥️ Portfolio ▶ ihh.dev
    📝 Blog ▶ isaacharrisholt.com
    __________________________________________
    Timestamps:
    00:00 - Introduction
    00:20 - Why Pandas has been essential
    01:32 - Polars
    02:32 - Performance
    03:00 - NYC Taxi benchmarks
    04:36 - IMDb benchmarks
    04:58 - Industry standard benchmarks
    #python #softwareengineer
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 55

  • @houstonbova3136
    @houstonbova3136 2 หลายเดือนก่อน +20

    Stops using pandas ~2 years ago for polars and it’s only gotten better since.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +1

      I'm glad you're a fan! I absolutely love it

  • @GoldenBeholden
    @GoldenBeholden 2 หลายเดือนก่อน +15

    I have been sleeping on this one because the data scientists at my job and in academia wouldn't feel comfortable switching for what are ostensibly "software engineering" reasons. Maybe this is the wake-up call I need to just go for it.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +1

      Do it! Force them into it 😅

  • @crossbow170
    @crossbow170 2 หลายเดือนก่อน

    Great video-very informative and well-put-together! The only suggestion I have is about the gifs used. They seem a bit too dynamic and short, making them jarring and hard to process before they loop again. Maybe try using longer, smoother gifs to enhance the viewing experience without distracting from the content. Everything else, especially your writing style, is spot on. Keep it up!

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Thanks for the feedback! I'll take it into account :)

  • @BabakFiFoo
    @BabakFiFoo 2 หลายเดือนก่อน +4

    Until there is a geopolars with proper functionalities, I have to stay on pandas realm.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Totally fair!

    • @zyelidarmeggedon2111
      @zyelidarmeggedon2111 2 หลายเดือนก่อน +2

      There is geopolars. Not "production" ready, but around technically

    • @KManAbout
      @KManAbout 2 หลายเดือนก่อน

      Can't you just do most of the stuff with polars and then port them over to pandas?

    • @BabakFiFoo
      @BabakFiFoo 2 หลายเดือนก่อน

      @@KManAbout then there is no reason to do the stuff in polars. Spatial data manipulation is different. Aggregations, summaries and manupilations are not part of pandas, but they are part of geopandas.

    • @BabakFiFoo
      @BabakFiFoo 2 หลายเดือนก่อน

      @@zyelidarmeggedon2111 I know! But you know the how nasty spatial data is. Production ready tools still have a lot of trouble.

  • @Sairysss1
    @Sairysss1 2 หลายเดือนก่อน

    Dude, I almost got an epilepsy from all those GIFs.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Apologies! Not everyone will enjoy this style of video, and I appreciate the feedback!

  • @user-et9by5uu9e
    @user-et9by5uu9e 2 หลายเดือนก่อน +3

    im no data scientist but can somone tell me why the like 20ms saved matters that much if the time for such large sets of data is so low does it really matter?

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +2

      These aren't particularly massive datasets, and 20ms doesn't really matter for exploratory stuff, but if you're doing loads of transformations regularly, cleaning up data etc., perhaps running in a production setting, it can make a big difference.
      Also, Polars allows you to analyse datasets that won't fit into memory, meaning you don't necessarily need to rent a high memory cloud machine, which are always expensive

    • @ElFlor95
      @ElFlor95 2 หลายเดือนก่อน +4

      In my last project I had to wrangle 5GB worth of unstructured data into a tabular format. My initial pandas attempt broke down due to running out of memory one hour into it. Polars completed it all in 15 minutes, without breaking a sweat.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Nice! I do think Polars is great

    • @robosergTV
      @robosergTV 2 หลายเดือนก่อน

      what? You measure difference in %, not absolute time. Polars here was 2x to 5x faster. wtf.

  • @pedrogorilla483
    @pedrogorilla483 2 หลายเดือนก่อน

    What about RAPIDS cuDF? Does it support it? Would be interested to see how it compares if you use GPU for processing data.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      I'm not 100% sure, but I don't think Pandas does either, right? Not out the box

    • @marco_gorelli
      @marco_gorelli 2 หลายเดือนก่อน

      Nvidia and Polars are collaborating on GPU support, check the Polars blog, there was an announcement there

    • @EarlZMoade
      @EarlZMoade 2 หลายเดือนก่อน +1

      Polars have announced they are integrating cudf support directly.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Awesome, thanks!

  • @takudzwamakusha5941
    @takudzwamakusha5941 2 หลายเดือนก่อน

    What are the specs of your computer?

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      The CPU is a Ryzen 9 3900X :)

  • @samarnagar9699
    @samarnagar9699 2 หลายเดือนก่อน

    How much time do you think will take this new tech to be adopted in market at scale

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +2

      Great question! Lots of tooling is built around pandas, and it's a slightly different way of thinking, but it's already in production use in a few places (including where I work!) so it's definitely on track

    • @samarnagar9699
      @samarnagar9699 2 หลายเดือนก่อน +1

      @@IsaacHarrisHolt this type of thinking is what I've came to ask myself looking at the web dev world new run time compiler bundler framework version whatever but will it effect your personal work experience it's good that it's in use where you work. The Python world is a very different world than webdev which is a good thing in this scenario. Another point to notice is that all the tutorials for this new tech will assume you know pandas and will be an migraine to new programmers

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +2

      Potentially, but Polars and pandas are quite different in my opinion, so there's always going to have to be some explanation of the basics.
      People should probably learn SQL before learning either anyway

    • @samarnagar9699
      @samarnagar9699 2 หลายเดือนก่อน +1

      @@IsaacHarrisHolt really sql before both

  • @risebio
    @risebio 2 หลายเดือนก่อน +1

    Where were all these useful tools a few years back when I was doing my diploma work. I had a 2 sets of chem structures. 1 set with 500k elements 21 GB and another one with 150k. I was spending from 4 to 7 hours to finish the script. And if I wrote smth wrong (i did it quite often and get stupid results) I needed to spend another 15 to 30 minutes to find mistakes in my code and wait another 4-7 hours for my laptop to finish calculations. I was so done with python that time that i was eager to do the same in go but the result was almost the same then I accepted my fate and started the script, went to bed, woke up after 7 hours and that was like I finished my work XD

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      I know the pain! One of my physicist friends had the same issue, so we ended up using multiprocessing to parallelise the work and run it on my 12 core desktop instead of her 4 core laptop. Took the runtime down from 17 hours to 40 minutes or something ridiculous

  • @DreySF
    @DreySF 2 หลายเดือนก่อน +1

    So basically it is R data.frame or data.table in dplyr and tidy verse but with rust in python 😊

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      I'm not familiar with any of those, but very possibly!

    • @ArnabAnimeshDas
      @ArnabAnimeshDas 2 หลายเดือนก่อน +1

      dplyr but better in every way

  • @samarnagar9699
    @samarnagar9699 2 หลายเดือนก่อน +4

    Ofc it's made in rust

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +1

      All the best things are 😉

    • @_tsu_
      @_tsu_ 2 หลายเดือนก่อน

      BLAZINGLY FAST

  • @romankovalful
    @romankovalful 2 หลายเดือนก่อน +5

    I'm not a data scientist but I occasionally use Pandas. What I've always wondered though is why don't people just load their data into a normal Postgres database and query it with plain SQL? You get a lot of features with minimal work.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Some data isn't necessarily suited to that, or you may need to present it e.g. in a Jupyter Notebook. Also, many data scientists aren't familiar with setting up DBs etc.

    • @moolavar9452
      @moolavar9452 2 หลายเดือนก่อน

      Sqlite sufficient. ​@@IsaacHarrisHolt
      Reason is we need that data object in hand to play with it easily in Jlabs 😂.

    • @robosergTV
      @robosergTV 2 หลายเดือนก่อน

      what? No, thats not how it works for DS

    • @Bozebo
      @Bozebo หลายเดือนก่อน

      Lack of knowledge and a need to use buzzword tools as much as possible. Basically everything I do now is just postgres with some decoration to make it do what I'm needing xD For almost anything with data postgres already has the answer and is optimal.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  หลายเดือนก่อน

      @@Bozebo Postgres is good, but there's the upfront setup cost, and it can be more difficult for people who aren't familiar with SQL. For a lot of things though, it's great. Especially when you add a columnar engine

  • @fabricehategekimana5350
    @fabricehategekimana5350 2 หลายเดือนก่อน +1

    👏👏👏

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน

      Thanks! Hope you found it helpful

  • @user-gj3kz7cm3x
    @user-gj3kz7cm3x หลายเดือนก่อน +1

    Switched to Polars about 19 months ago and never looked back. Pandas is awful. Spark is sometimes a necessary evil, but I believe that the folks behind Polars is looking to release a distributed compute framework soon.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  หลายเดือนก่อน

      Oh that's interesting! I didn't know that

    • @user-gj3kz7cm3x
      @user-gj3kz7cm3x หลายเดือนก่อน +1

      @@IsaacHarrisHolt yeah they raised $4m from Bain IIRC.

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  หลายเดือนก่อน

      Awesome!

  • @dhaval1489
    @dhaval1489 2 หลายเดือนก่อน +2

    My first choice is Polars, only if I get stuck I switch to Pandas, because your are sure get your solution on net with Pandas

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +1

      This is true. Community support is much better for pandas atm

  • @Sebastian-gu6wr
    @Sebastian-gu6wr 2 หลายเดือนก่อน +1

    Yeah! I want Pandas 🐼 extinct

    • @IsaacHarrisHolt
      @IsaacHarrisHolt  2 หลายเดือนก่อน +1

      Sounds like there's some trauma there 👀