Do these Pandas Alternatives actually work?

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ส.ค. 2024

ความคิดเห็น • 88

  • @robmulla
    @robmulla  ปีที่แล้ว +5

    If you enjoyed this video you should also check out my video about Polars, a pandas alternative that I didn't cover in this video: th-cam.com/video/VHqn7ufiilE/w-d-xo.html&feature=shares

  • @joaomurilopalonefauvel942
    @joaomurilopalonefauvel942 2 ปีที่แล้ว +30

    I wonder how polars performs. It seems like the fastest pandas alternative from my research.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +13

      Thanks for mentioning polars! I haven’t heard of it before but just read the GitHub page and it looks promising. Maybe next video I’ll cover it!

    • @robmulla
      @robmulla  ปีที่แล้ว +2

      @Charles I made a video about polars! Check it out here! th-cam.com/video/VHqn7ufiilE/w-d-xo.html&feature=shares

  • @CedricDeBoom
    @CedricDeBoom ปีที่แล้ว +7

    Would have liked more focus on memory and cpu usage. Especially in contexts with big datasets but limited resources, this is crucial, and it would have been nice to see and compare the effects of lazy evaluation here.

    • @glitchaddict99
      @glitchaddict99 4 หลายเดือนก่อน

      yeah this barely touched on the real reason I use dask, to do out of core data operations when I can’t use pandas anymore

  • @N147185
    @N147185 ปีที่แล้ว +9

    There is an important subtlety that is being missed about Vaex - it lazily reads the parquet file each time you do an operation. That means, you add the time to read (stream) the data in addition to the time it takes to actually do the math. That makes it all the more impressive (to me anyways). Other libraries read and hold the data in memory, so it is ready to be used. Vaex is more "memory safe", which is especially useful if you work with datasets that are much larger than ram.
    Anyway, very nice video - keep up the good work!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Great point. I didn’t know much about Vaex going into this. Interesting that it’s memory safe.

  • @zhenliu6596
    @zhenliu6596 ปีที่แล้ว +3

    Just want to say thank you for saving the time for us.

    • @robmulla
      @robmulla  ปีที่แล้ว

      I’m happy to! Thanks for watching 😀

  • @Maric18
    @Maric18 ปีที่แล้ว +3

    hm i am not that happy with this comparison, as it doesn't add anything that just naively trying these things out doesn't already do
    pandas (to my knowledge) already uses numpy under the hood, so it runs parallel on your local machine
    so every thing else doing the exact same thing as pandas will be less efficient
    most of these are optimizing for datasats that cannot fit in ram, dask is (to my knowledge) for clustering, at least thats how i am using it
    so a bit more research, trying to get ray to work for example, actually using dask features, doing some applys and so on would have been nice
    otherwise this video clickbait compatible title could be something like "Can these libraries be a direct drop in improvement over pandas?" or something

  • @gingerjiang666
    @gingerjiang666 ปีที่แล้ว +2

    Great video. Thank you very much. Quick question though, how did you change you jupyter lab theme. It looks so great?

  • @terusensei_japones
    @terusensei_japones 2 ปีที่แล้ว +4

    Very interesting. It seems that I will keep mostly using pandas 🤣🤣 thanks for sharing the experiment!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      When I started making this video I thought that each library would outperform pandas in a different way. I was suprised by the results. I'm sure there are situations where they are better alternatives to pandas - but for the time being I too will be mostly sticking with pandas.

    • @igormriegel
      @igormriegel 2 ปีที่แล้ว +1

      Try Polars, it is awesome and have way better results than what was shown in the video.

  • @PalataoArmy
    @PalataoArmy 2 ปีที่แล้ว +3

    I love modin the most because it is backed by dask, ray or omnisci and compatibility with pandas api. If they did not support processing big data, pyspark it is.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks for sharing. Is there a backed for modin you typically use more?

    • @bcak611
      @bcak611 2 ปีที่แล้ว

      Try Vaex for Big Data!

  • @CNW21
    @CNW21 ปีที่แล้ว +4

    This is interesting because from my understanding pandas can only use 1 CPU core, where as some or all of those alternatives *should* be able to use some or all of your threadripper cores which theoretically would drastically improve performance. Either way, from the looks of it I'd rather spend a few seconds/minutes waiting in pandas than reading the documentation for pandas alternatives.

    • @robmulla
      @robmulla  ปีที่แล้ว +3

      You have the same thought that I did! Why python inherently using only a single core the vectorized numpy and pandas functions are written in a lower level language that can do multithreading. So that's why straight pandas is hard to beat when the data can fit into memory.

    • @bennri
      @bennri ปีที่แล้ว +2

      @@robmulla yes but when I look at the tapsk manager, I don't see multiple cores running.

  • @JorgeRodriguez-ck6cy
    @JorgeRodriguez-ck6cy หลายเดือนก่อน

    Great video. Cudos. Question, what do you think of DuckDB?

  • @wayneh7067
    @wayneh7067 5 หลายเดือนก่อน

    Tbh if you have that much CPU memory, there’s really no need to consider any Pandas alternative. Maybe do some memory heavy tasks like joining large dataframes, which I usually use Dask for.

  • @585ghz
    @585ghz ปีที่แล้ว +1

    In dask, you can split into index, so they can aggregate by the index much more faster

    • @robmulla
      @robmulla  ปีที่แล้ว

      That’s true. I just wanted to compare as a drop in replacement

  • @rafaeel731
    @rafaeel731 ปีที่แล้ว +4

    Would be useful knowing the specs of your machine. I think Dask makes sense when you have much larger data and clusters of executors, more like Spark situation.

    • @robmulla
      @robmulla  ปีที่แล้ว

      My machine has a 32 thread ryzen CPU. There may be situations where it performs better but my main goal was to show how it performs on a single machine- most of the time pandas alone is the best.

    • @rafaeel731
      @rafaeel731 ปีที่แล้ว +1

      @@robmulla which is expected as Pandas can parallelise on a single machine and other options try to build on top of it, or some of them do. Thanks anyway!

  • @DarthJarJar10
    @DarthJarJar10 2 ปีที่แล้ว +4

    Was a tad surprised you didn't explore a Dask Delayed object, or try out some of Dask's concurrency features but love your videos! Would have loved to have seen Polars in the mix. For the missing cumsum in Vaex, I want to check if a reduce plus lambda combo would not have worked...

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Thanks for the feedback. I realize that this video only scratches the surface of what’s possible with each library. I felt like it wouldn’t be fair to use more than just the base API- but you make a fair point. Maybe I need to make a follow up video.

    • @DarthJarJar10
      @DarthJarJar10 2 ปีที่แล้ว +2

      @@robmulla, it was a pleasure! Your videos are great regardless!
      I'm immersing myself in this stuff bit by bit but having your content whilst working from home has been amazing!

    • @bennri
      @bennri ปีที่แล้ว

      Doesn't dask.dataframe run on multiple cores concurrently by default?

    • @bennri
      @bennri ปีที่แล้ว

      @Javis_Lumu Doesn't dask.dataframe run on multiple cores concurrently by default?

    • @DarthJarJar10
      @DarthJarJar10 ปีที่แล้ว

      @@bennri, I speak under correction - it may be specified by default that dask.dataframe is run concurrently but my understanding was that whether this was the case and the number of threads used is actually a setting, and moreover, that Dask utilises concurrency most efficiently using the delayed delayed.
      You're likely correct.

  • @kv1kv
    @kv1kv ปีที่แล้ว +1

    vaex is the only package that provides working out-of-core functionality
    you can process and explore the data that just does not fit into the memory at all on your desktop or laptop
    this is its purpose, it works when pandas just does not work at all
    it is an awesome package that I use on almost everyday basis
    it can be a little slower sometimes cause it does not load full dataset into memory and tries to use multiple cores so there is some expected overhead
    and it really misses some functionality so you sometimes need to convert data pieces into pandas which can be done easily

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Really cool. I haven't used vaex much outside of in this video. Seems similar to polars, which I made a different video on.

  • @sawekb8102
    @sawekb8102 ปีที่แล้ว +3

    In order to speed up dask you can configure client to use all cores or pass .compute(scheduler ="processes")

    • @robmulla
      @robmulla  ปีที่แล้ว

      Good to know. I didn’t want to add too much difference to the packages because I wanted to compare apples to apples.

  • @LeandroGessner
    @LeandroGessner 5 หลายเดือนก่อน +1

    I missed DuckDB
    In my tests, it is, by far, faster than these in the video (not sure about pandas)

  • @RockieYang
    @RockieYang ปีที่แล้ว +1

    Did you by change test with arrow format with vaex as well? As vaex is using memory mapping. It still need load the whole thing with parquet file. While it might avoid the whole load if using arrow.

    • @robmulla
      @robmulla  ปีที่แล้ว

      That's a good point. I just ran each package using the default with no modifications. If there is a way to change the backend format let me know how it might be done.

  • @gokulakrishnanm
    @gokulakrishnanm 2 ปีที่แล้ว +2

    Which processor you're using is that Intel processor? From what i heard is modin is good at running on Intel CPU. Please share your system spec

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I’m using a ryzen chip. Maybe that’s the issue.

    • @gokulakrishnanm
      @gokulakrishnanm 2 ปีที่แล้ว +1

      @@robmulla share your test code with data. I have i5 11 th gen I'll benchamark and share result.

  • @kayderl
    @kayderl ปีที่แล้ว +1

    Your notebook looks really nice. Is that jupyter notebook with a theme or something else?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks! Jupyterlab with the solarized dark theme.

  • @p.v.h.8659
    @p.v.h.8659 7 หลายเดือนก่อน

    Tbh the comparison is like comparing pears and apples. You start off with a pd df which can fit in your memory, you are then obviously faster with running it in pandas bc you gonna have less overhead. But when you have datasets ehich just cant fit into your RAM pandas starts to get useless and one has to switch to alternatives for example dask, especially when you run computational heavy stuff liek bootstrapping etc on a cluster where dask supports the proper allocation of resources while pandas normally lacks this support.

  • @jmoz
    @jmoz ปีที่แล้ว +2

    I spent days testing dask and couldn’t find any benefits or even it would r work for what I was trying. A large 450M row dataset needed to pivot it and it simply wouldn’t work. Maxed out memory and hdd space. Had to use standard pandas and some clever iterating.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I’ve been in the exact same situation a bunch of times before too! That’s partly why I wanted to make this video. Thanks for watching.

  • @jti107
    @jti107 2 ปีที่แล้ว +1

    nice! the big thing with pandas is the amount of resources when learning and debugging. any thoughts on Julia?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Agreed the resources and documentation surrounding pandas makes it hard to beat. I don’t have any experience with Julia- do you recommend it?

    • @jti107
      @jti107 2 ปีที่แล้ว +1

      @@robmulla i work in aerospace so alot of my colleague started using it but i love python too much so i've havent used it yet. i started with matlab and it took alot of effort to transition to python so the switching cost is pretty high. when i have some time, i'd like to at least try some tutorials to see what the hype is about. love your channel by the way, i've learned so much!!

  • @soren-1184
    @soren-1184 ปีที่แล้ว

    Pandas 2 with arrow backend would be interesting here as well.

  • @nishantkumar-lw6ce
    @nishantkumar-lw6ce ปีที่แล้ว +1

    How do we add existing list comprehension functions in pyspark?

    • @robmulla
      @robmulla  ปีที่แล้ว

      I'm not sure. I haven't used pyspark in a long time :D

  • @FabioRBelotto
    @FabioRBelotto ปีที่แล้ว +1

    You should share the notebook and the data (if it's avaliable somewhere). That would be interesting to explore more about such tools getting such a bad result as modin or dask.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks! I did provide the code to the people at modin and they were looking into how to speed it up, but I haven't heard anything about it lately.

  • @jorgetimes2
    @jorgetimes2 ปีที่แล้ว +1

    Hi, @Rob Mulla, would you please share the link to the data parquet, so that we could replicate your results and dig a bit deeper to tell where the actual problems lie? Thanks for your video!

    • @robmulla
      @robmulla  ปีที่แล้ว

      The data is a combination of the parquet files in this dataset: www.kaggle.com/datasets/robikscube/reddit-place-2022-official-canvas-history
      Good luck!

  • @GiasoneP
    @GiasoneP 2 ปีที่แล้ว +2

    Interesting video. I’ve been working on a problem to break up a 10 GB CSV file into multiple parquet files grouped by date. I’ve attempted to do it via Pandas chunksize= and Dask. Using Dask to read into a Pandas data frame (compute()) has yielded the fastest method. However, I think there are better methods. Moving on to pyspark next. The research continues…

    • @joaomurilopalonefauvel942
      @joaomurilopalonefauvel942 2 ปีที่แล้ว +2

      Have you taken a look at polars?

    • @GiasoneP
      @GiasoneP 2 ปีที่แล้ว

      @@joaomurilopalonefauvel942 i have not, but will check it out 👍

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks for the feedback. As I mentioned in the video every dataset may respond differently. I didn’t know how fast each would perform going into the video - and was a bit surprised by the results.

    • @igormriegel
      @igormriegel 2 ปีที่แล้ว +2

      @@GiasoneP I'm sure Polars will shine for you I'm using it on 40gbs datasets and it is pretty snappy.

    • @GiasoneP
      @GiasoneP 2 ปีที่แล้ว +1

      @@igormriegel I’ll check it out this weekend. Thanks for sharing.

  • @lucasbraesch805
    @lucasbraesch805 ปีที่แล้ว +2

    What about polars? This one beats everything else hands in all the benchmarks that I have seen.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I've heard a lot of good things about polars and need to check it out.

  • @riptorforever2
    @riptorforever2 ปีที่แล้ว +1

    A suggestion: Add pyarrow lib if you do a update video about this :) the presentation 'PyArrow and the future of data analytics ( id: 6aWX9bZizu4 ) by EuroPython Conference impressed me

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Oh wow. I need to check that out. Doing a review of polars soon. It’s really good!

  • @ajaypranav1390
    @ajaypranav1390 ปีที่แล้ว +1

    Try with polars

    • @robmulla
      @robmulla  ปีที่แล้ว

      I did! Check out my channel I have two new videos about it

  • @fizipcfx
    @fizipcfx ปีที่แล้ว +1

    How about cudf?

    • @robmulla
      @robmulla  ปีที่แล้ว

      I didn’t cover it in this video but maybe in the future.

  • @CaribouDataScience
    @CaribouDataScience 2 ปีที่แล้ว +2

    You misspelled "Tidyverse" 😂

  • @MichaelMantion
    @MichaelMantion ปีที่แล้ว +1

    My butt puckers when ever I see people use "dd". It is not an urban legend that people have lost a lot of data misusing dd in bash.

    • @robmulla
      @robmulla  ปีที่แล้ว

      lol. That thought has never crossed my mind but it’s pretty funny.

  • @FabioRBelotto
    @FabioRBelotto ปีที่แล้ว +1

    If you are an experienced user and have issues with some libs, imagine what happens to a beginner lol

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      true! But this is a good thing to learn as a beginner too.

  • @barelmishal9668
    @barelmishal9668 ปีที่แล้ว +1

    Hi try polars this is the best of the much better then pandas by far

    • @robmulla
      @robmulla  ปีที่แล้ว

      Absolutely! I made a whole video about it. Check it out here. th-cam.com/video/VHqn7ufiilE/w-d-xo.html&feature=shares

  • @tashfinbashar1943
    @tashfinbashar1943 9 หลายเดือนก่อน

    Great video. Can you do one on Polars? @robmulla