Polars: The Next Big Python Data Science Library... written in RUST?

แชร์
ฝัง
  • เผยแพร่เมื่อ 12 ม.ค. 2025

ความคิดเห็น • 246

  • @rahuldev2380
    @rahuldev2380 2 ปีที่แล้ว +341

    Polars is built on top of Apache Arrow which pandas supports. So you can easily convert your polars dataframe to pandas with almost zero overhead. I use polars to do the hard part and jump back to pandas for the visualization stuff

    • @cryptoworkdonkey
      @cryptoworkdonkey 2 ปีที่แล้ว +11

      If you use pyarrow firstly. Pandas convert arrow in his inner representation (numpy arrays managed by BlockManager) and reverse. It not zero cost.

    • @rahuldev2380
      @rahuldev2380 2 ปีที่แล้ว +2

      @@cryptoworkdonkey Ah my bad. I thought they had updated their internals from numpy

    • @jakobullmann7586
      @jakobullmann7586 ปีที่แล้ว +4

      Same here. There are some things where Pandas is more convenient, but for most stuff I strongly prefer Polars. It’s not just execution performance, but also the speed of writing the code.

    • @adrianjdelgado
      @adrianjdelgado ปีที่แล้ว +11

      ​@@cryptoworkdonkey good news, Pandas 2.0 release candidate now uses pyarrow as the backend. Polars Pandas conversions will be zero cost.

  • @bigphab7205
    @bigphab7205 ปีที่แล้ว +36

    10000 points for printing the version. Every tutorial video should do that.

    • @robmulla
      @robmulla  ปีที่แล้ว +7

      Thanks! I forget to do it on all of my videos but your comment is going to remind me to do it in the future.

  • @brd5548
    @brd5548 2 ปีที่แล้ว +175

    Our team tried to integrate polars into our analytics pipeline last year, and the result was kinda on and off. To be honest, the performance of pandas is not that bad, we spent some time on doing several fine tunings, like rewriting key bottlenecks with our native modules or with these vectorized pandas methods, and the result turned out just ok. On the other hand, the integration work of polars did require some major revamping and refactoring, due to API gaps and implementation differences between the two. However, the performance gains didn't seem to justify the effort. What's worse, while pandas does come with pitfalls and caveats here and there, polars is a relatively young project and it comes with bugs on basic text manipulating operations.
    But don't get me wrong, that was my experience last year. I do think polars has the potential. It has a much more robust and modern architecture than pandas in my opinion. Its API style is cleaner and more consistent. And it comes with a query optimization engine, which many users can appreciate if you are familiar with tools like apache spark or some databases. Given time, I think polars should become another powerful player in the future. So, definitely give it a try if you're building something new!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +14

      Thanks for sharing! I haven't used polars in production yet, so it's interesting to hear about your experience. I guess there are limitations I didn't consider in this video. I totally agree it's worth giving a try.

    • @BiologyIsHot
      @BiologyIsHot 2 ปีที่แล้ว +5

      This is,the major bit.. Who is bottlenecked by Pandas? I think the bottlenecks happen with ML or other modeling libraries which are working with the data in the form of Numpy arrays.

    • @leventelajos5078
      @leventelajos5078 ปีที่แล้ว +2

      "Its API style is cleaner" Really? I think Pandas is much more pythonic.

    • @incremental_failure
      @incremental_failure ปีที่แล้ว

      @@leventelajos5078 Agree. Column assignment in Pandas seems more pythonic.

    • @konstagold
      @konstagold ปีที่แล้ว +5

      @@BiologyIsHot When you're working with large data sizes, you will be bottlenecked by pandas in no time. Typically at that point, you switch to spark, which has its advantages, but also downsides. Polars looks to be a good middle fit between the two that dask was trying to achieve.

  • @jakobullmann7586
    @jakobullmann7586 ปีที่แล้ว +26

    13:20 Regarding learning the syntax… It’s worth mentioning that Polars syntax is very similar to PySpark, so it’s really two birds with one stone.

    • @robmulla
      @robmulla  ปีที่แล้ว +8

      That’s a good point. Thanks for pointing it out. I really need to do a spark vs polars comparison video.

  • @Joselias156216
    @Joselias156216 2 ปีที่แล้ว +15

    Nice video. Very interesting to see how polar works, hope to see it more frequent in your future streams to learn more about the practical use.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Thanks Jose! I apprecaite the feedback. I'm going to definately give it a try in a future stream. I just need to find a good dataset for it.

  • @tmb8807
    @tmb8807 ปีที่แล้ว +2

    I'm blown away by how fast this is. Sure there are some things it can't do, but man, even for just reading large data sets it's absolutely blazing.

  • @calum.macleod
    @calum.macleod 2 ปีที่แล้ว +16

    Thanks for a good explanation of how Polars could benefit people who use Pandas and need more speed. In my project we already have a heavy emphasis on multi processing and fast inter process communication, so I am especially interested to see a Pandas vs Polar single core performance comparison for group and join. I hope that someone does the comparison and posts it to TH-cam.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +2

      Glad it was helpful! If you look in the polars repo they have some queries that they benchmark. H2o also has a benchmark comparison of a few different libraries.

    • @calum.macleod
      @calum.macleod 2 ปีที่แล้ว +1

      @@robmulla Thanks for the reply. I will look into the benchmarks and h2o.

  • @santiagoperman3804
    @santiagoperman3804 2 ปีที่แล้ว +7

    Great timing, I was looking to start playing with Polars since Mark Tenenholtz mentioned it some days ago. I went back to Pandas because couldn't find the assign() and astype() equivalents in Polars, I thought they were lacking, but they seem to be with_columns() and cast(). Now I will resume more persistently.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Glad you found this video helpful. It does seem like polars may be worth the time investment now that it's becoming more established.

  • @gregharvey8574
    @gregharvey8574 2 ปีที่แล้ว +34

    Thanks for brining this to my attention, I think I might include polars into some productionalization processes. For data exploration, typically I only use parts of dataframes for plotting or investigation. Given that you can convert a polars dataframe to pandas, it seems like a good approach would be to have the the full dataset in polars and then filter into a pandas dataframe and plot.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +7

      That's a good point about how you can convert the dataframe to pandas when you need to do exploration. I'll have to think about how to use this in my EDA pipelines.

    • @headbangingidiot
      @headbangingidiot 2 ปีที่แล้ว +1

      ​@@robmulla you can pass polars columns into plotting libs like plotly

    • @BiologyIsHot
      @BiologyIsHot 2 ปีที่แล้ว

      The question though is do you save much time when doing this? Instantiation of Numpy arrays and Pandas dataframes themselves isn't the fastest. I guess if you have multiple "slow" actions to perform on the data you might have some benefits? Or if you really are working at such a massive scale with many many users that saving compute time is really valuable.

  • @scraps7624
    @scraps7624 ปีที่แล้ว +2

    I saw some tweets about Polars but seeing it in action is something else
    Also, I can't believe it took me this long to find your channel, subbed!

    • @robmulla
      @robmulla  ปีที่แล้ว

      That’s awesome! Glad you found my channel. Feel free to share with others!

  • @GiasoneP
    @GiasoneP 2 ปีที่แล้ว +30

    Like PySpark AND Pandas. Second half mirrors PySpark. Due to the speed, and out of the box parralelization, I wonder how it stacks up against Spark and how it’s functionality compares to a cluster of machines. Take AWS for example, can it be applied to an EMR cluster? As a side note, I’m super excited about Rust and it’s future in data.

    • @cryptoworkdonkey
      @cryptoworkdonkey 2 ปีที่แล้ว +6

      There is some Apache Arrow based Spark competitors (too young) like Ballista (distributed Data Fusion, written in Rust).
      We "buy" Spark for Resilent in RDD abbr. Polars can process 50gb on machine, Spark - 35gb because not so effective row-based abstraction from "distributed" trade-off, scala case classes memory blowing etc. vs skinny Rayon runtime in Polars.
      Ray platform has same arrow format backend and more effective than Spark but can't streaming (yet).
      In Polars repo polars-dask integration is empty.

    • @pabtorre
      @pabtorre 2 ปีที่แล้ว +1

      Yeah the syntax is very similar to pyspark
      Wonder how well it'll run on a spark cluster...

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Good question. I don’t think polars is meant as a replacement for pyspark because from I can tell it doesn’t computation across nodes.

    • @AWest-ns3dl
      @AWest-ns3dl 2 ปีที่แล้ว +5

      I can confirm, Polars does not use nodes.

    • @RyanApplegatePhD
      @RyanApplegatePhD ปีที่แล้ว +2

      @@robmulla With the ever improving compute, I think Polars could be in a sweet spot between Spark and Pandas. I know when I was parsing very raw large datasets in pandas I did feel sometimes constrained and moved to Spark, however; there is a lot of overhead for using Spark effectively and this might split the difference.

  • @rackstar2
    @rackstar2 ปีที่แล้ว +1

    I recently decided to fully transition over to using polars instead of pandas for a data pipeline project.
    The primary reason im liking polars over pandas is not just the speed (the speed is nice dont get me wrong) but its the Space usage!
    Allmost all of my operations entailed working with data larger than memory.
    One of the operations i have to do is pivoting a dataframe. My end result has thousands of columns!
    My kernel never seems to hold steady when doing this with pandas, but polars is really doing the trick for me.
    One small problem did face tho is when it comes to exporting the results of the pipline.
    I still have to resort to something like pyarrow and use its writer to do the export in chunks.
    This might just be because of how low my system memory is. Regardless of this, polars seems to be an excellent option for data processing and manipulation, and if you do want to showcase your data, you can always convert back and forth with pandas !

  • @jcbritobr
    @jcbritobr 2 ปีที่แล้ว +3

    Nice stuff. This Polars seems a killer tool. Thank you for share.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks for watching. It does seem promising.

  • @nikjs
    @nikjs ปีที่แล้ว +1

    3:35 - some audio interference starts from around this point, pls check the video

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for the heads up. I noticed that when editing. Sorry about it.

  • @BiologyIsHot
    @BiologyIsHot 2 ปีที่แล้ว +16

    I think the big problem is that it isn't inter-operable with Numpy-based libraries. I'm honestly struggling to think of many cases where Pandas is too slow. Some of thd features like a lazy/eager API could be nice, but I think most of the slow computations people are doing is within libraries that are going to require conversion to Numpy arrays already.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Yea, I guess it really depends on your use case. I've run across a few recently where polars was helpful.

    • @adrianjdelgado
      @adrianjdelgado ปีที่แล้ว +2

      You can convert to and from Pandas very easily. Now that Pandas 2.0 will use pyarrow as the backend, that conversion will be truly zero cost.

  • @curlyman_
    @curlyman_ 2 ปีที่แล้ว +3

    This is my little trick for hyper optimizing data processing haha. Pivots are insanely fast in polars

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Ohh. Never tried pivots in it.

  • @nikjs
    @nikjs ปีที่แล้ว +4

    For the python library developers : Pls create a wrapper lib that does this job of converting regular pandas syntax into the wee-bit more complicated polars syntax. I can see that not all ops would be readily convertible, but there's definitely some low-hanging fruit here, which would cover a lot of simple use cases.

    • @robmulla
      @robmulla  ปีที่แล้ว

      That would be nice. But I also think it’s nice to have it different to make it clear it’s not the same.

  • @juan.o.p.
    @juan.o.p. 2 ปีที่แล้ว +2

    Thanks for the recommendation, I will definitely give it a try 😊

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Please do and let me know what you think. There might be negatives about it that I'm not aware of.

  • @AaronWoodrow1
    @AaronWoodrow1 2 ปีที่แล้ว +2

    I don't fully get why it's geared more toward data pipelining rather than data exploration (as mentioned @ 13:33) if the data needs to be contained to a single host. Even with parallelization across multiple CPUs, there's still a data size cap limited by available memory. A tool such as PySpark (or Dask) seems better suited for pipelining, which ultimately consumes larger amounts of data.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Yea. I see your point. Sometimes you have data in between or just want a faster pipeline for a small job you run on a regular basis. Either way, if it was identical to python and faster then people would use it for sure!

    • @AaronWoodrow1
      @AaronWoodrow1 2 ปีที่แล้ว +1

      @@robmulla True, just a minor nit. Great video btw!

  • @rohitnair4268
    @rohitnair4268 2 ปีที่แล้ว

    as usual rob nice video i have learned a lot from you

  • @sonnix31
    @sonnix31 2 ปีที่แล้ว +2

    This is fantastic. Thank you

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      You're very welcome!

  • @bubbathemaster
    @bubbathemaster 2 ปีที่แล้ว +7

    Extremely interesting. It’ll be hard to dethrone pandas due to the huge community support but I really like the lib.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +2

      I agree pandas is too entrenched at this point to be easily dethroned.

  • @MaavBR
    @MaavBR ปีที่แล้ว +1

    7:10 Quick correction, SAN is San Diego, not San Francisco
    San Francisco airport's code is SFO

    • @robmulla
      @robmulla  ปีที่แล้ว

      Doh! Good catch.

  • @ChaiTimeDataScience
    @ChaiTimeDataScience 2 ปีที่แล้ว +4

    DataTable is also pretty legendary, you might also find it super awesome.
    Thanks again for your amazing videos, I have watched and learned from every one of them. I hope I'll interview you about your 100k celebration sometime next year 🙏

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks Sanyam! I need to check it out. Hopefully 100k will come next year, but maybe 2024! Talk soon.

  • @AlexanderHyll
    @AlexanderHyll 2 ปีที่แล้ว +4

    As a btw. If you want to plot smth quick, converting to a pandas is super fast (if ofc a bit mem inefficient). Can also just pass columns to plt. Just my 2 cents.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Good point, I do use df.plot() a lot though so it would take some getting used to.

    • @adrianjdelgado
      @adrianjdelgado ปีที่แล้ว

      Now that Pandas 2.0 uses pyarrow as backend, conversions will be truly zero cost.

  • @Mari_Selalu_Berbuat_Kebaikan
    @Mari_Selalu_Berbuat_Kebaikan 2 ปีที่แล้ว +2

    Let's always do good and encourage more people to do the same 🙏

  • @gabrielperfumo1122
    @gabrielperfumo1122 2 ปีที่แล้ว +1

    Great channel!! Thanks for sharing. I'll check it out for sure!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks Gabriel!

  • @The-KP
    @The-KP 2 ปีที่แล้ว +2

    @Rob Mulla Nice that Polars can perform rdbms-like ops, but what about the computation libs bind to Pandas dataframes, like numpy, scipy, scikit-learn? If it can be used with those, or somehow replaces them, I'm in! Hopefully Polars is not an island.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I know you can easily convert from polars back to a pandas dataframe and they use similar Apache Arrow.

  • @pimziengs2900
    @pimziengs2900 2 ปีที่แล้ว +1

    Thanks for this video! I am a data scientist always looking for some new techniques xD.
    Cheers from the Netherlands!
    PS: There is some background noise in your video around 3:30.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Welcome! Glad to have a viewer from the Netherlands. Sorry about the noise at 3:30 - I didn't notice it until after I was done editing and then it was too late.

  • @patrickonodje1428
    @patrickonodje1428 2 ปีที่แล้ว +2

    I love your work. You should have a course on data science.. for folks like us just learning

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Maybe one day! Thanks for watching Patrick!

    • @patrickonodje1428
      @patrickonodje1428 2 ปีที่แล้ว +1

      @@robmulla Looking forward

  • @ApeWithPants
    @ApeWithPants 2 ปีที่แล้ว +5

    Pandas has some strange quirks that always bothered me. Strange syntax or unintuitive copy/not copy behavior. Glad to see more competitors

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I’m a big fan. But also think polars and others like it have good potential. Thanks for watching! Are you a kraken fan? Go Caps!

  • @cryptoworkdonkey
    @cryptoworkdonkey 2 ปีที่แล้ว +4

    I think Polars must be replace Pandas in ETL tasks. But it have some struggles for comfortable Exprs construction.
    And in Arrow universe there is Data Fusion project as alternative.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I agree. I haven't fully tested out the expressions to notice what I use in pandas that polars is missing. What is the Data Fusion project, I'm not familiar with that?

    • @cryptoworkdonkey
      @cryptoworkdonkey 2 ปีที่แล้ว

      @@robmulla , DataFusion is more "arrow-society" convented project (part of Apache Arrow project) as Spark/Hive/MR challenger. This is designed more modularity with SQL and DataFrame APIs. This project can be used as library (it positioned self as query engine for arrow) for more high level projects.
      Polars positioning self as classical DataFrames libraries challenger. But with both you can use as SQL CLI. Both has plan optimizers, Rayon parallelism, simd optimizations etc.
      Both are cool. I don't know about larger-than-memory capabilities of DataFusion. DataFusion is fundament of Blaze/Ballista distributed computing engines. Polars Dask integration repo currently not active.

  • @neronjp9909
    @neronjp9909 ปีที่แล้ว

    how come everytime when u click the column name, the column name then copied into yr tpying code.. is there a hot key for that? my company raw data column name is so long and with _ / space / dot...i always get slow down when typing code across the column name, may i know how u do that 8:07..thx

  • @samstanton-cook1419
    @samstanton-cook1419 2 ปีที่แล้ว +6

    Great video thanks Rob! Our data science teams use polars alot. For long timeseries aggregation queries (100M+) rows we use the pykx python package to access q kdb+ language for higher performance still over pandas and polars. Have you seen it?
    kx.q.qsql.select(qtab, columns={'minCol2': 'min col2', 'medCol3': 'med col3'},
    by={'groupCol1': 'col1'},
    where=['col30.7']
    )

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I need to check that out. Pykx… first time hearing of it. Sounds cool though. Thanks for watching.

  • @tonik2558
    @tonik2558 2 ปีที่แล้ว +3

    The usage in Python seems to mirror a lot of the standard Rust iterator API. Looks like it would be even better if used directly in Rust. Thanks for making a video about this.

    • @brainsniffer
      @brainsniffer 2 ปีที่แล้ว +1

      I think that there is so much for data that is built in python that it’s easier to use an abstraction like this than to do things in rust, especially for interactions. It’s an interesting idea.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I have learning RUST on my todo list. Will you teach me? 😝

    • @tonik2558
      @tonik2558 2 ปีที่แล้ว +3

      @@robmulla The Book is an amazing starting resource. It's how I learned Rust, and it's probably the fastest way to get started with the language

    • @shadowangel8005
      @shadowangel8005 2 ปีที่แล้ว

      @@robmulla google just posted a small course a week or so back

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf ปีที่แล้ว +1

    Valeu!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks so much 🙌

  • @mutley11
    @mutley11 ปีที่แล้ว +2

    Very compelling presentation; many thanks. I would have liked to see an example of how user-friendly the error messages are. Rust error messages are surprisingly good in general and I was wondering if that is true of polars. You missed at least one opportunity to illustrate a typo. 😊

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad it was helpful! Next time I'll try to throw more errors :D

  • @simplemanideas4719
    @simplemanideas4719 ปีที่แล้ว +1

    Speed is always priority, because it is equal to resource optimization. However, this leads to question how effizient are both libs in per core efficiency?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Good question. I'd guess polars is faster on all fronts but it would depend on a lot of things.

  • @aminehadjmeliani72
    @aminehadjmeliani72 2 ปีที่แล้ว +1

    Hi @rub, I think it's a good approach to diversity our tools this days, especially when it comes to deal with memory (sometimes I find myself running out of time with pandas)

    • @robmulla
      @robmulla  2 ปีที่แล้ว +2

      Absolutely! Well said.

  • @hensonjhensonjesse
    @hensonjhensonjesse 2 ปีที่แล้ว +2

    It looks surprisingly similar to pyspark. Especially the lazy implementation. Pretty cool stuff!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Yea, a lot of similarities to pyspark!

  • @PlatinumDragonProductions999
    @PlatinumDragonProductions999 2 ปีที่แล้ว +4

    I love Pandas, but I prefer Spark. This looks very Spark-like to me; I'm eager to make it my goto dataframe processor. :-)

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      If you prefer spark I’m guessing this will be a great package for you.

  • @chris_kouts
    @chris_kouts ปีที่แล้ว +1

    You should do a benchmarking video i was waiting for you to tell me if i should start using it

    • @robmulla
      @robmulla  ปีที่แล้ว

      I made a video about it just yesterday! Check it out on my channel.

  • @bazoo513
    @bazoo513 ปีที่แล้ว +1

    I wonder what authors of these tabular data manipulation libraries didn't adopt relational algebra terminology (or even SQL as a, if not the, manipulation language). For example, why is not choosing only some columns called "projection"?
    Subtle syntax (and _especially_ semantics) differences between libraries designed to do essentially the same tasks make life of users unnecessarily more difficult.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      That’s a good point. Some libraries (like spark) do have the ability to write SQL directly on flat files like this.

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow 2 ปีที่แล้ว +6

    The rust community really produce brilliant stuff. Very impressive!
    Did you find any areas where polars is lacking vs pandas?
    Btw, have you checked out nu-shell? It’s essentially a new shell language designed to do the Unix-philosophy but with data frames for data flow. At least as far as I understand it. Written in rust of course.
    It’s in pretty early development but it feel pretty great to play around with and can probably produce some nice workflows.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Never heard of my-shell but I’ll check it out. I am not too familiar with the RUST community but this package is pretty solid. As people have mentioned the syntax is much more verbose and it lacks some of the built in pandas features.

  • @akhil-menon
    @akhil-menon ปีที่แล้ว

    Hi Rob, thank you for this super informative video! In one of your takeaways, you mentioned that Polars is a good fit if we have some really heavy data processing work. Would you be able to share some insight on how Polars would stack up against Pandas when having to perform heavy NumPy specific computations?(Think linear and vector algebra, trigonometry, matrix operations)
    I read on SO that it is imperative to not kill the parallelization that Polars provides by using Python specific code, so it is my intuition that applying NumPy operations on Polars columns could result in a loss of parallelization. It would be great if you could share your thoughts on this. Thank you again for the amazing content you produce!

  • @jackychan4640
    @jackychan4640 2 ปีที่แล้ว +1

    Happy New Year 2023

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Same to you Jacky! 🎆

  • @K-mk6pc
    @K-mk6pc ปีที่แล้ว +1

    I am working on large data in pandas.But its not a problem for me. Pandas is doing fine in few mins.

  • @JordiRosell
    @JordiRosell 2 ปีที่แล้ว +2

    For ploting polars, I think plotnine is a good option.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I have a video all about my favorite plotting libraries (including plotnine): th-cam.com/video/4O_o53ag3ag/w-d-xo.html&feature=shares

  • @CaribouDataScience
    @CaribouDataScience 2 ปีที่แล้ว

    Good stuff!!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you enjoyed it

  • @chintansawla
    @chintansawla 2 ปีที่แล้ว +3

    The library feels like it's based off the syntax/methods of pyspark. A lot of the methods used are similar to how RDDs are converted to DataFrames in pyspark

    • @robmulla
      @robmulla  2 ปีที่แล้ว +2

      Yes, definitely a lot of similarities between pyspark and polars. Pyspark has always been much slower for me when running on a single node.

    • @chintansawla
      @chintansawla 2 ปีที่แล้ว

      @@robmulla that's a bit shocking! Both seem to be performing in a similar fashion theoretically (lazy evaluation, parallel computing). Going to try and compare polars soon. Thanks

    • @jordanfox470
      @jordanfox470 2 ปีที่แล้ว +1

      @@robmulla have you tried pandas on spark? Databricks has that running.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      @@jordanfox470 no. Have you? How does it compare?

  • @Pedro_Israel
    @Pedro_Israel ปีที่แล้ว +1

    Hey Rob can you do a video about automatic EDA librearies? I used them and they blew my mind. I am amazed I didn´t know them earlier.

    • @robmulla
      @robmulla  ปีที่แล้ว

      That's a good suggestion. What libraries have you used that you like? The main one I've seen is pandas profiling.

  • @bryanwilly4086
    @bryanwilly4086 ปีที่แล้ว

    Perfect, thank you!

  • @두두-b2d
    @두두-b2d 2 ปีที่แล้ว +1

    OMG.. thank you!!

  • @bazoo513
    @bazoo513 ปีที่แล้ว +1

    "Split, apply, combine" approach sounds like it could employ massively parallel processing of graphics cards. Is there a CUDA implementation?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Yes! It’s called rapids. I need to make a video about it.

    • @bazoo513
      @bazoo513 ปีที่แล้ว

      @@robmulla Thanks!

  • @georgiyveter6391
    @georgiyveter6391 2 ปีที่แล้ว +1

    Use python 3.10.
    Created dictionary:
    d = {'a': [1,2,3], 'b': [4, -5, 6]}
    Created dataframe:
    df = pl.DataFrame(d)
    print(type(df))
    print(df)
    It all works. But if I change in dictionary d any number to float, for example 6.8, then functions print type still shows it's a dataframe, but next print silently do nothing, like 'pass', and script ends. Why?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      That’s a great question. Is it only with 3.10?

  • @JustinGarza
    @JustinGarza 2 ปีที่แล้ว +1

    i like this, but i wish i covered graphs? does this use matplotlib or something use to make graphs and charts ?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      It doesn’t. But you can always convert it back to a pandas data frame to plot.

    • @JustinGarza
      @JustinGarza 2 ปีที่แล้ว

      @@robmulla umm maybe I’ll wait til it gets more graphic/chart support or until pandas gets updated

  • @Myektaie
    @Myektaie ปีที่แล้ว +1

    Hi, thanks for this great video! It looks like polar is very similar to spark, do you know how they compare?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for the comment. They are very similar. Check out my most recent video where I compare the two.

  • @JohannPetrak
    @JohannPetrak 2 ปีที่แล้ว +2

    Your timeit presentation includes the time to read the data which might not be such a good idea.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Nice catch, but I actually did that intentionally because data I/O is one area where polars can be much faster.

    • @JohannPetrak
      @JohannPetrak 2 ปีที่แล้ว +1

      @@robmulla it is just very bad practice to do this and there other issues which may totally distort the measurements like the OS caching read data in buffers from a previous read.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      @@JohannPetrak that’s a good point. Any idea how I could properly compare the read time in a way that wouldn’t be messed up by the caching?

    • @JohannPetrak
      @JohannPetrak 2 ปีที่แล้ว

      @@robmulla i think there is no way to avoid it, but it may be possible to reduce the effect by loading files that are much larger than what the OS might use for caching, and also load a sequence of many different files for a single benchmarking run, then repeat this several times and take the average (and stdev). Also maybe check how much the external storage is the bottleneck by also loading from SSDs or memcached files.
      With HDDs this will be A LOT slower than the CPU based benchmarks, so I would argue to separate these benchmarks from each other.
      But even with the CPU based ones, running on larger data structures (on a computer that has even larger RAM) may give better results as the impact of other OS, memory management, (JIT) interpreter etc optimizations gets reduced.
      Sorry, I do not want to claim I know how to do proper benchmarks, but I do know (from experience) it is easy to not do it properly :)

  • @fredgavin
    @fredgavin 2 ปีที่แล้ว +2

    Tried Polars multiple times, and felt that it was too verbose. Just cannot give up R's data.table, which is the best data manipulation package in the data science world, no competitor at all.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Yea. Definitely more verbose than pandas. I haven’t used R in years but don’t remember it ever being the fastest.

  • @user-fv1576
    @user-fv1576 10 หลายเดือนก่อน

    Looks a bit like SQL with the select. Newbie question, why not just use pandasql library?

  • @ArnabAnimeshDas
    @ArnabAnimeshDas 2 ปีที่แล้ว +1

    I would import another plotting library which produces a better plot anyways.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Yep, that's totally reasonible. Thanks for watching.

    • @ArnabAnimeshDas
      @ArnabAnimeshDas 2 ปีที่แล้ว +1

      @@robmulla also you can convert polars dataframe to pandas if you want to

  • @张世濠-j8e
    @张世濠-j8e 2 ปีที่แล้ว +2

    somehow it's very similar to Spark on AWS Glue ?

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Yes, very similar but I think polars is intended for a single machine vs. spark which can be distributed across nodes.

  • @ankan650
    @ankan650 2 ปีที่แล้ว +1

    Wow. It looks like Apache Spark might be obsolete soon. Can you also compare Ray packages with Polar. I think Ray is not exactly for data processing instead for more compute intensive tasks. Thanks.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I benchmark ray in a different video if you want to check it out.

  • @suvidani
    @suvidani 2 ปีที่แล้ว +1

    How does the performance compares to pyspark? The syntax very similar to pyspark.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Good question. I might need to test it out. Haven’t used spark in years and had some bad experiences but it’s probably gotten better since then.

  • @donnillorussia
    @donnillorussia ปีที่แล้ว +1

    Isn't this "split-apply-combine" approach similar to map-reduce? Just curious 😉

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Yes! Exactly. Map reduce (like in spark) is very similar. Polars only runs single node, and map reduce I believe can be done across nodes.

  • @Matias-eh2pn
    @Matias-eh2pn 2 ปีที่แล้ว +1

    How did you configured that theme on jupyter?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I have a whole video on my setup. Check it out here: th-cam.com/video/TdbeymTcYYE/w-d-xo.html

  • @valuetraveler2026
    @valuetraveler2026 ปีที่แล้ว +1

    URLError:

    • @robmulla
      @robmulla  ปีที่แล้ว

      Strange. Did you get this error when trying to pip install? Otherwise polars shouldn't be using anything to connect to the internet.

  • @praveenmogilipuri4524
    @praveenmogilipuri4524 2 ปีที่แล้ว +1

    Hi, anyone can help me how to connect polars with snowflake. Through pandas i can but i don't want to use pandas.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I’ve never done anything like that before but maybe others will know how.

  • @JonLikesStats
    @JonLikesStats 4 หลายเดือนก่อน

    Why do we compare polars to pandas instead of polars to dask? I dabble in Rust myself, so im interested in polars. But the comparison most people make seems inherently unfair because of multithreading.

  • @FabioRBelotto
    @FabioRBelotto 6 หลายเดือนก่อน

    You should have tested polars with the same test as you did with dask, modin and vaex

  • @akshaydushyanth9720
    @akshaydushyanth9720 2 ปีที่แล้ว +1

    Is it similar to pyspark? Whats the difference between both?

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Only runs on a single node. Much faster than pyspark when working with data that can fit in memory.

  • @yayasssamminna
    @yayasssamminna 10 หลายเดือนก่อน

    Please make a tutorial on Dask!!!

  • @AyahuascaDataScientist
    @AyahuascaDataScientist ปีที่แล้ว

    Polars doesn’t have a .info() method? I can’t use it…

  • @jay_wright_thats_right
    @jay_wright_thats_right 2 หลายเดือนก่อน

    Orders of magnitude faster? What does that even mean?

  • @JayRodge
    @JayRodge ปีที่แล้ว +1

    Have you tried RAPIDS cuDF?

    • @robmulla
      @robmulla  ปีที่แล้ว

      A little bit. It can be really fast but requires that your data is small enough to fit into your GPU memory.

  • @Capsaicinophile
    @Capsaicinophile 2 ปีที่แล้ว +1

    Unless you need to run your scripts over and over, I believe Polars cannot replace Pandas, as it takes more effort to write a simple aggregation. 2 seconds of faster execution is not worth 20 seconds of writing a line for every aggregation column and giving it an alias.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Yea. For quick scripts on small data and EDA, I’m sticking with pandas.

  • @rhard007
    @rhard007 2 ปีที่แล้ว

    Is it not possible to use Matplotlib or Seaborn with Polars?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      It probably is possible. It's just not built into the dataframe as methods like it is in pandas. Just one additional step or you can convert the final data to pandas after processing.

  • @rolandheinze7182
    @rolandheinze7182 ปีที่แล้ว

    Polars syntax seems very similar to pyspark, and in my opinion therefore hurts readability vs pandas

  • @michaeldeleted
    @michaeldeleted ปีที่แล้ว +2

    OMG I just completely replaced pandas with polars and all the regular pandas commands worked

    • @robmulla
      @robmulla  ปีที่แล้ว

      Wait, what? I think the syntax should be very different. Unless they released a new version that I don't know about. Can you show an example?

    • @michaeldeleted
      @michaeldeleted ปีที่แล้ว +1

      Oops, didn't change all my pd to pl. LOL was still using pandas

    • @robmulla
      @robmulla  ปีที่แล้ว

      @@michaeldeleted oh! That explains it.

  • @XavierSoriaPoma
    @XavierSoriaPoma 2 ปีที่แล้ว +1

    So why should we use polars instead of pandas?

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Did you watch the video? 😂 speed is the main reason.

    • @XavierSoriaPoma
      @XavierSoriaPoma 2 ปีที่แล้ว +1

      @@robmulla yeah but still I'm not convinced, it's like tensorflow or pytorch they are not as fast as Flux, but we still use them in python

  • @mishmohd
    @mishmohd 2 ปีที่แล้ว +1

    Can we suggest they change the name to Polaris

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Why do you suggest that?

  • @EircWong
    @EircWong 2 ปีที่แล้ว +1

    Nosie at 3:29, about 10 seconds

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Yes! I noticed that. I forgot to put my phone further away from the mic. I tried to edit it out as much as possible. Hopefully it wasn't too distracting.

  • @leonidgrishenkov
    @leonidgrishenkov 2 ปีที่แล้ว +1

    In some cases Polars syntax seems like PySpark

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I've been hearing that a lot :D

    • @leonidgrishenkov
      @leonidgrishenkov 2 ปีที่แล้ว +1

      @@robmulla ahaha sorry, I’m just a captain obvious 😂

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      @@leonidgrishenkov No it's a good point that I didn't realize until people pointed it out. I personally don't use pyspark a ton. Thanks for watching.

  • @hanabimock5193
    @hanabimock5193 2 ปีที่แล้ว +1

    I already see books and videos about polars. The same as with pandas. It is like come on, who needs a book for pandas? Are you kidding me ?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Why do you dislike the fact that there are books about it? Honestly curious. Thanks for watching!

  • @richardbennett4365
    @richardbennett4365 2 ปีที่แล้ว +1

    It is the problem with people who use pandas. They don't by and large know about polars. But why? Polars creator's fault for not promoting his product or laziness by pandas operators who just don't look for something better.
    Also, if one writes import polars as pd, then one doesn't need to rewrite code written for pandas. Or, one can import polars s po. I never understood why people import this package as pl. That would be for a package called plank, line the dock replacement.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Importing as pl makes the most sense to me and it’s what their docs recommend.

  • @commonsense1019
    @commonsense1019 2 ปีที่แล้ว +1

    Well the core of pandas can also be changed using RUST no big deal

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      It can. But will it?

  • @AWest-ns3dl
    @AWest-ns3dl 2 ปีที่แล้ว +1

    Polars syntax is similar to spark

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I’ve been hearing that 😃

  • @BillyT83
    @BillyT83 ปีที่แล้ว +1

    So... Pandas + Dask = Polars?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Kinda… but it’s really just it’s own thing.

  • @grabani
    @grabani 2 ปีที่แล้ว +1

    Interesting.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you think so!

  • @vzmaster
    @vzmaster 2 ปีที่แล้ว

    I'm running into different problem when i try to speed up pandas (or dask), they eating up memory really fast.
    jupiterlab environment, I load ~3-5mb data, use pandas .extractall() function on string field, a then compare results with int fields(count of matches)
    In single thread it takes several week to calculate. If i use multiprocessing, then when comparing results with df.loc it eats up to 200gb+ memory.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      That doesn't sound right. If your data is 3-5Mb I can't imagine any sort of processing needing to take a week to calculate. I'm thinking it's probably something in your code and not an issue with pandas or dask.

    • @vzmaster
      @vzmaster 2 ปีที่แล้ว

      @@robmulla I actually have 2 tables, big one and small one. The Big one has data(~2M entries, ~150mb), the small has patterns (~10k entries ~2mb). I need to run each pattern on all data. Thats why it may take long.
      But primary problem is not speed, its memory consumption. Thats why i take small chunks of big table ~5-30mb. But even with 5mb i get memory overflow 200gb.
      Here is code:
      import numpy as np
      import pandas as pd
      import sqlalchemy as sql
      sql_engine=sql.create_engine('mysql+mysqlconnector://.......................')
      df=pd.read_sql_query("..........................",sql_engine) #big table
      patterns=pd.read_sql_query("..........................",sql_engine) #small table
      ----------------------------------------------jupyterlab block seperation--------------------------------------------------------
      def findoccurancesofpatterns(pat):
      (idx,row)=pat
      res=df.summarynormalized.str.extractall(row.pattern)
      numvaluesstats=pd.DataFrame(columns=['pattern','numvalueorder','param1','param2','param3'])
      if len(res)>0:
      numvaluecount=len(res.columns)
      res=pd.merge(res.reset_index(),df[[,'param1','param2','param3']],how='left',left_on='id',right_index=True)
      for i in range(numvaluecount):
      numvaluesstats.loc[len(numvaluesstats)]=[row.pattern,i,(res['param1']==res[i].astype('Int64')).sum(),(res['param2']==res[i].astype('Int64')).sum(),(res['param3']==res[i].astype('Int64')).sum()]
      return numvaluesstats
      from multiprocessing.pool import Pool
      pool = Pool(50)
      allnumvaluesstats=[]
      for numvaluesstats in pool.imap_unordered(findoccurancesofpatterns, patterns.iterrows()):
      allnumvaluesstats.append(numvaluesstats)

  • @cradleofrelaxation6473
    @cradleofrelaxation6473 2 ปีที่แล้ว +1

    Is it just me, the syntax is a bit more complicated than pandas whenever they differ!!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Yes. I agree, it ends up being more verbose.

  • @richardbennett4365
    @richardbennett4365 2 ปีที่แล้ว +1

    What??? Polars is supposed to give the same result as pandas. Duh. Polars is a pandas replacement.

  • @whitebai6367
    @whitebai6367 2 ปีที่แล้ว +1

    Okay, I'd like to use rust directly.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      You can do it! Polars has a rust API too. Try it out and let me know what you think.

  • @NickWindham
    @NickWindham ปีที่แล้ว +1

    Just use Julia instead of Python. Then you can do all this with speed similar to Rust in one language that even simpler syntax than Python.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Oh really? I haven’t had a chance to need to use Julia but I know it’s popular to use with spark.

  • @ryanwhite7887
    @ryanwhite7887 ปีที่แล้ว +1

    At 6:59 in the video, you can clearly hear him or her say "fifteen", but he or she types a 10 and continues without acknowledging his or her mistake. This is the sign of unambiguous processing and clearly his or her words can only be taken at face value. This has totally discredited all tutorials produced by this channel and I (they/them) will be withdrawing the like that I (they/them) had previously awarded the video.

  • @ErikS-
    @ErikS- ปีที่แล้ว +1

    Just take a huge amount of RAM.
    I did that also...

    • @robmulla
      @robmulla  ปีที่แล้ว

      I used polars on a live stream and crashed my computer during it because it ate all my memory. There is a way to set it to limit the amount it uses I think

  • @richardbennett4365
    @richardbennett4365 2 ปีที่แล้ว +1

    He said 15, but he wrote 10 at 7min 05s.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Good catch!

  • @ibekweobinna3514
    @ibekweobinna3514 2 ปีที่แล้ว +1

    Rob,can I add you to website as one of the best tutors of data science. Man you are good. But funny enough I am still learning pandas,then boom came polars.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks Ibekwe. Never stop learning!

  • @rahulrjb
    @rahulrjb 11 หลายเดือนก่อน

    Very pysparke syntax

  • @nitinkumar29
    @nitinkumar29 ปีที่แล้ว +1

    I will let it mature before dealing with this.

    • @robmulla
      @robmulla  ปีที่แล้ว

      That’s a fair approach. Adopting things too early can be problematic.