The BEST library for building Data Pipelines...

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 มิ.ย. 2024
  • Building data pipelines with #python is an important skill for data engineers and data scientists. But what's the best library to use? In this video we look at three options: pandas, polars, and spark (pyspark).
    Timeline:
    00:00 Data Pipelines
    01:11 The Data
    02:32 Pandas
    04:34 Polars
    06:15 PySpark
    09:15 Spark SQL
    Follow me on twitch for live coding streams: / medallionstallion_
    My other videos:
    Speed Up Your Pandas Code: • Make Your Pandas Code ...
    Intro to Pandas video: • A Gentle Introduction ...
    Exploratory Data Analysis Video: • Exploratory Data Analy...
    Working with Audio data in Python: • Audio Data Processing ...
    Efficient Pandas Dataframes: • Speed Up Your Pandas D...
    * TH-cam: youtube.com/@robmulla?sub_con...
    * Discord: / discord
    * Twitch: / medallionstallion_
    * Twitter: / rob_mulla
    * Kaggle: www.kaggle.com/robikscube
    #python #polars #spark #dataengineering
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 136

  • @robmulla
    @robmulla  ปีที่แล้ว +10

    If you enjoyed this video please consider subscribing and check out some of my videos on similar topics:
    - Polars Tutorial: th-cam.com/video/VHqn7ufiilE/w-d-xo.html&feature=shares
    - Pandas Alternatives Benchmarking: th-cam.com/video/LEhMQhCv3Kg/w-d-xo.html&feature=shares
    - Speed up Pandas: th-cam.com/video/SAFmrTnEHLg/w-d-xo.html&feature=shares

  • @anchyzas
    @anchyzas 11 หลายเดือนก่อน +2

    These are phenomenal, I especially like these short 10-15min videos. Thanks a lot for sharing all these relevant and up to date topics!

  • @riessm
    @riessm ปีที่แล้ว +18

    One thing you said implicitly is quite important: the footprint of polars is waaayyyy smaller than pandas which feels like polars may be a good choice for edge or serverless computing. In those cases I often refrain from using pandas because of the resources needed and the startup time. I then end up doing funny stuff with dicts, classes, tuples… I‘m considering exploring polars for that.

    • @robmulla
      @robmulla  ปีที่แล้ว +5

      Very good points! I need to start using polars more honestly.

  • @joseortiz_io
    @joseortiz_io ปีที่แล้ว +2

    Great video! Always curious about Spark and this gave a great overview of these 3 tools! 💡

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks for watching Jose!

  • @tonyle7562
    @tonyle7562 ปีที่แล้ว

    Thanks for such awesome content. I love polars and been trying it since your video came out, it would be nice to see you use it to do a data exploration video :D

  • @fee-f1-foe-fum
    @fee-f1-foe-fum ปีที่แล้ว +1

    Another great video! Thanks Rob! Looking forward to the next stream

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching. Glad you liked it!

  • @shivayshakti6575
    @shivayshakti6575 ปีที่แล้ว +2

    Hey Rob, huge fan of your work, keep rolling😀

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks. Will do!

  • @prashlovessamosa
    @prashlovessamosa ปีที่แล้ว +2

    I like these type of videos as they clear
    all confusion.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you like them!

  • @lumieraartabima231
    @lumieraartabima231 ปีที่แล้ว +2

    Really useful for me, thank you rob

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad you found it useful Lumiera. Thanks for watching.

  • @aminehadjmeliani72
    @aminehadjmeliani72 ปีที่แล้ว +1

    Thanks for the educational content Rob

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      My pleasure!

  • @wilsonsantosmarrola1251
    @wilsonsantosmarrola1251 ปีที่แล้ว +1

    Great content Rob! TKS

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you like it!

  • @DarthJarJar10
    @DarthJarJar10 ปีที่แล้ว +2

    Rob, thank you! It's almost as if you read minds! This video sort of went above-and-beyond here! I'd been toying with trying a local session of Spark, and thanks to you, now have the impetus to give it a go!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Awesome! The problem I've always run into for personal projects with spark is that the data I'm using is small enough not to warrent it. But it's a great skill to brush up on if you intend to work at a large company.

  • @somerset006
    @somerset006 หลายเดือนก่อน

    Thanks for the great video! I'd like to see a comparison with other distributed Python libraries, such as Modin. Thanks!

  • @aabbassp
    @aabbassp ปีที่แล้ว +4

    I really like your content. Absolutely grade A+

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you enjoy it!

  • @arturabizgeldin9890
    @arturabizgeldin9890 ปีที่แล้ว

    Great introduction video! Thank you!
    Looks like most of time for PySpark was to initiate the session itself, it creates once as far as I understand and the reuses for later GetOrCreate() function calls. But anyway, for bigger pipelines Spark will work faster.

  • @peterluo1776
    @peterluo1776 3 หลายเดือนก่อน

    excellent. Great contents.
    Thanks for sharing..

  • @TheSiddhaartha
    @TheSiddhaartha ปีที่แล้ว +3

    It was a great video and very useful. Adding Spark to the mix was just awesome! For next video, using duckdb and it's benefits vs polars or maybe duckdb alongside polars would be great! Founder of duckdb said that for most companies, it is enough. So testing and discussion on that claim would also be great! Duckdb is said to be using vector search. Discussion on how vector-search is faster or better would also be great. Thanks!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Great tip! I've been hearing a lot about duckdb lately so I need to check that out. I think I saw the twitter thread you are talking about. Interesting that they can be combined.

  • @jorislimonier5530
    @jorislimonier5530 ปีที่แล้ว

    Hi Rob and thanks for the excellent work, I enjoy each of your videos!
    I would be interested in a video explaining how to put several machine learning libraries pulled from GitHub in a row, for example: Object detection + Keypoints estimation + Person identification. Also, how to manage compatible library versions for all these repos that have different (incompatible) requirements.
    Thanks!

  • @Micro-bit
    @Micro-bit ปีที่แล้ว +1

    Thanks!!! Great JOB!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you liked it!

  • @orlandogarcia885
    @orlandogarcia885 7 หลายเดือนก่อน

    Hi! thank you for your video !
    a question, what version of pandas you were using ?, I see that you are not using the type "arrow" when you are reading the parquet file with pandas

  • @chillvibe4745
    @chillvibe4745 ปีที่แล้ว +4

    Great video! I have a Junior Data Engineer interview coming up and I'm stressed. I don't have any previous working experience in this field. I feel somewhat confident in SQL and Pandas and have been practicing on Strata Scratch. I absolutely hate the Data Structures and Algorithms type of questions like the ones on leetcode and I can't even answer the easy ones. I'm worried that my interview will have those kinds of coding problems. My initial goal was to become a Data Analyst but decided to apply for Data Engineer since it is a junior position.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks for the feedback. I hope your interview goes well. It sounds like you are well prepared and will do great! Do let me know how it goes.

    • @chillvibe4745
      @chillvibe4745 ปีที่แล้ว +1

      ​@@robmulla Thanks for the reply! I just had the interview but it was just talking with a recruiter, nothing technical. Hopefully, if they proceed with me I'm going to have to solve coding questions in a week or so. I just hope the coding questions are going to be like the ones on Strata Scratch and not the ones on Leetcode. If they proceed with me and I get the coding questions and a technical interview, I'm definitely going to share how it went.

    • @ErikS-
      @ErikS- ปีที่แล้ว

      "junior data engineer"
      You need an education of a few years for that and learn quite some math, statistics and what not... Programming is a really different animal than statistics.
      That companies are hiring programmers will only cause risks of doing wrong analysis.

  • @anupambayen5554
    @anupambayen5554 ปีที่แล้ว +1

    Thanks for this great video

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you liked it!

  • @bahamutffxii
    @bahamutffxii ปีที่แล้ว +1

    Hello Rob. In your video, you said that you use Anaconda for environment management, but you install all packages through pip. Could you tell me how to make PyPI the main channel in anaconda and reinstall all packages from it? I currently have an anaconda setup with channels: 'conda-forge' , 'defaults' , 'pandas'. How do I rearrange all installed packages from pip respecting all dependencies?

  • @seniorpeepers
    @seniorpeepers ปีที่แล้ว

    Good stuff

  • @Alexander-pk1tu
    @Alexander-pk1tu ปีที่แล้ว +1

    very good video. Can you please make more advanced polars videos? I have start switching to polars from pandas and I really want to learn more about how to do more advanced things with them.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Sure. I need to find some good examples to show. The polars docs has some nice ones.

  • @alejandroramirez6761
    @alejandroramirez6761 ปีที่แล้ว +1

    Rob, thank you so much!

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Absolutely Alejandro!

  • @josho225
    @josho225 ปีที่แล้ว +1

    kinda beginner-intermediate learner here, but how do you manage units in these data frames/sets? like datatypes are good and all (ints, floats, boolean), but how to you keep track of your units like seconds, hours, kilometers, miles, degrees etc. Would you just add the units in the header, e.g. "max_delay_minutes"? Sorry if this question is trivial.

  • @steve_dunlop
    @steve_dunlop ปีที่แล้ว +10

    Hey Rob, this was a great video - clear and concise. Could you explain how you would set up an analysis that would run regularly as the data changed? For example, the flight data you used in this example, let's say that was updated once a week and you needed to update the aggregate stats, and maybe even track the aggregates over time. Thanks!

    • @robmulla
      @robmulla  ปีที่แล้ว +5

      That's a great question. I'm sure others could answer it better but from my experience is you can solve this with: 1) a batch process that runs your aggregations at set intervals like daily and storing them out to summary files/tables. 2) streaming options that I'm not at all experienced with like: spark.apache.org/docs/latest/streaming-programming-guide.html

  • @legisam1754
    @legisam1754 ปีที่แล้ว +2

    Nice job, Rob. Keep them coming 👍

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I'll try my best!

  • @TheKick32
    @TheKick32 ปีที่แล้ว +3

    Great video! Do you have any thoughts on duckDB?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I've never used it but people seem to keep mentioning it so I need to take a closer look! I started using polars after it was mentioned in the comments of my previous videos.

    • @TheKick32
      @TheKick32 ปีที่แล้ว

      @@robmulla I didn't hear about it till this week I think is relatively new couldn't find anything about it older than a month

  • @JordiRosell
    @JordiRosell ปีที่แล้ว +2

    Awesome. What do you think about ibis? It can act as a frontend for Pandas, Polars, Spark, etc
    :O

    • @robmulla
      @robmulla  ปีที่แล้ว

      Never heard of it before but will def check it out.

  • @manjeetkumaryadav4377
    @manjeetkumaryadav4377 10 หลายเดือนก่อน

    Excellent

    • @robmulla
      @robmulla  10 หลายเดือนก่อน

      Thank you so much 😀

  • @Arkantosi
    @Arkantosi ปีที่แล้ว +1

    Hi Rob, wonderful video as always! Can you make a video on how to deploy a trained machine learning model (maybe the XGBoost forecaster you made) using Docker?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for the suggestion. I really need to make a video about MLops but I'm not the most experienced in it. Thanks for the idea I'll keep it in mind.

  • @serbiansuperliga1339
    @serbiansuperliga1339 ปีที่แล้ว +1

    Since couple of days ago, you can use SQL with Polars as well

  • @hunghai6378
    @hunghai6378 7 หลายเดือนก่อน

    that's great video

  • @kevinoudelet
    @kevinoudelet 10 หลายเดือนก่อน

    thx!

  • @DarkShine101
    @DarkShine101 ปีที่แล้ว +3

    Great work! It would be cool to see how you can use SPARK with ML. I have been using Pandas to do a lot of ML work recently, but my data grew too large to fit in my RAM. I need to swap to PySpark, but I know my scikitlearn pipelines won't work with it.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Good suggestion. I've done some ML with spark, but that was many years back. Usually with deep learning you can train on batches so having all the data in memory is not important. I believe spark tries to follow similar syntax to sklearn pipelines.

    • @DarkShine101
      @DarkShine101 ปีที่แล้ว

      @@robmulla thanks Rob! I thought data needed to be in memory at the same time to do training. It's way easier to split my data and train by chunks.

    • @casota272
      @casota272 ปีที่แล้ว

      @@DarkShine101 You can also leverage Pandas API in Spark to run your training Pandas code as an UDF in spark environment.

  • @sebastianarias9790
    @sebastianarias9790 7 หลายเดือนก่อน

    Hi Rob,
    What do you recommend me to do if I want to access a 30+ GB sqlite3 database table to access information to display on suppose, a web app or a jupyter notebook?

  • @Medina980
    @Medina980 ปีที่แล้ว +2

    Your videos are so nice Rob, I really love them. Could you please share the dataset or indicate us where to find it? Thx

    • @radek_osmulski
      @radek_osmulski ปีที่แล้ว +1

      I second this, would love to play with the data myself!

    • @robmulla
      @robmulla  ปีที่แล้ว +3

      Thanks guys. The dataset is on kaggle here: www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022
      Upvote if you like it!

    • @radek_osmulski
      @radek_osmulski ปีที่แล้ว

      @@robmulla Thanks a lot, appreciate it! 🙂🙂

  • @jonan.gueorguiev
    @jonan.gueorguiev ปีที่แล้ว +1

    That's great comparison and very relevant. What about 'dask'? Isn't it quite a mature replacement of Spark as well?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Great question, I actually have a video on dask in my "pandas alternatives" you should check it out.

  • @585ghz
    @585ghz ปีที่แล้ว +1

    Polars is so fast!. Great video

    • @robmulla
      @robmulla  ปีที่แล้ว

      It sure is! Apprecaite the feedback.

  • @juan.o.p.
    @juan.o.p. ปีที่แล้ว +1

    I think polars could replace pandas in the future once it matures a bit and the community and support grows. Great video as usual! 👌

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I think it's possible. However people are slow to adopt and speed isn't really the main issue for most people writing pandas code right now.

    • @mattizzle81
      @mattizzle81 9 หลายเดือนก่อน

      There's a bit of a chicken vs the egg problem there. Pandas is mature, tried and true. Polars can only mature if it is compelling enough to switch, but to be compelling, it needs the user base.

  • @yassinealaeeddin2229
    @yassinealaeeddin2229 ปีที่แล้ว

    hi Mulla, where i can download the file flight ? can you put url please ?

  • @pierrefaraut8341
    @pierrefaraut8341 ปีที่แล้ว +1

    How do you explain that spark is slower than polars? As theoretically, it should be better right?
    Maybe we would see better results with spark for larger datasets but Polars aims to be good at that too

    • @robmulla
      @robmulla  ปีที่แล้ว +2

      spark is useful when you can't fit the data in memory but all the overhead makes it slower when running on medium sized datasets. I try to mention that in the video. Just to demo I wanted to show how the synatx works but if the data is HUGE I wouldn't have been able to even open it in pandas.

  • @nadavnesher8641
    @nadavnesher8641 ปีที่แล้ว +1

    Love it

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks for watching Nadav!

  • @efrainsoto2719
    @efrainsoto2719 ปีที่แล้ว +1

    Hi, i recently find your chanel and it's amazing, best think I found. I what to ask you if you know of like a game or a page where I can find cleaning data excercis

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you found my channel. Do you mean something like leetcode but for data science? I think there are a few out there but I've never used any of them.

  • @danielfm123
    @danielfm123 ปีที่แล้ว +1

    R is an amazing tool for data pipelines. its native object is a dataframe and has dplyr witch is fast and makes the code easy to read.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I agree, but haven't used R in a long time. How does it compare in terms of speed? I thought R was generally slow.

  • @Panucci75
    @Panucci75 ปีที่แล้ว +1

    Lovely

  • @dataflex4440
    @dataflex4440 ปีที่แล้ว

    Please create a video on Gans creating artifical images

  • @lekalotte2825
    @lekalotte2825 ปีที่แล้ว

    Could you maybe do a similiar Video and compare polars with datatable? Thanks alot!

  • @113swaruppatil5
    @113swaruppatil5 ปีที่แล้ว

    Which machine learning project should I do for MAAANG companies?

  • @ashutoshtiwari4398
    @ashutoshtiwari4398 ปีที่แล้ว +1

    Bro, please create a Playlist on Polars Beginner to expert for faster processing.

  • @Mvobrito
    @Mvobrito ปีที่แล้ว +1

    Can spark be useful if I'm running on a single machine? (like my personal computer)
    Let's say my PC has 8gb of RAM and I need to work with a 20gb dataset. Can spark split the data somehow and make it work?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      It should but I would instead 1) Try splitting the data manually and working on it in chunks with pandas or 2) Try polars streaming to see if it would work.

  • @gustavomezzovilla7248
    @gustavomezzovilla7248 ปีที่แล้ว +1

    You should definitely cover Kedro pipeline!

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Never heard of Kedro before but I'll give it a look for sure!

    • @gustavomezzovilla7248
      @gustavomezzovilla7248 ปีที่แล้ว

      @@robmulla they have a demo in their website of a graphical pipeline of a full Project (starting from the input of data, the filters applyed, the model created and analysis). It works in a way that documentation of projects are build within the development of the project. It is perfect for recuring projects that many people will take a look independently of you been there to explain how it works.

  • @Dmaster247
    @Dmaster247 ปีที่แล้ว

    Can you do a tutorial for building a data pipeline using industry standard tools?

  • @macfrag574
    @macfrag574 ปีที่แล้ว +1

    Where can we get datasets like the one you just showed in the video?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      The airline dataset is on kaggle here: www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022

  • @carlo6195
    @carlo6195 11 หลายเดือนก่อน

    Hi there! Is it possible to request the file for practice purposes? Thank you!

  • @danielfischer4079
    @danielfischer4079 ปีที่แล้ว +4

    3:40 couldn't you solve the memory issue by processing the file in chunks?

    • @Blaze098890
      @Blaze098890 ปีที่แล้ว +1

      With a lot of operations it's not obvious how you do that. Let's say you want to sort a column but you can't load the dataset. Getting the sorted result of each chunk is not enough.

    • @robmulla
      @robmulla  ปีที่แล้ว

      That is true, but also depends on the operation you are working with. Something like standard deviation requires the entire dataset to compute. Obviously if you are doing a groupby std you could chunk the data. Essentially that's what these libraries are attempting to do for you.

  • @dataflex4440
    @dataflex4440 ปีที่แล้ว +1

    Best channel By a Grandmaster

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thank you sir!

  • @sylarfx
    @sylarfx ปีที่แล้ว +1

    I would also add dask to the comparison

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I compare dask on my pandas alternatives video!

  • @teejin
    @teejin ปีที่แล้ว +2

    Nice MKBHD shirt!

  • @soren-1184
    @soren-1184 ปีที่แล้ว

    What about dask?

  • @markokafor7432
    @markokafor7432 ปีที่แล้ว +1

    Polars is rust based which explains the fastness

    • @robmulla
      @robmulla  ปีที่แล้ว

      Yep! I have a whole video on polars/rust you should check out.

  • @guocity
    @guocity 2 หลายเดือนก่อน

    Pandas work much better in unclean data,
    pyarrow give so much headache in data conversion error:
    ArrowInvalid: Could not convert '230' with type str: tried to convert to double
    make many dependencies unusable:
    to_parquet()
    convert pandas to polars
    open csv in data wrangle,
    save as parquet in data wrangle

  • @Ant1-y
    @Ant1-y ปีที่แล้ว +1

    Thanks a lot but Sparks is a nightmare for me to install on my windows PC

    • @robmulla
      @robmulla  ปีที่แล้ว

      Oh man. I can't help you there. Why not install ubuntu dual boot?

    • @bbrother92
      @bbrother92 2 หลายเดือนก่อน

      @@robmulla how did you installed it?

  • @harikrishnanb7273
    @harikrishnanb7273 ปีที่แล้ว +1

    have you ever tried ibis?

    • @robmulla
      @robmulla  ปีที่แล้ว

      I have not. Other have mentioned it and duckDB

  • @jorge1869
    @jorge1869 ปีที่แล้ว +1

    I use Dask instead PySpark.

    • @robmulla
      @robmulla  ปีที่แล้ว +2

      I've used dask in previous videos with poor performance on a single machine. But it is an option for distributed. Check out this video: th-cam.com/video/LEhMQhCv3Kg/w-d-xo.html

  • @JeetJhaveriD
    @JeetJhaveriD ปีที่แล้ว +3

    TLDR; Polars was the fastest and Pandas was the slowest

    • @robmulla
      @robmulla  ปีที่แล้ว

      What about spark?

    • @JeetJhaveriD
      @JeetJhaveriD ปีที่แล้ว

      In the middle? That's what the video says, isn't it?

  • @tacorevenge87
    @tacorevenge87 ปีที่แล้ว +1

    Koalas is good too

    • @robmulla
      @robmulla  ปีที่แล้ว

      Whoa! First time I've heard of this but googled and it looks cool. Pandas API on spark... I need to check it out more.

    • @tacorevenge87
      @tacorevenge87 ปีที่แล้ว

      @@robmulla it’s really good . Runs on top of pyspark . Have you also tried dask?

  • @valueray
    @valueray ปีที่แล้ว +1

    U using Py 3.8? Srsly, go update to 3.11 and test again

    • @robmulla
      @robmulla  ปีที่แล้ว

      Why?

    • @valueray
      @valueray ปีที่แล้ว +1

      @@robmulla Performance 3.11 is really much better

  • @MissMagicAriel
    @MissMagicAriel ปีที่แล้ว

    Hello! How can I contact you directly via email or telegram for buisness iinquiries?

  • @iProxySupport
    @iProxySupport ปีที่แล้ว

    Hello! How can I contact you directly via telegram or email?