Make Your Pandas Code Lightning Fast

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ต.ค. 2024

ความคิดเห็น • 328

  • @hasijasanskar
    @hasijasanskar 2 ปีที่แล้ว +104

    Whoa.. 3500 times difference. Vectorised is even faster than apply, will give it try next time for sure. Awesome video as always.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +5

      Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.

    • @amazingdude9042
      @amazingdude9042 7 หลายเดือนก่อน

      @@robmulla can you make a video on how to make pandas resample faster ?

  • @miaandgingerthememebunnyme3397
    @miaandgingerthememebunnyme3397 2 ปีที่แล้ว +279

    That’s my husband! He’s so cool.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +35

      Love you boo. 😘

    • @FilippoGronchi
      @FilippoGronchi 2 ปีที่แล้ว +6

      Fully agree!

    • @sauloncall
      @sauloncall 2 ปีที่แล้ว +6

      Aww! This is wholesome!

    • @rahulchoudhary1024
      @rahulchoudhary1024 2 ปีที่แล้ว

      I've been watching your videos since last one week non stop! And enjoy comments from your SO!!! Lovely!

    • @mohammedgt8102
      @mohammedgt8102 ปีที่แล้ว

      He is awesome. Taking time out of his day to share knowledge 👏

  • @kip1272
    @kip1272 ปีที่แล้ว +29

    also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.

    • @robmulla
      @robmulla  ปีที่แล้ว +6

      Nice tip! I didn’t know that.

    • @kazmkazm9676
      @kazmkazm9676 ปีที่แล้ว +2

      I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster.
      import timeit
      setup = 'import random; random_list = random.sample(range(1,101),100)'
      # with or
      first_code = '''\
      result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand

    • @kip1272
      @kip1272 ปีที่แล้ว

      @@kazmkazm9676 the difference was even bigger between & and 'and', if i remember corectly.

    • @A372575
      @A372575 ปีที่แล้ว

      Great, never realized that. Will start using 'and' and 'or' now onwards.

  • @nathanielbonini8951
    @nathanielbonini8951 ปีที่แล้ว +2

    This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.

  • @Zenoandturtle
    @Zenoandturtle 11 หลายเดือนก่อน +4

    That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.

    • @robmulla
      @robmulla  11 หลายเดือนก่อน

      Glad you found it interesting. Thanks for watching!

  • @jti107
    @jti107 2 ปีที่แล้ว +90

    I didn’t realize you could write 10k as 10_000. I work with astronomical units so makes variables more readable. Great video!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +11

      Thanks! Yes, they introduced that functionality with underscores in numbers with python 3.6 - it really helps make numbers more readable.

    • @kailashlate6348
      @kailashlate6348 7 หลายเดือนก่อน

      😊😊😊

    • @kailashlate6348
      @kailashlate6348 7 หลายเดือนก่อน

      😊

  • @deepakramani05
    @deepakramani05 2 ปีที่แล้ว +47

    As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +5

      Thanks. That’s exactly why I wanted to make this video. Hopefully people will find it helpful.

    • @pr0skis
      @pr0skis 2 ปีที่แล้ว +5

      Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference

    • @allenklingsporn6993
      @allenklingsporn6993 ปีที่แล้ว +5

      @@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).

    • @nitinkumar29
      @nitinkumar29 ปีที่แล้ว +2

      @@pr0skis you can convert excel file to csv and then use csv files because csv files io are faster.

    • @jaimeduncan6167
      @jaimeduncan6167 ปีที่แล้ว +5

      It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.

  • @nirbhay_raghav
    @nirbhay_raghav ปีที่แล้ว +3

    My man made a df out of the time diff to plot them!! Really useful video. Will definitely keep this in mind from now.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Haha. Thanks Nirbhay!

  • @i-Mik
    @i-Mik ปีที่แล้ว +1

    Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!

    • @robmulla
      @robmulla  ปีที่แล้ว

      So awesome! It's really satisfying when you are able to improve the speed of code by orders of magnitude.

  • @robertjordan114
    @robertjordan114 2 ปีที่แล้ว +2

    Man where have you been all my Python-Life!?!? Thank you so much for this! Outstanding!!!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks Robert for watching. Glad you found it helpful!

    • @robertjordan114
      @robertjordan114 2 ปีที่แล้ว +1

      The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      @@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.

    • @robertjordan114
      @robertjordan114 2 ปีที่แล้ว +1

      Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣

  • @OPPACHblu_channel
    @OPPACHblu_channel ปีที่แล้ว +1

    Somehow i have been met vectorize method first at the beginning on my python and pandas journey. Thanks for sharing your experience, lightning fast

    • @robmulla
      @robmulla  ปีที่แล้ว

      It’s a great thing to learn early!

  • @FilippoGronchi
    @FilippoGronchi 2 ปีที่แล้ว +8

    That's another awesome video....extremely useful in the real world work. Thanks again Rob

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks for watching Filippo!

  • @OktatOnline
    @OktatOnline ปีที่แล้ว +2

    I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you found it helpful!

  • @LimitlesslyUnlimited
    @LimitlesslyUnlimited 2 ปีที่แล้ว +3

    Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      You called it! Thanks for the positive feedback. Hope to create more videos like it soon.

    • @robertnolte519
      @robertnolte519 ปีที่แล้ว

      Same! Still hasn't worked on picking up chicks at the bar, but I'm not giving up.

  • @alexandremachado1014
    @alexandremachado1014 2 ปีที่แล้ว +3

    Hey man, nice video! Kudos from reddit!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Glad you enjoed it. So cool that the reddit community liked this video so much. Hopefully my next one will be as popular.

  • @colmduffy2272
    @colmduffy2272 ปีที่แล้ว +2

    There are several videos on pandas vectorization. This is the best.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I apprecaite you saying that! Thanks for watching.

  • @gabriel-mckee
    @gabriel-mckee ปีที่แล้ว +7

    Great video! I wish I had known not to loop over my array for my machine learning project... going to go improve my code now!

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad you learned something new!

  • @anoopbhagat13
    @anoopbhagat13 2 ปีที่แล้ว +5

    Wow ! That's an excellent way of speed up the code.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks Anoop. Hope your future pandas code is a bit faster because of this video :D

  • @hussamcheema
    @hussamcheema 2 ปีที่แล้ว +4

    Wow amazing. Please keep making more videos like this.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks for the feedback. I’ll try my best.

  • @craftydoeseverything9718
    @craftydoeseverything9718 10 หลายเดือนก่อน

    Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!

  • @artemqqq7153
    @artemqqq7153 ปีที่แล้ว +1

    Dude, that row[column] thing was a shock to me, thanks!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you learned something!

  • @ajaybalakrishnan5208
    @ajaybalakrishnan5208 ปีที่แล้ว +1

    Awesome. Thanks Rob for introducing this concept to me.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Happy it helped!

  • @LaHoraMaker
    @LaHoraMaker ปีที่แล้ว +1

    I loved that you used Madrid Python user group for the pandas logo :)

    • @robmulla
      @robmulla  ปีที่แล้ว

      I did?! I didn't even realize. What's the timestamp where I show that logo?

  • @sphericalintegration
    @sphericalintegration ปีที่แล้ว +1

    Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems.
    And your Boo is right - you are pretty cool. Thanks again, sir.

    • @robmulla
      @robmulla  ปีที่แล้ว

      That's awesome that I was able to help you out. Check my other videos where I go over similar tips! Glad you agree with my Boo

  • @Vonbucko
    @Vonbucko ปีที่แล้ว +2

    Awesome video man! Appreciate the tips, I'll definitely be subscribing!

    • @robmulla
      @robmulla  ปีที่แล้ว

      I appreciate that a ton. Share with a friend too!

  • @MrValleMilton
    @MrValleMilton 5 หลายเดือนก่อน

    Thank you very much for this video Rob. It is very helpful for beginners like me. Have a great day.

  • @balajikrishnamoorthy5464
    @balajikrishnamoorthy5464 ปีที่แล้ว +1

    I am a begineer, admired your sound knowledge in Pandas

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching. Hope you leaned some helpful stuff.

  • @thebreath6159
    @thebreath6159 ปีที่แล้ว +1

    Ok this channel is great for data science, I’ll follow

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for subbing!

  • @pietraderdetective8953
    @pietraderdetective8953 2 ปีที่แล้ว +8

    I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me.
    What a great video!
    Can you please do more of these efficient pandas videos and use some stock market data? Thanks!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks for the feedback. I’m so happy you found this useful. I’ll try my best to do a future video related to stock market data.

  • @prodmanaiml9317
    @prodmanaiml9317 2 ปีที่แล้ว +8

    More video tips for pandas would be excellent!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Great suggestion. I'll try to keep the pandas videos coming.

  • @blogmaster7920
    @blogmaster7920 ปีที่แล้ว +1

    This can be really helpful, when moving data from one source to another through Internet.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Absolutely, compressing can make any data transfer faster.

  • @kingj5983
    @kingj5983 4 หลายเดือนก่อน

    Wow, awesome video, thanks! Although it takes time to figure out how to turn my limit conditions into logical calculation and return a bool dataframe

  • @GregZoppos
    @GregZoppos ปีที่แล้ว +1

    Wow, thanks! I'm a beginner in data science, this is really interesting to me.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Great to hear! Good luck in your data science journey.

  • @RichieStockholm
    @RichieStockholm 2 ปีที่แล้ว +2

    I expect a video about moped gangs in the future, Rob.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      That’s a great idea Richie! I practically majored in moped gangs in college. 😂

  • @bgotura
    @bgotura 11 หลายเดือนก่อน

    I love how that Pandas logo has canibalized the city of Madrid (Spain) logo

  • @alysmtech3683
    @alysmtech3683 ปีที่แล้ว

    Jesus, I'm over here blowing up my laptop. Had no idea, thank you!

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Hah. My name is Rob. But glad you learned something new.

  • @mic9657
    @mic9657 ปีที่แล้ว +1

    great tips! and very well presented

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you like it. Thanks for watching.

  • @spicytuna08
    @spicytuna08 2 ปีที่แล้ว +2

    oh my!!! awesome. thanks!!!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks 🙏

  • @ledestonilo7274
    @ledestonilo7274 ปีที่แล้ว +1

    Interesting. Thank you will try it.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Awesome! Let me know how it goes.

  • @djangoworldwide7925
    @djangoworldwide7925 8 หลายเดือนก่อน

    As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling

  • @ersineser7610
    @ersineser7610 2 ปีที่แล้ว +3

    Thank you very much for great video.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you liked it! Thanks for the feedback.

  • @bm647
    @bm647 ปีที่แล้ว +1

    Great video! very useful

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you found it useful!

  • @FabioRBelotto
    @FabioRBelotto ปีที่แล้ว +2

    Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better.
    Could you imagine a situation where vectorization would be not possible?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad this helped! As far as examples where vectorization is not possible:
      For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation.
      Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods.
      Hope that helps.

  • @FF-ct5dr
    @FF-ct5dr ปีที่แล้ว +2

    The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.

    • @Geza_Molnar_
      @Geza_Molnar_ ปีที่แล้ว

      @@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂
      (in this case "M" -> docs)

  • @SillyLittleMe
    @SillyLittleMe ปีที่แล้ว +2

    Hey, this is a great video and truly shows the benefit of vectorisation
    I would like to point out that always remebering the vectorize way of writing is hard. Fortunately, NumPy module does provide a neat method called "vectorize" that vectorizes your non-vectorize function.
    an example (from the docs):
    ## this is the function
    def myfunc(a, b):
    "Return a-b if a>b, otherwise return a+b"
    if a > b:
    return a - b
    else:
    return a + b
    ## vectorising the function and then applying it
    vfunc = np.vectorize(myfunc)
    vfunc([1, 2, 3, 4], 2)
    array([3, 4, 1, 2])
    This works on DataFrames as well.
    Do Note tho that this is not True vectorisation because of that ,in some cases, it performs similarly to functions like "apply". However, for the most part it does a tremendous job and has significantly increased the speed of my functions.
    The reasons for why it is not "true vectorisation" are mentioned in this thread : stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks! I've used that before and it does come in handy. Also using things like jit/numba can compile numpy operations.

  • @sweealamak628
    @sweealamak628 ปีที่แล้ว +1

    I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!

    • @robmulla
      @robmulla  ปีที่แล้ว

      I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.

    • @sweealamak628
      @sweealamak628 ปีที่แล้ว

      @@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought!
      Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!

  • @kevincannon2269
    @kevincannon2269 ปีที่แล้ว +1

    i _am_ excited! show the solution in machine code next pls thx

    • @robmulla
      @robmulla  ปีที่แล้ว

      Working on it…

  • @incremental_failure
    @incremental_failure ปีที่แล้ว

    Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.

  • @abdulkadirguven1173
    @abdulkadirguven1173 ปีที่แล้ว

    Thank you very much Rob.

  • @Sinke_100
    @Sinke_100 ปีที่แล้ว +1

    Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad it was helpful! There are some packages that compile functions (called numba/jit) there is also np.vectorize

    • @Sinke_100
      @Sinke_100 ปีที่แล้ว

      @@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same.
      My suggestion is that you should explain more in depth those complexed stuff.

  • @dreamdeckup
    @dreamdeckup 2 ปีที่แล้ว +6

    I had to do the same thing in my first internship lol. The script went from 4 hours to like 10 minutes to run

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Yea, when I learned this it 100% changed the way I write pandas code.

  • @BILALAHMAD-cz9gu
    @BILALAHMAD-cz9gu ปีที่แล้ว +1

    this man is amazing but i'm poor with english ...... but i will learn english definetly bcz of this man

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks. So glad it helped even though it’s not your native tongue!

  • @YuanYuan-uk8sz
    @YuanYuan-uk8sz ปีที่แล้ว

    thank you your very extremely perfect video,so so helpful for me,love you so much

    • @robmulla
      @robmulla  ปีที่แล้ว

      I'm so glad! Share it with a friend or two who you think might also appreciate it.

  • @kennethstephani692
    @kennethstephani692 ปีที่แล้ว +2

    Great video!!!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thank you!!

  • @Graham_Wideman
    @Graham_Wideman ปีที่แล้ว +1

    1:19 "a random integer between one and 100." I believe that should be from 0 to 99 (ie: inclusive at both ends). In case nobody else mentioned it.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Good catch! I think you are the first to point that out.

  • @rahulkmail
    @rahulkmail 3 หลายเดือนก่อน

    Thanks for sharing a nice information

  • @moodiiie
    @moodiiie ปีที่แล้ว +1

    That’s all I do at work, vectorize is the way to go. I was able to do some complex logic with them.

  • @beastmaroc7585
    @beastmaroc7585 ปีที่แล้ว +1

    thank you so much fir this game changer tips ....

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching!

  • @kanishkpareek6650
    @kanishkpareek6650 ปีที่แล้ว

    your teaching style is awesome. where can i find your videos in a structured manner??

  • @ivanrubnenkov919
    @ivanrubnenkov919 หลายเดือนก่อน

    now instead of two actions + .loc in the third example use np.where for oneliner and it will be even faster

  • @MrJak3d
    @MrJak3d 2 ปีที่แล้ว +2

    Damn, I knew lvl 2 but lvl 3 was awesome!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!

  • @agoodwin-8127
    @agoodwin-8127 2 ปีที่แล้ว +6

    I typically use numpy where in this situation (mainly because I like the syntax better!), so I was curious about the speed vs. the level 3 solution. Where ran a little faster (~15-20% for datasets sized 10K - 50M records).
    # level 3 - vectorized
    %%timeit
    df = get_data(10_000)
    df['reward'] = df['hate_food']
    df.loc[((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5))
    | (df['age'] > 90), 'reward'] = df['favorite_food']
    # 3.74 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    # level 4 - where
    %%timeit
    df = get_data(10_000)
    df['reward'] = np.where(((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5))
    | (df['age'] > 90),
    df['favorite_food'],
    df['hate_food'])
    # 3.15 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      I love that you ran that experiment! I'm actually suprised the numpy version isn't even faster. Thanks for sharing.

    • @georgebrandon7696
      @georgebrandon7696 ปีที่แล้ว +1

      @@robmulla I'll throw another wrench into this. df.at[] vs df.loc[]. df.at[] is considerably faster than df.loc[]. But I've never ran conditionals with df.at[]. I'm also an np.where() user. :)

  • @cbritton27
    @cbritton27 2 ปีที่แล้ว +18

    I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case.
    Edit: in my case, the numpy select is slightly faster than the vectorization.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +2

      Thanks for sharing. It would be cool to see an example code snippet similar to what I used in this video for comparison.

    • @linkernick5379
      @linkernick5379 ปีที่แล้ว

      Polars lib is quite fast with my 1 million dataset, I recommend to try.

  • @alexisdebrand6209
    @alexisdebrand6209 ปีที่แล้ว +1

    so usefull thank you !!!!!

    • @robmulla
      @robmulla  ปีที่แล้ว

      You're welcome! Thanks for commenting.

  • @andrew3068
    @andrew3068 ปีที่แล้ว +1

    Super awesome video.

    • @robmulla
      @robmulla  ปีที่แล้ว

      I appreciate that. Thanks for commenting!

  • @chndrl5649
    @chndrl5649 ปีที่แล้ว +1

    Could also use query instead of loc

    • @robmulla
      @robmulla  ปีที่แล้ว

      Not sure that would work for this case because we aren’t straight filtering.

  • @ehsankiani542
    @ehsankiani542 ปีที่แล้ว +1

    Thanks Rob

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks for watching!

  • @PeterSeres
    @PeterSeres 2 ปีที่แล้ว +12

    Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps.
    Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?

    • @robmulla
      @robmulla  2 ปีที่แล้ว +5

      Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!

    • @DrewLevitt
      @DrewLevitt 2 ปีที่แล้ว +2

      Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!

  • @danielbrett247
    @danielbrett247 ปีที่แล้ว +1

    Not everything can be vectorized, commonly when processing time series data. For these, a great library to know about is njit.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Agreed. Njit / numba can be great when needing to make sudo compiled python code.

  • @Levy957
    @Levy957 2 ปีที่แล้ว +2

    Apply is the way it works for me, but good to know vectorized functions exist

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks Levy. Yes, when working with really big datasets vectorized can save a lot of time.

    • @georgebrandon7696
      @georgebrandon7696 ปีที่แล้ว

      Most people won't use datasets with millions of rows. For those that don't, apply will work well. But if one is doing some sort of say, timeseries forecasting, one wouldn't get away with a mere few thousand rows. Not enough training data. Enter in np.where(). Easier to write and understand over vectorized. If I am not mistaken, np.where() is already vectorized in the background for you.

  • @rockwellshabani5180
    @rockwellshabani5180 2 ปีที่แล้ว +4

    Would vectorization also be faster than an np.where statement with multiple conditions?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.

    • @georgebrandon7696
      @georgebrandon7696 ปีที่แล้ว +1

      np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)

  • @justsayin...1158
    @justsayin...1158 ปีที่แล้ว

    It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?

  • @diegoalmeida2221
    @diegoalmeida2221 ปีที่แล้ว +2

    Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      You can try to vectorize using something like numba. But it depends on the complexity of the function.

  • @ruthwik081
    @ruthwik081 ปีที่แล้ว

    It can apparently run even faster if you chain your vectorizations

  • @GroWithUmar
    @GroWithUmar 2 ปีที่แล้ว +2

    Amazing video

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks for the feedback!

  • @onganxiety8719
    @onganxiety8719 หลายเดือนก่อน

    This is the basic way all the code should be written for pandas, but if you really need it to scale for high memory consumption and number of calculations use njit processing for numerics or even easier just use Polars for everything .

  • @Chuukwudi
    @Chuukwudi ปีที่แล้ว +2

    The apply was faster than the vectorised version.
    While computing the time it took for "apply", you added the time it took to get data. When it was time to use the vectorised version, you removed the time it took to get data.
    Why did you do that? Your experiment is biased!
    If I'm really concerned about speed, I'll use np.where() instead.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Uh oh. I didn't do anything like that intentionally, but vectorized functions are definitely faster than using apply. That's why they exist!

  • @adamleon8504
    @adamleon8504 ปีที่แล้ว

    in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?

  • @lucianodomingues2290
    @lucianodomingues2290 ปีที่แล้ว +1

    Very useful!!! Thanks for sharing.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching Luciano!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching Luciano!

  • @elgoogffokcuf
    @elgoogffokcuf ปีที่แล้ว +1

    What about Numba, if it can bring some more optimization, it will be nice if you make a video for it.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.

  • @vinitjha_
    @vinitjha_ 10 หลายเดือนก่อน

    which font do you use? That's awesome font and color scheme

  • @demosthenessss7850
    @demosthenessss7850 ปีที่แล้ว +1

    代码写得好顺滑啊,佩服啊!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for your comment. Translation I think is "The code is written so smoothly, I admire it!" I'm glad you liked it!

    • @demosthenessss7850
      @demosthenessss7850 ปีที่แล้ว

      @@robmulla Yes, translation is correct. I commented in Chinese because I want to have more Chinese voices here. 谢谢你的分享,我会继续关注:)

  • @mbcebrix
    @mbcebrix ปีที่แล้ว +1

    Is vectorization applicable for huge datasets? Like millions of datasets for example.

    • @robmulla
      @robmulla  ปีที่แล้ว

      If it can fit in your computer’s memory then yes!

  • @krishnapullak
    @krishnapullak ปีที่แล้ว +1

    Nice tip

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thx for watching.

  • @deniszhuravlev9874
    @deniszhuravlev9874 ปีที่แล้ว +1

    This is cool!
    This is the way to think in a diffrent way👍
    Always using 'loc.', but have never thought to use it insted of "for'.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you liked it and learned something new!

  • @Atlas92936
    @Atlas92936 ปีที่แล้ว +1

    Nice glasses! where are they from?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Haha. Thanks! 🤓 - They are from warby parker, but I accidentally broke this pair :(

  • @johnidouglasmarangon
    @johnidouglasmarangon ปีที่แล้ว +1

    Great video Bob, thanks.
    I curious, which interface for Jupyter Notebook you are using?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you liked it. This is jupyterlab with the solarized dark theme. Check out my full video on jupyter where I go into detail about it.

    • @johnidouglasmarangon
      @johnidouglasmarangon ปีที่แล้ว

      @@robmulla Tks Bob ✌️

    • @robmulla
      @robmulla  ปีที่แล้ว

      @@johnidouglasmarangon no problem. Jane!

  • @dh00mketu
    @dh00mketu ปีที่แล้ว +1

    Why didn't you remove get rewards function from other run times?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Oops. Did I do it incorrectly? Can you share the timestamp?

  • @A372575
    @A372575 ปีที่แล้ว

    Thanks, one query in case of vectorize, which one would be faster - np.where or the method you memtioned ?

  • @jrwkc
    @jrwkc ปีที่แล้ว +1

    when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.

    • @robmulla
      @robmulla  ปีที่แล้ว

      I don't think so because pandas will use the index when populating. But I'm also not 100% sure.

    • @jrwkc
      @jrwkc ปีที่แล้ว

      @@robmulla make github repos so we can test! that would be great

  • @PowerYAuthority
    @PowerYAuthority ปีที่แล้ว +1

    Never seen anyone use your level 1 methodology

    • @robmulla
      @robmulla  ปีที่แล้ว

      You are lucky!

  • @lucienjaegers2028
    @lucienjaegers2028 ปีที่แล้ว +1

    Nice trick, but what if you code it completely in C / C++ / Rust? Literature says those are 50 - 80 times faster?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I have a whole video on polars, which is written in rust. It’s faster for sure. But keep in mind pandas backend is just C code.

  • @Chris_87BC
    @Chris_87BC ปีที่แล้ว

    Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?

  • @blakeedwards3582
    @blakeedwards3582 ปีที่แล้ว +1

    What theme are you using to get your Jupyter Notebook to look like that?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Solarized dark theme. I have a whole video about my jupyter setup

  • @alanhouston5874
    @alanhouston5874 ปีที่แล้ว +1

    Can you save lists using Parquet
    Or is it only applicable to dataframes?

  • @bilalbayrakdar7100
    @bilalbayrakdar7100 ปีที่แล้ว +1

    so sql like logical filtering is the real deal dude

    • @robmulla
      @robmulla  ปีที่แล้ว

      Not sure if I follow. I do love SQL though!

  • @ErikS-
    @ErikS- ปีที่แล้ว +1

    3.5 seconds for a for loop with only 10k rows...
    Is this done in a Docker container or another VM(-like) environment?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Just done locally on my fairly beefy machine.

  • @co.n.g.studios5710
    @co.n.g.studios5710 ปีที่แล้ว +2

    Nice vid. Wouldn't it even be faster, if using the .values for the columns? is this even applicable in the case presented in the example? Looking forward to your answer, cheers

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for the comment. Yes using .values could be faster thanks for pointing that out. Not sure about specific part in this video but worth a try.

  • @rickk3658
    @rickk3658 ปีที่แล้ว +1

    3500 times faster is all well and good, but I'd like to know your speed up magic at 2:31. You were turning create dataset into a function. As you type the colon, the rest of the code in the cell became properly indented. My version of JupyterLab does not do that. What's the secret?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Oh. I’m using the black auto formatter with nb_black. It’s really helpful to keep your code clean in jupyter.

  • @TeXiCiTy
    @TeXiCiTy ปีที่แล้ว +1

    For looping over big datasets I switch to polars when speed becomes an issue.

    • @robmulla
      @robmulla  ปีที่แล้ว

      I have an entire video on my channel about polars. It’s great! Check it out.