Today we will be looking at how sets can be used to significantly optimize your Python code. ▶ Become job-ready with Python: www.indently.io ▶ Follow me on Instagram: / indentlyreels
As one of you were kind enough to point out, the speed_difference calculation was completely off. It should have been obvious to me immediately, but sometimes when you're writing tutorials these details pass you. In the example I provided, sets are MUCH faster than 99%. Sorry for that bad snippet of silly math :)
If it makes you feel any better, GitHub has been using the same incorrect calculation for their Copilot efficiency gains for months now, saying their product increases efficiency by 55% (presumably coming from the fact that tasks took 55% less time), when in reality it increases efficiency by 127%
I've thought about it and I'm still quite certain, that the math is correct in the video. My reason for believing this is that the 100% represents the time the list took to complete. That means 100% faster would mean that set took 100% less time to complete (which would be 0s). However, if you compared it the other way around (list took x% longer to complete), then it would be quite a large number, for example, at 3:00 it would be correct to say, that the list was 2,848,073% slower, than the set or that the list took 2,848,073% more time to complete. But the other way around, it can't go over 100%, because it would go into negatives that way.
@@Dent42 Isn't it true tho? When they're comparing to the old value and get the new value, that is 45% of the original value. That means the old value was reduced/improved by 55%. Their efficiency is based on the time it took and the time can't be 127% less than the original time, because nothing takes below 0s to complete. Edit: Nevermind, when talking about efficiency/performance increase, then it would be 122% I think (tried 3 different formulas, all gave 122%, not 127%).
To add to this, when lists check if an element is there, it uses '==' operation on each member, which is why it takes so long. However, set takes the hashcode of the element (hash(element)) and checks if that hashcode is in the table. In your own classes you can modify the behaviour of the operations, by writing your own implementation of the magic methods __eq__() and __hash__().
I'd suggest that for many applications, using a dict() is what you want. Not only do you get the speed of a set, you can store a value as well. Dict or list are the go-tos, while set can be used if you really need the speed of a dict and you really don't want to store anything.
if you whant to handle more infos (like counts, index in an array, order, etc...) you can use dicts, the keys of the dict will behave practicaly the same as the set and you will be able to store aditional infos
The fact that sets are implemented as hash tables, and that the performance is O(1) are not two separate “reasons”: they are the exact same reason expressed in two different ways.
I'm new to programming and I'm, as always, learning something new from your short videos. However, there is something wrong with the calculation on speed increase. As the result of 'set' approaches zero, the calculation approaches 100% and never beyond. I think you need to have the result of set as base. Thanks for the videos!
That's absolutely my fault for not double checking that bit, I used ChatGPT for inspiration and didn't notice that until you brought it up. I need to be more strict on the code it generates. Thanks for bringing it up!
@@Indently Not sure if it was gpt3 or 4 that you used because 3 is notorious for math mistakes. In any case your video was very good. Straight forward and to the point.
It approached zero, because the list was the base and going above 100% would mean, that the set took negative seconds to complete. If he used set as base, he would have to write that the list was slower by 2,848,073% for example. In video it stated that set was faster by
While I did know that sets were a lot faster than lists, I didn't know by how much. This was very interesting and informative. However, what I didn't know and just saw in your video was pep-515; being able to separate digits with underscores for visual aid seems awesome!
@@Indently That's absolutely true what I said. I'm assuming you're not British born or something because of your first name. Ialso bought your Udemy course last weekend, even though I know Python well by now. But it is a pleasure to work through and follow. And thank you for your great videos and hopefully many more to come on TH-cam!!!
One thing sets do as opposed to lists is eat twice the ram. It's not twice of a lot of ram, since lists tend to be storing pointers and those are int sized, but it's still more.
@Infinite Planes There are generally two types of lists used commonly. Lists you see by default in OOP languages are generally Array-backed - literally an array of a size similar to the list, but rounded to some pre-set number. They're more expensive to edit, since you need to make a new array every time, but can be accessed by index O(1) (it's a sum operation of array's beginning pointer and the index). They are generally the best if you need an array you would want to edit later. However, in Functional languages, the default is a Linked List - where every list member has a pointer to the next one. This means editing it is easy, since the next member can be anywhere in memory and you just have to change the pointer; however, finding a specific member is very difficult since you literally have to iterate over the entire piecemeal list. Also, since they store one extra pointer, they're slightly bigger. They are better for FP-like usages of lists, though.
@Infinite Planes There are two types of construction for tables (the video talks about sets, but values don't have to be unique and the difference in speed is very minor - in fact, many languages implement sets as tables without a value, or with the value being the actual value and key being hashed). Note that while calculating space taken by tables is relatively easy (they, like array lists, are slightly bigger than their elements put together), explaining them isn't quite as easy. They have a lot of optimizations related to the fact that they're meant for really fast lookups by keys. Hell, in a perfect world, hashmaps are O(1) (if you use an array indexed by the hash - obviously that is nowhere near space efficient).
@@laz272727 Array-backed lists are not necessarily or even usually more expensive to edit: it all depends on how you edit it. Further, linked lists offer asymptotic removal/addition benefits only if you can obtain the corresponding pointer for free or very cheaply, which is not usually the case. It is a common misconception, though, because people don't realize the cost of traversing the linked list to obtain the pointer. You also seem to mix random (index-based) access to finding a specific number in the collection. These are completely different operations, and the latter is O(n) for both (unsorted) arrays and linked lists.
You should search zero instead of 999_999. Or in other words, not that your conclusion isn't correct, but it is almost like some marketing department came up with your example: the most unoptimal case from the point of view of lists.
That is the point though. You should compare worst cases (or at least average), not best cases. Obviously if he searched for 0, it would come up immediately for his list. Instead, he should shuffle the list and search for multiple random numbers and make an average.
Thanks For this valuable practical , I have only studied this in books only. Do we have any other way to find out the time of execution of a program so that I can find the efficient programs
Just use ctypes. Don't be lazy, c/cpp is about 100-200 times faster (it takes about 0.2sec to load a cpp function in python but even considering this it will be 10 times faster at least)
I believe you made a mistake with the percentage code and result. Your result doesn't show how many times faster the fast code was. That would be determined by 9.16/(3.2•10^-5)=286250, so a quarter of a million times faster. Basically the fast code could run a quarter of a million times in the same time the slow code could do it once. I recognize your way of calculating it to be the way to find percentage error, where 100% seems like a reasonable result, but it's not what you're looking for I presume.
Just about to say exactly this, the limit in his approach is 100% faster which is accurate but not the best way IMO to show how much faster something is.
Hey @Indently, what if I am given a very long list and asked to check if an element exist in the list or not. In this case, which one is optimal - checking in the list or converting the list to set and checking in the set? The question is that is it better to convert list into set and check?
Converting a list to set takes N hashing calls, so you don't win in efficiency. Either store your data in a set, if order doesn't matter, or just use a list
Python 3.10.4 Windows 8.1 I dont have the library timeit and when try to install it through pip i get the error: No matching distribution found for timeit
Doesn't that mean if sets are unordered the number we are finding will be at 2nd index or in the middle where in list we know that what we are finding is always at the last index so set find it quicker than list ?😮
No. If it were in the middle (on average) it would be half as slow as the list, instead it is really fast, always. It's because sets are using a completely different algorithm than lists (hashing).
Looking at this makes me wonder if I can optimize my function by making a set instead of a dict for speed. Or if a dict is faster than a. List but slower than a set.
From an implementation perspective, a set is often implemented as a simple hashmap (dict) where the keys are the set element and the values are all void (empty). So the performance of indexing into a dictionary and checking if an element is in a set is pretty much the same under the hood
You could have a set for every integer value of a color channel. So, one for 0 red, one for 1 red, one for 2 red, etc. And the same for green and blue channels. Then, in each set, store the pixel indices for that color value.
As one of you were kind enough to point out, the speed_difference calculation was completely off. It should have been obvious to me immediately, but sometimes when you're writing tutorials these details pass you. In the example I provided, sets are MUCH faster than 99%. Sorry for that bad snippet of silly math :)
If it makes you feel any better, GitHub has been using the same incorrect calculation for their Copilot efficiency gains for months now, saying their product increases efficiency by 55% (presumably coming from the fact that tasks took 55% less time), when in reality it increases efficiency by 127%
That's a hilarious little piece of trivia, thanks for sharing 😂
I've thought about it and I'm still quite certain, that the math is correct in the video. My reason for believing this is that the 100% represents the time the list took to complete. That means 100% faster would mean that set took 100% less time to complete (which would be 0s). However, if you compared it the other way around (list took x% longer to complete), then it would be quite a large number, for example, at 3:00 it would be correct to say, that the list was 2,848,073% slower, than the set or that the list took 2,848,073% more time to complete. But the other way around, it can't go over 100%, because it would go into negatives that way.
@@Dent42 Isn't it true tho? When they're comparing to the old value and get the new value, that is 45% of the original value. That means the old value was reduced/improved by 55%. Their efficiency is based on the time it took and the time can't be 127% less than the original time, because nothing takes below 0s to complete.
Edit: Nevermind, when talking about efficiency/performance increase, then it would be 122% I think (tried 3 different formulas, all gave 122%, not 127%).
I like the idea of something taking 100% less time to complete
To add to this, when lists check if an element is there, it uses '==' operation on each member, which is why it takes so long. However, set takes the hashcode of the element (hash(element)) and checks if that hashcode is in the table. In your own classes you can modify the behaviour of the operations, by writing your own implementation of the magic methods __eq__() and __hash__().
I'd suggest that for many applications, using a dict() is what you want. Not only do you get the speed of a set, you can store a value as well. Dict or list are the go-tos, while set can be used if you really need the speed of a dict and you really don't want to store anything.
if you whant to handle more infos (like counts, index in an array, order, etc...) you can use dicts, the keys of the dict will behave practicaly the same as the set and you will be able to store aditional infos
The fact that sets are implemented as hash tables, and that the performance is O(1) are not two separate “reasons”: they are the exact same reason expressed in two different ways.
I love your videos, man. They are simple, effective and very useful. Hope you get the recognition you deserve.
I'm new to programming and I'm, as always, learning something new from your short videos. However, there is something wrong with the calculation on speed increase. As the result of 'set' approaches zero, the calculation approaches 100% and never beyond. I think you need to have the result of set as base. Thanks for the videos!
That's absolutely my fault for not double checking that bit, I used ChatGPT for inspiration and didn't notice that until you brought it up. I need to be more strict on the code it generates.
Thanks for bringing it up!
@@Indently Not sure if it was gpt3 or 4 that you used because 3 is notorious for math mistakes. In any case your video was very good. Straight forward and to the point.
It approached zero, because the list was the base and going above 100% would mean, that the set took negative seconds to complete. If he used set as base, he would have to write that the list was slower by 2,848,073% for example. In video it stated that set was faster by
While I did know that sets were a lot faster than lists, I didn't know by how much. This was very interesting and informative.
However, what I didn't know and just saw in your video was pep-515; being able to separate digits with underscores for visual aid seems awesome!
Thanks so much! Your videos are the best on the internet.
And he speaks the best and understanding English of the whole world!!
That's too kind, Winfried :)
@@Indently That's absolutely true what I said. I'm assuming you're not British born or something because of your first name. Ialso bought your Udemy course last weekend, even though I know Python well by now. But it is a pleasure to work through and follow. And thank you for your great videos and hopefully many more to come on TH-cam!!!
I'm, as always, learning something new from your short videos. great content keep it up 👌👌
Wow bro, everytime i see one of your videos i immediately know i'll learn something new, thank you
One thing sets do as opposed to lists is eat twice the ram. It's not twice of a lot of ram, since lists tend to be storing pointers and those are int sized, but it's still more.
Definitely interesting :)
There are always trade offs, and I am interested in hearing some more stuff about lists vs sets
@Infinite Planes
There are generally two types of lists used commonly.
Lists you see by default in OOP languages are generally Array-backed - literally an array of a size similar to the list, but rounded to some pre-set number. They're more expensive to edit, since you need to make a new array every time, but can be accessed by index O(1) (it's a sum operation of array's beginning pointer and the index). They are generally the best if you need an array you would want to edit later.
However, in Functional languages, the default is a Linked List - where every list member has a pointer to the next one. This means editing it is easy, since the next member can be anywhere in memory and you just have to change the pointer; however, finding a specific member is very difficult since you literally have to iterate over the entire piecemeal list. Also, since they store one extra pointer, they're slightly bigger. They are better for FP-like usages of lists, though.
@Infinite Planes There are two types of construction for tables (the video talks about sets, but values don't have to be unique and the difference in speed is very minor - in fact, many languages implement sets as tables without a value, or with the value being the actual value and key being hashed).
Note that while calculating space taken by tables is relatively easy (they, like array lists, are slightly bigger than their elements put together), explaining them isn't quite as easy. They have a lot of optimizations related to the fact that they're meant for really fast lookups by keys. Hell, in a perfect world, hashmaps are O(1) (if you use an array indexed by the hash - obviously that is nowhere near space efficient).
@@laz272727 Array-backed lists are not necessarily or even usually more expensive to edit: it all depends on how you edit it. Further, linked lists offer asymptotic removal/addition benefits only if you can obtain the corresponding pointer for free or very cheaply, which is not usually the case. It is a common misconception, though, because people don't realize the cost of traversing the linked list to obtain the pointer.
You also seem to mix random (index-based) access to finding a specific number in the collection. These are completely different operations, and the latter is O(n) for both (unsorted) arrays and linked lists.
My first thought was that you're going to talk about list vs tupple comparation and I was confused about how immutability could make iteration faster.
Very good explanation, thank you
W video as always, thanks for the great content
You should search zero instead of 999_999. Or in other words, not that your conclusion isn't correct, but it is almost like some marketing department came up with your example: the most unoptimal case from the point of view of lists.
That is the point though. You should compare worst cases (or at least average), not best cases. Obviously if he searched for 0, it would come up immediately for his list. Instead, he should shuffle the list and search for multiple random numbers and make an average.
@@Efebur Indeed. What I learned is that sarchasm is a difficult sport.
The amount of friends I've lost through using sarcasm in text form 😂
The thumbnail showed the word "why",
so i thought this video would explain the reason why a set is faster to search for a specific value than a list.
3:35
Thanks For this valuable practical , I have only studied this in books only.
Do we have any other way to find out the time of execution of a program so that I can find the efficient programs
Maybe this will help: th-cam.com/video/qXLh5sZLpHE/w-d-xo.html&ab_channel=Indently
Just use ctypes. Don't be lazy, c/cpp is about 100-200 times faster (it takes about 0.2sec to load a cpp function in python but even considering this it will be 10 times faster at least)
We use Python because we are lazy.
@@Indently python is also lazy so that sums into a very poor performance :)
@@Smbrine Poor performance, big salary 😎
Which IDE are you using in this video?
im curious too
@@mailonbido I got it. It's pycharm. It looks different because of the latest UI update in the 2023 version I guess.
@@tanmaypatel4152 oh nice, thanks!
@@mailonbido You're welcome
Good info. Thanks.
funny. it is O(1) because of sets being hash tables :)
Very cool!
I believe you made a mistake with the percentage code and result. Your result doesn't show how many times faster the fast code was. That would be determined by 9.16/(3.2•10^-5)=286250, so a quarter of a million times faster. Basically the fast code could run a quarter of a million times in the same time the slow code could do it once.
I recognize your way of calculating it to be the way to find percentage error, where 100% seems like a reasonable result, but it's not what you're looking for I presume.
Just about to say exactly this, the limit in his approach is 100% faster which is accurate but not the best way IMO to show how much faster something is.
The math is silly, I was fooled by the shiny ChatGPT-4 model. I need to be much more harsh on it.
Sets seem far more efficient than lists. Nice you can convert them back and forth to get items sorted.
What about tuples compared to lists?
Somewhat slow... Just a lil bit....
tuples are faster to create and index, and very similar to search through. Lists are useful for collections that need to be edited.
Hey @Indently, what if I am given a very long list and asked to check if an element exist in the list or not. In this case, which one is optimal - checking in the list or converting the list to set and checking in the set? The question is that is it better to convert list into set and check?
Converting a list to set takes N hashing calls, so you don't win in efficiency. Either store your data in a set, if order doesn't matter, or just use a list
Python 3.10.4
Windows 8.1
I dont have the library timeit and when try to install it through pip i get the error:
No matching distribution found for timeit
Doesn't that mean if sets are unordered the number we are finding will be at 2nd index or in the middle where in list we know that what we are finding is always at the last index so set find it quicker than list ?😮
No. If it were in the middle (on average) it would be half as slow as the list, instead it is really fast, always. It's because sets are using a completely different algorithm than lists (hashing).
Looking at this makes me wonder if I can optimize my function by making a set instead of a dict for speed. Or if a dict is faster than a. List but slower than a set.
From an implementation perspective, a set is often implemented as a simple hashmap (dict) where the keys are the set element and the values are all void (empty). So the performance of indexing into a dictionary and checking if an element is in a set is pretty much the same under the hood
Love from India bro 🇮🇳
How about dicts?
is a frozenset() faster than a set()?
They take slightly less ram, since they don't need to be mutable.
@@laz272727 Thanks 🧑💻
Is there a way to generate pixel-by-pixel smooth, colorful images incredibly fast with set()?
You could have a set for every integer value of a color channel. So, one for 0 red, one for 1 red, one for 2 red, etc. And the same for green and blue channels. Then, in each set, store the pixel indices for that color value.
what ide is this
👍