OMG. It's exactly what I needed for my project!! Thank you for this vid!! At first I had giant problem with understanding why the code is correct but after the proofs and banging my head, now I got it. Thank you so much 🙂
Isn't this useful for minibatch creation? I mean mini batch SGD, Adam etc. and other optimization algorithms expect each mini batch to be a "unbiased random" sample for the convergence to occur, useful especially when training is happening on a live stream of data. The question makes sense for ML Engineering or Research roles, probably not for developers though Gr8 content and explanation!
Given a pool of N lottery tickets, where the lottery tickets dispatched within X stores (each store might get different numbers of lottery tickets). Does the probability of winning the lottery by purchasing ONE ticket change if we buy lottery ticket from different stores?
11:50 why do we replace the elements that we have picked? doesn't it mean that if i picked element of already existing k elements that it has "won" and should stay?
Hey great question, I hope you've already figured it out since this is a late reply, but it's a great question and something I was stuck on too. Here's a simple answer that I can elaborate on further: "We replace the already picked elements, because we want every element to have an equal probability of being chosen. If we were to keep the existing k elements that have 'won' so to speak, we give an unfair advantage to the elements that came first". It's easy to answer your question with a proof by contradiction. Let's say you just flipped a coin and with probability p=0.5, if the coin lands on heads then it was put into the reservoir of k elements, and said if an element was picked, it stays in the reservoir. Well then, every element until you fill up the reservoir has a probability 1/2 of being put into the reservoir (seems equal right?). Let's say k=5, N=100, and you flipped your coin 5 times and it ended up HHHHH; well then, the rest of the n - k elements (95 elements) in this scenario would have 0% probability of being chosen. Hence, the probability of each element being chosen was not equal to each other 0 != p. The question implies you're only allowed one-pass over a seemingly infinite stream and that you must do something to guarantee an equal probability to each element being chosen, so the point of replacing is making sure that as you increase the size of the elements you've seen to N, that you have given each element an equal probability of being chosen (1/N). You realize once you've chosen to place something in the reservoir, the probability of that staying in the reservoir has to then decrease as you're iterating! Because as you iterate, the probability of any element being chosen needs to be 1, 1/2, 1/3, 1/4, 1/5 .... 1/N when you stop. Hope this helped!
@@tryandcatchjeeves i did actually figure out why that's the case as the only way to have equal probability is to replace but now i have another question which is: how's the probability of each element getting picked is 1/n, shouldn't it be k/n?
I have a question about the code. What if i=2 and k=5, the random_index must be smaller than k. So, does it mean that you will have the same element in the reservoir twice? one by appending, one by replacing?
@@benlucas8109 Bullocks! You're right! yeah the array could also be initialized with the first k elements and then continue the algorithm. Maybe the proofs would have to be adjusted too to show that this indeed correct.
@Amitav Hello there! Well you'd be the first to request something, and I'd be happy to do it. If you have any questions in particular that would help make something tailored to what would be most helpful to you and people like you with similar questions, then please let me know!
Very organized video! Can you please help me understand why your heap solution would be O(n + nlogk)? Wouldn't it simply be O(nlogk), since for each of the n elements in the stream, we are maintaining a heap with k elements (which takes log(k) each for each element of the stream)? Assigning each element of the stream a random key (random.random()) should be an O(1) operation, AFAIK.
Thank you! Yes you're correct (and so am I?) I apologize if it caused you any confusion. I left it as O(n + nlogk) to simply show (and also separate in my mind) that you are indeed iterating over the entire stream of size N and each time you do so you are do a log K operation (insertion into a fixed-size heap) a grand total of N times. That said, the asymptotic running time of this algorithm is supposed to be linear O(n), because depending on the context, it is safe to assume k is a small number independent of n (like the integer 5 haha). You can safely write it off the operation. In other contexts, if K was sufficiently large something like k=N/3, because you actually wanted a decent sample size or something, then O(N log K) seems more appropriate like you said.
@@tryandcatchjeeves Perfect, that makes sense! Yes I also wonder in what scenarios this algorithm is actually implemented and what typical values of K would be. But in terms of big O, I am on board with everything you said. Thank you for clarifying!
Sorry to hear that and you're not alone! haha same thing happened to me. Imho, it's a terrible interview question, doesn't showcase any relevant skills, and can leave the candidate dejected. I hope you have better luck and crush your upcoming job interviews!
OMG. It's exactly what I needed for my project!! Thank you for this vid!! At first I had giant problem with understanding why the code is correct but after the proofs and banging my head, now I got it. Thank you so much 🙂
Bit late here but your explanation at 11:05 and on helped me hella. Hope you make more videos on confusing topics like this in the future
Great video dude, you laid out all the background really well and explained everything super clearly, thanks!
Thank you so much! Good luck on your studying and/or interviews!
more coding content? i really like your thought process and personally think there is good things to learn from you
Thanks, Jeeves! Needed to find a good explanation for this :)
Nice video, I really like your thought process of this not being a fair question.
Thanks for explaining the intuition part for solving this problem. That was really helpful!
Thank you! I'm glad it helped, I was struggling with it too haha
Isn't this useful for minibatch creation? I mean mini batch SGD, Adam etc. and other optimization algorithms expect each mini batch to be a "unbiased random" sample for the convergence to occur, useful especially when training is happening on a live stream of data. The question makes sense for ML Engineering or Research roles, probably not for developers though
Gr8 content and explanation!
Given a pool of N lottery tickets, where the lottery tickets dispatched within X stores (each store might get different numbers of lottery tickets). Does the probability of winning the lottery by purchasing ONE ticket change if we buy lottery ticket from different stores?
look, how cool is this guy.
Hi Jeeves. Great explanation. I have a doubt. If the stream is too large to fit in the memory, how is the heapq.nlargest() thing going to work?
this is great!!
11:50 why do we replace the elements that we have picked? doesn't it mean that if i picked element of already existing k elements that it has "won" and should stay?
Hey great question, I hope you've already figured it out since this is a late reply, but it's a great question and something I was stuck on too. Here's a simple answer that I can elaborate on further: "We replace the already picked elements, because we want every element to have an equal probability of being chosen. If we were to keep the existing k elements that have 'won' so to speak, we give an unfair advantage to the elements that came first".
It's easy to answer your question with a proof by contradiction. Let's say you just flipped a coin and with probability p=0.5, if the coin lands on heads then it was put into the reservoir of k elements, and said if an element was picked, it stays in the reservoir. Well then, every element until you fill up the reservoir has a probability 1/2 of being put into the reservoir (seems equal right?). Let's say k=5, N=100, and you flipped your coin 5 times and it ended up HHHHH; well then, the rest of the n - k elements (95 elements) in this scenario would have 0% probability of being chosen. Hence, the probability of each element being chosen was not equal to each other 0 != p.
The question implies you're only allowed one-pass over a seemingly infinite stream and that you must do something to guarantee an equal probability to each element being chosen, so the point of replacing is making sure that as you increase the size of the elements you've seen to N, that you have given each element an equal probability of being chosen (1/N). You realize once you've chosen to place something in the reservoir, the probability of that staying in the reservoir has to then decrease as you're iterating! Because as you iterate, the probability of any element being chosen needs to be 1, 1/2, 1/3, 1/4, 1/5 .... 1/N when you stop.
Hope this helped!
@@tryandcatchjeeves i did actually figure out why that's the case as the only way to have equal probability is to replace but now i have another question which is:
how's the probability of each element getting picked is 1/n, shouldn't it be k/n?
I have a question about the code. What if i=2 and k=5, the random_index must be smaller than k. So, does it mean that you will have the same element in the reservoir twice? one by appending, one by replacing?
I think in the first if state, where i < k, "continue" should be added to avoid executing code for i > k
@@benlucas8109 Bullocks! You're right! yeah the array could also be initialized with the first k elements and then continue the algorithm. Maybe the proofs would have to be adjusted too to show that this indeed correct.
Hey Jeeves, I am not sure if you would take any requests but I would like to see a video on algorithm L for reservoir sampling.
@Amitav Hello there! Well you'd be the first to request something, and I'd be happy to do it. If you have any questions in particular that would help make something tailored to what would be most helpful to you and people like you with similar questions, then please let me know!
@@tryandcatchjeeves Algorithm R is subtle to develop to an intuition for. Algorithm L 's random number generation part was a bit unclear.
Very organized video!
Can you please help me understand why your heap solution would be O(n + nlogk)? Wouldn't it simply be O(nlogk), since for each of the n elements in the stream, we are maintaining a heap with k elements (which takes log(k) each for each element of the stream)? Assigning each element of the stream a random key (random.random()) should be an O(1) operation, AFAIK.
Thank you! Yes you're correct (and so am I?) I apologize if it caused you any confusion. I left it as O(n + nlogk) to simply show (and also separate in my mind) that you are indeed iterating over the entire stream of size N and each time you do so you are do a log K operation (insertion into a fixed-size heap) a grand total of N times.
That said, the asymptotic running time of this algorithm is supposed to be linear O(n), because depending on the context, it is safe to assume k is a small number independent of n (like the integer 5 haha). You can safely write it off the operation.
In other contexts, if K was sufficiently large something like k=N/3, because you actually wanted a decent sample size or something, then O(N log K) seems more appropriate like you said.
@@tryandcatchjeeves Perfect, that makes sense! Yes I also wonder in what scenarios this algorithm is actually implemented and what typical values of K would be. But in terms of big O, I am on board with everything you said. Thank you for clarifying!
@@EpiKuhL Did you come here from leetcode's december challenge?
@@anandapoudel5767 yes I did haha! I expect you did too? I had never heard of reservoir sampling prior to today’s problem
Great Video!
Thank you!
i just bombed my interview with this question and here i am. and cool, it's just math/statistics. i don't feel so bad anymore. why was i asked this :(
Sorry to hear that and you're not alone! haha same thing happened to me. Imho, it's a terrible interview question, doesn't showcase any relevant skills, and can leave the candidate dejected. I hope you have better luck and crush your upcoming job interviews!
great video
Thanks so much Jayesh! Glad you found the video useful. Good luck on your interviews!
#AskJeeves :)
Gotten that all my life - never gets old haha