Thanks for the video, found some new info in this video. The gaps I found: - The Linage is not explained and Linage is how Spark recover from worker failures - 8:29 is not correct, you can trigger a writing to disk as a user to make it easier to recover from failures in a long Linage but Spark does not do that automatically - The API difference is not mentioned, that one is also very big benefit of Spark over MR
Guessing you mean lineage here. 1) Which part specifically am I not explaining how we aren't covering from worker failures? We either hope the node comes up or we restore state from the previous checkpoint to another node and recompute any work that we need to. 2) Good to know, thank you! 3) Which API benefit? It seems to me what you're mentioning is being able to use arbitrary operators over a mapper + reducer, which I do believe I mentioned but maybe I forgot this time around.
@@jordanhasnolife5163 1) I meant that you could just define a RDD's Linage\Query plan, explicitly stating that a sequence of upstream RDD transformation from the last checkpoint is maintained and that it is used for the failure recovery if needed. 3) Spark provides much richer and and more powerful API and after working with Spark engineers don't want to get back to using bare bone MR.
Thanks for the content. With Spark there is no need to store the entire data in memory we can also persist part of it or all on disk, but that will make things slower. Also the data will be loaded by partition in the executors, so we don't need to fit the entire dataset in memory unless the input (file) format to read doesn't support reading by chunk (rare). I hope I am not missing something.
Have trouble following how Spark is addressing a wide-dependency failure (8:15). The solution to deal with a wide-dependency failure is to assume the wide-dependency succeed then write the results to disk? How does this address a failed wide-dependency? For instance, using your example diagram, what if the top node failed after {a: 3} an never got {a: 6}. Writing a partial result to disk here wouldn't be entirely useful.
You'd have to redo the computation from the last checkpoint then up to this point. In the example you provided that would mean the bottom node went down: we'd spin up another and have it redo that local computation.
really curious what you think of lakesail's pysail bro. built on rust and suppsoedly 4x the speed and 90% less hardware cost than spark. pretty recent project but looks cool
I have yet to hear of it, but I'll try taking a look at some point if possible! If it's a similar API as spark and a lot faster, I imagine that would gain a lot of traction!
Map reduce treats everything as a "wide dependency", even when it isn't necessarily (purely map jobs or something like that). Also, chaining together map reduce jobs materializes intermediate state to a distributed file system.
When I want to know how something hyped works, I usually look for someone who has beef with it due to having been there done that. For MapReduce, it's Stonebraker (he's literally like Schmidthuber but in the world of DBs), who wrote a paper "Why MapReduce is a dumb hype and sucks big time" (well, he renamed it later into "MapReduce: A major step backwards"), where he nicely shts on MR, with references. So yeah, MR is bs, and if it weren't for Google, nobody would've even touched it, but 2011 was the year when resume-driven development exploded big time, and most architects used every opportunity to prepare for an interview at Google at the expense of pointy-haired managers. HOWEVER, Jeff Dean isn't dumb, and IIRC MR was never developed to be efficient and what not, rather, there was a massuve underutilization problem, power saving tech didn't enter the picture, yet, and so if a shtty commodity hardware piece wasn't used 100%, it would fail for nothing, leaving behind only an electricity bill. MR was an attempt to run SOMETHING as low priority jobs, which if the machine is needed, would be killed with no remorse (i.e. spot instances). Thus if there was however dumb of an idea (like that famous ML project discovering that major eigenvalues of all youtube videos in the world look like kittens), running it at google scale was considered a good opportunity for the 20% projects. Those were good times, kids, and don't ask me what Google Wave was, we don't talk about it in decent societies. On the other hand, blitzscaling was entering the picture, with IQ of CS grads dropping like a stone, faster than the expectations of team leads, and map/fold (functor/monoid) was chosen as a simple enough concept for anyone to be able to operate on, basically a microwave of fp world. So Jeff and his team wrote the difficult parts (shuffling, coordinating), and dumbed down the exposed parts as much as possible. Stranglers? Who cares! Skewed reducers, all outputs mapped to the same key? You go girl! Don't get me quoted on this, I only heard this story from unreliable sources who probably lied. Better listen to Stonebraker and watch Andy Pavlo.
Really funny you commented this, I read the article last night. Yeah it's an interesting one, but funny to see MR's popularity nonetheless. Andy Pavlo is great as well!
This playlist is 100 times better and offers an in-depth explanation than all those paid system design courses.
agreed lol
Thanks for the video, found some new info in this video.
The gaps I found:
- The Linage is not explained and Linage is how Spark recover from worker failures
- 8:29 is not correct, you can trigger a writing to disk as a user to make it easier to recover from failures in a long Linage but Spark does not do that automatically
- The API difference is not mentioned, that one is also very big benefit of Spark over MR
Guessing you mean lineage here.
1) Which part specifically am I not explaining how we aren't covering from worker failures? We either hope the node comes up or we restore state from the previous checkpoint to another node and recompute any work that we need to.
2) Good to know, thank you!
3) Which API benefit? It seems to me what you're mentioning is being able to use arbitrary operators over a mapper + reducer, which I do believe I mentioned but maybe I forgot this time around.
@@jordanhasnolife5163
1) I meant that you could just define a RDD's Linage\Query plan, explicitly stating that a sequence of upstream RDD transformation from the last checkpoint is maintained and that it is used for the failure recovery if needed.
3) Spark provides much richer and and more powerful API and after working with Spark engineers don't want to get back to using bare bone MR.
Ur explanations are so clear, thanks a ton!! Also 10k coming soon 💪
Woohoo! @recursion may have actually botted me some subs...
Thanks for the content. With Spark there is no need to store the entire data in memory we can also persist part of it or all on disk, but that will make things slower. Also the data will be loaded by partition in the executors, so we don't need to fit the entire dataset in memory unless the input (file) format to read doesn't support reading by chunk (rare). I hope I am not missing something.
Yep! Right, by partition fair point, and totally true with regards to not needing to fit it all, things just get slower haha
Have trouble following how Spark is addressing a wide-dependency failure (8:15).
The solution to deal with a wide-dependency failure is to assume the wide-dependency succeed then write the results to disk?
How does this address a failed wide-dependency? For instance, using your example diagram, what if the top node failed after {a: 3} an never got {a: 6}. Writing a partial result to disk here wouldn't be entirely useful.
You'd have to redo the computation from the last checkpoint then up to this point.
In the example you provided that would mean the bottom node went down: we'd spin up another and have it redo that local computation.
Drunk vids are the best. Thanks for recording this!
Excellent explanation and summary. Thanks!
really curious what you think of lakesail's pysail bro. built on rust and suppsoedly 4x the speed and 90% less hardware cost than spark. pretty recent project but looks cool
I have yet to hear of it, but I'll try taking a look at some point if possible! If it's a similar API as spark and a lot faster, I imagine that would gain a lot of traction!
wide dependency write to disk in intermediate state is like map reduce 😅 what changed?
Map reduce treats everything as a "wide dependency", even when it isn't necessarily (purely map jobs or something like that). Also, chaining together map reduce jobs materializes intermediate state to a distributed file system.
When I want to know how something hyped works, I usually look for someone who has beef with it due to having been there done that.
For MapReduce, it's Stonebraker (he's literally like Schmidthuber but in the world of DBs), who wrote a paper "Why MapReduce is a dumb hype and sucks big time" (well, he renamed it later into "MapReduce: A major step backwards"), where he nicely shts on MR, with references.
So yeah, MR is bs, and if it weren't for Google, nobody would've even touched it, but 2011 was the year when resume-driven development exploded big time, and most architects used every opportunity to prepare for an interview at Google at the expense of pointy-haired managers.
HOWEVER, Jeff Dean isn't dumb, and IIRC MR was never developed to be efficient and what not, rather, there was a massuve underutilization problem, power saving tech didn't enter the picture, yet, and so if a shtty commodity hardware piece wasn't used 100%, it would fail for nothing, leaving behind only an electricity bill. MR was an attempt to run SOMETHING as low priority jobs, which if the machine is needed, would be killed with no remorse (i.e. spot instances). Thus if there was however dumb of an idea (like that famous ML project discovering that major eigenvalues of all youtube videos in the world look like kittens), running it at google scale was considered a good opportunity for the 20% projects. Those were good times, kids, and don't ask me what Google Wave was, we don't talk about it in decent societies.
On the other hand, blitzscaling was entering the picture, with IQ of CS grads dropping like a stone, faster than the expectations of team leads, and map/fold (functor/monoid) was chosen as a simple enough concept for anyone to be able to operate on, basically a microwave of fp world. So Jeff and his team wrote the difficult parts (shuffling, coordinating), and dumbed down the exposed parts as much as possible. Stranglers? Who cares! Skewed reducers, all outputs mapped to the same key? You go girl!
Don't get me quoted on this, I only heard this story from unreliable sources who probably lied.
Better listen to Stonebraker and watch Andy Pavlo.
Really funny you commented this, I read the article last night. Yeah it's an interesting one, but funny to see MR's popularity nonetheless.
Andy Pavlo is great as well!
With Flink, why do we still want to use Spark?
Flink is only useful for processing data as it comes in. With spark, the goal is to take existing data on disk and output more data on disk.
@@jordanhasnolife5163 but with Spark Streaming, would that allows it to perform the same role as Flink, right?
Why do you only have 10k subscribers!!!
Haha you guys gotta tell your friends about the channel - randomly having a nice day subs wise here though
You gotta start showing some skin if you want more subs😂
@@navdeepredhu4081 incoming toes next video
@@jordanhasnolife5163 instead of a facecam, how about a feetcam
@@user-se9zv8hq9r Now these are the ideas I'm looking for, you're hired
this old intro xD
Dude what are these jokes??? 😂😂
I wish I could answer
Map Reduce was already dead over a decade
Golly!