From reading the comments, I think what some people are missing is the following: the value of MapReduce as it was implemented at Google (and in Hadoop) is that it is a framework for doing these sorts of jobs at scale. Furthermore, the framework allows one to plug in your map and reduce functions independently of everything else. That is why the shuffle is important for example. Often, people see the word count example and they want to just take that out and count at that point. And that might be better for this particular example, but the idea of the framework/algorithm is that it does not "know" at the shuffle phase that you are going to count. I think more examples would really help people understand it better.
Professor Baclawski developed the search engine technology now called MapReduce in 1993. Northeastern University patented this technology, and the patent has not been successfully challenged. Northeastern University sued Google for patent infringement, and Google awarded a settlement to Northeastern University.
You can actually use k-means with MapReduce. It basically works by splitting the data into sets and running k-means on different nodes with the same cluster values. After that, the new means are calculated and broadcasted to all nodes for the next iteration.
map and reduce also just happen to be the two most commonly used functions in functional programming, and have been for decades this tech isn't new, it's just being applied at a large scale
I admit I'm an old fogy programmer... I have a massive problem with crediting this to Google. They *may* have been the ones to put a name on it, but that doesn't mean this isn't an exceedingly basic strategy of dividing work and collating the results that was being used by a lot of people for decades.
I don't get it. So you shuffle the keys between the nodes so that each node is counting all the instances of the same key? Wouldn't that shuffling have been the perfect time to just do the count and get it done with?
The word count is a good example. (Actually the best I came across.) She just should have choosen sentences with actual words, and put it in the context of a search engine.
I don't understand what a, b, and c are in the example. I thought they were words, but then the presenter calls them letters. In any case, given the example case of distributing a word count over parallel processors, why isn't the answer simply splitting the whole document/data into chucks, and asking the processors to count the words of one chuck each. Afterwards we can just sum the result. What am I missing?
In this example why wouldn't you just have each server count the words in its data and then send the results back to the operator's machine which adds them up?
Hadoop as an approach to map reduce is so over engineered and complicated that most people avoid it. Also, unless you're at really big scale, the overhead of Hadoop is so big that any decent laptop can outperform it. That's specifically a Hadoop criticism, not a criticism of MR in general. Alternatives like DASK (simplistically: distributed python pandas) is much simpler to manage for most purposes... Not quite the same thing but ultimately, you pick your poison... And as noted by another commenter, the thing Google "pioneered" was _distributed_ MR. MR itself has been a part of many languages for many years... And the good explanation given in this video works just as well for those as it does for distributed versions thereof. :)
I just had to do a mapreduce implementation with grpc for my Advanced Operating Systems class that was due last night, wish this had come out earlier -_-
Why does this allow duplicate keys on the same node instead of just one key per word with the value being the total count of the word on that node? Seems like that would speed up both the shuffle and reduce phases?
Could you do a video on concurrency and possible solutions/history video? I'm surprised you haven't yet touched on sequential and concurrent execution. Great vid!
@@feldinho Yeah; I guess, they have a bunch of it lying around from that era. At home we also have (or had; don't know if my father threw it away, yet) a stack of similar paper from back when we had a dot matrix printer. And I imagine, that, at a university, they got tons of that stuff left.
Does the way this is done have any similarity to how array.prototype.map() and array.prototype.reduce() works in ES6 JS? Edit : i mean at a level of dividing the map process to different processes/threads/tasks to speed up compution.
Why are there duplicate keys? Maybe it's cause I only know Python, but I'm really confused. Like for the first line of text, I would write the output of the map phase as a list of tuples, [('a', 1), ('b', 1), ('a', 1)], but wouldn't it make more sense as a dict, {'a': 2, 'b': 1} ?
I'm not seeing the practicality of this, shuffling seems to make it pointless. It's almost like seeing a huffman tree being formed, but then introducing it to some meth.
MapReduce is a generic algorithm paradigm. There's a CUDA implementation for numerical purposes. This is a description for a cluster size problem rather than a GPU size problem.
From reading the comments, I think what some people are missing is the following: the value of MapReduce as it was implemented at Google (and in Hadoop) is that it is a framework for doing these sorts of jobs at scale. Furthermore, the framework allows one to plug in your map and reduce functions independently of everything else. That is why the shuffle is important for example. Often, people see the word count example and they want to just take that out and count at that point. And that might be better for this particular example, but the idea of the framework/algorithm is that it does not "know" at the shuffle phase that you are going to count. I think more examples would really help people understand it better.
Professor Baclawski developed the search engine technology now called MapReduce in 1993. Northeastern University patented this technology, and the patent has not been successfully challenged. Northeastern University sued Google for patent infringement, and Google awarded a settlement to Northeastern University.
Thanks for the info
You can actually use k-means with MapReduce. It basically works by splitting the data into sets and running k-means on different nodes with the same cluster values. After that, the new means are calculated and broadcasted to all nodes for the next iteration.
map and reduce also just happen to be the two most commonly used functions in functional programming, and have been for decades
this tech isn't new, it's just being applied at a large scale
Glad I'm not the only one bothered by the opening of this video
I remember learning about Hadoop Map Reduce in 2013 and being completely confused about it. Thanks!
it's really refreshing to hear someone explain this in a straight forward way ... kudos to that young lady!
I admit I'm an old fogy programmer... I have a massive problem with crediting this to Google. They *may* have been the ones to put a name on it, but that doesn't mean this isn't an exceedingly basic strategy of dividing work and collating the results that was being used by a lot of people for decades.
I don't get it. So you shuffle the keys between the nodes so that each node is counting all the instances of the same key? Wouldn't that shuffling have been the perfect time to just do the count and get it done with?
This feels like the most impenetrable way of explaining this...
Is it just me or was that not a particularly good example?
It's the classic map reduce example but you need to imagine 10TB of text, not 4 words a b c d
Perhaps not, but word counting is the typical Hello World project for MapReduce so it makes sense to use it.
The word count is a good example. (Actually the best I came across.) She just should have choosen sentences with actual words, and put it in the context of a search engine.
It was confusing until I found a Computerphile video.
I don't understand what a, b, and c are in the example. I thought they were words, but then the presenter calls them letters. In any case, given the example case of distributing a word count over parallel processors, why isn't the answer simply splitting the whole document/data into chucks, and asking the processors to count the words of one chuck each. Afterwards we can just sum the result. What am I missing?
In this example why wouldn't you just have each server count the words in its data and then send the results back to the operator's machine which adds them up?
Hadoop as an approach to map reduce is so over engineered and complicated that most people avoid it. Also, unless you're at really big scale, the overhead of Hadoop is so big that any decent laptop can outperform it. That's specifically a Hadoop criticism, not a criticism of MR in general. Alternatives like DASK (simplistically: distributed python pandas) is much simpler to manage for most purposes... Not quite the same thing but ultimately, you pick your poison... And as noted by another commenter, the thing Google "pioneered" was _distributed_ MR. MR itself has been a part of many languages for many years... And the good explanation given in this video works just as well for those as it does for distributed versions thereof. :)
I just had to do a mapreduce implementation with grpc for my Advanced Operating Systems class that was due last night, wish this had come out earlier -_-
Is "reduce" the same as the fold function in Haskell?
It is
Yes
Yes! Fold with the added benefit of directionality.
This is a good video, no idea wtf the rest are yapping about. She actually goes through the main bit of how data shuffles and moves to different nodes
Why does this allow duplicate keys on the same node instead of just one key per word with the value being the total count of the word on that node? Seems like that would speed up both the shuffle and reduce phases?
When will computer scientists stop putting new fancy names on decade old concepts?
Can you do a segment on the MPI protocol?
When I hear about map/reduce I think about the corresponding commands in perl and that confuses me because they don't seem to be much related.
Could you do a video on concurrency and possible solutions/history video? I'm surprised you haven't yet touched on sequential and concurrent execution. Great vid!
They have covered concurrency within a CPU core, unless maybe you mean parallel execution.
Map reduce gives me google interview flashbacks...
Why do you always write on that particular type of paper in your videos?
They get that for free
Consistency. Computerphile uses dot matrix paper while numberfiile uses brown paper towels.
it's called "continuous form" and was historically used with daisy wheel-type printers (fix type face, size and line height) to ease the reading.
Because computers.
@@feldinho Yeah; I guess, they have a bunch of it lying around from that era. At home we also have (or had; don't know if my father threw it away, yet) a stack of similar paper from back when we had a dot matrix printer. And I imagine, that, at a university, they got tons of that stuff left.
if the mapping creates a single record for each occurence (e.g. a 1, b 1, c 1), why not just use an array (e.g. a, b, a), since the 1 is implied?
Very useful. Thank you very much!
Needed this a week ago for my exam :(
How would you implement this if you want to leave your data untouched? Mirror your data?
is it me or is everything complex just break it down -> bring it back together
forget the nasty comments, I found this fascinating, and so did 94% of people who rated.
Thank you, amazing video! Any chance you’ll talk about Spark next? 😄
My previous project was with Apache spark which uses Hadoop and map reduce and we used it for batch jobs.
select count(...) group by ...
Query plan: parallelism, stream aggregate
But sure, tell me more about how SQL is outdated.
no love for filter :( it's map-reduce-filter (atleast for me)
filter can be part of either phase, depending on how you see it.
filter p l = foldl (++) [] $ map (\x -> if p then [x] else []) l
Thanks!
Does the way this is done have any similarity to how array.prototype.map() and array.prototype.reduce() works in ES6 JS?
Edit : i mean at a level of dividing the map process to different processes/threads/tasks to speed up compution.
map is not computed concurrently but sequential in js.
I love computerphile ❤️
Why are there duplicate keys? Maybe it's cause I only know Python, but I'm really confused.
Like for the first line of text, I would write the output of the map phase as a list of tuples, [('a', 1), ('b', 1), ('a', 1)], but wouldn't it make more sense as a dict, {'a': 2, 'b': 1} ?
I'm not seeing the practicality of this, shuffling seems to make it pointless. It's almost like seeing a huffman tree being formed, but then introducing it to some meth.
She didn't explain it very well...
brilliant example of a practical application
Yeah, that was a pretty bad example and quite boring.
too much talking and not enough visual examples
the sound of felt pens makes me cringe
anyone knows who the genius is?
What's better MapReduce or CUDA?
That's like comparing Apache Webserver to Fortran. They are completely different things.
I don't think these two are directly comparable. I'm pretty sure there are GPGPU implementations of MapReduce that make use of CUDA.
MapReduce is a generic algorithm paradigm. There's a CUDA implementation for numerical purposes. This is a description for a cluster size problem rather than a GPU size problem.