Learn MapReduce with Playing Cards

let's go deeper into Hadoop MapReduce implementation as the name implies there are two phases to MapReduce there is the matte phase and the reduce phase the first phase in a MapReduce program is the matte phase the map or mappers job is to create or process the input data as shown in the diagram this is usually a file or directory this file is stored on HDFS and is passed into the mapper function file is usually passed in line by line into the mapper function the mapper processes this input data to create as few or as many outputs as needed there is no API contract requiring a certain number of outputs given the input data this is completely up to the programmer the mappers output is n word passed to the reduce phase the reduce or reducers job is to process the data from the mapper into something usable every single value from the mapper is passed into the reduce function the reducer will create new output values based on the input from the mapper the new output values from the reducer will be saved to HDFS there is another part to MapReduce that the API gives us the technical name for this part is shuffle and sort phase I prefer to call it the magic we'll talk about what happens during this magic part later on as I mentioned in the introduction we're going to use playing cards to understand MapReduce you might be familiar with standard playing cards I like using cards to show MapReduce because the cards have colors and shapes for suits the pack has four suits spades hearts diamonds and clubs there are some other cards that aren't from these four suits that are the Joker's I strongly encourage you to follow along with me as I do the exercise or watch me do the exercise and then do it yourself either way you'll understand the concepts by moving the cards around with your hands and for yourself playing cards have four suits spades hearts diamonds and clubs looking at the cards we can see that some of them have numbers and others don't those cards are the ones like the ace king queen and Jack we also have joker cards in the pack we're going to pretend that non-numeric cards are bad data or data that we want to exclude we're simply going to ignore them and throw them out as we find them our goal or program is to add up the card numbers for a single suit this will give us the sum of the numbers for a particular suit we're going to be acting like a Hadoop cluster I have a three of diamonds that's numeric so I'm going to keep that and place it in the Diamonds pile we have a jack of spades that's non numeric so I'm going to throw that out we're pretending that this jack is bad data so we want to exclude we have a three of Hearts that's numeric so I'm going to keep that and place it in the hearts pile we have a two of hearts that's numeric so I'm going to keep that and place it in the hearts pile we have an ace of hearts that's non numeric so I'm going to throw that out we have a joker that's non numeric so I'm going to throw that out I'm going to speed this up now and quickly sort the remainder of the cards queen of Diamonds three of spades six of Hearts ten of spades eight of clubs seven of Hearts four of spades six of diamonds three of diamonds eight of Hearts seven of diamonds four of clubs seven of diamonds nine of spades seven of clubs five of spades ten of diamonds King of Diamonds looking at our result we have four stacks of cards each stack makes up one suits worth of cards remember that our objective here is to sum up the card numbers for a particular suit it's much easier now that we've sorted them into piles that are only made up of that suit we can just pick up that stack of a suit and know we hold all of the cards for that suit from there we start summoning things our algorithm is pretty simple now because we've separated things out we've manually done what would be done in MapReduce we do this by going through data and saying this piece of data is of a particular type in our case we've said that the card suit identifies the type of our data we'll see in a minute why that typing is so useful Hadoop wouldn't be getting any notoriety if it can just process a few playing cards it needs to be huge amounts of playing cards Hadoop handles these massive amounts of cards by distributing the load across multiple nodes let's see how Hadoop scales using playing cards again I'm holding a stack of playing cards I have two pieces of paper representing two nodes running together on the left is node 1 and on the right is node 2 I'm going to take the cards and split them in half one half will go to one node and the other half will go to the other node even though I'll be doing the algorithm one at a time both nodes are running simultaneously I'm going to run the same algorithm as before I'll simply take my node stack of cards and sort them by suit now I'm going to run the algorithm on the second nodes cards looking at things I have two nodes with four stacks each of cards remember that MapReduce has two phases there is the mapper phase that we just did now we have to run the reduce phase do you remember the magic part of MapReduce that I spoke about earlier I'm going to manually perform part of that magic I'm going to combine the clubs cards from both nodes on node 1 next I'm going to combine the Diamonds cards from both nodes on node 1 then I'm going to combine the hearts cards from both nodes on node 2 finally I'm going to combine the spades cards from both nodes on node 2 let's look at what this magic is done before we continue node 1 has a stack of every clubs and diamonds cards in separate Lyle's node2 has a stack of every hearts and spades cards in separate piles this magic has allowed several notes to process data independently and combine it together this magic combined the data from all these nodes so that a single node can process the entire suit at once this is how a Hadoop cluster works in real life the cluster is made up of tens hundreds or even thousands of nodes all connected by a network a job is broken up into smaller parts to run on each node these smaller parts are usually based on the amount and size of the data being processed the mapper on a node operates on that smaller part the magic takes the mappers data from every node and brings it together on nodes all around the cluster the reducer runs a node and knows it it has access to everything with the same key let's imagine a stack of playing cards that goes all the way up into the stratosphere there are two problems with the scenario as we scale it that high one problem is storing that many cards remember that Hadoop breaks up the data into smaller parts what would happen if you were to take a section of cards from the bottom of the tower the entire Tower of cards would fall you really want to break up the single very tall tower into lots of shorter towers that way you can work on them much easier Hadoop does the same thing we talked briefly about HDFS as the distributed storage mechanism let's say we were to store a terabyte file in HDFS that file would be physically broken up into much smaller chunks by default these chunks are blocks are 64 megabytes often it's recommended that these blocks be 128 megabytes the file continues to be logically one terabyte in size we get several benefits by breaking up the terabyte file into smaller blocks when a mapper is operating on a terabyte file it's actually operating on a block and not the entire file the terabyte file is being worked on by various nodes in the cluster all at once the various nodes are just operating on different chunks once the mappers are all done the magic takes over all of the data and the terabyte file is combined based on the key the reducers run on different keys we have a big benefit that comes out of breaking up a problem this way we're able to efficiently break up even a large file into smaller pieces having 10 nodes running on chunks of the same problem is much more efficient than a single node running the same problem Hadoop scales in virtually a linear fashion and it would take about 1/10 of time a single node would Hadoop's ability to scale sets it apart from other systems you

25 thoughts on “Learn MapReduce with Playing Cards”

  1. An innovative idea to use a pack of cards to explain the concept. Getting fundamentals right with an example is great ! Thank you

  2. When you say nodes and clusters, does an input file of 1TB should definitely be run in more than one computer or we can install hadoop in a single laptop and virtually create nodes and clusters ?

  3. 4:51 – – i'm kind of lost. so you said two papers as two sets of nodes.
    left is node1 and right is node2.
    then you said, "I have two nodes, where each node has 4 stacks of cards".
    I also understood that you are merging two varieties of cards in node1 and another two varieties of cards in node2.
    " a cluster is made of tens, hundreds or even thousands of nodes all connected by a network".
    so in this example, let's say two papers(nodes) are one cluster.
    the part I get confused is , when you say " the mapper on a node operates on that smaller part. the magic takes the mapper data from every node and brings it together on nodes all around the cluster. the reducer runs a node and knows it has access to everything with same key ".

    So if there are two nodes A and B that has mapper data, then the reducer part will happen on two other nodes C and D. I'm confused when you say "on nodes all around the cluster".

  4. The 'scalability' of hadoop has to do with the fact that the data being processed CAN be broken up and processed in parallel in chunks and then the results can be tallied by key. It's not an inherent ability of the tech other than HDFS itself.
    Like most technology or jobs for that matter the actual 'process' is simple it's wading through the industry specific terminology that has makes it unnecessarily complicated. Hell you can make boiling an egg or making toast complicated too if that's your intent.

  5. Thanks Jesse! This is a wonderful video! I have 2 doubts.
    1. Instead of sum, if it is a sort function, how will splitting it into nodes work? Because then every data point should be treated in one go.

    2. The last part on scaling, how will different nodes working on a file and then combining based on key, be more efficient than one node working on one file?

    I am new to this and would appreciate some guidance and help on the same.

  6. Why did they come up with such a terribly unintuitive name as "MapReduce" ??? It's basically just "bin by attribute, then process each bin in parallel". BinProcess.

Leave a Reply

Your email address will not be published. Required fields are marked *