Monday, March 16, 2015

MapReduce


It's a programmable frame work for pulling data in parallel out of the cluster.
Low-level programming. It uses Job tracker (on master), task tracker(on slaves).

Job client: the one who submits the job (terminal, the programmer who writes commands)
Job: A job contains the compiled binaries that contains the mapper and reducer functions, some configurations that drive the job
Job Tracker: The master to task trackers, similar to name node in HDFS, it coordinate jobs, comes up an execution plan. Schedule the jobs across the task trackers and do phase coordinations.
Task Tracker: The one that is going to break down the job into tasks (map and reduce tasks)
Every task track has slots on it. It will take the map and reduce functions out of the binaries and throw them into the slots, which actually do the execution. Task trackers also report their progress back to the job tracker. If something fails, the job tracker will reschedule that part of the job to other task trackers.

Source: https://www.youtube.com/watch?v=wAZGawXhf8s


Steps:

  1. Job client submits a job to Job Tracker
    • At the background, it will copy the binaries that contains the code (mapper, reducer, configurations), and put them into the HDFS that is close to the task trackers, once they need to use that code, they will download it. 
    • It is easier to copy code across the network then copy data across the network, so all the data is downloaded locally and run on that machine
  2. Job tracker will query the name node and get the locations and name of the nodes that contains the data for the job.
  3. After that, the job tracker will talk to the task trackers and form an execution plan.
    •  it will communicate to the task trackers in the order based on the distance of the task tracker and the data (check if they have available slots). 
    • If the task trackers that are closest to the data don't have available slots, it will go for a rack locality and find out if any task trackers have available slots. 
    • So the order will be data local (the same tracker) -> rack local (trackers in the same rack) -> across the rack. 
    • Then the job tracker will create an execution plan.
  4. Then the task trackers will execute tasks, and put the data into pipelines.
  5. Task trackers will send status reports (heartbeats) to the job tracker. 
    • Information about the task and what its current phrase (split, map or reduce), so the job tracker can do phrase coordination (all the map needs to be finished before it can hit to the reduce part). 
    • It will also send information about its status (successful or failure)
    • As well as slot availability (send more tasks if available, or maps from other jobs)
  6.  Job tracker manage phrases
  7. The job tracker will update its status to success after all phrases are successfully completed. It will notify the client about the status and the client can get the information.

We can go to the web UI to see everything the job has done: the data nodes used, statistical information, runtime information, etc.


Phrases:
Split: use the input format to bring data off the disk/out of HDFS and split it so that it can be sent to mappers.

  • The default input format is the text input format, which breaks up data line by line. Each line will be sent to a mapper. If you have a large file with lots of lines, you can use many mappers run parallel against the data. 
  • Other input format: binary, database, etc.


Map (Mapper): transforms the input splits into key/value pairs based on user-defined code.

Combine phrase: used for optimization. It can save network bandwidth by running the so called local reducer. The same reduce code only runs on the local task slots.

Shuffle & Sort (Intermediate phrase): It will takes all the data nodes that are part of the job into account, shuffle (partition and grouping) and sort it and send it after to reducers.

Reducer: Its work requires the sorted data, and aggregate all results, and use the output Format to put it into HDFS.

The output of each phrase is the input of the next phrase, and the data keeps smaller and smaller.

Source: https://www.youtube.com/watch?v=wAZGawXhf8s


Functional Programming: A programming paradigm that treats computation as the evaluation of mathematical functions. Avoids states and mutable data (data can be updated). No states, no objects, only jobs. Do not modify any data structures, they always create new ones, create copies of old data to represent the updated form. The original data still exists in its unmodified form. We don't need to synchronize the data. The data flow is implicit in the program design.

Imperative Programming: A programming paradigm that describes computation in terms of statements that change program state. Represent data structures and variables in memory.


No comments:

Post a Comment