Tuesday, February 2, 2016

MapReduce framework (WIP)



Hadoop MapReduce:
It is a Big Data, Hadoop batch-based, distributed programming framework for large scale data processing. The approach is inspired by map and reduce functions from functional programming, (and not same as functional programming). The MapReduce framework perform their work on nodes on a HDFS in a cluster hosted on racks of commodity servers. The MapReduce works with Hadoop and natively uses the same language. Map and reduce functions can be written in Java as Hadoop is written in Java.



MapReduce decomposes work submitted by a client into small parallelized map and reduce workers, as shown in figure 1.4. The map and reduce constructs used in MapReduce are borrowed from those found in the Lisp functional programming lan- guage, and use a shared-nothing model6 to remove any parallel execution interdepen- dencies that could add unwanted synchronization points or state sharing.


A shared-nothing architecture is a distributed computing concept that represents the notion that each node is independent and self-sufficient. 


These are the MapReduce framework main characteristics:

  • Scheduling: Breaks jobs into individual tasks for the map and the reduce portions of the application.
  • Synchronization: Executes multiple process concurrently by keeping track of tasks and their timings 
  • Code/Data Co-location: Places the code and its related data on the same node prior to execution to co-relate process and related data 
  • Fault/Error Handling: Handles errors and faults efficiently across the nodes in a Hadoop cluster
The following process explains how MapReduce performs its tasks on a high level:
  • Input: Start with a large data file having multiple number of records as a input to process.
  • Splitting/Mapping: Split large input source into multiple pieces of data. Use mapping function to extract something of interest. Create master and workers and execute the worker processes remotely. Iterate over the data and create an output list. The MapReduce framework uses Key Value pairs (KVPs) as input and output. The map function extracts data of interest and presents them in the Key Value pair format. 
  • Organize the output list to optimize it for further processing. Different map tasks simultaneously work and read pieces of data
  • Reducing: Apply reduce function to compute a set of results on the Key Value pair of data coming out from mapping process. The reduce function receives this Key Value pairs list as an input and returns another Key Value pair ist as an output, but the keys of the reduce function are often different from those of the map function. 
  • The reduce workers to begin their work. The reduce function is invoked for every unique key.
  • Produce the final output.  Transfers the control to the user program. The input Key Value pairs  are written to stdin (standard input for reading from a file) and the output Key Value pairs are written to stdout (standard output for writing to a file).  

The MapReduce execution process (technical): Following are the main components of the MapReduce execution process

Driver: It initializes a MapReduce job and defines job-specific configurations. The driver can also get back the status of the job execution.
Context: Available at any point of MapReduce execution, it provides a convenient mechanism for exchanging required system and job-wide information.
Input Data: The data for a MapReduce task is initially stored here.
InputFormat: This defines how input data is read and split.
InputSplit: Defines a unit of work for a single map task in a MapReduce program.

OutputFormat: OutputFormat governs the way the job output is written.

• RecordReader: The RecordReader class actually reads the data from its source, converts it into key/value pairs suitable for processing by the mapper, and delivers them to the map method.

• Mapper: The mapper performs the user-defined work of the first phase of the MapReduce program.


Partition: The Partitioner class determines which reducer a given key/value pair will go to. The default Partitioner computes a hash value for the key and assigns the partition based on this result.


Shuffle: The process of moving map outputs to the reducers is known as shuffling.


RecordWriter: A RecordWriter defines how individual output records are written.


Sort: The set of intermediate key/value pairs for a given reducer is automatically sorted by Hadoop before they are presented to the reducer.







0 comments:

Post a Comment