Tuesday, February 16, 2016

Saturday, February 13, 2016

The modes Hadoop can be run in...

Hadoop can run in three modes:
1. Standalone Mode: 
Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

2. Pseudo-Distributed Mode (Single Node Cluster): 
In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

3. Fully Distributed Mode (Multiple Cluster Node): 
This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

Tuesday, February 2, 2016

MapReduce framework (WIP)



Hadoop MapReduce:
It is a Big Data, Hadoop batch-based, distributed programming framework for large scale data processing. The approach is inspired by map and reduce functions from functional programming, (and not same as functional programming). The MapReduce framework perform their work on nodes on a HDFS in a cluster hosted on racks of commodity servers. The MapReduce works with Hadoop and natively uses the same language. Map and reduce functions can be written in Java as Hadoop is written in Java.



MapReduce decomposes work submitted by a client into small parallelized map and reduce workers, as shown in figure 1.4. The map and reduce constructs used in MapReduce are borrowed from those found in the Lisp functional programming lan- guage, and use a shared-nothing model6 to remove any parallel execution interdepen- dencies that could add unwanted synchronization points or state sharing.


A shared-nothing architecture is a distributed computing concept that represents the notion that each node is independent and self-sufficient. 


These are the MapReduce framework main characteristics:

  • Scheduling: Breaks jobs into individual tasks for the map and the reduce portions of the application.
  • Synchronization: Executes multiple process concurrently by keeping track of tasks and their timings 
  • Code/Data Co-location: Places the code and its related data on the same node prior to execution to co-relate process and related data 
  • Fault/Error Handling: Handles errors and faults efficiently across the nodes in a Hadoop cluster
The following process explains how MapReduce performs its tasks on a high level:
  • Input: Start with a large data file having multiple number of records as a input to process.
  • Splitting/Mapping: Split large input source into multiple pieces of data. Use mapping function to extract something of interest. Create master and workers and execute the worker processes remotely. Iterate over the data and create an output list. The MapReduce framework uses Key Value pairs (KVPs) as input and output. The map function extracts data of interest and presents them in the Key Value pair format. 
  • Organize the output list to optimize it for further processing. Different map tasks simultaneously work and read pieces of data
  • Reducing: Apply reduce function to compute a set of results on the Key Value pair of data coming out from mapping process. The reduce function receives this Key Value pairs list as an input and returns another Key Value pair ist as an output, but the keys of the reduce function are often different from those of the map function. 
  • The reduce workers to begin their work. The reduce function is invoked for every unique key.
  • Produce the final output.  Transfers the control to the user program. The input Key Value pairs  are written to stdin (standard input for reading from a file) and the output Key Value pairs are written to stdout (standard output for writing to a file).  

The MapReduce execution process (technical): Following are the main components of the MapReduce execution process

Driver: It initializes a MapReduce job and defines job-specific configurations. The driver can also get back the status of the job execution.
Context: Available at any point of MapReduce execution, it provides a convenient mechanism for exchanging required system and job-wide information.
Input Data: The data for a MapReduce task is initially stored here.
InputFormat: This defines how input data is read and split.
InputSplit: Defines a unit of work for a single map task in a MapReduce program.

OutputFormat: OutputFormat governs the way the job output is written.

• RecordReader: The RecordReader class actually reads the data from its source, converts it into key/value pairs suitable for processing by the mapper, and delivers them to the map method.

• Mapper: The mapper performs the user-defined work of the first phase of the MapReduce program.


Partition: The Partitioner class determines which reducer a given key/value pair will go to. The default Partitioner computes a hash value for the key and assigns the partition based on this result.


Shuffle: The process of moving map outputs to the reducers is known as shuffling.


RecordWriter: A RecordWriter defines how individual output records are written.


Sort: The set of intermediate key/value pairs for a given reducer is automatically sorted by Hadoop before they are presented to the reducer.







Thursday, January 14, 2016

Apache Pig Scripts and examples 1



Apache Pig Scripts and examples:
I am listing few pig scripting examples with the results to help get a better understanding around apache pig coding.

1. Total record count with Apache Pig script:
In the below post we are reading a normal text file and producing the output as total record count on the console as well as storing the output on disk.

@ Source data file details:
Name: filesgrads-2.txt
Delimeter: tab
File Header: CDS_CODE ETHNIC GENDER GRADS UC_GRADS YEAR
Total records: 20,747

File data sample:

...


@ Running Pig grunt shell in local mode:
Execute below on Unix prompt:

$ pig -x local



@ Output criteria:
Finding the total record count in the file.

@ Pig script:
Please execute the following script at the grunt shell prompt and review the final output.

/* load data file in pig memory/reference variable */
data = load '/Users/Neo/Downloads/filesgrads-2.txt' using PigStorage('\t');

/* group all data in the file and store in pig reference variable */
grp_stud_all = GROUP data ALL;

/* apply count function on file data and store in pig reference variable */
total_stud_count = FOREACH grp_stud_all GENERATE COUNT(data);

/*store or display final out put on console */
dump total_stud_count;

@ Final output review:
Please note that, with dump statement the output is displayed on the console of the grunt shell.



@ Storing the results:
To have the output in the file on the disk, we use store command with appropriate folder location. Please issue following on grunt prompt.

grunt> store total_stud_count into '/Users/xxx/tot_stud_recs.txt'

You can review below to if the command was successful or not.



@ Review results from the stored file:
Please note, the above command creates a folder named tot_stud_recs.txt and the actual output is stored in a file called part-r-00000 and not in a file tot_stud_recs.txt



Thanks!

@ Reference/s:

  • https://pig.apache.org
  • http://pig.apache.org/docs/r0.14.0/


Wednesday, January 6, 2016

Cloudera Training VM on VirtualBox Quick installation

Cloudera Training VM on VirtualBox Quick installation:    

This post is for VirtualBox-5.0.12 and Cloudera Training VM cloudera-quickstart-vm-4.4.0. I believe the steps will be the same or additional minor steps will be required for the future versions of “VM”s and VirtualBox.
You need the following:
  1. The latest VirtualBox software: Visit www.virtualbox.org, if not available with you already and install on your local system.
  2. The latest Cloudera Training VM: Visit the http://www.cloudera.com/content/www/en-us/downloads.html. The Cloudera Training VM typically bundled which needs to be extracted on the OS disk directory.
  3. Unbundle the VM from Command line:
    1. Run following command on Unix based OSs (ex: Linux or OS X):
tar xjf cloudera-quickstart-vm-4.4.0-1-vmware.7z
  1. Or GUI based extractor software can be used, if available. It will be extracted in the below mentioned directory with 3 files in it.
4. Start the VirtualBox.
5. Press, the New button from the icon bar on the top.
6. Configure new VirtualBox VM as follows:
  • Name: any reasonable name indicating it is a Cloudera Training VM and its version.
  • Type: Linux
  • Version: Ubuntu (64 bit) (Note: VM is on CentOS a flavor of Ubuntu)
    Memory Size: for basic operations can be kept as 1024mb. For full application development cycle can be expended to a size depending on additional space and memory available on your system.
  • Hard disk: We have already extracted Training VM on our local directory that can be used as option “Use an existing virtual hard disk file” and select the location of .vmdk file.
  • Press Create button to create the VM. You are all set here for basic operations.

7. Once you create the Oracle Virtual Box Manager will display below:
 
8. Now you are ready to start the VM. Press start button and VirtualBox will boot into CentOS:


9. Once booting process finishes you will see hosted desktop of Cloudera Training VM:
ssCouderaVMStarted.png
References:

Monday, July 20, 2015

BigData Hadoop Frameworks (technology ecosystem)




List of the Big Data Hadoop Ecosystem/technology Frameworks


Ambari:            Deployment, configuration and monitoring

Avro:                Data serialization system
Chukwa:          Data collection system for monitoring large distributed systems
Flume:             Collection and import of log and event data
HBase:            Column-oriented database scaling to billions of rows
HCatalog:        Schema and data type sharing over Pig, Hive and MapReduce
HDFS:             Distributed redundant file system for Hadoop
Hive:                Data warehouse with SQL-like access
Mahout:           Library of machine learning and data mining algorithms
MapReduce:    Parallel computation on server clusters 
Pig:                  High-level programming language for Hadoop computations
Oozie:              Orchestration and workflow management
Sqoop:             Imports data from relational databases
Spark:              A fast and general compute engine for Hadoop data.
Tez:                 A generalized data-flow programming framework, built on Hadoop YARN, 
Whirr:              Cloud-agnostic deployment of clusters
Zookeeper:      Configuration management and coordination