BigData Hadoop Frameworks (technology ecosystem) ~ Learning BigData, Hadoop and related

List of the Big Data Hadoop Ecosystem/technology Frameworks

Ambari: Deployment, configuration and monitoring

Avro: Data serialization system
Chukwa: Data collection system for monitoring large distributed systems
Flume: Collection and import of log and event data

HBase: Column-oriented database scaling to billions of rows

HCatalog: Schema and data type sharing over Pig, Hive and MapReduce

HDFS: Distributed redundant file system for Hadoop

Hive: Data warehouse with SQL-like access

Mahout: Library of machine learning and data mining algorithms

MapReduce: Parallel computation on server clusters

Pig: High-level programming language for Hadoop computations

Oozie: Orchestration and workflow management

Sqoop: Imports data from relational databases
Spark: A fast and general compute engine for Hadoop data.
Tez: A generalized data-flow programming framework, built on Hadoop YARN,

Whirr: Cloud-agnostic deployment of clusters

Zookeeper: Configuration management and coordination

Ambari:
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

https://ambari.apache.org/

Avro :

Avro is a data serialization system. It provides functionality similar to systems like Protocol Buffers, Thrift etc. In addition to that it provides some other significant features like rich data structures, a compact, fast, binary data format, a container file to store persistent data, RPC mechanism and pretty simple dynamic languages integration. And the best part is that Avro can easily be used with MapReduce, Hive and Pig. Avro uses JSON for defining data types.

Tip : Use Avro when you want to serialize your BigData with good flexibility. (ref: apache.org)

http://avro.apache.org/

Hadoop :

Hadoop is basically 2 things, a distributed file system (HDFS) which constitutes Hadoop's storage layer and a distributed computation framework(MapReduce) which constitutes the processing layer. You should go for Hadoop if your data is very huge and you have offline, batch processing kinda needs. Hadoop is not suitable for real time stuff. You setup a Hadoop cluster on a group of commodity machines connected together over a network(called as a cluster). You then store huge amounts of data into the HDFS and process this data by writing MapReduce programs(or jobs). Being distributed, HDFS is spread across all the machines in a cluster and MapReduce processes this scattered data locally by going to each machine, so that you

don't have to relocate this gigantic amount of data. (ref: apache.org)

http://hadoop.apache.org

Hbase:

Hbase is a distributed, scalable, big data store, modeled after Google's BigTable. It stores data as key/value pairs. It's basically a database, a NoSQL database and like any other database it's biggest advantage is that it provides you random read/write capabilities. As I have mentioned earlier, Hadoop is not very good for your real time needs, so you can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase. Hbase has got it's own set of very good API which could be used to push/pull the data. Not only this, Hbase can be seamlessly integrated with MapReduce so that you can do bulk operation, like indexing, analytics etc etc.

Tip : You could use Hadoop as the repository for your static data and Hbase as the datastore which will hold data that is probably gonna change over time after some processing. (ref: apache.org)

http://hbase.apache.org/

HCatalog:

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

http://hortonworks.com/hadoop/hcatalog/

Hive:
Originally developed by Facebook, Hive is basically a datawarehouse. It sits on top of your Hadoop cluster and provides you an SQL like interface to the data stored in your Hadoop cluster. You can then write SQL ish queries using Hive's query language, called as HiveQL and perform operations like store, select, join, and much more. It makes processing a lot easier as you don't have to do lengthy, tedious coding. Write simple Hive queries and get the results. Isn't that cool??RDBMS folks will definitely love it. Simply map HDFS files to Hive tables and start querying the data. Not only this, you could map Hbase tables as well, and operate on that data.

Tip: Hive to be used for warehousing needs, SQL comfortableness and don't want to write MapReduce jobs. Hive queries get converted into a corresponding MapReduce job in the background which runs on your cluster and gives you the result. Each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to use MapReduce's framework. (ref: apache.org)

http://hive.apache.org/

Pig:

Originally developed at Yahoo and used it extensively. Is a data-flow language that allows to process enormous amounts of data very easily and quickly by repeatedly transforming it in steps. It has 2 parts, a) the Pig Interpreter and b) the language, PigLatin.

Like Hive, PigLatin queries also get converted into a MapReduce job and give you the result. You can use Pig for data stored both in HDFS and Hbase very conveniently. Just like Hive, Pig is also really efficient at what it is meant to do. It saves a lot of your effort and time by allowing you to not write MapReduce programs and do the operation through straightforward Pig queries.

Tip : Use Pig when you want to do a lot of transformations on your data and don't want to take the pain of writing MapReduce jobs. (ref: apache.org)

http://pig.apache.org/

Sqoop:

Sqoop is a tool that allows you to transfer data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Not only this, imports can also be used to populate tables in Hive or HBase. Along with this Sqoop also allows you to export the data back into the relational database from the cluster.

Tip : Use Sqoop when you have lots of legacy data and you want it to be stored and processed over your Hadoop cluster or when you want to incrementally add the data to your existing storage.(ref: apache.org)

http://sqoop.apache.org/

Oozie:

Now you have everything in place and want to do the processing but find it crazy to start the jobs and manage the workflow manually all the time. Specially in the cases when it is required to chain multiple MapReduce jobs together to achieve a goal. You would like to have some way to automate all this. No worries, Oozie comes to the rescue. It is a scalable, reliable and extensible workflow scheduler system. You just define your workflows(which are Directed Acyclical Graphs) once and rest is taken care by Oozie. You can schedule MapReduce jobs, Pig jobs, Hive jobs, Sqoop imports and even your Java programs using Oozie.

Tip : Use Oozie when you have a lot of jobs to run and want some efficient way to automate everything based on some time (frequency) and data availability. (ref: apache.org)

Flume / Chukwa:

Both Flume and Chukwa are data aggregation tools and allow you to aggregate data in an efficient, reliable and distributed manner. You can pick data from some place and dump it into your cluster. Since you are handling BigData, it makes more sense to do it in a distributed and parallel fashion which both these tools are very good at. You just have to define your flows and feed them to these tools and rest of things will be done automatically by them.

Tip : Go for Flume/Chukwa when you have to aggregate huge amounts of data into your Hadoop environment in a distributed and parallel manner. (ref: apache.org)

https://chukwa.apache.org/ https://flume.apache.org/

Learning BigData, Hadoop and related

Monday, July 20, 2015

BigData Hadoop Frameworks (technology ecosystem)

1 comment:

Popular Posts

Recent Posts

Categories

Text Widget

Blog Archive

Sample Text

Labels

Total Pageviews

About Me

Contact Us

Labels

Blogger Pages

Subscribe