Published by BIWHIZ team
While most of the IT communities are still awestruck with Hadoop, we thought to bring to you the fastest possible crash course on it and make you aware about some common tools and techniques.
And, not at the expense of your precious time and huge fee, Quick, simple and free. Hope you will enjoy it !!
This is the name of the project. Open-source software are developed for reliable, scalable and distributed computing under this project,
Think this as a common platform for development.
Common utilities (set of code libraries) available to all Hadoop tools.
Hadoop Distributed File System (HDFS)
A distributed file system that provides high-throughput access to application data. Basically a file system for Hadoop tools.
A system of parallel processing for large data sets. In simple language, It does the processing in two steps. Map - break the problem in smaller subsets and process them in distributed enviornment. Reduce- Collect all the output of Map steps and create one final output.
A framework for job scheduling and cluster resource management.
A scalable, distributed database that supports structured data storage for large tables.
A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig & Pig Latin
A high-level data-flow language and execution framework for parallel computation. It has the language - Pig Latin which is used to write the programs for large data sets stored in Hadoop.
A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools- Apache Pig, Apache MapReduce, and Apache Hive- to more easily read and write data on the grid. HCatalog's table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
A Scalable machine learning and data mining library.
A scalable multi-master database with no single points of failure. Cassandra is essentially a hybrid between a key-value and a column-oriented (or tabular) database.
Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key.
A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Spark runs MapReduce programs faster than Hadoop MapReduce and provide application coding in Scala and Python along with Java
A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.