June 09 2015
Top 50 Big Data and Hadoop Interview Questions !
Published by BIWHIZ team
- What is the difference between Big Data and Hadoop ?
Big Data is an umbrella term used for a set of technologies to handle Big Data. Hadoop is a Big Data tool (or software). It can be understood with the E-Commerce and PHP analogy. While E-Commerce is a technology, it can be implemented in PHP or other languages.
- Can we query the files on Hadoop similar to how we query a RDBMS table?
Yes, by using HIVE.
- Is it necessary in Hadoop that we create all programs in Java?
No, read the explanation here.
- What is the best quality of Hadoop if we want to use it for File storage purpose?
Hadoop does not require us to define the schema before storing the data on it. We can simply dump lot of files on Hadoop and should define schema only when we want to read this data. Data Lakes work on this concept.
- What are some of the core features of Hadoop?
- Can process huge amount of data by breaking the incoming files into smaller blocks and storing those on multiple machines and processing all of those in parallel.
- Can be used as a huge storage place.
- Can process structured, semi-structured and un-structured data.
- Can be deployed on commodity hardware hence cheaper.
- What is the difference between HIVE, PIG and MapReduce Java Programs?
HIVE provides us a query language similar to SQL by which we can query the set of files stored on HDFS. PIG provides a scripting language which can be used to transform the data. HIVE and PIG both languages are converted to Java Map Reduce programs before they get submitted to Hadoop for processing. Java Map Reduce programs can be written to customize input formats or run some specific functions available in Java.
- What are the Data extraction tools in Hadoop?
Sqoop can be used to transfer data between RDBMS and HDFS. Flume can be used to extract the streaming data from social media, web log etc and store it on HDFS.
- What is Hadoop framework?
Hadoop is an open source framework which is written in java by apache software foundation. This is a top level Apache project with many sub-projects like HDFS, PIG, Hive etc. It is designed to process large amount of data. It distributes a task to over a large cluster of 1000s of computers (Nodes) and run them in parallel. The architecture which fuels this large scale parallel processing is called Map Reduce architecture. Hadoop processes data very reliably and in fault-tolerant manner.
- On What concept the Hadoop framework works?
It works on MapReduce. This architecture was adviced by Google in a research paper published in 2005.
- What is MapReduce and why is it better than earlier MPP architecture?
MapReduce is an algorithm to process huge amount of data in parallel. It divideds a given task into Map and Reduce pahses, runs many map tasks in parallel and send the output to a single Reducer where output from all map tasks are aggregated and returned to the client application. Map Reduce works on data locality concepts hence it is better than MPP (Massively Parallel Processing) Architecture.
- Name the most common Input Formats defined in Hadoop?
The most common input formats defined in Hadoop are TextInputFormat, KeyValueInputFormat & SequenceFileInputFormat.
- What is the difference between TextInputFormat and KeyValueFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper. KeyValueInputFormat: Reads text file and parses lines into key, Value pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
- What is InputSplit in Hadoop?
InputSplit is a logical unit of storage. A mapper is assigned to each InputSplit. HDFS breaks the files into blocks and distributes these blocks to multiple nodes. A Mapper is supposed to process data on one block, however there are situations when some records are spilled over from one block to other blocks. HDFS has to bring this extra data from other blocks while processing. All these data together is called a InputSplit.
- How the file splitting is done in Hadoop framework?
It is invoked by the Hadoop framework by running getInputSplit() method of the Input format class (e.g. FileInputFormat class)
- What is the purpose of Record Reader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs which is then read by Mapper. The RecordReader instance is defined by the Input Format.
- What is Combiner?
The Combiner is a mini-reduce process which operates only on data generated by mappers. Combiner will receive all input data emitted by Mappers on a given node and combine them. Output from Combiner is then sent to Reducers. This helps in reducing the amount of data transfer from Mapper to Reducer.
- What is a JobTracker in Hadoop 1.0?
JobTracker is the service within Hadoop that accepts the requests for running MapReduce jobs and allocate those to various nodes.
- What is a TaskTracker in Hadoop 1.0?
TaskTracker is a node in the cluster that performs tasks like MapReduce and Shuffle operations. These tasks are assigned by a JobTracker.
- What is the difference between Hadoop 1.0 and Hadoop 2.0?
In Hadoop 2.0, the MapReduce architecture is replaced with YARN. There are no JobTracker and TaskTracker now but we have Resource Manager, Node Manager and Application Master.
- What is Speculative execution and how does it work in Hadoop?
Files are broken into blocks and replicated to multiple nodes in a Hadoop Cluster. When Hadoop process a data block (or InputSplit), it does the same processing on multiple copies of the same data block. This is to make sure that if a node fails, Hadoop does not have to run those processes again. Because otherwise the entire execution will be delayed. This type of execution is called the Speculative Execution.
When a task completes, it announces this fact to the JobTracker. Whichever copy of the task finishes first becomes the definitive (or permanent) copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard the output from those tasks. The Reducer receives the inputs from whichever Mappers completed successfully, first.
- What is Hadoop Streaming?
Streaming is a generic API that allows programs written in languages other than Java to be used for Hadoop MapReduce implementations.
- What is DistributedCache in Hadoop?
If you want some files to be shared across all nodes in the Hadoop Cluster then you can define it as a DistributedCache. DistributedCache is configured with Job Configuration and it will share the read only data files to all nodes on the cluster.
- What is the benefit of Distributed cache? Why can't we just have the file in HDFS and have the application read it?
This is because reading data from DistributedCache is much faster. Generally parameter or configuration files are shared as DistributionCache. If every Mapper has to load these files again and again from HDFS then it will take lot of time. It is better to let all nodes have a read only copy.
- What is a counter in Hadoop?
Counters are global variables which can be used for counting purposes. Since a Java MapReduce job runs on each Mapper, we can not define a counter inside a Map Reduce program as the value will be duplicated on each Mapper. Counters are defined at Global level.
- Is it possible to provide multiple inputs to Hadoop? If yes then how can you give multiple directories as input to Hadoop Job?
Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.
- What interfaces need to be implemented to create Mapper and reducer for the Hadoop?
- What are the methods in Mapper Interface?
The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls the cleanup() method. However all these methods we can override in our code.
- What happens if we do not override the Mapper methods and keep them as it is?
If we do not override any method (having even map() as-is), it is going to work as an identity function; emitting each input record as a separate output one by one.
- What is the use of Context Objects?
The Context object allows the mapper to interact with the rest of the Hadoop system. It Includes configuration data for the job, as well as the interfaces which allow Mappers to emit outputs.
- How can you add the arbitrary key values pairs in your mapper?
You can set arbitrary (key, value) pairs of configuration data in your Job, e.g. with Job.getConfiguration().set("myKey", "myVal"), and then retrieve this data in your mapper with Context.getConfiguration().get("myKey"). This kind of functionality is typically done in the Mapper's setup() method.
- How does Mappers Run() method works?
The Mapper.run() method calls map(KeyInType, ValInType, Context) for each key/value pair in the InputSplit for a particular task. Input key is passed through KeyInType, Input value is passed through ValInType and output key,value is emitted through Context variable.
- What is next after Mapper or MapTask?
The output of the Mapper are sorted and Partitions will be created for the output. Number of partition depends on the number of reducer.
- How can we control that particular key should go to a specific reducer?
Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner
- What is reducer used for?
Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values. The number of reducer for a job can be set by using Job.setNumReduceTasks(int).
- What are the Primary Phases of the Reducer?
Shuffle, Sort and Reduce
- Explain the Shuffle?
In Shuffle phase the output from mappers are placed together for matching keys via HTTP.
- Explain the Reducer's short and reduced Phases?
Data is sorted for each reducer and reduced (aggregated) by each reducer. Number of output files is equal to the number of reducers.
- What are two main parts of the Hadoop framework?
Hadoop consists of two main parts
1. Hadoop distributed file system, a distributed file system with high throughput,
2. Hadoop MapReduce, a software framework for processing large data sets.
- How many instances of JobTracker can run on a Hadoop Cluster?
- What is JobTracker and What it performs in a Hadoop Cluster?
JobTracker is a daemon service which submits and tracks the MapReduce tasks in Hadoop cluster. It runs its own JVM process. It usually runs on a separate machine. JobTracker was single point of failure in Hadoop 1.0 .
- Explain the use of TaskTracker in the Hadoop Cluster?
A TaskTracker is a slave node which accepts and executes the tasks from JobTracker. TaskTracker runs in its own JVM Process. Every TaskTracker is configured with a set number of slots which indicates the number of tasks it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance), this is to ensure that process failure does not take down the TaskTracker.The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also update the JobTracker about the number of slots available.
- What do you mean by TaskInstance?
TaskInstances are the actual MapReduce jobs which run on each slave node.
- How Many Daemon processes run on a Hadoop 1.0 Cluster?
Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM. Following 3 Daemons run on Master nodes. NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode - Stores actual HDFS data blocks. TaskTracker - It is Responsible for instantiating and monitoring individual Map and Reduce tasks.
- How many maximum JVM can run on a slave node?
One or Multiple instances of TaskInstance can run on each slave node. Each task instance is run as a separate JVM process. The number of TaskInstances can be controlled by configuration.
- What is Network File System (NFS) ?
It is a kind of file system where data can reside on one centralized machine and all the cluster members can read & write data from this. HDFS way of storing data is more efficient.
- How HDFS differs from NFS?
In HDFS, Data Blocks are distributed across local drives of all nodes, whereas in NFS the data is stored on a dedicated machine. HDFS is designed to work with MapReduce System, since computation is moved to where the data is stored. NFS is not suitable for MapReduce. HDFS runs on a cluster of machines and provides redundancy using replication protocol. NFS is provided by a single machine therefore does not provide data redundancy.
- How does a NameNode handle the failure of the data nodes?
HDFS has master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. The NameNode and DataNode are pieces of software designed to run on commodity machines. NameNode periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not received a heartbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the one DataNode to another. The replication data transfer happens directly between DataNode and the data never passes through the NameNode.
- Where the Mappers Intermediate data will be stored?
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in configuration by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
- What is the Hadoop MapReduce API contract for a key and value class?
The Key must implement the org.apache.hadoop.io.WritableComparable interface. The value must implement the org.apache.hadoop.io.Writable interface.
- What is the IdentifyMapper and IdentifyReducer in MapReduce?
org.apache.hadoop.mapred.lib.IdentityMapper: Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer does not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default valueorg.apache.hadoop.mapred.lib.IdentityReducer : Performs no reduction, writing all input values directly to the output. If MapReduce programmer does not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.