MapReduce interview questions
Hey! Looks like you are looking for the MapReduce interview questions. You have landed in the right place. This article is best for individuals who are looking for job opportunities around Hadoop and MapReduce. By the end of the article, you will gain an understanding of some of the important concepts and questions asked in the interviews. This article consists of the frequently asked MapReduce interview questions that are curated by experts. Let’s get started now!
Q1) What do you know about MapReduce and how does it work?
Ans: Hadoop MapReduce is a framework that is used to process the large data sets in parallel in a Hadoop cluster. Data analysis makes use of a two-step map and reduces the process. During the map phase in MapReduce, it will count the number of words in each document, while in the reduce phase it aggregates the data based on the document spanning the entire collection. During the map phase, the input data will be divided into splits for the purpose of analysis by map tasks that will be running in parallel across the Hadoop framework.
Q2) What do you understand by the terms Shuffling and Sorting in MapReduce?
Ans: Shuffling and Sorting are considered as the two major processes that will be operating simultaneously when the mapper and reducer start working.
Shuffling: Shuffling refers to the process of transferring data from Mapper to reducer. It is one of the mandatory operations for the reducers to continue or proceed with their jobs further as the shuffling process will be serving as an input for the reduced tasks.
Sorting: MapReduce contains the output key-value pairs that exist between the map and reduce phases (after the mapper) will be automatically sorted before moving to the reducer. The sorting feature is helpful in the programs that require sorting at some stages. It also helps in saving the programmer’s overall time.
Q3) Briefly explain about Identity Mapper and Chain Mapper in MapReduce?
Identity Mapper: Identity Mapper is referred to as the default Mapper class which is provided by Hadoop. When there is no other Mapper class defined, identity mapper will be executed. It is only capable of writing the input data into output and does not perform any calculations and computations on the input data. The class name is org.apache.hadoop.mapred.lib.IdentityMapper.
Chain Mapper: Chain Mapper is referred to as the implementation of a simple Mapper class through which the chain operations across a set of Mapper classes, within a single map task. In the chain mapper, the output from the first mapper will become the input for the second mapper, and the output of the second mapper will be the input for the third mapper and so on till the last mapper. The class name is org.apache.hadoop.MapReduce.lib.ChainMapper.
Do you want to Master MapReduce? Then enroll in "MapReduce Training" This course will help you to master MapReduce
Q4) Illustrate the differences between HDFS block and InputSplit?
HDFS block: An HDFS block is responsible for splitting data into some physical divisions.
Inputsplit: InputSplit in MapReduce is responsible for splitting the input files logically.
The InputSplit is also capable of controlling the number of mappers, however, the size of splits is user-defined. When it comes to HDFS, the HDFS block size is fixed to 64 MB, which tells that, for 1GB data, it will be 1GB/64MB = 16 splits/blocks. However, if input split size is not defined by the user, then it takes the default block size of HDFS.
Q5) Explain job scheduling through JobTracker.
Ans: JobTracker is responsible for communicating with NameNode in order to identify or detect the data location and submit the work to the TaskTracker node. The TaskTracker plays a significant role as it will be notifying the JobTracker if there is any job failure. Job scheduling is referred to as the heartbeat reporter which involves the reassuring of the JobTracker stating that it is still alive. Later, the JobTracker is responsible for the actions present in it and it also provides the flexibility to either resubmit the job or mark a specific record as unreliable or blacklist it.
Q6) Why is MapReduce used?
Ans: We know that the traditional enterprise systems make use of a centralized server for storing and processing the data. The traditional models are not capable or certainly not suitable for processing the large volumes of scalable data and are not capable of being accommodated by standard database servers. While processing multiple files simultaneously, a centralized system creates too much of a bottleneck. Google has brought up a solution for this issue by using an algorithm called MapReduce. MapReduce will be dividing a task into small parts and assigning them to many different computers. Later, the results will be collected in one single place and integrated to form the entire result dataset. It makes Data processing a lot easier than traditional systems.
Q7) List out the main components of MapReduce?
Ans: Below listed are the main components of MapReduce:
- Main Class - The main class is specifically used for providing the main parameters for the job like providing the different data files for sorting.
- Mapper Class - The mapper class is used for Mapping which is mainly done in this class. The map method is executed.
- Reducer Class - All the data that is aggregated is put forward in the reducer class. Data is reduced in this class.
Q8) Briefly list out the different configuration parameters that are required to perform the job of the MapReduce framework?
Ans: Below is the configuration parameters that are required to perform the job of the MapReduce framework.
1. Input location of the data or the job is required to be specified in the file system
2. Output location of the data also is also required to be specified in the system
3. The format of the input design
4. The format of the output design
5. Define the specific class of the mapper function
6. Define the specific class of the reducer function
7. JAR file which includes all the mapper and reducer classes
Q9) What do you know about Input Format?
Ans: Input Format is referred to as a type of MapReduce programming feature. This feature is responsible to provide its support in specifying the different job requirements. It includes the following function:
1. It is responsible for dividing the input files or the input data into different instances and these instances are called the Input Split. The total number of split files will then be assigned to the different mapping classes to the individual mappers. They will be divided in a logical manner.
2. The Input Format is also helpful to perform the validation of the input specification job.
3. Due to more number of mapper processes, the text input is also helpful in the implementation of Record Reader which then also helps in extracting the inputs or data.
Q10) Illustrate the differences between a reducer and a combiner?
Ans: A combiner is responsible for performing all the local tasks of reducing the local data files. A combiner mainly works on the Map Output. It is also capable of producing the output for the reducer's input just like the reducer. Combiner has come up with other uses too. For example, it is often used for the job of network optimization especially when there is an increase in output numbers by the map generator. Combiner also keeps varying from the reducer in many ways like for example, a reducer is limited but however, a combiner has a set of limitations like the input data or the output data, and the values would need to be similar to the output data of the mapper. A combiner is also capable of working on the commutative function like for example; it will be able to operate on subsets of the values and keys of the data. A combiner is capable of getting its input from only one single mapper while a reducer gets its input from different numbers of mappers.
Q11) Explain what you understand by the term NameNode in Hadoop?
Ans: A NameNode in Hadoop is referred to as the node, where Hadoop is capable of storing all the information regarding the file location in HDFS (Hadoop Distributed File System). In simple terms, a NameNode is a centerpiece or central aspect of the HDFS file system. It is also responsible for keeping the record of all the files in the file system and also track the file data through the cluster or multiple machines.
Q12) What do you think will happen when a data node fails?
Ans: When a data node fails, below are the possibilities that might take place.
All the tasks might get re-scheduled on the failed node.
jobtracker and namenode will detect the failure.
Namenode will be replicating the user's data to another node.
MapReduce programming questions
Q13) Illustrate the differences between MapReduce and PIG?
PIG: PIG is basically referred to as the data flow language that is responsible for managing the data flow from one source to another. It is also responsible for managing the data storage system and also helps in compressing the storage systems. Pig is capable of rearranging the steps for faster and better processing. The output data of the MapReduce job will be managed by PIG. Some of the functions of MapReduce processing can also be added in the processing of PIG. The functions can either include grouping, ordering, and counting data.
MapReduce: MapReduce is considered one of the frameworks which are used for writing the code for the developers. This is a data processing paradigm that separates the application of two types of developers, one who is responsible for writing it and another who is responsible for scaling it.
Q14) Define the term speculative execution?
Ans: Speculative execution is one type of feature that is available in MapReduce, which provides the feature of allowing the launch of several tasks on different kinds of nodes. Sometimes, even some multiple copies were also made by the speculative execution. Generally, duplicate copies of the task will be created using this feature if one task takes a long time to process or get completed.
Q15) Define the term Text Input Format in MapReduce?
Ans: A Text input format is considered as the default format for text files or input data. The files will be broken within the text input format. The line of the text will be referring to the value and the key will be referring to the position. These are the two main components of data files.
Q16) Define the term Partitioner and why is it used?
Ans: Partitioner is capable of working with the hash function which is helpful for controlling the partitioning of the various output data that are available in the MapReduce. This process also ensures and helps in providing the input data to the reducer. The total number of partitioners and the total number of reducers will be equal.
Q17) what are the different Job control options specified by MapReduce.
Ans: As the MapReduce framework provides its support to chained operations wherein an input of one map job serves as the output for another, there is a requirement of the job controls to govern and work with these complex operations.
The various job control options are:
Job.submit() : The Job.submit() is used to submit the job to the cluster and immediately return.
Job.waitforCompletion(boolean) : Job.waitforCompletion(boolean) is used to submit the job to the cluster and wait till it is completed.
Q18) What do you understand by the term SequenceFileInputFormat?
Ans: A compressed binary output file format is used for reading the sequence files and extends the FileInputFormat.It is responsible for passing the data between output-input (between the output of one MapReduce job to input of another MapReduce job) phases of MapReduce jobs.
Q19) What do you understand by the term MapReduce Combiner?
Ans: A MapReduce combiner is also known as a semi-reducer. A combiner is referred to as an optional class that is used to combine the map out records using the same key. The main functionality of a combiner is for accepting the inputs from the map class and pass the key-value pairs to the reducer class.
Q20) What is meant by OutputCommitter?
Ans: An OutPutCommitter describes the commit of the MapReduce task. FileOutputCommitter is the default class available for OutputCommitter in MapReduce. It performs the following functions:
An output committer will create a temporary output directory for the job during the initialization stage.
Then, it cleans the job as it removes the temporary output directory post job completion.
The next step is to set up the task temporary output.
It will then identify whether a task requires a commit or not. The commit is applied if it is required.
job setup, JobCleanup, and TaskCleanup are important tasks during output commit.
Q21) List out the different types of modes in which Hadoop can be run?
Ans: The three different modes in which Hadoop can be run are
Standalone (local) mode
Pseudo distributed mode
Fully distributed mode
Q22) How many InputSplits are made by a Hadoop Framework?
Ans: Hadoop will make 5 splits
1 split for 64K files
2 split for 65mb files
2 splits for 127mb files
MapReduce coding interview questions
Q23) What is meant by a distributed cache in Hadoop?
Ans: A Distributed cache in Hadoop is one of the facilities provided by the MapReduce framework. During the execution of the job, it is used for caching the files. This framework allows us to copy the necessary files to the slave node before the execution of any task takes place at that node.
Q24) Explain what is storage and compute nodes?
Storage node: The storage node is referred to as the machine or computer where the file system resides in order to store the processing data
Compute node: The compute node is referred to as the computer or machine where the actual business logic will be executed.
Q25) What is the process of writing a custom partitioner for a Hadoop job?
Ans: In order to write a custom partitioner for a Hadoop job, you will need to follow the following path
Create a new class that extends Partitioner Class
Override the method getPartition
In the wrapper that runs the MapReduce
Adding the custom partitioner to the job by using the method set Partitioner Class or – add the custom partitioner to the job as a config file
Q26) Explain what is WebDAV in Hadoop?
Ans: WebDAV is a set of extensions to HTTP which provides extensive support for editing and updating files. On most of the operating systems WebDAV shares can be mounted as filesystems, hence it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
Q27) Can MapReduce programs be written in any language other than Java?
Ans: Yes, MapReduce programs can be written in many different programming languages like Java, C++, R, Scripting Languages (Python, PHP). Any language will be able to read from stdin and write to stdout and parse tab and newline characters will work. Hadoop streaming (A Hadoop Utility) is providing the flexibility to the users to create and run the Map/Reduce jobs with any executable or scripts as the mapper and/or the reducer.
Q28) What do you understand by the term Stragglers?
Ans: Stragglers are referred to the process of the MapReduce during which the task takes a long time to get completed than expected.
Q29) What is meant by a Task instance in Hadoop? Where does it run?
Ans: Task instances are referred to as the actual MapReduce jobs which will be running on each slave node. The TaskTracker starts a separate JVM process to do the actual work (called Task Instance). This is done in order to ensure that process failure will not take down the task tracker. Each Task Instance will be running on its own JVM process. There can be multiple processes of task instances that will be running on a slave node. This will be based on the number of slots configured on the task tracker. By default, a new task instance JVM process is spawned for a task.
Q30) How many Daemon processes can be run on a Hadoop system?
Ans: Hadoop comprises five separate daemons. Each of these daemons will run in its own JVM. Following are the three Daemons that will run on Master nodes.
NameNode — This daemon is responsible for storing and maintaining the metadata for HDFS.
Secondary NameNode —This daemon is responsible for performing the housekeeping functions for the NameNode.
JobTracker — JobTracker is responsible for managing the MapReduce jobs, distributing the individual tasks to the machines that are running the Task Tracker.
Following are the two Daemons that will run on each slave node:
DataNode — The data node will be storing the actual HDFS data blocks.
TaskTracker — The task tracker is responsible for instantiating and monitoring individual Map and Reduce tasks.
We all know that there is a high demand for Hadoop now. There is an immense number of opportunities available in the outside market and it will be a plus for you to crack the jobs in Hadoop technology. As MapReduce is one of the important concepts of Hadoop, it has got higher precedence in the job roles now. By this time, you might have reviewed the important questions asked in the interview. These interview questions that are curated by experts will help both freshers and professionals to crack the interview at their best. It is good for you to review once more and prepare well