• USA : +1 973 910 5725
  • INDIA: +91 905 291 3388
  • info@tekslate.com
  • Login

MapReduce Interview Questions and Answers

MapReduce Interview Questions

What is Hadoop Map Reduce ?

For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used.  Data analysis uses a two-step map and reduce process.

How Hadoop MapReduce works?

In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.

Explain what is shuffling in MapReduce ?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle

Explain what is distributed Cache in MapReduce Framework ?

Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache  is used.  The files could be an executable jar files or simple properties file.

Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?

In Hadoop for submitting and tracking MapReduce jobs,  JobTracker is used. Job tracker run on its own JVM process

Hadoop performs following actions in Hadoop

– Client application submit jobs to the job tracker

– JobTracker communicates to the Namemode to determine data location

– Near the data or with available slots JobTracker locates TaskTracker nodes

– On chosen TaskTracker Nodes, it submits the work

– When a task fails, Job tracker notify and decides what to do then.

– The TaskTracker nodes are monitored by JobTracker

Explain what combiners is and when you should use a combiner in a MapReduce Job?

To increase the efficiency of MapReduce Program, Combiners are used.  The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner.  The execution of combiner is not guaranteed in Hadoop

Explain what is Speculative Execution?

In Hadoop during Speculative Execution a certain number of duplicate tasks are launched.  On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk.  Disk that finish the task first are retained and disks that do not finish first are killed.

Explain what are the basic parameters of a Mapper?

The basic parameters of a Mapper are

– LongWritable and Text

– Text and IntWritable

Explain what is the function of MapReducer partitioner?

The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers

Explain what is difference between an Input Split and HDFS Block?

Logical division of data is known as Split while physical division of data is known as HDFS Block

Explain what happens in textinformat ?

In textinputformat, each line in the text file is a record.  Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?

The user of Mapreduce framework needs to specify

– Job’s input locations in the distributed file system

– Job’s output location in the distributed file system

– Input format

– Output format

– Class containing the map function

– Class containing the reduce function

– JAR file containing the mapper, reducer and driver classes

Explain what is Sequencefileinputformat?

Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.

Explain what does the conf.setMapper Class do ?

Conf.setMapperclass  sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper

What happens in case of hardware/software failure?  

MapReduce framework must be able to recover from both hardware (disk failures, RAM errors) and software (bugs, unexpected exceptions) errors. Both are common and expected.

Is it possible to start reducers while some mappers still run? Why?

No. Reducer’s input is grouped by the key. The last mapper could theoretically produce key already consumed by running reducer.

Define a straggler?

Straggler is either map or reduce task that takes unusually long time to complete.

What is speculative execution (also called backup tasks)? What problem does it solve?

Identical copy of the same task is executed on multiple nodes. Output of the fastest task used.
Speculative execution helps if the task is slow because of hardware problem. It does not help if the distribution of values over keys is skewed.

What does combiner do? 

Combiner does local aggregation of key/values pairs produced by mapper before or during shuttle and sort phase. In general, it reduces amount of data to be transferred between nodes.

The framework decides how many times to run it. Combiner may run zero, one or multiple times on the same input.

Explain mapper life cycle?

Initialization method is called before any other method is called. It has no parameters and no output.

Map method is called separately for each key/value pair. It process input key/value pairs and emits intermediate key/value pairs.

Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.

Explain reducer life cycle?

Initialization method is called before any other method is called. It has no parameters and no output.

Reduce method is called separately for each key/[values list] pair. It process intermediate key/value pairs and emits final key/value pairs. Its input is a key and iterator over all intermediate values associated with the same key.

Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.  

What is local aggregation and why is it used?

Either combiner or a mapper combines key/value pairs with the same key together. They may do also some additional preprocessing of combined values. Only key/value pairs produced by the same mapper are combined.

Key/Value pairs created by map tasks are transferred between nodes during shuffle and sort phase. Local aggregation reduces amount of data to be transferred.

If the distribution of values over keys is skewed, data pre-processing in combiner helps to eliminate reduce stragglers.

What is in-mapper combining? State advantages and disadvantages over writing custom combiner? 

Local aggregation (combining of key/value pairs) done inside the mapper.

Map method does not emit key/value pairs, it only updates internal data structure. Close method combines and preprocess all stored data and emits final key/value pairs. Internal data structure is initialized in init method.

Advantages:

– It will run exactly once. Combiner may run multiple times or not at all.

– We are sure it will run during map phase. Combiner may run either after map phase or before reduce phase. The latter case provides no reduction in transferred data.

– In-mapper combining is typically more effective. Combiner does not reduce amount of data produced by mappers, it only groups generated data together. That causes unnecessary object creation, destruction, serialization and deserialization.

Disadvantages:

– Scalability bottleneck: the technique depends on having enough memory to store all partial results. We have to flush partial results regularly to avoid it. Combiner use produce no scalability bottleneck.

Describe order inversion design pattern?

Order inversion is used if the algorithm requires two passes through mapper generated key/value pairs with the same key. The first pass generates some overall statistic which is then applied to data during the second pass. The reducer would need to buffer data in the memory just to be able to pass twice through them.

First pass result is calculated by mappers and stored in some internal data structure. The mapper emits the result in closing method, after all usual intermediate key/value pairs.

The pattern requires custom partitioning and sort. First pass result must come to the reducer before usual key/value pairs. Of course, it must come to the same reducer.

Describe reduce side join between tables with one-on-one relationship?

Mapper produces key/value pairs with join ids as keys and row values as value. Corresponding rows from both tables are grouped together by the framework during shuffle and sort phase.

Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data.

Describe map side join between two database tables?

Map side join works only if following assumptions hold:
– both datasets are sorted by the join key,
– both datasets are partitioned the same way.

Mapper maps over larger dataset and reads corresponding part of smaller dataset inside the mapper. As the smaller set is partitioned the same way as bigger one, only one map task access the same data. As the data are sorted by the join key.

Describe memory backed join?

Smaller set of data is loaded into the memory in every mapper. Mappers loop over larger dataset and joins it with data in the memory. If the smaller set is too big to fit into the memory, dataset is loaded into memcached or some other caching solution.
 Our design of course tutorials and interview questions is practical and informative. At TekSlate, we offer resources to help you learn various IT courses. We avail both written material and demo video tutorials. For in-depth knowledge and practical experience explore Online MapReduce Training.

Summary
Review Date
Reviewed Item
Useful, MapReduce Interview Questions and Answers
Author Rating
5

“At TekSlate, we are trying to create high quality tutorials and articles, if you think any information is incorrect or want to add anything to the article, please feel free to get in touch with us at info@tekslate.com, we will update the article in 24 hours.”

0 Responses on MapReduce Interview Questions and Answers"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    Site Disclaimer, Copyright © 2016 - All Rights Reserved.

    Support


    Please leave a message and we'll get back to you soon.

    I agree to be contacted via e-mail.