MapReduce Interview Questions and Answers
MapReduce Interview Questions
What happens in case of hardware/software failure?
Is it possible to start reducers while some mappers still run? Why?
Define a straggler?
Straggler is either map or reduce task that takes unusually long time to complete.
What is speculative execution (also called backup tasks)? What problem does it solve?
Speculative execution helps if the task is slow because of hardware problem. It does not help if the distribution of values over keys is skewed.
What does combiner do?
The framework decides how many times to run it. Combiner may run zero, one or multiple times on the same input.
Explain mapper life cycle?
Map method is called separately for each key/value pair. It process input key/value pairs and emits intermediate key/value pairs.
Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.
Explain reducer life cycle?
Reduce method is called separately for each key/[values list] pair. It process intermediate key/value pairs and emits final key/value pairs. Its input is a key and iterator over all intermediate values associated with the same key.
Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.
What is local aggregation and why is it used?
Key/Value pairs created by map tasks are transferred between nodes during shuffle and sort phase. Local aggregation reduces amount of data to be transferred.
If the distribution of values over keys is skewed, data pre-processing in combiner helps to eliminate reduce stragglers.
What is in-mapper combining? State advantages and disadvantages over writing custom combiner?
Map method does not emit key/value pairs, it only updates internal data structure. Close method combines and preprocess all stored data and emits final key/value pairs. Internal data structure is initialized in init method.
Advantages:
- It will run exactly once. Combiner may run multiple times or not at all.
- We are sure it will run during map phase. Combiner may run either after map phase or before reduce phase. The latter case provides no reduction in transferred data.
- In-mapper combining is typically more effective. Combiner does not reduce amount of data produced by mappers, it only groups generated data together. That causes unnecessary object creation, destruction, serialization and deserialization.
Disadvantages:
- Scalability bottleneck: the technique depends on having enough memory to store all partial results. We have to flush partial results regularly to avoid it. Combiner use produce no scalability bottleneck.
Describe order inversion design pattern?
First pass result is calculated by mappers and stored in some internal data structure. The mapper emits the result in closing method, after all usual intermediate key/value pairs.
The pattern requires custom partitioning and sort. First pass result must come to the reducer before usual key/value pairs. Of course, it must come to the same reducer.
Describe reduce side join between tables with one-on-one relationship?
Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data.
Describe map side join between two database tables?
Mapper maps over larger dataset and reads corresponding part of smaller dataset inside the mapper. As the smaller set is partitioned the same way as bigger one, only one map task access the same data. As the data are sorted by the join key.
Describe memory backed join?