MapReduce Tutorial

Ratings:
(4.4)
Views:1684
Banner-Img
  • Share this blog:

Welcome to MapReduce Tutorials. The intent of these tutorials is to provide good understanding of MapReduce.In addition to these tutorials, we will also cover common issues, Interview questions and How To’s of MapReduce.

Introduction of MapReduce Tutorials

MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request, tasks like:

-Analyzing application logs

-Aggregating related data from external sources

-Transforming data from one format to another

-Exporting data for external analysis

-Mapper: Consumes input data, analyzes it (usually with filter and sorting operations), and emits tuples (key-value pairs)

-Reducer: Consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data.

This list is given to a Reducer
 
-There may be a single Reducer, or multiple Reducers
-All values associated with a particular intermediate key are guaranteed to go to the same Reducer
-The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
-This step is known as the ‘shuffle and sort’
 
The Reducer outputs zero or more final key/value pairs
 
-These are written to HDFS
-In practice, the Reducer usually emits a single key/value pair for each input key
 
Add up all the values associated with each intermediate key:
 
reduce(output_key,intermediate_vals)
foreach v in intermediate_vals:
  count += v
emit(output_key, count)
 
Reducer output
 
('aardvark',1)
('mat',1) ('on',2) ('sat',2) ('sofa',1) ('the',4)
MapReduce Tutorials
 

Functional Programming Concepts

MapReduce programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily. The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale.

Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated. If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution.

Python MapReduce Code

The “trick” behind the following Python code is that we will use the Hadoop Streaming API (see also the corresponding wiki entry) for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and print our own output to sys.stdout. That’s all we need to do because Hadoop Streaming will take care of everything else!

Usage of MapReduce

The classical example for using MapReduce is logfile analysis. Big logfiles are split and a mapper search for different webpages which are accessed. Every time a webpage is found in the log a key / value pair is emitted to the reducer where the key is the webpage and the value is "1". The reducers aggregate the number of for certain webpages. As a end result you have the total number of hits for each webpage. Another example if full text indexing. The mapper would map every phrase / word in one document to the document and the reducer would write these mappings to an index. Other applications are:
-Distributed Grep -Reverse Web-Link Graph: Map function outputs (URL target, source) from an input web page (source). The reduce function concatenates the list of all source URLs associated with a give target URL and returns (target, list(sources))-Word count in a number of documents

MapReduce can also be applied to lots of other problems. For example Google uses MapReduce to calculate their Pagerank.

Word Count

-We have a large file of words, one word to a line

-Count the number of times each distinct word appears in the file

-Sample application: analyze web server logs to find popular URLs

Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory nsort datafile | uniq –c To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable

Word Count using MapReduce

map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)

MapReduce Architecture

MapReduce is a programming framework that abstracts the complexity of parallel applications. The management architecture is based on the master/worker model, while a slave-to slave data exchange requires a P2P model. The simplified computational model for data handling means the programmer is unaware of the complexity of data distribution and management. There is a considerable degree of complexity because of the large number of data sets, scattered across hundreds or thousands of machines, and the need for lower computing runtime. The MapReduce architecture consists of a master machine that manages other worker machines.

MapReduce Tutorials

 

Advanced MapReduce features

The advanced MapReduce features describe the execution and lower level details. In normal MapReduce programming, only knowing APIs and their usage are sufficient to write applications. But inner details of MapReduce are a must to understand the actual working details and gain confidence. Now let us discuss advanced features in the following sections.

Custom Types (Data)

The data which passes through Mappers and Reducers are stored in Java objects. Writable Interface: The Writable interface is one of the most important interfaces. The objects which can be marshaled to/from files and over the network use this interface. Hadoop also uses this interface to transmit data in a serialized form. Writable interface are mentioned below:

-Text class(It stores String data)

-LongWritable

-FloatWritable

-IntWritable

-Boolean Writable

The custom data type can also be made by implementing Writable interface. Hadoop is capable of transmitting any custom data type (which fits your requirement) that implements Writable interface. Following is the Writable interface which is having two methods readFields and write. The first method (readFields) initializes the data of the object from the data contained in the ‘in’ binary stream. The second method (write) is used to reconstruct the object to the binary stream ‘out’. The most important contract of the entire process is that the order of read and write to the binary stream is same. Writable interface public interface Writable { void readFields(DataInput in); void write(DataOutput out); }

Custom Types (Key)

In Hadoop MapReduce, the Reducer processes the key in sorted order. So the custom key type needs to implement the interface called WritableComparable. The key types should also implement hashCode (). Following is showing WritableComparable interface. It represents a Writable which is also Comparable. Showing WritableComparable interface public interface WritableComparable<T> extends Writablecomparable<T>

How to use Custom Types

Now we explore the mechanism that Hadoop uses to understand it. The JobConf object (which defines the job) has two methods called setOutputKeyClass () and setOutputValueClass () and these methods are used to control the value and key data types. If the Mapper produces different types that do not match the Reducer then JobConf's setMapOutputKeyClass () and setMapOutputValueClass () methods can be used to set the input type as expected by the Reducer. - See more at: http://www.devx.com/opensource/using-advanced-hadoop-mapreduce-features.html#sthash.OgdkvKfl.dpuf

Advantages of MapReduce

-Distribute data and computation.The computation local to data prevents the network overload.

-Linear scaling in the ideal case.It used to design for cheap, commodity hardware.

-Simple programming model.The end-user programmer only writes map-reduce tasks.

-HDFS store large amount of information

-HDFS is simple and robust coherency model

-That is it should store data reliably.

-HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.

-HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.

-HDFS provide streaming read performance.

-Data will be written to the HDFS once and then read several times.

-The overhead of caching is helps the data should simply be re-read from HDFS source.

-Fault tolerance by detecting faults and applying quick, automatic recovery

-Processing logic close to the data, rather than the data close to the processing logic

-Portability across heterogeneous commodity hardware and operating systems

-Economy by distributing data and processing across clusters of commodity personal computers

-Efficiency by distributing data and logic to process it in parallel on nodes where data is located

-Reliability by automatically maintaining multiple copies of data and automatically deploying processing logic in the event of failures

-HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity

Disadvantages of MapReduce

-Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.

-Programming model is very restrictive:- Lack of central data can be preventive.

-Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.

-Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.

-Still single master which requires care and may limit scaling

-Managing job flow isn’t trivial when intermediate data should  be kept

-Optimal configuration of nodes not obvious. Eg: – #mappers, #reducers, mem.limits

Conclusion

In this discussion we have covered the most important Hadoop MapReduce features. These features are helpful for customization purpose. In practical MapReduce applications, the default implementation of APIs does not have much usage. Rather, the custom features (which are based on the exposed APIs) have significant impact. All these customizations can be done easily once the concepts are clear.

You liked the article?

Like : 0

Vote for difficulty

Current difficulty (Avg): Medium

Recommended Courses

1/15

About Author
Authorlogo
Name
TekSlate
Author Bio

TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.


Stay Updated


Get stories of change makers and innovators from the startup ecosystem in your inbox