What is Hadoop?
Hadoop is a distributed computing platform. It is written in Java. It consist of the features like Google File System and MapReduce.
What platform and Java version is required to run Hadoop?
Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.
What kind of Hardware is best for Hadoop?
Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.
What are the most common input formats defined in Hadoop?
These are the most common input formats defined in Hadoop:
- TextInputFormat is a by default input format.
What is InputSplit in Hadoop? Explain.
When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.
How many InputSplits is made by a Hadoop Framework?
Hadoop will make 5 splits as following:
- One split for 64K files
- Two splits for 65MB files, and
- Two splits for 127MB files
What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.
What is JobTracer in Hadoop?
is a service within Hadoop which runs MapReduceMapReduce jobs on the cluster.
What are the functionalities of JobTracer?
These are the main tasks of JobTracer:
- To accept jobs from client.
- To communicate with the NameNode to determine the location of the data.
- To locate TaskTracker Nodes with available slots.
- To submit the work to the chosen TaskTracker node and monitors progress of each tasks.
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.
What is Map/Reduce job in Hadoop?
Map/Reduce is programming paradigm which is used to allow massive scalability across the thousands of server.
Actually MapReduce refers two different and distinct tasks that Hadoop performs. In the first step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It takes the output from the map as input and compress those data tuples into smaller set of tuples.
Learn more about Hadoop Certification in this blog post.
What is Hadoop Streaming?
Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.
What is a combiner in Hadoop?
A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.
Is it necessary to know java to learn Hadoop?
If you have a background in any programming language like C, C++, PHP, Python, Java etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.
How to debug Hadoop code?
There are many ways to debug Hadoop codes but the most popular methods are:
By using Counters.
By web interface provided by Hadoop framework.
Is it possible to provide multiple inputs to Hadoop? If yes, explain.
Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.
What is the relation between job and task in Hadoop?
In Hadoop, A job is divided into multiple small parts known as task.
What is distributed cache in Hadoop?
Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.
What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?
Hadoop job – list
Hadoop job – kill jobID
What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker run on Hadoop cluster?
JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
It locates TaskTracker nodes with available slots for data.
It assigns the work to the chosen TaskTracker nodes.
The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.
How JobTracker assign tasks to the TaskTracker?
The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.
Is it necessary to write jobs for Hadoop in Java language?
No, There are many ways to deal with non-java codes. HadoopStreaming allows any shell command to be used as a map or reduce function.
How is Hadoop different from other parallel computing systems?
Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.
What all modes Hadoop can be run in?
Hadoop can run in three modes:
- Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
- Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
- Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
What is distributed cache and what are its benefits?
Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.
Benefits of using distributed cache are:
It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
Distributed cache tracks the modification time stamps of cache files, which notifies that the files should not be modified until a job is executing currently.
What are the most common Input Formats in Hadoop?
There are three most common input formats in Hadoop:
- Text Input Format: Default input format in Hadoop.
- Key Value Input Format: used for plain text files where the files are broken into lines
- Sequence File Input Format: used for reading files in sequence
Define DataNode and how does NameNode tackle DataNode failures?
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.
The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.
What are the core methods of a Reducer?
The three core methods of a Reducer are:
- setup(): this method is used for configuring various parameters like input data size, distributed cache.
public void setup (context)
- reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
- cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)
What is SequenceFile in Hadoop?
Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:
- Uncompressed key/value records.
- Record compressed key/value records – only ‘values’ are compressed here.
- Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.
What is Job Tracker role in Hadoop?
Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).
- It is a process that runs on a separate node, not on a DataNode often
- Job Tracker communicates with the NameNode to identify data location
- Finds the best Task Tracker Nodes to execute tasks on given nodes
- Monitors individual Task Trackers and submits the overall job back to the client.
- It tracks the execution of MapReduce workloads local to the slave node.
What is the use of RecordReader in Hadoop?
Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to
It will be read as “Welcome to Intellipaat” using RecordReader.
What happens if you try to run a Hadoop job with an output directory that is already present?
It will throw an exception saying that the output file directory already exists. To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.
To delete the directory before running the job, you can use shell:
Hadoop fs –rmr /path/to/your/output/
Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);
How can you debug Hadoop code?
First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.
- Run: “ps –ef | grep –I ResourceManager” and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
- On the basis of RM logs, identify the worker node that was involved in execution of the task.
- Now, login to that node and run – “ps –ef | grep –iNodeManager”
- Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.
How to configure Replication Factor in HDFS?
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:
[training@localhost ~]$ hadoopfs –setrep –w 3 /my/file
Conversely, you can also change the replication factor of all the files under a directory.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
How to compress mapper output but not the reducer output?
To achieve this compression, you should set:
What is the difference between Map Side join and Reduce Side Join?
Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.
How can you transfer data from Hive to HDFS?
By writing the query:
hive> insert overwrite directory ‘/’ select * from emp;
You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.
What companies use Hadoop, any idea?
Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis , Amazon, Netflix, Adobe, eBay, Spotify, Twitter, Adobe