Welcome to Apache Flume Tutorials. The objective of these tutorials is to provide in depth understand of Apache Flume.
In addition to free Apache Flume, we will cover common interview questions, issues and how to’s of Windows Networking.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents.
The following components make up Apache Flume:
Event: A singular unit of data that is transported by Flume (typically a single log entry)
Source: The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink: The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
Channel: The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
Agent: Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels
Client: The entity that produces and transmits the Event to the Source operating within the Agent.
Flume components interact in the following way:
-A flow in Flume starts from the Client.
-The Client transmits the Event to a Source operating within the -Agent.
-The Source receiving this Event then delivers it to one or more Channels.
-One or more Sinks operating within the same Agent drains these Channels.
-Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.
-When spikes in client side activity cause data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue normal operation for the duration of the spike.
-The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies.
Apache Flume Architecture
Flume basically consists of three main components, Source, Channel and Sink. A flume agent is basically a JVM process which consists of these three components through which data flow occurs. Given below is representation of a simple Flume agent listening to a webserver and writing the data to HDFS.
The source component receives the data from an external data source or a flume sink. There are different formats in which the data can be transferred. Therefore the source has to be configured based on the format of the input data. The source can be configured to listen on various sources. This helps in aggregating data from various sources and store them in a single location. The channel is a temporary storage for the events. It is similar to a queue before being passed to the sink. The storage can be of two types, memory storage or disk storage. By using memory based storage high throughput can be achieved but in case of a failure all the data is lost. A disk based storage offers less throughput and more reliability.
The sink retrieves the events from the channel and then either writes it into a file system or passes it to the next agent. The source and sink work asynchronously which leads to the necessity of channel. The architecture of the Flume agent is very flexible. Source component can be designed to listen to various sources. Same goes with the channel and sink. Various sinks can be configured to send data to various destinations. We can develop a multi-hop architecture for the data transfer which is a combination of various flume agents linked either to aggregate data from various sources or distribute data to various destinations.
The flume system follows a transaction approach for the data transfer. An event is requested from the channel by the sink. An event from the channel is sent to the sink. The event is removed from the channel after it receives a confirmation from the sink. It helps in reliable transfer and prevents data loss. The same is followed in between sink-source in case of multi-hop system. Flume supports various types of data formats/mechanisms which can be passed through its agents. It supports Avro, Thrift, Syslog and Netcat. These are various formats used in data transfer.
Features of Flume
-Some of the notable features of Flume are as follows −
-Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase) efficiently.
-Using Flume, we can get the data from multiple servers immediately into Hadoop.
-Along with the log files, Flume is also used to import huge volumes of event data produced by social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and Flipkart.
-Flume supports a large set of sources and destinations types.
-Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
-Flume can be scaled horizontally.
Advantages of Flume
-Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
-When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them.
-Flume provides the feature of contextual routing.
-The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
-Flume is reliable, fault tolerant, scalable, manageable, and customizable.