A Gentle Introduction to Apache Flume

5 min readOct 29, 2020

Efficient data collection, aggregation and movement.

What is Apache Flume?

Apache Flume is a powerful distributed data integration tool which helps collect and aggregate large amounts of streaming data from various sources and then move them to a centralized data store, usually HDFS. While the most common use case of Apache Flume is to handle huge volume of log files, it can also be used to transport event data generated by social media like Twitter.

Architecture

The above figure shows the basic architecture of a Flume agent, which has 3 main components: source, sink and channel.

A Flume source consumes event data from an external source, such as a web server. Various data types are supported, including Avro, NetCat, Kafka, etc.
When an event received by a Flume source, it is stored to one or more Flume channel(s). A channel acts as a temporary storage of the event data. The channel can be backed by local filesystem, memory, etc.
A Flume sink removes event data from the channel, and either writes them to some external destination like HDFS or forwards them to the next Flume source in the flow.

There are a few other Flume components which support more complicated flows. Flume Channel selector is used to determine the appropriate channel for an event to be sent to. Flume Interceptor can modify/drop events on-the-fly, which is useful for simple ETL work.

Use case in Movie Streaming Scenario

In the movie streaming scenario, it’s common that we have a large number of web servers to handle the huge traffic of streaming and recommendation requests. Therefore, Flume can be used to collect logs from the different clients (i.e. web servers) and streaming them to a same destination. The destination could be a HDFS cluster, and the log files are consumed later by the analytics team with analytics tools like Spark to obtain business insights. Moreover, the destination could even be an ElasticSearch cluster to display the events in Kibana graphical interface in a near-real-time fashion.

A Simple Example

Let’s see a simple example where a Flume agent collects streaming log data from Kafka and stores them to a HDFS cluster. The data are sythetic logs in the movei streaming schenario. Assume we have a Kafka bootstrap server at localhost:9092. Also assume we have set up a pseudo-distributed HDFS cluster locally on port 9000.

We want the Flume agent to pull data from two different topics movielog1 and movielog2, so we configure two sources with the same bootstrap server address localhost:9092 but different topics as shown below.

# Source r1 reading movielog1
flume_example.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
flume_example.sources.r1.kafka.bootstrap.servers = localhost:9092
flume_example.sources.r1.kafka.topics = movielog1
flume_example.sources.r1.kafka.consumer.group.id = flume_example

# Source r2 reading movielog2
flume_example.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
flume_example.sources.r2.kafka.bootstrap.servers = localhost:9092
flume_example.sources.r2.kafka.topics = movielog2
flume_example.sources.r2.kafka.consumer.group.id = flume_example

The Flume agent should write to the HDFS cluster with address hfs://localhost:9000, and we configure the sink path so that logs from different topic are written to different folders.

# HDFS sink confiugration
flume_example.sinks.k1.type = hdfs
flume_example.sinks.k1.hdfs.path = hdfs://localhost:9000/kafka/%{topic}/%y-%m-%d
flume_example.sinks.k1.hdfs.rollInterval = 60
flume_example.sinks.k1.hdfs.rollSize = 0
flume_example.sinks.k1.hdfs.rollCount = 0
flume_example.sinks.k1.hdfs.fileType = DataStream

Finally, we use a simple memory channel and bind the sources and sink to it.

# Use a channel which buffers events in memory
flume_example.channels.c1.type = memory
flume_example.channels.c1.capacity = 10000
flume_example.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
flume_example.sources.r1.channels = c1
flume_example.sources.r2.channels = c1
flume_example.sinks.k1.channel = c1

If we run this agent for a while, we will be able to see new directories in the HDFS like this.

So these FlumeData.* files contain the logs of the movie streaming requests. If we check the first ten lines of /kafka/movielog2/FlumeData.1603933741968, it looks like as below.

Basically each of the 10 logs means a user is requesting mpg files for some movie, which means the user is currently watching this movie.

Notice

This example if only for demo purpose, and what it shows is usually not the best configuration in production. In reality, a fan-in architecture is more desirable when there're a lot of sources. In this architecture, ideally each source has an agent, which forwards the events to fewer aggregator agents, and the aggregator agents then forward the events to a final destination agent. This architecture is more scalable and allows simple failover.

Pros and Cons

Advantages

Flume fits the need of collecting log data for the scale-out architecture which is dominant in this cloud computing era. With the help of Flume, we could easily pull logs from many different sources.
Flume is easy to install and set up. The only software prerequisite to run flume is having Java installed.
Flume by default supports many popular sources (e.g. Avro, Syslog, Kafka) and destination types (e.g. HDFS, Hive, ElasticSearch). It is also open to customization of everything, including sources, sinks, interceptors. Hence, it's pretty easy to integrate Flume with any data streaming tools.
Flume is well-documented with many good examples and classical topology. The documentation is helpful for new user to quickly get started with Flume.

Limitations

Flume does not guarantee the order of events. Namely, the events received by the sink could be in different order of sent from the sources, which is not desirable in some cases.
The scalability and reliability of Flume highly depends on the topology and channel type. If the system is not carefully designed, the performance could degrades. For example, memory channel could cause unexpected data loss on machine failure.

References

[1] Flume 1.9.0 User Guide

A Gentle Introduction to Apache Flume

What is Apache Flume?

Architecture

Use case in Movie Streaming Scenario

A Simple Example

Pros and Cons

Advantages

Limitations

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Wenfei Yan

Responses (3)