An introduction to Flume

Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop (HDFS or HBase) – it is not designed to ship out to a large number of consumers.

Flume, like Kafka is distributed, reliable and highly available, however, it’s not quite as slick as the method adopted by Kafka.

Let’s say for example, we have an HTTP data source flowing into Flume. We want to make it highly available and ensure that if a Flume agent were to go down, we would still have access to the data. 

If we are concerned about losing events because the Flume agent fails, we can use the file channel which provides us with checkpointing functionality. That means, if we do experience a failure, we don’t lose any events & it continues where it left off.

Another way to do this in Flume would be to create two source machines that could be written to in parallel by the HTTP source. You’d then need to tag the data as it traverses through the Flume process so that we can de-duplicate the data in the sink. As you can imagine, the de-duplication phase of the Flume cycle adds some fairly unnecessary overheads but it shows a slightly more creative approach to high availability.

With Flume, we have plenty of tried and tested, out of the box connectors for source data, removing the development overhead that is present with Kafka.

So let’s look a little deeper into the inner workings of Flume. If we look at the below diagram, we can see that we have a number of data generators on the left hand side. Data generators are the applications, servers, etc.. that generate log files or events, which we need to handle in real time. 

These get pushed into Flume Agents, which we will describe in more detail next. From the Flume agent, the log files are pushed into the collector, which aggregates them up and sends them onto their destination.

The Flume agent is a Java process which receives events (logs) from the clients and forwards it to its destination. It has 3 principal components. Note: a Flume agent can have multiple sources, channels and sinks. 

The source component receives data from the data generators and transfers it to a Flume Channel. 

The Channel is a transient data store, which stores the log data, until a Sink consumes them.

Finally, the sink is the function that stores the data into a centralised store (HDFS or HBase). 

Now we know what all the components are, we can start looking at how they’re deployed in the real world.

As you can see from the below diagram, Flume agents sit with the datasource.