What’s all this about Flume + Kafka (Flafka)?

There is a concept called ‘Flafka’ through which we exploit the strengths of both Kafka and Flume to create a more robust ingestion process.

By deploying Flafka, we take advantage of the out of the box connectors available in Flume and use it to connect to all of our data sources. The data is then passed to a Kafka topic. As we discussed earlier, Kafka then replicates the topic partitions across a number of Kafka brokers (servers). Kafka then distributes the data to many consumers. As discussed above, Flume is best at moving data somewhere in Hadoop (HDFS and HBase), so Kafka brings about some additional flexibility by writing to many more consumers.

Once Flume passes the message to Kafka, we can guarantee that it won’t be lost. So, Flafka utilizes the flexibility of Flume and the industrialised resilience of Kafka.

We can use Flume on one end of the ingestion process, or both. In the above diagram, Flume is used to connect to data sources. Consumers then pull data from Kafka. However, as below, we could use Flume on both ends of the process (as a data source & a consumer) – taking advantage of its connectors to not only data sources but also its optimized connectors to HDFS and HBase.

To summarize:

  1. Flume provides a mechanism for writing streaming log files to HDFS or HBase
  2. The tool comes with a number of pre-built connectors for datasources
  3. Flume can be configured to be fault tolerant and is scalable horizontally.