An introduction to Kafka

Kafka is a low-latency messaging system. It takes data from one location, for example application log files and makes it available to other systems with very little delay (latency). 

That may be useful in the below example, where Kafka makes our CRM and Google Analytics data available to Spark Streaming, which will carry out some kind of data transformation, before loading it into HBase. From the diagram below, you can see that Kafka can have many producers and many consumers. 

A producer is a system that produces data. It’s a datasource, a log stream or even data input on a website. 
A consumer comes at the other end of the pipe. It’s a system or service that consumes the log files.

Before we delve a little further into the details of Kafka, let’s talk about why it is a useful tool to have. 

If we look at the below diagram, you can see that it’s possible to have lots of datasources, sending data directly into Hadoop. From there, the various systems can either pull the data or we can setup push functionality. 

This presents a couple of problems:

  1. This kind of architecture lends itself to batch processing and less to the processing of streaming data. It would lead to latency between ingestion and insight.
  2. Each additional consumer of the data (e.g. CRM, Tableau) has different connectivity requirements (some may not support log streaming or they may use different protocols /  languages and different connection methods), which becomes very confusing as the number of consumers grows.
  3. If multiple consumers require the same data, multiple feeds would be configured, rather than a single Kafka topic, which could be subscribed to by multiple consumers, leading to duplication of workload.

If we take a look at the below, we have the same datasources pushing to the same systems, but this time using Kafka. Each of the datasources publishes to their own Kafka topic which is subscribable by consumers, who pull the data. 

As we say, the different consumers have different capabilities. For example, System X may not be able to ingest streaming data & may require data to be loaded in batches. So, this particular topic could have a retention period of 7 days and a weekly batch pull could be configured. Other topics, may have a retention of just a few minutes, as their systems may ingest the data in real time. Kafka retentions can be set based on time or log size.

NOTE: Kafka does not push data to the source, a consumer must subscribe & pull data from a Kafka topic.

As a simpler example below, you can see that Kafka producers write to a Kafka topic. In the below example, our CRM writes to the ‘Customers’ topic while Google Analytics writes to a ‘Marketing’ topic. We can conceptually think of these topics as two pipelines within Kafka.

The consumers of the data can subscribe to one or many topics. So, in the above diagram, Spark only subscribes to the customers topic while the job that moves data to HDFS subscribes to both customers and marketing data. 

Kafka topics are split into partitions and distributed among Kafka brokers / nodes (which is what we call servers). This enables us to parallelize topics, which means we have parallel reads and writes, leading to far more capacity. Kafka maintains a position in the log per consumer (which we call the offset), so we know which messages have been consumed by each consumer.

One topic can have several partitions which can be placed on a separate broker, which enables multiple consumers to read the same topic from multiple machines in parallel – leading to very high message processing throughput. 

Let’s look at the above diagram for a second. As you can see, each Kafka broker can hold several partitions. The purple blocks are the leaders, they grey blocks are the replicas. Whenever a write action happens to a topic, it’s written to the leader, which coordinates the replication to the replicas. If a leader fails, the replicas will take over as leader. 

The above highlights that Kafka is highly available and resilient to node failures and supports automatic recovery – making it an excellent option and is why it’s become so popular within the industry as a whole.

Key advantages of Kafka:

  • High (low latency) throughput of high velocity and volume data with relatively low hardware requirements. Achieved through parallelization across multiple machines for producers and consumers.
  • Fault tolerance against node failure through partitioning and parallelization. If a node were to fail, a partition on the other nodes would be promoted to live, maintaining availability.
  • Message durability – they persist on the disk for the life of the retention period you define.
  • Kafka is scalable horizontally, meaning you can scale for additional demand with relative ease & low cost and without downtime.
  • High concurrency writes and reads to a Kafka topic.
  • Allows us to integrate with multiple consumers, regardless of the language they’re written in.
  • Allows us to choose between at-least once, or at-most once delivery. 
  • If there is a network bottleneck, Kafka allows us to compress messages (using Gzip or Snappy) in either real-time or in bulk.
  • Kafka uses zookeeper to know what brokers are alive and in-sync (where replication has crossed all nodes. 
  • Zookeeper keeps track of which messages the consumer has / hasn’t read (known as the offset)

So, what does Kafka give us then? Well, it gives us a highly resilient messaging platform with massive throughput potential; guaranteed message ordering (for a partition (not a topic)) and it guarantees that messages will not be lost, so long as one replica is alive.