Kafka

Introduction

Kafka is a distributed event streaming platform. Kafka is suitable for low latency that suitable to process stream of data.

Kafka is a pull-model where each Kafka Consumer will pull for a message in a Kafka topic

Kafka has 2 mode

  • Zookeeper mode
  • Kafka Raft (KFRAFT) to manage the consensus
    • Kraft is better since it's easier to scale, lighter
    • New node joining the circle doesn't need to re-load the memory state in zookeeper. The configuration is already in Kraft in-memory
    • Zookeeper is hard to scale because it's own a cluster.

Architecture

A producer can publish a message to a Kafka topic. It can publish to multiple topics.

Loading...

Publishing model

Loading...

When publish a message, publisher should specify the key so that it's hashed to a partition. The producer will keep track of internal metadata map so it knows which partition it should send to. Producer will hash and put the message in the partition itself

If publisher does not specify the key:

  • Old kafka (< 2.4): we use Round robin to distribute the task
  • Newer version: we use Sticky partioner
graph TD
    subgraph Round_Robin_Old[Round robin]
    P1[Producer] --> M1((M1))
    P1 --> M2((M2))
    P1 --> M3((M3))
    M1 --> Part0[Partition 0]
    M2 --> Part1[Partition 1]
    M3 --> Part2[Partition 2]
    end

    subgraph Sticky_New[Sticky partitioner]
    P2[Producer] --> B1[Batch: M1, M2, M3]
    B1 --> PartA[Partition 0]
    P2 -.-> B2[Next Batch]
    B2 -.-> PartB[Partition 1]
    end

Consuming model

Loading...

The same topic can be consumed by different consumer in a consumer group to process the message in parallel. Each Kafka Consumer store the Kafka Offset (next message id to read) and keep incrementing the offset as they read

[!danger]
1 topic partition can be subcribed by only 1 consumer in the same consumer group

[!tip]
1 consumer can also subscribe to mulitple topics, multiple partitions

Kafka scaling

Kafka scaling