Kafka Live Lock

Problem

In our team we happened to have the kafka live lock. From observation, the problem was:

  • The consumer keeps disconnecting
  • Consumer keeps reading from the same offset (cannot commit successfully)
  • There is a huge message lag behinds
  • Produce is low rate but consume at high rate (since consumer keeps reading from the same offset)

Issue

From my investigation, it was because of max.poll.interval.ms by default was set to 300000 ms which is equivalent to 5 minutes.

Also we have max.poll.records = 500 by default.

This means that 1 kafka consumer (not consumer group) worstcase can pool 500 records and expect to finish all of them in 5 minutes. If it cannot finish, the consumer will disconnect itself for other consumers because it thought that it's inactive.

Because the consumer disconnects itself, by default Spring will use ACKMODE.BATCH, if one batch fail within 500 reocrds, it will not commit.

Therefore the messages are stuck.

This is also because of our Exponential backoff retry happens that by worst case could lead up to 2.5 minutes for an event.

For more information: KafkaConsumer (kafka 0.10.2.1 API) (apache.org)

Solution

We decided to set ack to ACKMODE.MANUAL_IMMEDIATE to make sure that the listener commit immediately.

We tune our max.poll.records down to make sure that we always pull just enough record to finish everything before 5 minutes (otherwise will have to retry). For our scenario, we set it to 3

We tune our retry exponential down to maximum a retry of a request to be 30s.