Kafka Live Lock
Problem
In our team we happened to have the kafka live lock. From observation, the problem was:
- The consumer keeps disconnecting
- Consumer keeps reading from the same offset (cannot commit successfully)
- There is a huge message lag behinds
- Produce is low rate but consume at high rate (since consumer keeps reading from the same offset)
Issue
From my investigation, it was because of max.poll.interval.ms
by default was set to 300000 ms
which is equivalent to 5 minutes.
Also we have max.poll.records = 500
by default.
This means that 1 kafka consumer (not consumer group) worstcase can pool 500 records and expect to finish all of them in 5 minutes. If it cannot finish, the consumer will disconnect itself for other consumers because it thought that it's inactive.
Because the consumer disconnects itself, by default Spring will use ACKMODE.BATCH
, if one batch fail within 500 reocrds, it will not commit.
Therefore the messages are stuck.
This is also because of our Exponential backoff retry happens that by worst case could lead up to 2.5 minutes for an event.
For more information: KafkaConsumer (kafka 0.10.2.1 API) (apache.org)
Solution
We decided to set ack to ACKMODE.MANUAL_IMMEDIATE
to make sure that the listener commit immediately.
We tune our max.poll.records
down to make sure that we always pull just enough record to finish everything before 5 minutes (otherwise will have to retry). For our scenario, we set it to 3
We tune our retry exponential down to maximum a retry of a request to be 30s.