- What are delivery guarantees in Kafka?
- What do in-sync replicas mean in Kafka, and what are the related configs?
- What happens if a consumer doesn't process a message in time?
- If the number of consumers is greater than the number of partitions, what will happen?
- There is increasing Kafka lag, but the service works as expected. What are possible options to solve the issue?
- Is it possible to change the number of partitions while downstream components are processing messages?
- How to calculate the optimal number of partitions per Kafka topic?
- At-most-once: Messages may be lost but never redelivered (e.g., acks=0). Fastest, least reliable.
- At-least-once: Messages are never lost but may be duplicated (e.g., acks=all). Default for most production use cases.
- Exactly-once: Each message is delivered and processed exactly once, using idempotent producers and transactions. Strongest guarantee, highest overhead.
In-Sync Replicas are the set of replica followers that are fully caught up with the partition leader (no lag). Only ISR members can become the new leader after a failure, preventing data loss. Key configs:
min.insync.replicas– minimum ISR size required for a write withacks=allreplica.lag.time.max.ms– max time a follower can lag before being removed from ISRunclean.leader.election.enable=false– prevents non-ISR replicas from becoming leader (avoids data loss)
The consumer is considered failed, leaves the consumer group, and triggers a rebalance to reassign its partitions. This is controlled by max.poll.interval.ms (default 300 seconds) – if poll() isn't called within this time, the consumer times out and exits the group.
The extra consumers will remain idle because each partition can be assigned to at most one consumer in a group.
There is increasing Kafka lag, but the service works as expected. What are possible options to solve the issue?
- Consumer group state → Is it Stable, Rebalancing, or Empty?
- Partition lag distribution → Even lag or hot partitions?
- Production rate → Did throughput spike on the source side?
- Consumer processing time → Where is the time spent (fetch, process, write)?
- Rebalance frequency → Are consumers bouncing?
- GC pauses → Is the JVM stalling?
- Sink health → Is the destination keeping up?
- Scale or fix → Based on the bottleneck, choose the right action
Is it possible to change the number of partitions while downstream components are processing messages?
Yes, but only to increase (cannot decrease). If you use message keys, ordering may break because hashing changes. Existing data is not redistributed.
The optimal number balances producer/consumer throughput with minimal overhead. A simple formula: max(target throughput / producer rate, target throughput / consumer rate).