fix: align kafka consumer fetch ceiling to 64 MiB broker max

tonyalaribe · tonyalaribe · commit 0e53f9e30841 · 2026-06-22T16:14:37.000+02:00
librdkafka defaults max.partition.fetch.bytes to 1 MiB and fetch.max.bytes
to 50 MiB, both below the 64 MiB a DLQ re-publish can reach. An
oversized-but-valid record is accepted/stored by the broker yet trips
MSG_SIZE_TOO_LARGE on consume, wedging the partition in a re-seek loop
(observed: DLQ replay group at ~243 MiB in-flight, 0 committed partitions).

Raise consumer fetch ceilings to match the producer/broker 64 MiB
max.message.bytes. Config-only; applies to both primary and _dlq groups.
diff --git a/src/Pkg/Queue.hs b/src/Pkg/Queue.hs
@@ -470,6 +470,14 @@ kafkaService appLogger appCtx tp role label kafkaTopics batchSize fn = checkpoin
         -- round-trip cost). 250ms stays well under session/poll timeouts.
         <> K.extraProp "fetch.min.bytes" "65536"
         <> K.extraProp "fetch.wait.max.ms" "250"
+        -- Consumer fetch ceiling MUST match the producer/broker 64 MiB max.message.bytes.
+        -- librdkafka defaults max.partition.fetch.bytes to 1 MiB and fetch.max.bytes to
+        -- 50 MiB — both below the 64 MiB a DLQ re-publish can reach — so an oversized-but-
+        -- valid record (esp. header-restamped DLQ messages) is accepted and stored by the
+        -- broker yet MSG_SIZE_TOO_LARGE wedges the partition on consume, re-seeking forever.
+        <> K.extraProp "max.partition.fetch.bytes" "67108864"
+        <> K.extraProp "fetch.max.bytes" "67108864"
+        <> K.extraProp "receive.message.max.bytes" "104857600"
         <> K.extraProp "partition.assignment.strategy" "cooperative-sticky"
         <> K.extraProp "group.instance.id" clientId
         <> K.logLevel K.KafkaLogInfo