kafka_consumer: bound and refine estimated_consumer_lag (#24167)

piochelepiotr · claude · web-flow · commit 77b7cda32b44 · 2026-06-30T12:16:50.000Z
* kafka_consumer: bound and refine estimated_consumer_lag Cap left-extrapolation of the broker timestamp cache so a consumer offset older than the oldest cached sample cannot extrapolate more than 10 minutes past it, keeping estimated_consumer_lag bounded. Use max(consumer_offset, low_watermark) as the offset basis for lag-in-time when cluster monitoring is enabled: messages below the low watermark are out of retention and unreachable, so they should not inflate the time lag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: add changelog entry for PR #24167 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: compact and prune the broker-timestamp cache Replace single-oldest eviction with batch compaction (Visvalingam-Whyatt) triggered when the cache reaches capacity: keep the oldest and newest samples and drop the points that least distort the offset/timestamp curve, so the cache spans a longer history at a coarsening resolution and high lag is interpolated rather than extrapolated. At the same trigger, prune samples below the earliest consumer offset (keeping one anchor) since no consumer will ever interpolate there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: prune broker-timestamp cache by low watermark Use the partition low watermark as the prune floor when cluster monitoring is enabled (the physically meaningful "lowest readable offset"), falling back to the earliest committed consumer offset otherwise. The low watermark is now fetched before the cache update and reused for both pruning and the lag-in-time floor, so there is no extra broker call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: fetch low watermark offsets once and share them Previously the log-start (low watermark) offsets were fetched twice per run when cluster monitoring and data streams were both enabled: once by the metadata collector for partition.size/topic.size/throughput, and again by the lag path for the lag-in-time and cache-pruning floor. Fetch them once in check(), gated on cluster monitoring, over all non-internal topic partitions, and share the result with both the data-streams lag path and the metadata collector. Removes the duplicate list_offsets(earliest) call and the divergent internal-topic handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: reuse _fetch_earliest_offsets instead of a parallel fetch Drop the PR-added Client.get_low_watermark_offsets and the _get_low_watermark_offsets wrapper, which duplicated the existing ClusterMetadataCollector._fetch_earliest_offsets. The check now calls _fetch_earliest_offsets once under cluster monitoring and shares the result with both the data-streams lag/pruning path and the topic-metadata collection, so the earliest offsets are still fetched only once per run. This reverts client.py to master and keeps the cluster_metadata.py change to a small signature tweak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: use low_watermark_offsets directly in topic metadata Drop the redundant earliest_offsets alias and reference the passed-in low_watermark_offsets directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: address review feedback on lag bounding - Clarify that the left-extrapolation cap bounds lag-in-time regardless of cluster monitoring or the low-watermark floor, and document why there is no symmetric right-side clamp (the newest cached sample is the just-collected highwater, which the consumer offset can never exceed). - Promote ClusterMetadataCollector.fetch_earliest_offsets to a public method since KafkaCheck now calls it across the class boundary. - Log a debug line when the cache-prune floor falls back from the low watermark to the earliest consumer offset. - Extract the Visvalingam-Whyatt significance closure into a module-level _interpolation_error helper. - Parameterize the _visvalingam_whyatt tests; add direct tests for _earliest_consumer_offsets, _prune_below_anchor, and the left-extrapolation cap through report_consumer_offsets_and_lag without a low watermark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: trim comments to a single note on the extrapolation cap Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: move extrapolation-cap comment to the clamp line Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: reuse fetched topic partitions in topic metadata collection Pass the topic-partition map computed in check() through collect_all_metadata into _collect_topic_metadata instead of fetching it again, so the cluster monitoring path makes the same number of get_topic_partitions calls as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: satisfy ruff formatting for collect_all_metadata call Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * kafka_consumer: clear full timestamp cache on reset, test pruning end-to-end When a reset is detected (any cached offset above the new highwater), clear the entire cache instead of only dropping entries above the highwater. The VW compactor always preserves the minimum cached offset as an endpoint, so old-generation low-offset entries would never age out and would poison lag interpolation indefinitely after a partial reset. Also replaces the direct private-method test for consumer-floor pruning with a dd_run_check test that exercises the full check() path, and adds tests for the new clear-on-reset behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: satisfy ruff formatting for new unit tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: shorten reset-detection comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: trim reset test comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: test timestamp compaction via dd_run_check instead of private method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: replace _prune_below_anchor direct tests with dd_run_check tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: satisfy ruff formatting for prune_below_anchor replacement tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: replace private method tests with public method tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: test that lag accuracy is preserved after VW compaction Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * kafka_consumer: parametrize VW compaction test with 4 cases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff --git a/kafka_consumer/changelog.d/24167.fixed b/kafka_consumer/changelog.d/24167.fixed
@@ -0,0 +1 @@
+Improve the accuracy of ``estimated_consumer_lag`` for consumers that are far behind: cap interpolation for offsets older than the cached broker history, use the low watermark as a floor for the lag offset when cluster monitoring is enabled, and retain a longer broker-timestamp history by compacting the cache (Visvalingam-Whyatt) and pruning samples below the lowest readable offset (the low watermark, or the earliest consumer offset when cluster monitoring is disabled) instead of evicting the oldest one.
diff --git a/kafka_consumer/datadog_checks/kafka_consumer/cluster_metadata.py b/kafka_consumer/datadog_checks/kafka_consumer/cluster_metadata.py
@@ -204,7 +204,7 @@ def _parallel_fetch(self, fn: Callable[[str], Any], subjects: list[str], error_l
                     self.log.warning("Error fetching %s for %s: %s", error_label, subject, e)
         return results
 
-    def collect_all_metadata(self, highwater_offsets):
+    def collect_all_metadata(self, highwater_offsets, low_watermark_offsets, topic_partitions):
         try:
             shared_metadata = self.client.kafka_client.list_topics(timeout=self.config._request_timeout)
         except Exception as e:
@@ -217,7 +217,7 @@ def collect_all_metadata(self, highwater_offsets):
             self.log.error("Error collecting broker metadata: %s", e)
 
         try:
-            self._collect_topic_metadata(shared_metadata, highwater_offsets)
+            self._collect_topic_metadata(shared_metadata, highwater_offsets, low_watermark_offsets, topic_partitions)
         except Exception as e:
             self.log.error("Error collecting topic metadata: %s", e)
 
@@ -386,7 +386,7 @@ def _collect_broker_metadata(self, metadata=None):
                 "data-streams-message",
             )
 
-    def _fetch_earliest_offsets(self, topic_partitions):
+    def fetch_earliest_offsets(self, topic_partitions):
         """Batch-fetch log-start offsets via AdminClient.list_offsets(earliest).
 
         Uses ListOffsets with the EARLIEST_TIMESTAMP sentinel, which the broker
@@ -441,11 +441,9 @@ def _fetch_earliest_offsets(self, topic_partitions):
             )
         return result
 
-    def _collect_topic_metadata(self, metadata, highwater_offsets):
+    def _collect_topic_metadata(self, metadata, highwater_offsets, low_watermark_offsets, topic_partitions):
         self.log.debug("Collecting topic metadata")
 
-        topic_partitions = self.client.get_topic_partitions()
-
         cluster_id = self.config._kafka_cluster_id_override or (
             metadata.cluster_id if hasattr(metadata, 'cluster_id') else 'unknown'
         )
@@ -455,8 +453,6 @@ def _collect_topic_metadata(self, metadata, highwater_offsets):
 
         self.check.gauge('topic.count', len(topic_partitions), tags=self.config._get_tags(cluster_id))
 
-        earliest_offsets = self._fetch_earliest_offsets(topic_partitions)
-
         now_ts = time.time()
         prev_ts = None
         previous_partition_offsets = {}
@@ -496,7 +492,7 @@ def _collect_topic_metadata(self, metadata, highwater_offsets):
 
                 partition_metadata = topic_metadata.partitions.get(partition_id)
                 latest = highwater_offsets.get((topic_name, partition_id), 0)
-                earliest = earliest_offsets.get((topic_name, partition_id))
+                earliest = low_watermark_offsets.get((topic_name, partition_id))
 
                 if earliest is None:
                     have_all_earliest = False
diff --git a/kafka_consumer/datadog_checks/kafka_consumer/kafka_consumer.py b/kafka_consumer/datadog_checks/kafka_consumer/kafka_consumer.py
@@ -1,6 +1,7 @@
 # (C) Datadog, Inc. 2019-present
 # All rights reserved
 # Licensed under Simplified BSD License (see LICENSE)
+import heapq
 import json
 from collections import defaultdict
 from time import time
@@ -18,6 +19,8 @@
 
 MAX_TIMESTAMPS = 1000
 
+LAG_EXTRAPOLATION_LIMIT_SECONDS = 600
+
 
 class KafkaCheck(AgentCheck):
     __NAMESPACE__ = 'kafka'
@@ -67,6 +70,8 @@ def check(self, _):
         # Fetch the broker highwater offsets
         highwater_offsets = {}
         broker_timestamps = defaultdict(dict)
+        low_watermark_offsets = {}
+        topic_partitions = {}
         cluster_id = ""
         persistent_cache_key = "broker_timestamps_"
         consumer_contexts_count = self.count_consumer_contexts(consumer_offsets)
@@ -86,9 +91,17 @@ def check(self, _):
                             partitions.add((topic, partition))
                 # Expected format: ({(topic, partition): offset}, cluster_id)
                 highwater_offsets, cluster_id = self.get_highwater_offsets(partitions)
+                if self.config._cluster_monitoring_enabled:
+                    topic_partitions = self.client.get_topic_partitions()
+                    low_watermark_offsets = self.metadata_collector.fetch_earliest_offsets(topic_partitions)
                 if self._data_streams_enabled:
                     broker_timestamps = self._load_broker_timestamps(persistent_cache_key)
-                    self._add_broker_timestamps(broker_timestamps, highwater_offsets)
+                    if low_watermark_offsets:
+                        prune_floors = low_watermark_offsets
+                    else:
+                        self.log.debug("No low watermarks available; pruning cache by earliest consumer offset")
+                        prune_floors = self._earliest_consumer_offsets(consumer_offsets)
+                    self._add_broker_timestamps(broker_timestamps, highwater_offsets, prune_floors)
                     self._save_broker_timestamps(broker_timestamps, persistent_cache_key)
             else:
                 self.warning("Context limit reached. Skipping highwater offset collection.")
@@ -129,6 +142,7 @@ def check(self, _):
             reporting_limit - len(highwater_offsets),
             broker_timestamps,
             cluster_id,
+            low_watermark_offsets,
         )
 
         # Collect cluster metadata if enabled
@@ -137,7 +151,7 @@ def check(self, _):
             self._send_cluster_monitoring_heartbeat(total_contexts, cluster_id, connect_status)
 
             try:
-                self.metadata_collector.collect_all_metadata(highwater_offsets)
+                self.metadata_collector.collect_all_metadata(highwater_offsets, low_watermark_offsets, topic_partitions)
             except Exception as e:
                 self.log.error("Error collecting cluster metadata: %s", e)
 
@@ -274,22 +288,29 @@ def _load_broker_timestamps(self, persistent_cache_key):
             self.log.warning('Could not read broker timestamps from cache: %s', str(e))
         return broker_timestamps
 
-    def _add_broker_timestamps(self, broker_timestamps, highwater_offsets):
+    def _earliest_consumer_offsets(self, consumer_offsets):
+        """Return the lowest committed offset per (topic, partition) across all consumer groups."""
+        earliest = {}
+        for offsets in consumer_offsets.values():
+            for topic_partition, offset in offsets.items():
+                if topic_partition not in earliest or offset < earliest[topic_partition]:
+                    earliest[topic_partition] = offset
+        return earliest
+
+    def _add_broker_timestamps(self, broker_timestamps, highwater_offsets, prune_floors=None):
+        prune_floors = prune_floors or {}
         for (topic, partition), highwater_offset in highwater_offsets.items():
             timestamps = broker_timestamps["{}_{}".format(topic, partition)]
-            # If the highwater offset went backwards (topic recreated,
-            # retention wipe, or offset reset) any cached pair with a larger
-            # offset points to a now-nonexistent message and would poison
-            # interpolation. Drop those entries.
-            stale = [o for o in timestamps if o > highwater_offset]
-            for o in stale:
-                del timestamps[o]
+            # Reset detected: clear the whole cache. Low-offset survivors are from the
+            # previous generation and VW pins the minimum endpoint, so they'd never age out.
+            if any(o > highwater_offset for o in timestamps):
+                timestamps.clear()
             timestamps[highwater_offset] = time()
-            # If there's too many timestamps, we delete the oldest one (by
-            # timestamp, not by offset — evicting by min offset would discard
-            # the fresh post-reset entries and keep poisonous stale ones).
-            if len(timestamps) > self._max_timestamps:
-                del timestamps[min(timestamps, key=timestamps.get)]
+            if len(timestamps) >= self._max_timestamps:
+                prune_floor = prune_floors.get((topic, partition))
+                if prune_floor is not None:
+                    _prune_below_anchor(timestamps, prune_floor)
+                _visvalingam_whyatt(timestamps, max(2, self._max_timestamps // 2))
 
     def _save_broker_timestamps(self, broker_timestamps, persistent_cache_key):
         """Saves broker timestamps to persistent cache."""
@@ -312,9 +333,16 @@ def report_highwater_offsets(self, highwater_offsets, contexts_limit, cluster_id
         self.log.debug('%s highwater offsets reported', reported_contexts)
 
     def report_consumer_offsets_and_lag(
-        self, consumer_offsets, highwater_offsets, contexts_limit, broker_timestamps, cluster_id
+        self,
+        consumer_offsets,
+        highwater_offsets,
+        contexts_limit,
+        broker_timestamps,
+        cluster_id,
+        low_watermark_offsets=None,
     ):
         """Report the consumer offsets and consumer lag."""
+        low_watermark_offsets = low_watermark_offsets or {}
         reported_contexts = 0
         self.log.debug("Reporting consumer offsets and lag metrics")
         for consumer_group, offsets in consumer_offsets.items():
@@ -388,7 +416,9 @@ def report_consumer_offsets_and_lag(
                     timestamps = broker_timestamps["{}_{}".format(topic, partition)]
                     # The producer timestamp can be not set if there was an error fetching broker offsets.
                     producer_timestamp = timestamps.get(producer_offset, None)
-                    consumer_timestamp = _get_interpolated_timestamp(timestamps, consumer_offset)
+                    low_watermark = low_watermark_offsets.get((topic, partition))
+                    effective_offset = consumer_offset if low_watermark is None else max(consumer_offset, low_watermark)
+                    consumer_timestamp = _get_interpolated_timestamp(timestamps, effective_offset)
                     if consumer_timestamp is None or producer_timestamp is None:
                         continue
                     lag = producer_timestamp - consumer_timestamp
@@ -502,4 +532,58 @@ def _get_interpolated_timestamp(timestamps, offset):
     timestamp_after = timestamps[offset_after]
     slope = (timestamp_after - timestamp_before) / float(offset_after - offset_before)
     timestamp = slope * (offset - offset_after) + timestamp_after
+
+    if offset < offset_before:
+        # Cap how far past the oldest cached sample we extrapolate, so estimated lag stays bounded.
+        timestamp = max(timestamp, timestamp_before - LAG_EXTRAPOLATION_LIMIT_SECONDS)
     return timestamp
+
+
+def _prune_below_anchor(timestamps, floor):
+    below = [o for o in timestamps if o < floor]
+    if len(below) <= 1:
+        return
+    anchor = max(below)
+    for o in below:
+        if o != anchor:
+            del timestamps[o]
+
+
+def _visvalingam_whyatt(timestamps, target_count):
+    if len(timestamps) <= target_count:
+        return timestamps
+
+    offsets = sorted(timestamps)
+    prev = {o: (offsets[i - 1] if i > 0 else None) for i, o in enumerate(offsets)}
+    nxt = {o: (offsets[i + 1] if i < len(offsets) - 1 else None) for i, o in enumerate(offsets)}
+    alive = set(offsets)
+
+    current = {}
+    heap = []
+    for o in offsets:
+        if prev[o] is not None and nxt[o] is not None:
+            current[o] = _interpolation_error(o, prev, nxt, timestamps)
+            heap.append((current[o], o))
+    heapq.heapify(heap)
+
+    remaining = len(offsets)
+    while remaining > target_count and heap:
+        error, o = heapq.heappop(heap)
+        if o not in alive or error != current.get(o):
+            continue
+        before, after = prev[o], nxt[o]
+        alive.discard(o)
+        del timestamps[o]
+        remaining -= 1
+        nxt[before], prev[after] = after, before
+        for neighbor in (before, after):
+            if prev[neighbor] is not None and nxt[neighbor] is not None:
+                current[neighbor] = _interpolation_error(neighbor, prev, nxt, timestamps)
+                heapq.heappush(heap, (current[neighbor], neighbor))
+    return timestamps
+
+
+def _interpolation_error(o, prev, nxt, timestamps):
+    before, after = prev[o], nxt[o]
+    predicted = timestamps[before] + (timestamps[after] - timestamps[before]) * (o - before) / (after - before)
+    return abs(timestamps[o] - predicted)
diff --git a/kafka_consumer/tests/test_unit.py b/kafka_consumer/tests/test_unit.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+Improve the accuracy of ``estimated_consumer_lag`` for consumers that are far behind: cap interpolation for offsets older than the cached broker history, use the low watermark as a floor for the lag offset when cluster monitoring is enabled, and retain a longer broker-timestamp history by compacting the cache (Visvalingam-Whyatt) and pruning samples below the lowest readable offset (the low watermark, or the earliest consumer offset when cluster monitoring is disabled) instead of evicting the oldest one.