Branch: feature/watermarking → main
Distributed, Kafka-backed watermarking for time-windowed stream processing.
Global watermark = min(all partition watermarks) — signals "all data up to this point has arrived."
Enables windows to expire by time across all keys, not just when a new message arrives for the same key.
quixstreams/processing/watermarking.py—WatermarkMessageTypedDict +WatermarkManagerclassquixstreams/models/topics/manager.py—watermarks_topic()creates internal Kafka topic<consumer_group>.watermarksquixstreams/app.py— integrates WatermarkManager into the run loopquixstreams/dataframe/windows/time_based.py—expire_by_partition(), before/after update callbacksquixstreams/core/stream/stream.py—is_watermarkflag threading, sink blocking
- Every normal message →
watermark_manager.store(topic, partition, timestamp, default=True) - On idle loop →
watermark_manager.produce()flushes buffered timestamps to the internal watermarks Kafka topic (rate-limited bywatermarking_interval) - All instances consume the watermarks topic →
watermark_manager.receive()→ global watermark =min(all TPs) - When global watermark advances → re-run pipeline for each assigned partition with
value=None, key=None, timestamp=<watermark>, is_watermark=True apply/filter/updateops → no-op, pass watermark downstreamtransformwithon_watermark→ handler fires, emits real records; watermark continues downstream- Sink boundary → watermark signal dropped (never reaches sinks)
TimeWindow.final()/.current()registeron_watermarkcallbacks- Watermark arrival triggers
expire_by_partition(transaction, timestamp_ms)— sweeps ALL keys in the partition's state store - Closes any window with
end_time ≤ watermark − grace - Major improvement: windows close even if no new messages arrive for a key
| Parameter | Default | Description |
|---|---|---|
watermarking_default_assignor_enabled |
True |
Auto-track Kafka message timestamps as watermarks |
watermarking_interval |
1.0 |
Seconds between watermark flushes to Kafka |
broker_availability_timeout |
120.0 |
Crash if Kafka unreachable (triggers orchestrator restart) |
before_update(current_value, new_value, key, timestamp_ms, headers) -> bool— return True to emit window BEFORE new value is addedafter_update(new_aggregated, new_value, key, timestamp_ms, headers) -> bool— return True to emit window AFTER update
StreamingDataFrame.set_timestamp(fn)— now also stores a non-default watermark (overrides auto Kafka-timestamp watermark)StreamingDataFrame.test(..., is_watermark=False)— test watermark code pathsprocessed_offsetsrecovery mechanism removed (replaced by watermark-driven approach)- Watermarks exported from
quixstreams/__init__.py
- Global watermark gated by slowest partition — no incorrect early window closure
- Watermarks stored in Kafka — shared across all consumer group instances
- Watermarks transparent to non-windowed pipeline ops
- Watermarks blocked from reaching sinks (
_sink_wrapperdrops them)