Move live reconciliation real-time gates to the monotonic clock#4376
Conversation
|
Hi @folknor, Thanks for the PR, and for the differential paused-time tests: advancing one clock axis at a time is exactly the right way to pin these gates. The direction reads right to me, and it closes out the family as you described. I traced every converted site against the base: gate directions and retry/escalation thresholds are preserved, every A couple of things to fix before this lands:
Not blocking, your call:
Let me know if anything is unclear! |
|
I'll handle the two blockers, and two of the optionals here (2 and 4). Two follow-up PRs that stack on top of this will afterwards be filed for (1) the shared predicate / consolidation + renames and (2) closing the class / the sweep. I was thinking I can wait for this to land before filing the first followup. But if you prefer I file all 3 before anything lands, I can - just say the word. |
The continuous reconciliation checks measured their real-time settling windows on self.clock, which a clock factory can drive fast, anchor to a foreign epoch, or stall, so an N-second window stops meaning N real seconds. nautechsystems#4366 fixed this for the position grace; this moves the rest of the family onto the monotonic dst::time clock: - open-order local-activity grace (order_local_activity) - inflight submit and query timeout (InflightCheck) - recent-fills dedup TTL (recent_fills_cache) - shared open/inflight query cadence (ts_last_query) - inner reconciliation sub-check cadence gate (run_reconciliation_checks) self.clock is kept for every domain timestamp: generated event and command ts_event/ts_init, and venue lookback and purge cutoffs. The order-recency gate in handle_missing_order stays on domain time but is hardened from a panicking UnixNanos subtraction to a checked one that logs a warning and defers when a venue timestamp runs ahead of local time. Also drops the now-unused ExecutionManager::generate_timestamp_ns accessor. Tests migrate to tokio paused virtual time (test-util dev-dependency), with differential cases that advance one clock axis at a time to prove the split. Coded by an LLM.
b8400bb to
6954162
Compare
|
I've pushed the new commit, forwarded the branch, and updated the PR description. |
cjdsellers
left a comment
There was a problem hiding this comment.
Thanks for the follow-ups @folknor, good changes 👌
The live execution manager tracked four "mark now / within window? / prune older than TTL" gates as bare IndexMap<K, dst::time::Instant>: order_query_recency, order_local_activity, position_local_activity, and recent_fills_cache. Every call site had to remember to stamp from the monotonic dst::time clock rather than the trading self.clock, and to handle a future-marked instant in the safe direction by hand. Extract a RecencyMap<K> that owns dst::time::Instant::now() internally, so the monotonic-clock choice becomes structural: a call site can no longer reach for self.clock to build a recency gate. within/within_at, elapsed/elapsed_at, and prune_older_than all resolve a None checked_duration_since to the safe direction, matching each original site. Migrate the four maps onto it. Also apply the residual ts_*-named renames cjdsellers flagged in the nautechsystems#4376 review, since ts_* repo-wide means a domain UnixNanos but these hold monotonic instants: InflightCheck::ts_submitted -> submitted_at, last_query_ts -> last_query_at, the node scheduler's ts_last_* locals and state fields -> last_*_check, and the one map field whose ts_ prefix survived the type change, ts_last_query -> order_query_recency. Add focused RecencyMap unit tests for mark/contains/remove, monotonic expiry under paused time, pruning, and safe handling of a future-marked instant. Coded by an LLM.
`handle_missing_order` had a recency gate comparing the order's venue `ts_last` against `self.clock` now - a cross-axis compare that nautechsystems#4376 hardened against underflow but left on the trading clock. Under a custom live/sandbox clock factory the trading clock is not wall-paced (it can run accelerated or sit on a foreign epoch), so that window did not measure the real settling time it was meant to, and a corrupt far-future `ts_last` could stall the order's reconciliation. Drop the venue-`ts_last` gate. The missing-order settling window is now solely the monotonic `order_local_activity` recency gate (the `RecencyMap` from the recency-map consolidation), which measures real receipt-time elapsed at any clock speed. This also removes the warn-and-defer arm nautechsystems#4376 added for a far-future `ts_last`: with the cross-axis gate gone there is no longer a failure mode to warn about - the order simply reconciles once the real grace expires. Audit of the remaining live timers for the same class: the cache-purge intervals are deliberately left on the ExecutionEngine clock-timer path so they stay controlled by the injected Clock for custom-clock callers; a comment and a conversion test now pin that. Data-engine, order emulator, and core timing stay domain/deterministic, and `snapshot_positions_interval_secs` is left as-is (no live monotonic replacement). The missing-order test becomes a differential paused-time case: with a far-future venue `ts_last` present throughout, recent local activity defers, and after the monotonic grace expires reconciliation proceeds. Coded by an LLM.
The live execution manager tracked four "mark now / within window? / prune older than TTL" gates as bare IndexMap<K, dst::time::Instant>: order_query_recency, order_local_activity, position_local_activity, and recent_fills_cache. Every call site had to remember to stamp from the monotonic dst::time clock rather than the trading self.clock, and to handle a future-marked instant in the safe direction by hand. Extract a RecencyMap<K> that owns dst::time::Instant::now() internally, so the monotonic-clock choice becomes structural: a call site can no longer reach for self.clock to build a recency gate. within/within_at, elapsed/elapsed_at, and prune_older_than all resolve a None checked_duration_since to the safe direction, matching each original site. Migrate the four maps onto it. Also apply the residual ts_*-named renames cjdsellers flagged in the #4376 review, since ts_* repo-wide means a domain UnixNanos but these hold monotonic instants: InflightCheck::ts_submitted -> submitted_at, last_query_ts -> last_query_at, the node scheduler's ts_last_* locals and state fields -> last_*_check, and the one map field whose ts_ prefix survived the type change, ts_last_query -> order_query_recency. Add focused RecencyMap unit tests for mark/contains/remove, monotonic expiry under paused time, pruning, and safe handling of a future-marked instant. Coded by an LLM.
`handle_missing_order` had a recency gate comparing the order's venue `ts_last` against `self.clock` now - a cross-axis compare that nautechsystems#4376 hardened against underflow but left on the trading clock. Under a custom live/sandbox clock factory the trading clock is not wall-paced (it can run accelerated or sit on a foreign epoch), so that window did not measure the real settling time it was meant to, and a corrupt far-future `ts_last` could stall the order's reconciliation. Drop the venue-`ts_last` gate. The missing-order settling window is now solely the monotonic `order_local_activity` recency gate (the `RecencyMap` from the recency-map consolidation), which measures real receipt-time elapsed at any clock speed. This also removes the warn-and-defer arm nautechsystems#4376 added for a far-future `ts_last`: with the cross-axis gate gone there is no longer a failure mode to warn about - the order simply reconciles once the real grace expires. Audit of the remaining live timers for the same class: the cache-purge intervals are deliberately left on the ExecutionEngine clock-timer path so they stay controlled by the injected Clock for custom-clock callers; a comment and a conversion test now pin that. Data-engine, order emulator, and core timing stay domain/deterministic, and `snapshot_positions_interval_secs` is left as-is (no live monotonic replacement). The missing-order test becomes a differential paused-time case: with a far-future venue `ts_last` present throughout, recent local activity defers, and after the monotonic grace expires reconciliation proceeds. Coded by an LLM.
`handle_missing_order` had a recency gate comparing the order's venue `ts_last` against `self.clock` now - a cross-axis compare that nautechsystems#4376 hardened against underflow but left on the trading clock. Under a custom live/sandbox clock factory the trading clock is not wall-paced (it can run accelerated or sit on a foreign epoch), so that window did not measure the real settling time it was meant to, and a corrupt far-future `ts_last` could stall the order's reconciliation. Drop the venue-`ts_last` gate. The missing-order settling window is now solely the monotonic `order_local_activity` recency gate (the `RecencyMap` from the recency-map consolidation), which measures real receipt-time elapsed at any clock speed. This also removes the warn-and-defer arm nautechsystems#4376 added for a far-future `ts_last`: with the cross-axis gate gone there is no longer a failure mode to warn about - the order simply reconciles once the real grace expires. Making local activity the sole gate exposed an ordering bug in the `LiveNode` dispatch path: acknowledgement events (`Accepted` et al) stamped local activity and then immediately wiped it via `clear_recon_tracking`, so a just-accepted order omitted by a lagging venue report could be falsely rejected as NOT_FOUND_AT_VENUE. The per-order-event tracking now lives in `ExecutionManager::observe_order_event`, which clears first and stamps after - matching the ordering `observe_execution_report` already used - and the node's batch accept/cancel arms are reordered the same way. Audit of the remaining live timers for the same class: the cache-purge intervals are deliberately left on the ExecutionEngine clock-timer path so they stay controlled by the injected Clock for custom-clock callers; a comment and a conversion test now pin that. Data-engine, order emulator, and core timing stay domain/deterministic, and `snapshot_positions_interval_secs` is left as-is (no live monotonic replacement). The missing-order test becomes a differential paused-time case: with a far-future venue `ts_last` present throughout, recent local activity defers, and after the monotonic grace expires reconciliation proceeds. It tracks the accepted event through `observe_order_event` - the exact `LiveNode` call - and a dedicated regression test covers the just-accepted-order deferral end to end. Coded by an LLM.
…time (#4387) `handle_missing_order` had a recency gate comparing the order's venue `ts_last` against `self.clock` now - a cross-axis compare that #4376 hardened against underflow but left on the trading clock. Under a custom live/sandbox clock factory the trading clock is not wall-paced (it can run accelerated or sit on a foreign epoch), so that window did not measure the real settling time it was meant to, and a corrupt far-future `ts_last` could stall the order's reconciliation. Drop the venue-`ts_last` gate. The missing-order settling window is now solely the monotonic `order_local_activity` recency gate (the `RecencyMap` from the recency-map consolidation), which measures real receipt-time elapsed at any clock speed. This also removes the warn-and-defer arm #4376 added for a far-future `ts_last`: with the cross-axis gate gone there is no longer a failure mode to warn about - the order simply reconciles once the real grace expires. Making local activity the sole gate exposed an ordering bug in the `LiveNode` dispatch path: acknowledgement events (`Accepted` et al) stamped local activity and then immediately wiped it via `clear_recon_tracking`, so a just-accepted order omitted by a lagging venue report could be falsely rejected as NOT_FOUND_AT_VENUE. The per-order-event tracking now lives in `ExecutionManager::observe_order_event`, which clears first and stamps after - matching the ordering `observe_execution_report` already used - and the node's batch accept/cancel arms are reordered the same way. Audit of the remaining live timers for the same class: the cache-purge intervals are deliberately left on the ExecutionEngine clock-timer path so they stay controlled by the injected Clock for custom-clock callers; a comment and a conversion test now pin that. Data-engine, order emulator, and core timing stay domain/deterministic, and `snapshot_positions_interval_secs` is left as-is (no live monotonic replacement). The missing-order test becomes a differential paused-time case: with a far-future venue `ts_last` present throughout, recent local activity defers, and after the monotonic grace expires reconciliation proceeds. It tracks the accepted event through `observe_order_event` - the exact `LiveNode` call - and a dedicated regression test covers the just-accepted-order deferral end to end. Coded by an LLM.
The bug
#4366 fixed the position reconciliation grace - a real-time settling window measured on
self.clock, which a clock factory can drive fast, sit on a foreign epoch, or stall. As @cjdsellers noted in that review, the order-side siblings still have it. This is the rest of the family.Worth being precise about the runner loop, because there are two gates and only one was already right. The outer maintenance loop (
LiveNode::run) wakes on the monotonicdst::timeclock - but the inner per-check cadence insiderun_reconciliation_checksgates onself.clockviagenerate_timestamp_ns(). So even after #4366, a stalledself.clocknever marks the inflight/open/position checks due; they silently stop running, and the position grace #4366 fixed guards a check that never fires. An accelerated clock collapses the cadence to the outer minimum and burns inflight retries at clock speed. Same root cause.The fix
Everything measuring a real-time settling window moves onto monotonic
dst::time::Instant:order_local_activity)InflightCheck)recent_fills_cache)ts_last_query) - one map written by both paths, so converting only one axis would compare across clocksreconciliation_check_due)self.clockstays for every domain timestamp: generated event/commandts_event/ts_init, and venue lookback/purge cutoffs. The two query-emitting functions are split accordingly - monotonic for the gate,self.clockfor theQueryOrderts_init. The old plainUnixNanossubtractions in these windows panicked on underflow (a venue clock slightly ahead of local time was enough);Instantdurations remove that class outright.One hunk that isn't a pure clock swap
handle_missing_orderhas a second gate comparing the order's venuets_lastagainst local time - genuinely mixed-axis, no monotonic instant to move to, so it stays onself.clock. Its subtraction had the same underflow foot-gun, so it's hardened from a panicking-to a checkedduration_since: where a venue timestamp runs ahead of the local clock the old code panicked, the new code logs a warning and defers reconciliation, so a corrupted far-futurets_last(for example a double-scaled timestamp) is visible to operators rather than silently stalling that order's reconciliation forever. Flagging it as the one behavior fix rather than a pure refactor.Choices worth flagging
prune_recent_fills_cacheispub, so it must preserve the oldf64-to-u64cast's saturation exactly: negative/NaN saturated to0(prune all), and positive overflow /+infsaturated tou64::MAX(keep all).try_from_secs_f64returnsErrfor both, so the fallback branches on the sign -Err(_) if ttl_secs > 0.0 => Duration::MAX, elseDuration::ZERO- rather than collapsing everyErrto zero.ExecutionManager::generate_timestamp_ns; the cadence-gate change was its last caller.checked_duration_since(...).is_none_or(...), mirroring each original site rather thanelapsed().reconciliation_check_due(it no longer referencesself.clock), not a full integration harness.Tests
Affected inflight / open-order / recent-fills tests migrate to tokio paused virtual time (
tokiotest-utildev-dependency, matchingcrates/data), plus a position-grace expiry regression. The key cases are differential - advancing one clock axis at a time: monotonic-only to prove the gate fires on real time,self.clock-only to prove a regression back to it would be caught, so "advance both" can't mask a wrong-clock implementation.Follow-ups (deliberately not in this PR)
IndexMap<_, dst::time::Instant>gates with identical "mark now / within window? / prune older than TTL" semantics (ts_last_query,order_local_activity,position_local_activity,recent_fills_cache). A smallRecencyMap<K>that ownsInstant::now()internally would make the monotonic-clock choice structural rather than a per-call-site discipline, so a future call site could not reach for the trading clock again. It is also the destination the wider sweep below migrates onto.handle_missing_order's venue-ts_lastrecency gate - hardened against underflow in this PR, but still a cross-axis compare. A follow-up could re-base that on local receipt time (landing it on theRecencyMapabove) and audit the other timers/throttles (data-engine cadences, the order emulator, cache-purge intervals) for the same shape, moving the genuinely real-time ones ontodst::time.