Skip to content

Move live reconciliation real-time gates to the monotonic clock#4376

Merged
cjdsellers merged 1 commit into
nautechsystems:developfrom
folknor:fix-reconciliation-clock-gates
Jul 4, 2026
Merged

Move live reconciliation real-time gates to the monotonic clock#4376
cjdsellers merged 1 commit into
nautechsystems:developfrom
folknor:fix-reconciliation-clock-gates

Conversation

@folknor

@folknor folknor commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

The bug

#4366 fixed the position reconciliation grace - a real-time settling window measured on self.clock, which a clock factory can drive fast, sit on a foreign epoch, or stall. As @cjdsellers noted in that review, the order-side siblings still have it. This is the rest of the family.

Worth being precise about the runner loop, because there are two gates and only one was already right. The outer maintenance loop (LiveNode::run) wakes on the monotonic dst::time clock - but the inner per-check cadence inside run_reconciliation_checks gates on self.clock via generate_timestamp_ns(). So even after #4366, a stalled self.clock never marks the inflight/open/position checks due; they silently stop running, and the position grace #4366 fixed guards a check that never fires. An accelerated clock collapses the cadence to the outer minimum and burns inflight retries at clock speed. Same root cause.

The fix

Everything measuring a real-time settling window moves onto monotonic dst::time::Instant:

  • open-order local-activity grace (order_local_activity)
  • inflight submit/query timeout (InflightCheck)
  • recent-fills dedup TTL (recent_fills_cache)
  • shared open/inflight query cadence (ts_last_query) - one map written by both paths, so converting only one axis would compare across clocks
  • inner reconciliation sub-check cadence gate (reconciliation_check_due)

self.clock stays for every domain timestamp: generated event/command ts_event/ts_init, and venue lookback/purge cutoffs. The two query-emitting functions are split accordingly - monotonic for the gate, self.clock for the QueryOrder ts_init. The old plain UnixNanos subtractions in these windows panicked on underflow (a venue clock slightly ahead of local time was enough); Instant durations remove that class outright.

One hunk that isn't a pure clock swap

handle_missing_order has a second gate comparing the order's venue ts_last against local time - genuinely mixed-axis, no monotonic instant to move to, so it stays on self.clock. Its subtraction had the same underflow foot-gun, so it's hardened from a panicking - to a checked duration_since: where a venue timestamp runs ahead of the local clock the old code panicked, the new code logs a warning and defers reconciliation, so a corrupted far-future ts_last (for example a double-scaled timestamp) is visible to operators rather than silently stalling that order's reconciliation forever. Flagging it as the one behavior fix rather than a pure refactor.

Choices worth flagging

  • prune_recent_fills_cache is pub, so it must preserve the old f64-to-u64 cast's saturation exactly: negative/NaN saturated to 0 (prune all), and positive overflow / +inf saturated to u64::MAX (keep all). try_from_secs_f64 returns Err for both, so the fallback branches on the sign - Err(_) if ttl_secs > 0.0 => Duration::MAX, else Duration::ZERO - rather than collapsing every Err to zero.
  • Removed ExecutionManager::generate_timestamp_ns; the cadence-gate change was its last caller.
  • Query-cadence sites use checked_duration_since(...).is_none_or(...), mirroring each original site rather than elapsed().
  • The stalled-clock regression is pinned by a pure-function unit test on reconciliation_check_due (it no longer references self.clock), not a full integration harness.

Tests

Affected inflight / open-order / recent-fills tests migrate to tokio paused virtual time (tokio test-util dev-dependency, matching crates/data), plus a position-grace expiry regression. The key cases are differential - advancing one clock axis at a time: monotonic-only to prove the gate fires on real time, self.clock-only to prove a regression back to it would be caught, so "advance both" can't mask a wrong-clock implementation.

Follow-ups (deliberately not in this PR)

  • Consolidate the recency-map pattern. This change leaves four IndexMap<_, dst::time::Instant> gates with identical "mark now / within window? / prune older than TTL" semantics (ts_last_query, order_local_activity, position_local_activity, recent_fills_cache). A small RecencyMap<K> that owns Instant::now() internally would make the monotonic-clock choice structural rather than a per-call-site discipline, so a future call site could not reach for the trading clock again. It is also the destination the wider sweep below migrates onto.
  • A wider sweep for the same class. "Real-time window measured on the trading clock" is a pattern, not a one-off. This PR closes the live-reconciliation instances, but one remains open even here: handle_missing_order's venue-ts_last recency gate - hardened against underflow in this PR, but still a cross-axis compare. A follow-up could re-base that on local receipt time (landing it on the RecencyMap above) and audit the other timers/throttles (data-engine cadences, the order emulator, cache-purge intervals) for the same shape, moving the genuinely real-time ones onto dst::time.

Separately from this change: cargo test -p nautilus-live hangs in tests/node.rs on develop under a nightly toolchain - I believe in serial_tests::test_error_log_triggers_graceful_shutdown, though I've not nailed it down yet (that test passes in isolation, so it may be a cross-test interaction). Not reproduced on the pinned stable 1.96.0, which is presumably why CI stays green. Worth a look if it shows up in a CI-adjacent setup.

@cjdsellers

Copy link
Copy Markdown
Member

Hi @folknor,

Thanks for the PR, and for the differential paused-time tests: advancing one clock axis at a time is exactly the right way to pin these gates.

The direction reads right to me, and it closes out the family as you described. I traced every converted site against the base: gate directions and retry/escalation thresholds are preserved, every checked_duration_since None arm resolves to the safe direction, the generate_timestamp_ns removal is complete, and the tokio test-util dev-dependency cannot reach production builds.

A couple of things to fix before this lands:

  • handle_missing_order: the new venue-ahead branch defers silently. Bounded skew just delays escalation, but a corrupted far-future ts_last (for example a double-scaled timestamp) now stalls that order's reconciliation forever with no signal, where the old code at least failed loudly. I suggest we add a warning log on the None arm so operators can see it.
  • prune_recent_fills_cache: the comment says the unwrap_or(Duration::ZERO) fallback matches the old cast semantics, but that only holds for negative/NaN. The old cast saturated overflow (+inf, oversized TTLs) to u64::MAX, keeping every entry; the new fallback prunes every entry. Either saturate positive overflow to Duration::MAX or narrow the comment, since this is a pub fn.

Not blocking, your call:

  • The local-activity gate is now repeated at three sites, and two comparison idioms coexist in the same pass (.elapsed() per order vs checked_duration_since against a loop-top now). One small predicate taking the pass's now would settle both, and avoids resampling Instant::now() per order in the reconciliation loop.
  • dst::time::Duration is plain std::time::Duration (only Instant is feature-switched); both files already import Duration, and the long form reads as if it were part of the seam.
  • Several ts_*-named fields now hold monotonic Instants (ts_submitted, last_query_ts, ts_last_query, the node's ts_last_*). Renaming them would keep the axis split this PR establishes readable, since repo-wide ts_* means domain UnixNanos.
  • While you're in tests/manager.rs: the ); got {} assertion message from Measure position-reconciliation grace on the monotonic clock #4366 could switch to was {} per repo style.

Let me know if anything is unclear!

@folknor

folknor commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

I'll handle the two blockers, and two of the optionals here (2 and 4).

Two follow-up PRs that stack on top of this will afterwards be filed for (1) the shared predicate / consolidation + renames and (2) closing the class / the sweep.

I was thinking I can wait for this to land before filing the first followup. But if you prefer I file all 3 before anything lands, I can - just say the word.

The continuous reconciliation checks measured their real-time settling
windows on self.clock, which a clock factory can drive fast, anchor to a
foreign epoch, or stall, so an N-second window stops meaning N real seconds.
nautechsystems#4366 fixed this for the position grace; this moves the rest of the family
onto the monotonic dst::time clock:

- open-order local-activity grace (order_local_activity)
- inflight submit and query timeout (InflightCheck)
- recent-fills dedup TTL (recent_fills_cache)
- shared open/inflight query cadence (ts_last_query)
- inner reconciliation sub-check cadence gate (run_reconciliation_checks)

self.clock is kept for every domain timestamp: generated event and command
ts_event/ts_init, and venue lookback and purge cutoffs. The order-recency
gate in handle_missing_order stays on domain time but is hardened from a
panicking UnixNanos subtraction to a checked one that logs a warning and
defers when a venue timestamp runs ahead of local time. Also drops the
now-unused ExecutionManager::generate_timestamp_ns accessor.

Tests migrate to tokio paused virtual time (test-util dev-dependency), with
differential cases that advance one clock axis at a time to prove the split.

Coded by an LLM.
@folknor folknor force-pushed the fix-reconciliation-clock-gates branch from b8400bb to 6954162 Compare July 4, 2026 11:21
@folknor

folknor commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

I've pushed the new commit, forwarded the branch, and updated the PR description.

@cjdsellers cjdsellers left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-ups @folknor, good changes 👌

@cjdsellers cjdsellers merged commit 12d05ac into nautechsystems:develop Jul 4, 2026
32 checks passed
@folknor folknor deleted the fix-reconciliation-clock-gates branch July 4, 2026 14:21
folknor added a commit to folknor/nautilus_trader that referenced this pull request Jul 4, 2026
The live execution manager tracked four "mark now / within window? /
prune older than TTL" gates as bare IndexMap<K, dst::time::Instant>:
order_query_recency, order_local_activity, position_local_activity,
and recent_fills_cache. Every call site had to remember to stamp from
the monotonic dst::time clock rather than the trading self.clock, and
to handle a future-marked instant in the safe direction by hand.

Extract a RecencyMap<K> that owns dst::time::Instant::now() internally,
so the monotonic-clock choice becomes structural: a call site can no
longer reach for self.clock to build a recency gate. within/within_at,
elapsed/elapsed_at, and prune_older_than all resolve a None
checked_duration_since to the safe direction, matching each original
site. Migrate the four maps onto it.

Also apply the residual ts_*-named renames cjdsellers flagged in the
nautechsystems#4376 review, since ts_* repo-wide means a domain UnixNanos but these
hold monotonic instants: InflightCheck::ts_submitted -> submitted_at,
last_query_ts -> last_query_at, the node scheduler's ts_last_* locals
and state fields -> last_*_check, and the one map field whose ts_
prefix survived the type change, ts_last_query -> order_query_recency.

Add focused RecencyMap unit tests for mark/contains/remove, monotonic
expiry under paused time, pruning, and safe handling of a
future-marked instant.

Coded by an LLM.
folknor added a commit to folknor/nautilus_trader that referenced this pull request Jul 4, 2026
`handle_missing_order` had a recency gate comparing the order's venue
`ts_last` against `self.clock` now - a cross-axis compare that nautechsystems#4376
hardened against underflow but left on the trading clock. Under a
custom live/sandbox clock factory the trading clock is not wall-paced
(it can run accelerated or sit on a foreign epoch), so that window did
not measure the real settling time it was meant to, and a corrupt
far-future `ts_last` could stall the order's reconciliation.

Drop the venue-`ts_last` gate. The missing-order settling window is now
solely the monotonic `order_local_activity` recency gate (the
`RecencyMap` from the recency-map consolidation), which measures real
receipt-time elapsed at any clock speed. This also removes the
warn-and-defer arm nautechsystems#4376 added for a far-future `ts_last`: with the
cross-axis gate gone there is no longer a failure mode to warn about -
the order simply reconciles once the real grace expires.

Audit of the remaining live timers for the same class: the cache-purge
intervals are deliberately left on the ExecutionEngine clock-timer path
so they stay controlled by the injected Clock for custom-clock callers;
a comment and a conversion test now pin that. Data-engine, order
emulator, and core timing stay domain/deterministic, and
`snapshot_positions_interval_secs` is left as-is (no live monotonic
replacement).

The missing-order test becomes a differential paused-time case: with a
far-future venue `ts_last` present throughout, recent local activity
defers, and after the monotonic grace expires reconciliation proceeds.

Coded by an LLM.
cjdsellers pushed a commit that referenced this pull request Jul 5, 2026
The live execution manager tracked four "mark now / within window? /
prune older than TTL" gates as bare IndexMap<K, dst::time::Instant>:
order_query_recency, order_local_activity, position_local_activity,
and recent_fills_cache. Every call site had to remember to stamp from
the monotonic dst::time clock rather than the trading self.clock, and
to handle a future-marked instant in the safe direction by hand.

Extract a RecencyMap<K> that owns dst::time::Instant::now() internally,
so the monotonic-clock choice becomes structural: a call site can no
longer reach for self.clock to build a recency gate. within/within_at,
elapsed/elapsed_at, and prune_older_than all resolve a None
checked_duration_since to the safe direction, matching each original
site. Migrate the four maps onto it.

Also apply the residual ts_*-named renames cjdsellers flagged in the
#4376 review, since ts_* repo-wide means a domain UnixNanos but these
hold monotonic instants: InflightCheck::ts_submitted -> submitted_at,
last_query_ts -> last_query_at, the node scheduler's ts_last_* locals
and state fields -> last_*_check, and the one map field whose ts_
prefix survived the type change, ts_last_query -> order_query_recency.

Add focused RecencyMap unit tests for mark/contains/remove, monotonic
expiry under paused time, pruning, and safe handling of a
future-marked instant.

Coded by an LLM.
folknor added a commit to folknor/nautilus_trader that referenced this pull request Jul 5, 2026
`handle_missing_order` had a recency gate comparing the order's venue
`ts_last` against `self.clock` now - a cross-axis compare that nautechsystems#4376
hardened against underflow but left on the trading clock. Under a
custom live/sandbox clock factory the trading clock is not wall-paced
(it can run accelerated or sit on a foreign epoch), so that window did
not measure the real settling time it was meant to, and a corrupt
far-future `ts_last` could stall the order's reconciliation.

Drop the venue-`ts_last` gate. The missing-order settling window is now
solely the monotonic `order_local_activity` recency gate (the
`RecencyMap` from the recency-map consolidation), which measures real
receipt-time elapsed at any clock speed. This also removes the
warn-and-defer arm nautechsystems#4376 added for a far-future `ts_last`: with the
cross-axis gate gone there is no longer a failure mode to warn about -
the order simply reconciles once the real grace expires.

Audit of the remaining live timers for the same class: the cache-purge
intervals are deliberately left on the ExecutionEngine clock-timer path
so they stay controlled by the injected Clock for custom-clock callers;
a comment and a conversion test now pin that. Data-engine, order
emulator, and core timing stay domain/deterministic, and
`snapshot_positions_interval_secs` is left as-is (no live monotonic
replacement).

The missing-order test becomes a differential paused-time case: with a
far-future venue `ts_last` present throughout, recent local activity
defers, and after the monotonic grace expires reconciliation proceeds.

Coded by an LLM.
folknor added a commit to folknor/nautilus_trader that referenced this pull request Jul 5, 2026
`handle_missing_order` had a recency gate comparing the order's venue
`ts_last` against `self.clock` now - a cross-axis compare that nautechsystems#4376
hardened against underflow but left on the trading clock. Under a
custom live/sandbox clock factory the trading clock is not wall-paced
(it can run accelerated or sit on a foreign epoch), so that window did
not measure the real settling time it was meant to, and a corrupt
far-future `ts_last` could stall the order's reconciliation.

Drop the venue-`ts_last` gate. The missing-order settling window is now
solely the monotonic `order_local_activity` recency gate (the
`RecencyMap` from the recency-map consolidation), which measures real
receipt-time elapsed at any clock speed. This also removes the
warn-and-defer arm nautechsystems#4376 added for a far-future `ts_last`: with the
cross-axis gate gone there is no longer a failure mode to warn about -
the order simply reconciles once the real grace expires.

Making local activity the sole gate exposed an ordering bug in the
`LiveNode` dispatch path: acknowledgement events (`Accepted` et al)
stamped local activity and then immediately wiped it via
`clear_recon_tracking`, so a just-accepted order omitted by a lagging
venue report could be falsely rejected as NOT_FOUND_AT_VENUE. The
per-order-event tracking now lives in
`ExecutionManager::observe_order_event`, which clears first and stamps
after - matching the ordering `observe_execution_report` already used -
and the node's batch accept/cancel arms are reordered the same way.

Audit of the remaining live timers for the same class: the cache-purge
intervals are deliberately left on the ExecutionEngine clock-timer path
so they stay controlled by the injected Clock for custom-clock callers;
a comment and a conversion test now pin that. Data-engine, order
emulator, and core timing stay domain/deterministic, and
`snapshot_positions_interval_secs` is left as-is (no live monotonic
replacement).

The missing-order test becomes a differential paused-time case: with a
far-future venue `ts_last` present throughout, recent local activity
defers, and after the monotonic grace expires reconciliation proceeds.
It tracks the accepted event through `observe_order_event` - the exact
`LiveNode` call - and a dedicated regression test covers the
just-accepted-order deferral end to end.

Coded by an LLM.
cjdsellers pushed a commit that referenced this pull request Jul 5, 2026
…time (#4387)

`handle_missing_order` had a recency gate comparing the order's venue
`ts_last` against `self.clock` now - a cross-axis compare that #4376
hardened against underflow but left on the trading clock. Under a
custom live/sandbox clock factory the trading clock is not wall-paced
(it can run accelerated or sit on a foreign epoch), so that window did
not measure the real settling time it was meant to, and a corrupt
far-future `ts_last` could stall the order's reconciliation.

Drop the venue-`ts_last` gate. The missing-order settling window is now
solely the monotonic `order_local_activity` recency gate (the
`RecencyMap` from the recency-map consolidation), which measures real
receipt-time elapsed at any clock speed. This also removes the
warn-and-defer arm #4376 added for a far-future `ts_last`: with the
cross-axis gate gone there is no longer a failure mode to warn about -
the order simply reconciles once the real grace expires.

Making local activity the sole gate exposed an ordering bug in the
`LiveNode` dispatch path: acknowledgement events (`Accepted` et al)
stamped local activity and then immediately wiped it via
`clear_recon_tracking`, so a just-accepted order omitted by a lagging
venue report could be falsely rejected as NOT_FOUND_AT_VENUE. The
per-order-event tracking now lives in
`ExecutionManager::observe_order_event`, which clears first and stamps
after - matching the ordering `observe_execution_report` already used -
and the node's batch accept/cancel arms are reordered the same way.

Audit of the remaining live timers for the same class: the cache-purge
intervals are deliberately left on the ExecutionEngine clock-timer path
so they stay controlled by the injected Clock for custom-clock callers;
a comment and a conversion test now pin that. Data-engine, order
emulator, and core timing stay domain/deterministic, and
`snapshot_positions_interval_secs` is left as-is (no live monotonic
replacement).

The missing-order test becomes a differential paused-time case: with a
far-future venue `ts_last` present throughout, recent local activity
defers, and after the monotonic grace expires reconciliation proceeds.
It tracks the accepted event through `observe_order_event` - the exact
`LiveNode` call - and a dedicated regression test covers the
just-accepted-order deferral end to end.

Coded by an LLM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants