Skip to content

feat(coprocessor): slow lane for dependent ops#1907

Merged
mergify[bot] merged 77 commits intomainfrom
codex/slow-lane-throttle
Feb 16, 2026
Merged

feat(coprocessor): slow lane for dependent ops#1907
mergify[bot] merged 77 commits intomainfrom
codex/slow-lane-throttle

Conversation

@Eikix
Copy link
Copy Markdown
Contributor

@Eikix Eikix commented Feb 4, 2026

What

  • Add binary scheduling priority on dependence_chain: 0 (fast) / 1 (slow).
  • Classify slow chains in host-listener ingest using unweighted dependent-op count (+1 per newly inserted, allowed TFHE op with dependencies) per chain per ingest pass.
  • Add --dependent-ops-max-per-chain (0 disables slow-lane classification).
  • Persist monotonic priority with GREATEST(existing, incoming) so concurrent HL types cannot downgrade.
  • Keep worker fast-first ordering in normal path (schedule_priority, then last_updated_at).
  • Keep oldest-first fallback (acquire_early_lock) as liveness escape hatch when no progress is possible.

Inheritance model

  • Add ingest-only parent metadata (inheritance_parents) to improve slow-lane inheritance without changing scheduling behavior.
  • Scheduling still uses dependencies (no-fork parallelism behavior unchanged).
  • Slow-lane inheritance now uses inheritance_parents so parallel splits are less likely to drop lineage for throttling decisions.

Off mode (--dependent-ops-max-per-chain=0)

  • No new slow-lane decisions are made.
  • Startup promotes chains back to fast (schedule_priority=0) using advisory-lock serialized, batched updates.

Why

  • Isolate heavy dependent chains from normal traffic without dropping data.
  • Avoid backoff-style priority churn/inversions.
  • Improve protection against parent-slow/child-fast splits when chain lineage is partially split for parallelism.

How

  • DB migration adds schedule_priority and aligns pending-chain index with worker acquisition order.
  • Ingest computes per-chain dependent-op totals and marks over-cap chains as slow.
  • Ingest also:
    • inherits slow from known slow parents in DB,
    • propagates transitively to dependents in the current batch graph.

Impact

  • Hot path remains in-memory for classification/propagation.
  • Priority persistence is monotonic and idempotent.
  • Off-mode promotion is bounded per batch to reduce DB pressure spikes.

Tracking

Validation

  • SQLX_OFFLINE=true cargo +1.91.1 clippy -p host-listener --all-targets -- -D warnings
  • SQLX_OFFLINE=true cargo +1.91.1 clippy -p tfhe-worker --all-targets -- -D warnings
  • SQLX_OFFLINE=true cargo +1.91.1 test -p host-listener --test host_listener_integration_tests test_slow_lane_threshold_matrix_locally -- --nocapture
  • SQLX_OFFLINE=true cargo +1.91.1 test -p host-listener --test host_listener_integration_tests test_slow_lane_cross_block_sustained_below_cap_stays_fast_locally -- --nocapture
  • SQLX_OFFLINE=true cargo +1.91.1 test -p host-listener --test host_listener_integration_tests test_slow_lane_cross_block_parent_lookup_finds_known_slow_parent_locally -- --nocapture
  • SQLX_OFFLINE=true cargo +1.91.1 test -p host-listener --test host_listener_integration_tests test_slow_lane_priority_is_monotonic_across_blocks_locally -- --nocapture
  • SQLX_OFFLINE=true cargo +1.91.1 test -p host-listener --test host_listener_integration_tests test_slow_lane_off_mode_promotes_all_chains_on_startup_locally -- --nocapture
  • SQLX_OFFLINE=true cargo +1.91.1 test -p tfhe-worker test_acquire_early_lock_ignores_priority -- --nocapture

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 4, 2026

@claude /pr-review

@claude
Copy link
Copy Markdown

claude bot commented Feb 4, 2026

Claude finished @Eikix's task in 5m 28s —— View job


PR Review Complete

  • Pre-review validation
  • Gathering context (PR summary & guidelines)
  • Running parallel review agents
  • Validating findings
  • Posting final review

Summary

Approved - This PR adds a well-designed slow-lane scheduling mechanism for rate-limited dependent operations.

Verified as working correctly:

  • Token bucket implementation handles edge cases properly
  • SQL parameter binding order is correct after switching to sqlx::query()
  • Greylisted chains are properly cleaned up within block processing
  • Worker ordering correctly prioritizes fast lane (lane 0)
  • Migration uses IF NOT EXISTS for safety

Non-blocking design notes posted in review:

  1. Rate limiting is per-operation not per-chain (may be more aggressive than expected)
  2. release_all_owned_locks doesn't reset schedule_lane (likely intentional)

See the full review for details.


claude[bot]
claude bot previously approved these changes Feb 4, 2026
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

Code Review Summary

This PR adds a well-designed slow-lane scheduling mechanism for rate-limited dependent operations. The implementation is sound with proper use of GREATEST for sticky lane escalation and correct lane reset on successful processing.

Verified as Working Correctly

  • Token bucket implementation handles edge cases properly (no division by zero, correct refill logic)
  • SQL parameter binding order is correct after switching to sqlx::query()
  • Greylisted chains are properly cleaned up within block processing
  • tokio::sync::Mutex doesn't poison on panic (by design)
  • Boolean to i16 conversion for schedule_lane is safe
  • Worker ordering ORDER BY schedule_lane ASC, last_updated_at ASC correctly prioritizes fast lane
  • The GREATEST SQL function ensures lanes only escalate, never de-escalate during updates
  • Migration uses IF NOT EXISTS for safety

Design Considerations (Non-blocking)

  1. Rate limiting granularity: The limiter consumes tokens per-operation rather than per-chain. If a single chain has 100 dependent operations, it will consume ~100 tokens. This is more aggressive than limiting by chain count. This may be intentional (mentioned as "per-op" in PR description) but could be documented.

  2. release_all_owned_locks and schedule_lane: The release_all_owned_locks function (called on worker startup) doesn't reset schedule_lane to 0, unlike release_current_lock. If a worker crashes while processing a greylisted chain, that chain will remain in lane 1 after recovery until it's successfully processed. This is likely intentional (keeping previously-hot chains throttled), but worth confirming.

Tests

Good test coverage added:

  • test_acquire_next_lock_prefers_fast_lane - verifies lane ordering
  • dependent_ops_limiter_defers_after_burst - verifies token bucket behavior
  • dependent_ops_limiter_refills_over_time - verifies refill logic
  • dependent_ops_limiter_disabled_when_rate_zero - verifies rate=0 disables limiter

LGTM! Approving.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 4, 2026

🧪 CI Insights

Here's what we observed from your CI run for 8b8927d.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Copy Markdown
Contributor

@rudy-6-4 rudy-6-4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If easy could be done in ingest function instead of database struct. (I understand the existing state database is handy for that)

@Eikix Eikix marked this pull request as ready for review February 5, 2026 09:21
@Eikix Eikix requested a review from a team as a code owner February 5, 2026 09:21
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 5, 2026

@claude /pr-review

@claude
Copy link
Copy Markdown

claude bot commented Feb 5, 2026

Claude finished @Eikix's task in 6m 48s —— View job


PR Review Complete

  • Pre-review validation
  • Gathering context (PR summary & guidelines)
  • Running parallel review agents
  • Validating findings
  • Posting final review

Summary

Approved - This PR implements a well-designed slow-lane scheduling mechanism for rate-limited dependent operations.

Verified as working correctly:

  • Token bucket implementation handles edge cases properly (debt accumulation, time-based refill)
  • SQL GREATEST function correctly ensures lanes only escalate (0→1), never de-escalate
  • Worker ordering schedule_lane ASC, last_updated_at ASC correctly prioritizes fast lane
  • Lane reset to 0 on successful processing is intentional design
  • Migration uses IF NOT EXISTS for safety
  • Limiter is properly disabled when rate_per_min == 0

Reviewer comments addressed:

  • Code uses sqlx::query! macro for compile-time verification ✓
  • Throttling is done per-chain (not per-op) as intended ✓

Non-blocking: Minor defensive dead code in consume() rate check (constructor already validates).

See the full review for details.


claude[bot]
claude bot previously approved these changes Feb 5, 2026
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

Code Review Summary

This PR implements a well-designed slow-lane scheduling mechanism for rate-limited dependent operations. The implementation is sound and I found no critical bugs.

✅ Verified as Working Correctly

  • Token bucket implementation - Correctly implements a leaky bucket with debt accumulation and time-based refill
  • SQL GREATEST usage - Properly ensures lanes only escalate (0→1), never de-escalate during concurrent updates
  • Lane reset on processing - Intentionally resets to 0 when chain processing completes (correct design)
  • Worker ordering - ORDER BY schedule_lane ASC, last_updated_at ASC correctly prioritizes fast lane
  • Migration safety - Uses ADD COLUMN IF NOT EXISTS for idempotency
  • Limiter disabled correctly - Returns None from constructor when rate_per_min == 0
  • Duplicate event handling - Correctly skips update_dependence_chain when no new insertions (chains were already persisted)

Reviewer Comments Addressed

Regarding @rudy-6-4's comments:

  1. The code does use sqlx::query! macro for compile-time verified queries in update_dependence_chain
  2. The throttling logic location in ingest.rs and persistence in update_dependence_chain appears appropriate for the design goals
  3. Throttling is correctly done per-chain (not per-op) - the limiter consumes tokens based on chain's total dependent op count

Non-Blocking Observations

Nit: The rate_per_sec variable calculation in consume() (line 167-169) is defensive dead code since the constructor already validates rate_per_min > 0. Could be simplified but doesn't affect correctness.

Test Coverage

Good test coverage added:

  • dependent_ops_limiter_defers_after_burst - verifies token bucket behavior
  • dependent_ops_limiter_refills_over_time - verifies refill logic
  • dependent_ops_limiter_disabled_when_rate_zero - verifies rate=0 disables limiter
  • test_acquire_next_lock_prefers_fast_lane - verifies lane ordering in worker

LGTM! Approving.

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 5, 2026

I am still thinking about the mental model of the slow lane, will take some time to think of something fair and good.

@Eikix Eikix marked this pull request as draft February 5, 2026 11:10
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 5, 2026

@claude /pr-review

@claude
Copy link
Copy Markdown

claude bot commented Feb 5, 2026

Claude finished @Eikix's task in 4m 6s —— View job


PR Review Complete

  • Pre-review validation
  • Gathering context (PR summary & guidelines)
  • Running parallel review agents
  • Validating findings
  • Posting final review

Summary

Approved - This PR implements a well-designed slow-lane scheduling mechanism for dependent operations.

Verified as working correctly:

  • SchedulePriority type correctly implements Fast=0, Slow=1
  • Weighted ops calculation matches HCU limit ratios (mul=7, div=5, rem=8)
  • SQL GREATEST ensures priority monotonically escalates (0→1 only)
  • Worker ordering correctly prioritizes fast lane with FIFO within same priority
  • Per-chain cap logic correctly disables throttling when set to 0
  • Migrations use safe IF NOT EXISTS / RENAME COLUMN
  • All SQL queries use sqlx::query! macro for compile-time verification

Test coverage:

  • test_acquire_next_lock_prefers_fast_lane validates core priority ordering

Non-blocking notes:

  • Theoretical integer overflow in weighted ops would require ~536M ops per chain per block (unrealistic)
  • Test coverage could be expanded for boundary conditions and weighted ops calculation

See the full review for details.


claude[bot]
claude bot previously approved these changes Feb 5, 2026
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

Code Review Summary

This PR implements a well-designed slow-lane scheduling mechanism for dependent operations. The implementation correctly prioritizes chains based on their weighted dependent operation count.

✅ Verified as Working Correctly

  1. SchedulePriority type - Correctly implements Fast=0, Slow=1 with proper From traits for conversion

  2. Weighted ops calculation - Operations are weighted appropriately (mul=7, div=5, rem=8, others=1) matching HCU limit ratios

  3. SQL GREATEST usage - The GREATEST(dependence_chain.schedule_priority, EXCLUDED.schedule_priority) correctly ensures priority monotonically escalates (0→1 only, never de-escalates)

  4. Worker ordering - ORDER BY schedule_priority ASC, last_updated_at ASC correctly prioritizes fast lane, with FIFO within same priority

  5. Per-chain cap - Comparison stats.total > options.dependent_ops_max_per_chain with > 0 check correctly disables throttling when cap is 0

  6. Migration safety - Both migrations use IF NOT EXISTS / RENAME COLUMN for idempotency

  7. Parameterized SQL - All queries use sqlx::query! macro for compile-time verification and SQL injection protection

Test Coverage

test_acquire_next_lock_prefers_fast_lane validates the core priority ordering behavior

Coverage gaps (non-blocking):

  • Weighted ops calculation not directly unit tested
  • Per-chain cap threshold boundary conditions not tested
  • Monotonic priority behavior (GREATEST) not tested

Non-Blocking Design Notes

  1. Integer overflow in weighted ops - Theoretically possible but requires ~536M weighted ops per chain per block, which is unrealistic given Ethereum gas limits

  2. Slow lane starvation - By design, slow lane chains only process when fast lane is empty. This is documented behavior and acceptable for the use case.

LGTM! Approving.

@Eikix Eikix marked this pull request as ready for review February 5, 2026 15:58
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 6, 2026

@claude /pr-review

@claude
Copy link
Copy Markdown

claude bot commented Feb 6, 2026

Claude finished @Eikix's task in 5m 46s —— View job


PR Review Complete

  • Pre-review validation
  • Gathering context (PR summary & guidelines)
  • Running parallel review agents
  • Validating findings
  • Posting final review

Summary

Approved - This PR implements a well-designed slow-lane scheduling mechanism for dependent operations.

Verified as working correctly:

  • SchedulePriority type correctly implements Fast=0, Slow=1 with proper conversions
  • Weighted ops calculation uses saturating_add() for safe arithmetic (mul=7, div=5, rem=8)
  • SQL GREATEST ensures priority monotonically escalates (0→1 only)
  • Worker ordering correctly prioritizes fast lane with FIFO within same priority
  • Per-chain cap disable (--dependent-ops-max-per-chain=0) resets all priorities and skips throttling
  • Migration safely adds column with IF NOT EXISTS and recreates index aligned with ORDER BY
  • All SQL queries use sqlx::query! macro for compile-time verification

Previous reviewer comments addressed:

  • @rudy-6-4: compile-time verified queries ✓
  • @antoniupop: index alignment, HCU weight provenance, disable mode, consolidated migrations ✓

Test coverage: test_acquire_next_lock_prefers_fast_lane validates core priority ordering

See the full review for details.


claude[bot]
claude bot previously approved these changes Feb 6, 2026
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

Code Review Summary

This PR implements a well-designed slow-lane scheduling mechanism for dependent operations. After comprehensive review by multiple specialized agents, no bugs with confidence >= 80% were found.

✅ Verified as Working Correctly

  1. SchedulePriority type - Correctly implements Fast=0, Slow=1 with proper From traits for bidirectional conversion

  2. Weighted ops calculation - Operations weighted appropriately (mul=7, div=5, rem=8, baseline=1) matching HCU limit ratios. Uses saturating_add() to prevent integer overflow.

  3. SQL GREATEST usage - GREATEST(dependence_chain.schedule_priority, EXCLUDED.schedule_priority) correctly ensures priority monotonically escalates (0→1 only, never de-escalates)

  4. Worker ordering - ORDER BY schedule_priority ASC, last_updated_at ASC correctly prioritizes fast lane, with FIFO within same priority

  5. Per-chain cap disable - When --dependent-ops-max-per-chain=0:

    • slow_lane_enabled is false, skipping all throttling logic
    • reset_schedule_priorities() resets all priorities to 0 at startup
    • Correctly reverts to FIFO behavior
  6. Migration safety - Uses ADD COLUMN IF NOT EXISTS and recreates index with proper column ordering matching the ORDER BY clause

  7. Parameterized SQL - All queries use sqlx::query! macro for compile-time verification and SQL injection protection

✅ Previous Reviewer Comments Addressed

  • @rudy-6-4's comments: Code uses sqlx::query! macro for compile-time verification ✓
  • @antoniupop's comments:
    • Index now aligned with ORDER BY clause ✓
    • HCU weight provenance documented inline ✓
    • Feature can be fully disabled with --dependent-ops-max-per-chain=0
    • Migrations consolidated to single file ✓

Test Coverage

test_acquire_next_lock_prefers_fast_lane validates the core priority ordering behavior - verifies that a chain with schedule_priority=0 (fast) is acquired before one with schedule_priority=1 (slow), even when the slow chain has an older timestamp.

Design Notes (Non-blocking)

  1. Monotonic demotion - By design, slow chains stay slow. This is documented and mitigated by processing slow lanes when fast lane is empty.

  2. Metrics pattern - Uses .unwrap() for Prometheus metrics registration, which follows the established pattern throughout the codebase.

LGTM! Approving.

@Eikix Eikix force-pushed the codex/slow-lane-throttle branch from 7c8e71e to 4f05b54 Compare February 13, 2026 14:12
Use inserted-only per-chain counting to avoid underestimating required producer pressure.
Keep no_fork parallelism, but aggregate inserted-op pressure over split dependency closures for slow-lane classification.
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 16, 2026

@mergify queue

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 16, 2026

Merge Queue Status

Rule: main


This pull request spent 2 hours 52 minutes 34 seconds in the queue, including 1 hour 51 minutes 6 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • check-success = run-e2e-tests / fhevm-e2e-test
  • any of [🛡 GitHub branch protection]:
    • check-success = common-pull-request/lint (bpr)
    • check-neutral = common-pull-request/lint (bpr)
    • check-skipped = common-pull-request/lint (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-success = coprocessor-cargo-listener-tests/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-success = coprocessor-cargo-test/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-test/cargo-tests (bpr)
    • check-skipped = coprocessor-cargo-test/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-success = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-neutral = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-skipped = coprocessor-dependency-analysis/dependencies-check (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-neutral = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-success = gateway-contracts-deployment-tests/sc-deploy (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = kms-connector-tests/test-connector (bpr)
    • check-neutral = kms-connector-tests/test-connector (bpr)
    • check-success = kms-connector-tests/test-connector (bpr)

mergify bot added a commit that referenced this pull request Feb 16, 2026
@mergify mergify bot merged commit e1734b9 into main Feb 16, 2026
64 checks passed
@mergify mergify bot deleted the codex/slow-lane-throttle branch February 16, 2026 19:49
@mergify mergify bot removed the queued label Feb 16, 2026
immortal-tofu added a commit that referenced this pull request Apr 8, 2026
Cover undocumented user-facing changes since v0.11.3:
- coprocessor: DB state revert runbook (operator workflow, #2122)
- coprocessor: slow lane for dependent ops — config + fundamentals (#1907)
- gateway-contracts: KMS context ID — new env var, event rename, errors (#IGatewayConfig)
- gateway-contracts: IDecryption breaking changes — isUserDecryptionReady overload,
  isDelegatedUserDecryptionReady param removal, new errors (#2137)
- library-solidity: isPublicDecryptionResultValid view function (#1987)
- library-solidity: FHE.fromExternal uninitialized handle support (#1969)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants