Skip to content

Collator waits for external messages until timeout#2180

Open
SpyCheese wants to merge 7 commits intotestnetfrom
simplex-work
Open

Collator waits for external messages until timeout#2180
SpyCheese wants to merge 7 commits intotestnetfrom
simplex-work

Conversation

@SpyCheese
Copy link
Member

No description provided.

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

  1. High: duplicate callback messages can be incorrectly marked as bad and deleted from mempool.
    validator/impl/ext-message-pool.cpp:198 allows reinserting the same hash with higher priority, and validator/impl/ext-message-pool.cpp:212 still invokes collator callbacks for it.
    In collator, duplicates are treated as errors in validator/impl/collator.cpp:6614, and got_new_external_message pushes all such errors into bad_ext_msgs_ at validator/impl/collator.cpp:6663. Later, these hashes are sent as to_delete in validator/impl/collator.cpp:6549, so a valid message can be dropped as “bad” just because it was re-announced/upgraded.

  2. Low: do_collate() moves the error out of result and then moves status again, so perf logging loses the real error status.
    At validator/impl/collator.cpp:2386 result.move_as_error() is consumed by fatal_error, then validator/impl/collator.cpp:2388 calls result.move_as_status() again on already-moved state. This likely records a synthetic/internal status instead of the original failure reason.

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

  1. High: first-block error path can get stuck in an infinite retry loop without ever starting a new collation. In validator/consensus/block-producer.cpp:137, any r_candidate.is_error() for is_first_block does continue before clearing block_generation_active. For non-timeout errors, the shared future is already resolved with error, so next iteration immediately hits the same branch again forever.
    Suggested fix: in that branch, only continue on timeout, or set block_generation_active = false before continue for non-timeout errors so a fresh collate_block request is started.

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

  1. High: potential busy-loop on first-slot collation failures in new retry path.
    In validator/consensus/block-producer.cpp:144, on is_first_block error you do --slot, reset start_time, and continue without any backoff. If collate_block returns a deterministic non-timeout error quickly, this can spin retries very fast (log spam + manager pressure) until the leader window changes. Consider adding a minimal retry delay / capped retry rate for this branch.

  2. Medium: callback-based external-message waiting is silently disabled in disk/hardfork managers.
    validator/manager-disk.cpp:545 and validator/manager-hardfork.cpp:380 accept std::unique_ptr<ExtMsgCallback> callback but never store/use it. That means wait_externals_until behavior differs from main manager implementation and won’t receive post-fetch external messages in these modes. If this divergence is intentional, it should be explicit; otherwise wire callback delivery similarly to manager.cpp.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

  1. High: await_with_timeout has two detached tasks racing to complete the same Promise without coordination.
    SharedFuture.h:63, SharedFuture.h:65, SharedFuture.h:72, PromiseFuture.h:190
    The timeout worker and task worker both call set_* on the same Promise object. There is no winner arbitration or synchronization, so behavior near the deadline is nondeterministic (timeout can win even if result is ready), and on a multi-threaded executor this is a data race on Promise::promise_.

  2. Medium: callback path added to get_external_messages is ignored in manager-disk and manager-hardfork, so collator waiting for new externals cannot be woken in these modes.
    manager-disk.cpp:545, manager-hardfork.cpp:380, collator.cpp:4300
    When wait_externals_until is enabled, these manager implementations return initial messages but drop the live callback, so wait_for_external_message(...) only exits on timeout. This can add avoidable per-slot delay in those environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant