Skip to content

Conversation

@shane-moore
Copy link

@shane-moore shane-moore commented Jan 20, 2026

Dependencies:

Scope: This PR focuses on the validator store changes only. Downstream changes to QBFT consensus and P2P message verification for committee-based partial signatures will be completed in a separate PR.

Issue Addressed

Required for committee-based partial signature collection. The full list of attesting and sync committee validators must be available at slot start before produce_selection_proof is called, which was not possible with the previous single-phase metadata approach.

Proposed Changes

  • Refactor MetadataService to use a phased approach that spans the full slot duration, rather than computing all metadata at a single point
    • Split metadata into three distinct phases: VotingAssignments at slot start, VotingContext at 1/3 slot, and AggregationAssignments at 2/3 slot
    • Add event-driven coordination between MetadataService and DutiesService using watch channels to ensure duties are polled before reading them
    • Update lighthouse dependencies to a fork that exposes the new duties service polling events
    • Update produce_selection_proof and produce_sync_selection_proof to use committee-based partial signature collection after the Boole fork
      • These methods now request signatures from all committee operators rather than just the validator's operators
      • VotingAssignments provides the expected message count for committee-based batching
    • Update lighthouse dependencies to a fork that exposes the new duties service polling events

Additional Info

The previous approach computed all metadata at 1/3 into the slot. This created a race condition where duties might not be available yet, and aggregation information wasn't accurate until after selection proofs were computed.

The new phased approach:

  • Phase 1 (slot start): Caches voting assignments immediately after DutiesService completes polling
  • Phase 2 (1/3 slot): Fetches beacon vote and builds voting context using cached assignments
  • Phase 3 (2/3 slot): Reads aggregator status after selection proofs are computed

This ensures each piece of metadata is available at the earliest point it can be accurately determined.

@shane-moore shane-moore changed the title Metadata service refactor refactor: phase metadata service across full slot duration Jan 20, 2026
@shane-moore
Copy link
Author

@claude, could you review the changes just for these two files: anchor/validator_store/src/lib.rs and anchor/validator_store/src/metadata_service.rs

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Jan 20, 2026

Claude finished @shane-moore's task in 1m 23s —— View job


Code Review: Metadata Service Refactoring

I've reviewed the two files and the phased metadata service refactoring looks solid overall. Here are my observations:


Architecture & Design ✅

Excellent phasing strategy: The three-phase approach elegantly solves the race condition problem:

  • Phase 1 (slot start): VotingAssignments cached after DutiesService polling completes
  • Phase 2 (1/3 slot): VotingContext built using cached assignments + beacon vote
  • Phase 3 (2/3 slot): AggregationAssignments computed after selection proofs exist

The event-driven coordination using watch channels (attesters_poll_rx, sync_poll_rx) ensures duties are available before reading them - this is a smart solution to the race condition.


Strengths

  1. Clear separation of concerns: Each phase has a distinct responsibility and timing
  2. Well-documented counting patterns: The VotingAssignments methods (selection_proof_count_for_committee vs committee_message_count_for_committee) clearly document the different counting semantics
  3. Comprehensive testing: The test suite covers multi-subnet validators, filtering, and edge cases
  4. Thread-safe coordination: Watch channels provide clean async coordination without blocking

Issues & Concerns

1. Error handling in metadata_service.rs lines 93-94 ⚠️

let _ = attesters_poll_rx.wait_for(|&s| s >= slot).await;
let _ = sync_poll_rx.wait_for(|&s| s >= slot).await;

Issue: Errors from wait_for are silently ignored. If the watch channel closes or returns an error, the code proceeds anyway and may read stale/incorrect duties.

Recommendation: Log errors and potentially skip the update:

if attesters_poll_rx.wait_for(|&s| s >= slot).await.is_err() {
    error!("Attesters poll channel closed or errored for slot {}", slot);
    continue;
}
if sync_poll_rx.wait_for(|&s| s >= slot).await.is_err() {
    error!("Sync poll channel closed or errored for slot {}", slot);
    continue;
}

2. Potential data race in lib.rs line 317 ⚠️

// Create the validator voting assignmentsrmation
let validator_duty = ValidatorDuty {

Issue: There's a typo in the comment ("assignmentsrmation" should be "assignments information"), but more importantly, this code doesn't use the new phased approach. The block proposal flow doesn't seem to leverage VotingAssignments at all - it constructs ValidatorDuty directly.

Question: Should block proposals also use the cached voting assignments, or is it intentional that they don't need them?

3. Missing bounds checking in lib.rs lines 1955-1972 ⚠️

fn create_test_voting_assignments(
    attesting_validators: Vec<usize>,
    sync_validators_by_subnet: Vec<(usize, Vec<u64>)>,
) -> VotingAssignments {

Issue: Test helper converts Vec<u64> to SyncSubnetId without validation. While this is test code, SyncSubnetId::new() might panic or wrap on invalid values.

Recommendation: Add a comment about valid ranges or use try_into() with proper error handling.

4. Subtle timing assumption in metadata_service.rs line 120 ⚠️

sleep(duration_to_next_slot + slot_duration / 3).await;

Issue: If the async runtime is under heavy load, the task might wake up late and miss the intended 1/3 slot timing. The comment mentions this is when beacon_vote becomes available, but there's no verification that we're actually within the correct time window.

Recommendation: After waking, verify the current slot is still correct:

sleep(duration_to_next_slot + slot_duration / 3).await;

let current_slot = self_clone_phase2.slot_clock.now().ok_or("Failed to read slot clock")?;
if current_slot != expected_slot {
    warn!("Phase 2 woke up late: expected slot {}, got {}", expected_slot, current_slot);
    continue; // Skip this slot
}

5. Documentation clarity in lib.rs lines 764-782 📝

The doc comments for VotingAssignments are excellent, but there's some redundancy:

/// Cached validator voting assignmentsrmation for a slot.
///
/// This struct caches voting assignments computed at slot start and reuses it at 1/3 slot,
/// eliminating redundant computation. It supports two different counting patterns:

Suggestion: Fix the typo "assignmentsrmation" → "assignments information" on line 764 and throughout the file.

6. Potential panic in lib.rs line 742 ⚠️

key_hex.strip_prefix("0x").unwrap_or(key_hex),

Context: This is inside decrypt_key_share which returns Result<_, ()>. While the unwrap_or here is safe (it just falls back to key_hex), the overall error handling in this function returns () instead of a descriptive error type.

Recommendation: Consider using a proper error type instead of () for better error context:

fn decrypt_key_share(
    operator_key: &Rsa<Private>,
    encrypted_private_key: [u8; ENCRYPTED_KEY_LENGTH],
    pubkey_bytes: PublicKeyBytes,
) -> Result<SecretKey, DecryptionError> {
    // ... with proper error enum
}

7. Race condition check needed in lib.rs lines 467-483 🔍

pub async fn get_voting_assignments(
    &self,
    slot: Slot,
) -> Result<Arc<VotingAssignments>, Error> {
    let Some(voting_assignments) = self
        .voting_assignments_tx
        .subscribe()
        .wait_for(|d| d.as_ref().is_some_and(|info| info.slot >= slot))
        .await
        .ok()
        .and_then(|info| info.clone())
    else {
        return Err(Error::SpecificError(SpecificError::MetadataChannelClosed));
    };

    if voting_assignments.slot == slot {
        Ok(voting_assignments)
    } else {
        Err(Error::SpecificError(SpecificError::MetadataSlotPassed))
    }
}

Issue: There's a subtle TOCTOU (time-of-check-time-of-use) issue here. Between wait_for completing and the final if check, the slot could advance further. While unlikely, this could cause the function to return an assignment for a newer slot when the caller expected the exact slot.

Analysis: Actually, looking at the logic more carefully - if wait_for returns a slot >= requested_slot, and then we check if it equals requested_slot, we correctly error if it's greater. This is correct behavior. However, the error message MetadataSlotPassed is slightly misleading - it suggests we're late, when in reality we might have just gotten a newer slot.

Suggestion: Consider clarifying the error variant name or the logic comment.


Code Quality Issues

8. Inconsistent error types in lib.rs lines 976-981 📝

/// Requested slot has already passed the current cached slot in VotingAssignments
MetadataSlotPassed,
/// Watch channel for VotingAssignments has been closed
MetadataChannelClosed,
/// Requested slot has already passed the current cached slot in AggregationAssignments
AggregatorInfoSlotPassed,
/// Watch channel for AggregationAssignments has been closed
AggregatorInfoChannelClosed,

Observation: You have separate error variants for VotingAssignments vs AggregationAssignments. This is actually good (specific errors are better than generic ones), but consider if you could use a more generic pattern with context:

enum MetadataError {
    SlotPassed { phase: MetadataPhase, requested: Slot, actual: Slot },
    ChannelClosed { phase: MetadataPhase },
}

This is a minor style point - the current approach is perfectly fine.


Performance & Efficiency

9. Potential optimization in lib.rs line 243 💡

let num_signatures_to_collect = voting_context_tx
    .voting_assignments
    .committee_message_count_for_committee(|idx| {
        committee_validator_indices.contains(idx)
    });

Observation: HashSet::contains is O(1), so this is efficient. However, the closure is called for every validator in the voting assignments. If you have many validators, consider inverting the logic:

// Count only validators that are both in voting_assignments and in committee
let num_signatures_to_collect = committee_validator_indices
    .iter()
    .filter(|idx| voting_context_tx.voting_assignments.attesting_validators.contains(idx) || 
                   voting_context_tx.voting_assignments.sync_validators_by_subnet.contains_key(idx))
    .count();

Actually, looking at the method implementation, the current approach is cleaner. The performance difference would be negligible. Keep as-is.


Testing

10. Test coverage is excellent

The tests in lib.rs lines 1948-2166 thoroughly cover:

  • Multi-subnet validators
  • Filtering with committee membership
  • Counting method differences
  • Overlapping attesting and sync validators
  • Watch channel behavior
  • Error conditions

Recommendation: Consider adding an integration test that exercises all three phases in sequence to verify the end-to-end timing and coordination.


Documentation

11. Missing phase timing documentation 📝

Observation: The phase timing is well-documented in metadata_service.rs comments, but lib.rs methods like get_voting_assignments don't mention when they become available.

Suggestion: Add timing context to the public API docs:

/// Get validator voting assignments, waiting if not yet available for this slot.
///
/// VotingAssignments become available at slot start (Phase 1) after DutiesService
/// completes polling. This method blocks until the assignments for the requested
/// slot are published.
///
/// # Timing
/// - Available: Slot start (after DutiesService polling)
/// - Used by: Selection proof generation, VotingContext (Phase 2)

Summary

Overall assessment: This is a well-designed refactoring that solves a real architectural problem. The phased approach is clean, the synchronization is correct, and the testing is thorough.

Priority fixes:

  1. ⚠️ Handle errors from wait_for in Phase 1 (metadata_service.rs:93-94)
  2. 📝 Fix typo "assignmentsrmation" throughout
  3. 🔍 Clarify the block proposal flow - should it use VotingAssignments?

Nice-to-haves:

  • Add slot timing verification in Phase 2/3
  • Improve decrypt_key_share error handling
  • Add integration test for full phase cycle

No blocking issues found - the code is safe, follows Rust best practices, and maintains thread safety correctly. Great work on the refactoring! 🎉


@shane-moore
Copy link
Author

@claude, could you review the changes just for these two files again: anchor/validator_store/src/lib.rs and anchor/validator_store/src/metadata_service.rs

for your points:

  1. updated
  2. updated
  3. ignore
  4. ignore
  5. updated
  6. ignore
  7. ignore
  8. ignore
  9. ignore
  10. ignore
  11. ignore

additionally, i modified produce_selection_proof and produce_sync_selection_proof to be fork gated and use these new services so please take a look at those changes in last 3 commits as well

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Jan 20, 2026

Claude finished @shane-moore's task in 1m 43s —— View job


Code Review: Follow-up on Metadata Service Changes

Reviewed the updated files and the fork-gated selection proof changes in the last 3 commits.

Todo List:

  • Check recent commits to understand changes
  • Read anchor/validator_store/src/lib.rs
  • Read anchor/validator_store/src/metadata_service.rs
  • Review fixes for points 1, 2, and 5
  • Review new fork-gated selection proof changes
  • Provide comprehensive feedback

Summary

The updates look excellent! You've addressed the critical error handling issues and the typo. The new fork-gated selection proof implementation is well-designed and follows a clean pattern. I found one critical issue and a few minor suggestions below.


Issues Addressed from Previous Review

✅ Point 1: Error handling for wait_for (metadata_service.rs:95-101)

Status: Fixed correctly

The error handling now properly logs and continues to the next slot, which is the right behavior:

if attesters_poll_rx.wait_for(|&s| s >= slot).await.is_err() {
    error!(%slot, "Attesters poll channel closed, skipping slot");
    continue;
}

This prevents signing with potentially stale duties if the poll channels close unexpectedly. Good fix!

✅ Point 2: Typo fixed

Status: Partially fixed - one instance remaining

The typo "assignmentsrmation" has been fixed in most places, but I found one remaining instance at lib.rs:764:

/// Cached validator voting assignmentsrmation for a slot.

Should be "assignments information" or just "assignments".

✅ Point 5: Documentation fixed

Status: Fixed

The struct and method documentation is now clear and consistent.


New Fork-Gated Selection Proof Implementation

The fork-gated approach for produce_selection_proof and produce_sync_selection_proof is well-implemented. Here's my analysis:

Strengths ✅

  1. Clean fork gating: Using self.fork_schedule.active_fork(epoch) >= Fork::Boole is the correct approach
  2. Consistent batching: Both functions use the same SelectionProofBatchId::new(slot, committee_id) for base_hash, which ensures all selection proofs batch together into one P2P message per committee
  3. Correct counting method: Uses selection_proof_count_for_committee which counts +N for sync validators (N = subnet count), matching the semantics
  4. Divergent operator view handling: Building committee_validator_indices from the database state correctly handles cases where operators have different views of committee membership
  5. Backward compatibility: Preserves original single-validator behavior for pre-Boole forks

Critical Issue ⚠️

produce_sync_selection_proof uses wrong PartialSignatureKind in Boole fork (lib.rs:1665)

self.collect_signature(
    PartialSignatureKind::AggregatorCommitteePartialSig,  // ← This
    Role::AggregatorCommittee,
    collection_mode,
    ...
)

Problem: In the Boole fork path, produce_sync_selection_proof uses PartialSignatureKind::AggregatorCommitteePartialSig, but this is for attestation aggregator selection proofs, not sync committee contribution proofs.

Expected behavior: Sync committee selection proofs should use a different PartialSignatureKind to distinguish them from attestation selection proofs, even though they batch together (same base_hash).

Looking at the pre-Boole code path, it uses PartialSignatureKind::ContributionProofs, which is correct for sync committee.

Question: Should the Boole fork path use a new enum variant like PartialSignatureKind::SyncAggregatorCommitteePartialSig or similar? Or is reusing AggregatorCommitteePartialSig intentional?

If they should be distinguished, you need to:

  1. Add a new variant to PartialSignatureKind for sync committee aggregator selection proofs in committee mode
  2. Update the Boole fork path to use this new variant
  3. Ensure downstream code (signature_collector, message handlers) handles both variants

If reusing is intentional, please add a comment explaining why sync and attestation selection proofs share the same PartialSignatureKind despite being different signature types.

Fix the PartialSignatureKind issue →


Minor Suggestions

1. Code documentation clarity (lib.rs:1533)

The comment in produce_selection_proof could be clearer:

// Build a set of validator indices in this committee
// This handles divergent operator views, since we only count validators we have
// shares

Suggest:

// Build a set of validator indices in this committee.
// This handles divergent operator views, since we only count validators we have
// shares for in the database.

2. Consistent terminology (lib.rs:1550, 1666)

Both functions have similar comments about "how many selection proofs to collect", which is great for consistency. Consider using identical wording:

// Calculate how many selection proofs to collect using the selection proof counting
// method.

This is already good, no change needed.

3. Comment accuracy (lib.rs:1651)

The comment says:

// Compute deterministic base_hash for batching (SAME as attestation selection
// proofs). This ensures all selection proofs (attestation + sync) batch together
// into one P2P message per committee.

This is excellent documentation! It clearly explains the critical invariant that both use the same base_hash. Consider adding a reference to the attestation version:

// Compute deterministic base_hash for batching (SAME as attestation selection
// proofs at lib.rs:1557). This ensures all selection proofs (attestation + sync) 
// batch together into one P2P message per committee.

But this is a very minor suggestion - the current comment is already clear.


Remaining Items from Previous Review

You marked the following as "ignore", which is fine:

  • Point 3: Block proposal flow doesn't need VotingAssignments (reasonable)
  • Point 4: Timing assumption - acceptable given async runtime behavior
  • Point 6: decrypt_key_share error handling - not critical
  • Point 7: TOCTOU analysis - already correct as-is
  • Points 8-11: Style suggestions and test coverage

These are all acceptable trade-offs.


Testing Recommendations

The fork-gated logic should be tested to ensure:

  1. ✅ Pre-Boole fork uses single-validator collection (PartialSignatureKind::SelectionProof and PartialSignatureKind::ContributionProofs)
  2. ✅ Boole fork uses committee-based collection with correct counting
  3. ✅ Both paths produce valid signatures
  4. ⚠️ Correct PartialSignatureKind is used in Boole fork (pending resolution of the issue above)

Consider adding integration tests that cover the fork transition boundary.


Final Assessment

Overall: The refactoring is well-executed. The phased metadata approach solves the race condition elegantly, and the fork-gated selection proof implementation follows good patterns.

Critical: Verify the PartialSignatureKind usage in produce_sync_selection_proof Boole fork path (line 1665).

Minor: Fix the remaining "assignmentsrmation" typo at line 764.

Great work on this complex refactoring! 🎉


@shane-moore
Copy link
Author

@claude,

  1. good, done
  2. already resolved, maybe you looked at wrong commit or something

produce_sync_selection_proof uses wrong PartialSignatureKind in Boole fork

  • this is correct actually, should use PartialSignatureKind::AggregatorCommitteePartialSig for both types of selection proof

i'm ignoring all the minor suggestions

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Jan 20, 2026

Claude finished @shane-moore's task in 41s —— View job


Follow-up Review Verification

Todo List:

  • Verify point 2 (typo fix) in latest code
  • Acknowledge feedback and provide final assessment

Summary

Thank you for the clarifications! I've verified the current state:

✅ Point 1 (Error handling)

Confirmed fixed in commit dd1c357.

✅ Point 2 (Typo "assignmentsrmation")

You're correct - the typo has been completely fixed. I searched the entire file and found no remaining instances of "assignmentsrmation". The struct documentation at line 761-762 now correctly reads:

#[derive(Debug, Clone)]
pub struct VotingAssignments {

with proper field documentation below. My apologies for the confusion - I must have been looking at an older version.

📝 PartialSignatureKind Clarification

Thank you for clarifying that PartialSignatureKind::AggregatorCommitteePartialSig is intentionally used for both attestation and sync committee selection proofs in the Boole fork. This makes sense from a message batching perspective - using the same kind allows them to batch together into a single P2P message per committee (as documented in your code comments at lib.rs:1651-1653).

The key insight I was missing: the PartialSignatureKind is about message structure and batching, not about distinguishing the cryptographic signature type. The actual signature type is already distinguished by the Role parameter and the signing root being different.


Final Assessment

All critical issues have been resolved:

  • ✅ Error handling for watch channels (point 1)
  • ✅ Typo fixes (point 2)
  • ✅ Correct PartialSignatureKind usage (my misunderstanding)

The metadata service refactoring is in excellent shape. The phased approach elegantly solves the race condition, and the fork-gated selection proof implementation correctly uses committee-based collection with proper batching semantics.

No remaining issues. Great work on this complex refactoring! 🎉


@shane-moore shane-moore force-pushed the metadata-service-refactor branch from dd1c357 to 73e9a82 Compare January 23, 2026 20:56
@shane-moore shane-moore force-pushed the metadata-service-refactor branch from 8aba0c4 to 146af66 Compare January 24, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant