Skip to content

Conversation

@bkioshn
Copy link
Contributor

@bkioshn bkioshn commented Dec 22, 2025

Description

Quick context:
Doc-sync messages weren't propagating between nodes. Turned out to be three issues caused by the new on doc commit:

  1. Subscription conflicts - IPFS was auto-subscribing to topics that modules needed to control
  2. Wrong CID format in messages - was sending dag-pb CIDs when protocol expects dag-cbor
  3. Blocking operations - file operations were blocking the PubSub handler

What got fixed:
• Made the PubSub handler spawn async tasks so it doesn't block
• Fixed message format - now computes both CID types (dag-pb for storage, dag-cbor for protocol)
• Removed the auto-subscription logic that was causing conflicts
• Fixed the test detection (was using grep -q which was flaky, switched to wc -l)

Testing:
Run just test-pubsub-propagation from the p2p-testing dir. Should see all 5 nodes receive messages.

Related Issue(s)

The issues arise when trying to run test created from
#704
so I'm pointing this PR to this issue

Please confirm the following checks

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream module

  WHAT THIS FIXES:
  - CID format: Use dag-cbor codec (0x51) instead of dag-pb (0x70) for protocol compliance
  - Message encoding: Publish CBOR-encoded payload::New with CID lists instead of raw document bytes
  - Subscription conflict: Remove auto-subscription that prevented doc-sync module from subscribing

  CHANGES:
  - hermes/bin/src/runtime_extensions/hermes/doc_sync/host.rs:
    * Import minicbor crate directly (not via cardano_chain_follower)
    * Compute both dag-pb CID (storage) and dag-cbor CID (protocol) in add_file()
    * Construct proper payload::New structure in publish()
    * Encode to CBOR before publishing to PubSub

  - hermes/bin/src/ipfs/mod.rs:
    * Remove auto-subscription to documents.new during bootstrap
    * Prevents subscription kind conflicts with doc-sync module

  - hermes/bin/src/ipfs/task.rs:
    * Add extensive documentation of blocking operations issue
    * Document root cause and recommended async solution

  REMAINING ISSUE:
  P2P propagation still fails due to blocking operations in async context.
  The doc_sync_topic_message_handler (added in 944360f) calls file_pin()
  and file_get() which use blocking_send()/blocking_recv(). This causes
  deadlock when called from async PubSub handler.

  NEXT STEPS:
  Convert file_pin() and file_get() to async versions. See detailed
  implementation guide in task.rs comments (lines 354-374).

  TEST:
  cd p2p-testing && just test-pubsub-propagation
  Currently FAILS with No Propagation due to blocking operations issue.
  Will PASS once async conversion is complete.

  Related: #691 (introduced blocking operations bug)
…e operations

  The doc_sync_topic_message_handler was blocking the PubSub event loop by using
  blocking_send/blocking_recv for IPFS file operations, preventing new messages
  from being processed while fetching content.

  Changes:
  - Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
    send().await instead of blocking_send()
  - Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
    with tokio::spawn() for file operations
  - Add RECEIVED PubSub message with CID logging for test detection
  - Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
    message reception using wc -l instead of grep -q

  Result:
  - PubSub handler no longer blocks - can process new messages while previous
    content is still being fetched
  - test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
  - Resolves doc-sync P2P message propagation issue
…e operations

  The doc_sync_topic_message_handler was blocking the PubSub event loop by using
  blocking_send/blocking_recv for IPFS file operations, preventing new messages
  from being processed while fetching content.

  Changes:
  - Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
    send().await instead of blocking_send()
  - Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
    with tokio::spawn() for file operations
  - Add RECEIVED PubSub message with CID logging for test detection
  - Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
    message reception using wc -l instead of grep -q

  Result:
  - PubSub handler no longer blocks - can process new messages while previous
    content is still being fetched
  - test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
  - Resolves doc-sync P2P message propagation issue
…e operations

  The doc_sync_topic_message_handler was blocking the PubSub event loop by using
  blocking_send/blocking_recv for IPFS file operations, preventing new messages
  from being processed while fetching content.

  Changes:
  - Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
    send().await instead of blocking_send()
  - Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
    with tokio::spawn() for file operations
  - Add RECEIVED PubSub message with CID logging for test detection
  - Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
    message reception using wc -l instead of grep -q

  Result:
  - PubSub handler no longer blocks - can process new messages while previous
    content is still being fetched
  - test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
  - Resolves doc-sync P2P message propagation issue
…e operations

  The doc_sync_topic_message_handler was blocking the PubSub event loop by using
  blocking_send/blocking_recv for IPFS file operations, preventing new messages
  from being processed while fetching content.

  Changes:
  - Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
    send().await instead of blocking_send()
  - Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
    with tokio::spawn() for file operations
  - Add RECEIVED PubSub message with CID logging for test detection
  - Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
    message reception using wc -l instead of grep -q

  Result:
  - PubSub handler no longer blocks - can process new messages while previous
    content is still being fetched
  - test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
  - Resolves doc-sync P2P message propagation issue
…e operations

  The doc_sync_topic_message_handler was blocking the PubSub event loop by using
  blocking_send/blocking_recv for IPFS file operations, preventing new messages
  from being processed while fetching content.

  Changes:
  - Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
    send().await instead of blocking_send()
  - Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
    with tokio::spawn() for file operations
  - Add RECEIVED PubSub message with CID logging for test detection
  - Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
    message reception using wc -l instead of grep -q

  Result:
  - PubSub handler no longer blocks - can process new messages while previous
    content is still being fetched
  - test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
  - Resolves doc-sync P2P message propagation issue
@bkioshn bkioshn added the squad: hermetics Hermes Backend, System Development & Integration Team label Dec 22, 2025
@bkioshn bkioshn added this to Catalyst Dec 22, 2025
OnNewDocEvent::new(channel_name, &content),
crate::event::TargetApp::List(app_names.clone()),
crate::event::TargetModule::All,
// IMPORTANT: The message contains dag-cbor CIDs (CIDv1, codec 0x51), but IPFS storage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ONLY every use 0x51 and no other CID. I see there is logic for both, only have 1.

fn publish(
ctx: &mut HermesRuntimeContext,
doc: DocData,
cid: &str,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be DocData so we don't need to construct the data here 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never be using cid as a string, and always use its binary equivalent. Otherwise we are constantly encoding and decoding it everywhere. It should ONLY ever be a printable string if we intend to print it.
If we are using a CID, it should be a CID type that holds it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkioshn, I think we only need the cid in this function.
DocData is already added to IPFS at this point here: https://github.com/input-output-hk/hermes/blob/fix/doc-sync-p2p-propagation/hermes/bin/src/runtime_extensions/hermes/doc_sync/host.rs#L185

@bkioshn bkioshn added the comment me Request for comments label Dec 22, 2025
@bkioshn bkioshn moved this from New to 🏗 In progress in Catalyst Dec 22, 2025
}

/// Handler function for topic message streams (Doc Sync).
/// Handler for Doc Sync PubSub messages on "*.new" topics.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add TODO here, since this function should be able to handle multiple topics (as mention // TODO: match the topic against a static list.)

fn publish(
ctx: &mut HermesRuntimeContext,
doc: DocData,
cid: &str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never be using cid as a string, and always use its binary equivalent. Otherwise we are constantly encoding and decoding it everywhere. It should ONLY ever be a printable string if we intend to print it.
If we are using a CID, it should be a CID type that holds it.

// Messages contain CID lists, not document content - receivers fetch from IPFS.

// Step 1: Parse CID string
let cid_parsed = match cid.parse::<hermes_ipfs::Cid>() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because it should never be a string, it should always be its bytes and we should only convert bytes to strong for display and never "to use".


let multihash = multihash::Multihash::<64>::wrap(SHA2_256_CODE, &hash_digest)?;
let cbor_cid = hermes_ipfs::Cid::new_v1(CBOR_CODEC, multihash);
let cbor_cid_str = cbor_cid.to_string();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't return a string, return its bytes.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to 🔖 Ready in Catalyst Dec 22, 2025
Copy link
Contributor

@no30bit no30bit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great catches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comment me Request for comments squad: hermetics Hermes Backend, System Development & Integration Team

Projects

Status: 🔖 Ready

Development

Successfully merging this pull request may close these issues.

5 participants