-
Notifications
You must be signed in to change notification settings - Fork 3
fix(hermes): doc sync p2p propagation #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WHAT THIS FIXES:
- CID format: Use dag-cbor codec (0x51) instead of dag-pb (0x70) for protocol compliance
- Message encoding: Publish CBOR-encoded payload::New with CID lists instead of raw document bytes
- Subscription conflict: Remove auto-subscription that prevented doc-sync module from subscribing
CHANGES:
- hermes/bin/src/runtime_extensions/hermes/doc_sync/host.rs:
* Import minicbor crate directly (not via cardano_chain_follower)
* Compute both dag-pb CID (storage) and dag-cbor CID (protocol) in add_file()
* Construct proper payload::New structure in publish()
* Encode to CBOR before publishing to PubSub
- hermes/bin/src/ipfs/mod.rs:
* Remove auto-subscription to documents.new during bootstrap
* Prevents subscription kind conflicts with doc-sync module
- hermes/bin/src/ipfs/task.rs:
* Add extensive documentation of blocking operations issue
* Document root cause and recommended async solution
REMAINING ISSUE:
P2P propagation still fails due to blocking operations in async context.
The doc_sync_topic_message_handler (added in 944360f) calls file_pin()
and file_get() which use blocking_send()/blocking_recv(). This causes
deadlock when called from async PubSub handler.
NEXT STEPS:
Convert file_pin() and file_get() to async versions. See detailed
implementation guide in task.rs comments (lines 354-374).
TEST:
cd p2p-testing && just test-pubsub-propagation
Currently FAILS with No Propagation due to blocking operations issue.
Will PASS once async conversion is complete.
Related: #691 (introduced blocking operations bug)
…e operations
The doc_sync_topic_message_handler was blocking the PubSub event loop by using
blocking_send/blocking_recv for IPFS file operations, preventing new messages
from being processed while fetching content.
Changes:
- Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
send().await instead of blocking_send()
- Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
with tokio::spawn() for file operations
- Add RECEIVED PubSub message with CID logging for test detection
- Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
message reception using wc -l instead of grep -q
Result:
- PubSub handler no longer blocks - can process new messages while previous
content is still being fetched
- test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
- Resolves doc-sync P2P message propagation issue
…e operations
The doc_sync_topic_message_handler was blocking the PubSub event loop by using
blocking_send/blocking_recv for IPFS file operations, preventing new messages
from being processed while fetching content.
Changes:
- Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
send().await instead of blocking_send()
- Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
with tokio::spawn() for file operations
- Add RECEIVED PubSub message with CID logging for test detection
- Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
message reception using wc -l instead of grep -q
Result:
- PubSub handler no longer blocks - can process new messages while previous
content is still being fetched
- test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
- Resolves doc-sync P2P message propagation issue
…e operations
The doc_sync_topic_message_handler was blocking the PubSub event loop by using
blocking_send/blocking_recv for IPFS file operations, preventing new messages
from being processed while fetching content.
Changes:
- Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
send().await instead of blocking_send()
- Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
with tokio::spawn() for file operations
- Add RECEIVED PubSub message with CID logging for test detection
- Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
message reception using wc -l instead of grep -q
Result:
- PubSub handler no longer blocks - can process new messages while previous
content is still being fetched
- test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
- Resolves doc-sync P2P message propagation issue
…e operations
The doc_sync_topic_message_handler was blocking the PubSub event loop by using
blocking_send/blocking_recv for IPFS file operations, preventing new messages
from being processed while fetching content.
Changes:
- Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
send().await instead of blocking_send()
- Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
with tokio::spawn() for file operations
- Add RECEIVED PubSub message with CID logging for test detection
- Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
message reception using wc -l instead of grep -q
Result:
- PubSub handler no longer blocks - can process new messages while previous
content is still being fetched
- test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
- Resolves doc-sync P2P message propagation issue
…e operations
The doc_sync_topic_message_handler was blocking the PubSub event loop by using
blocking_send/blocking_recv for IPFS file operations, preventing new messages
from being processed while fetching content.
Changes:
- Add file_get_async() and file_pin_async() methods in ipfs/mod.rs that use
send().await instead of blocking_send()
- Update doc_sync_topic_message_handler() in ipfs/task.rs to spawn async task
with tokio::spawn() for file operations
- Add RECEIVED PubSub message with CID logging for test detection
- Fix test-pubsub-propagation in p2p-testing/justfile to properly detect
message reception using wc -l instead of grep -q
Result:
- PubSub handler no longer blocks - can process new messages while previous
content is still being fetched
- test-pubsub-propagation now passes with 100% success rate (5/5 nodes)
- Resolves doc-sync P2P message propagation issue
| OnNewDocEvent::new(channel_name, &content), | ||
| crate::event::TargetApp::List(app_names.clone()), | ||
| crate::event::TargetModule::All, | ||
| // IMPORTANT: The message contains dag-cbor CIDs (CIDv1, codec 0x51), but IPFS storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using cbor 0x51, not dag-cbor
https://github.com/multiformats/multicodec/blob/master/table.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ONLY every use 0x51 and no other CID. I see there is logic for both, only have 1.
| fn publish( | ||
| ctx: &mut HermesRuntimeContext, | ||
| doc: DocData, | ||
| cid: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be DocData so we don't need to construct the data here 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never be using cid as a string, and always use its binary equivalent. Otherwise we are constantly encoding and decoding it everywhere. It should ONLY ever be a printable string if we intend to print it.
If we are using a CID, it should be a CID type that holds it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkioshn, I think we only need the cid in this function.
DocData is already added to IPFS at this point here: https://github.com/input-output-hk/hermes/blob/fix/doc-sync-p2p-propagation/hermes/bin/src/runtime_extensions/hermes/doc_sync/host.rs#L185
hermes/bin/src/ipfs/task.rs
Outdated
| } | ||
|
|
||
| /// Handler function for topic message streams (Doc Sync). | ||
| /// Handler for Doc Sync PubSub messages on "*.new" topics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add TODO here, since this function should be able to handle multiple topics (as mention // TODO: match the topic against a static list.)
| fn publish( | ||
| ctx: &mut HermesRuntimeContext, | ||
| doc: DocData, | ||
| cid: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never be using cid as a string, and always use its binary equivalent. Otherwise we are constantly encoding and decoding it everywhere. It should ONLY ever be a printable string if we intend to print it.
If we are using a CID, it should be a CID type that holds it.
| // Messages contain CID lists, not document content - receivers fetch from IPFS. | ||
|
|
||
| // Step 1: Parse CID string | ||
| let cid_parsed = match cid.parse::<hermes_ipfs::Cid>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because it should never be a string, it should always be its bytes and we should only convert bytes to strong for display and never "to use".
|
|
||
| let multihash = multihash::Multihash::<64>::wrap(SHA2_256_CODE, &hash_digest)?; | ||
| let cbor_cid = hermes_ipfs::Cid::new_v1(CBOR_CODEC, multihash); | ||
| let cbor_cid_str = cbor_cid.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't return a string, return its bytes.
no30bit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great catches
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Signed-off-by: bkioshn <[email protected]>
Description
Quick context:
Doc-sync messages weren't propagating between nodes. Turned out to be three issues caused by the new on doc commit:
What got fixed:
• Made the PubSub handler spawn async tasks so it doesn't block
• Fixed message format - now computes both CID types (dag-pb for storage, dag-cbor for protocol)
• Removed the auto-subscription logic that was causing conflicts
• Fixed the test detection (was using
grep -qwhich was flaky, switched towc -l)Testing:
Run
just test-pubsub-propagationfrom thep2p-testingdir. Should see all 5 nodes receive messages.Related Issue(s)
The issues arise when trying to run test created from
#704
so I'm pointing this PR to this issue
Please confirm the following checks