Skip to content

fix(rust-sdk): replace unwrap() with poison-resilient lock handling#567

Closed
imran-siddique wants to merge 7 commits intomicrosoft:mainfrom
imran-siddique:fix/rust-sdk-unwrap-safety
Closed

fix(rust-sdk): replace unwrap() with poison-resilient lock handling#567
imran-siddique wants to merge 7 commits intomicrosoft:mainfrom
imran-siddique:fix/rust-sdk-unwrap-safety

Conversation

@imran-siddique
Copy link
Copy Markdown
Member

Summary

Addresses CI review findings from PR #563 -- eliminates all non-test unwrap() calls in the Rust SDK to prevent cascading panics from poisoned locks.

Fix

  • audit.rs: 4 Mutex lock unwrap -> unwrap_or_else
  • policy.rs: 4 RwLock/Mutex unwrap -> unwrap_or_else
  • trust.rs: 7 RwLock unwrap -> unwrap_or_else
  • identity.rs: 3 try_into unwrap -> let-else returning false

All 31 tests pass. No behavioral changes -- only panic safety.

Depends on #563

imran-siddique and others added 7 commits March 27, 2026 15:17
- mcp-proxy: shebang must be line 1 (TS18026)
- copilot, mcp-server: typescript ^6.0.2 → ^5.7.0 (eslint <6.0.0)
- NuGet: replace ESRP Sign+Release with NuGetCommand@2 push
  via NuGet.org service connection

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…crosoft#557, microsoft#558)

Upstream bug fixes from AzureClaw vendored agentmesh-sdk:

TypeScript SDK (identity.ts):
- Add stripKeyPrefix() — strips ed25519:/x25519: prefixes before
  base64 decoding (fixes key decode failures with typed key formats)
- Add safeBase64Decode() — convenience wrapper for safe prefix-aware
  decoding
- Update fromJSON() to use safeBase64Decode() so serialized keys
  with type prefixes round-trip correctly
- Export both functions from index.ts

Proto (registration.proto):
- Add GovernanceService with EvaluatePolicy, RecordAudit, GetTrustScore
  RPCs for language-agnostic governance integration
- Add PolicyRequest, PolicyDecision, AuditEntry, AuditAck, TrustQuery,
  TrustScoreResult message types

Tests:
- 12 new tests in key-prefix.test.ts covering prefix stripping,
  safe decode, and fromJSON round-trip with prefixed keys

Closes microsoft#557, Closes microsoft#558

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ration

Implements Phase 1 (P0) and Phase 2 (P1) of issue microsoft#556:

Phase 1 — Core governance (P0):
- policy.rs: YAML-based policy evaluation with 4-way decisions
  (allow/deny/requires-approval/rate-limit), wildcard patterns,
  condition matching, and scoped capability rules
- trust.rs: Trust scoring (0-1000, 5 tiers) with configurable
  reward/penalty, threshold checks, and optional JSON persistence
- audit.rs: SHA-256 hash-chain audit logging with tamper detection,
  integrity verification, and filtered queries
- types.rs: Shared types (PolicyDecision, TrustTier, TrustScore,
  AuditEntry, GovernanceResult)

Phase 2 — Identity (P1):
- identity.rs: Ed25519 agent identity via ed25519-dalek with DID
  generation, sign/verify, and JSON serialization

Unified client (lib.rs):
- AgentMeshClient combining all modules with execute_with_governance
  pipeline (policy → audit → trust update)

Includes 30 unit tests + 1 doc-test, all passing.
Follows Go SDK patterns and aligns with AzureClaw governance.rs.

Closes microsoft#556

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace all non-test unwrap() calls on Mutex/RwLock with
unwrap_or_else(|e| e.into_inner()) to prevent cascading panics
if a thread panics while holding a lock.

Replace try_into().unwrap() in identity.rs verify() methods with
let-else patterns that return false on conversion failure instead
of panicking.

Changes across 4 files (22 lines):
- audit.rs: 4 Mutex::lock().unwrap() → unwrap_or_else
- policy.rs: 3 RwLock + 1 Mutex unwrap() → unwrap_or_else
- trust.rs: 7 RwLock unwrap() → unwrap_or_else
- identity.rs: 3 try_into().unwrap() → let-else pattern

All 31 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation tests agent-mesh agent-mesh package size/XL Extra large PR (500+ lines) labels Mar 29, 2026
@github-actions
Copy link
Copy Markdown

🤖 AI Agent: security-scanner — Security Review of PR: `fix(rust-sdk): replace unwrap() with poison-resilient lock handling`

Security Review of PR: fix(rust-sdk): replace unwrap() with poison-resilient lock handling

This PR addresses a critical issue by replacing unsafe unwrap() calls with safer alternatives (unwrap_or_else) in the Rust SDK. The changes aim to prevent cascading panics caused by poisoned locks. Additionally, the PR introduces new governance-related gRPC messages and services in the registration.proto file.


Security Findings

1. Potential Race Condition in Governance Policy Evaluation

Severity: 🔴 CRITICAL
Issue: The newly introduced EvaluatePolicy RPC in the GovernanceService does not specify how concurrent policy evaluations are handled. If multiple agents attempt to evaluate policies simultaneously, there is a risk of a Time-of-Check-to-Time-of-Use (TOCTOU) race condition. This could allow an attacker to exploit the time gap between policy evaluation and enforcement to bypass governance rules.
Attack Vector: An attacker could craft a sequence of rapid requests to exploit the time gap between policy evaluation and enforcement, potentially bypassing restrictions or gaining unauthorized access to resources.
Recommendation: Implement a mechanism to ensure atomicity between policy evaluation and enforcement. For example:

  • Use a transactional approach to lock resources during policy evaluation and enforcement.
  • Include a nonce or timestamp in the PolicyRequest and validate it during enforcement to ensure the request has not been tampered with or delayed.

2. Trust Chain Weakness in VerifyPeerTrust RPC

Severity: 🟠 HIGH
Issue: The VerifyPeerTrust RPC in the AgentMeshIdentityService does not specify the mechanism for validating the trustworthiness of the peer agent's identity (e.g., verifying the agent's DID or SPIFFE/SVID). If the trust validation process is not robust, it could lead to trust chain weaknesses.
Attack Vector: An attacker could impersonate a trusted agent by exploiting weak or absent validation mechanisms, potentially gaining unauthorized access to sensitive resources or bypassing governance policies.
Recommendation: Ensure that the VerifyPeerTrust RPC includes robust mechanisms for validating the peer agent's identity. For example:

  • Implement SPIFFE/SVID validation for agent identities.
  • Use certificate pinning or other cryptographic techniques to ensure the authenticity of the peer agent.

3. Credential Exposure in Audit Logging

Severity: 🟠 HIGH
Issue: The AuditEntry message in the GovernanceService includes a metadata field that allows arbitrary key-value pairs. If sensitive information (e.g., credentials, tokens, or private keys) is inadvertently logged in this field, it could lead to credential exposure.
Attack Vector: If an attacker gains access to the audit logs, they could extract sensitive information from the metadata field and use it to compromise the system.
Recommendation: Implement strict validation and sanitization of the metadata field to prevent sensitive information from being logged. Additionally:

  • Use encryption for sensitive fields in the audit log.
  • Clearly document the expected content of the metadata field to avoid accidental inclusion of sensitive data.

4. Deserialization Attack Risk in serde_yaml

Severity: 🟠 HIGH
Issue: The serde_yaml dependency is marked as 0.9.34+deprecated in the Cargo.lock file. This version is outdated and may contain vulnerabilities, including potential deserialization attacks.
Attack Vector: An attacker could craft malicious YAML payloads to exploit vulnerabilities in the serde_yaml library, potentially leading to remote code execution or denial of service.
Recommendation: Replace the deprecated serde_yaml dependency with a maintained and secure alternative. If serde_yaml is still required, ensure that it is updated to the latest stable version and that input is validated before deserialization.


5. Potential Supply Chain Risk in Dependencies

Severity: 🟡 MEDIUM
Issue: The Cargo.lock file includes a large number of dependencies, some of which are outdated or have potential supply chain risks (e.g., serde_yaml, wit-bindgen, wit-parser).
Attack Vector: An attacker could exploit vulnerabilities in outdated dependencies or introduce malicious code through dependency confusion or typosquatting attacks.
Recommendation: Perform a thorough review of all dependencies in the Cargo.lock file. Specifically:

  • Update all dependencies to their latest stable versions.
  • Use tools like cargo-audit to identify known vulnerabilities in dependencies.
  • Consider using a dependency scanner to detect potential typosquatting or malicious packages.

Summary of Findings

Finding Severity Recommendation
Potential Race Condition in Governance Policy Evaluation 🔴 CRITICAL Implement atomicity in policy evaluation and enforcement.
Trust Chain Weakness in VerifyPeerTrust RPC 🟠 HIGH Add robust identity validation (e.g., SPIFFE/SVID, certificate pinning).
Credential Exposure in Audit Logging 🟠 HIGH Validate and sanitize metadata field; encrypt sensitive data in audit logs.
Deserialization Attack Risk in serde_yaml 🟠 HIGH Replace serde_yaml with a maintained alternative or update to the latest version.
Potential Supply Chain Risk in Dependencies 🟡 MEDIUM Audit and update dependencies; use tools to detect vulnerabilities and typosquatting.

General Observations

  1. The replacement of unwrap() with unwrap_or_else is a positive change that improves the resilience of the Rust SDK against panics caused by poisoned locks. This is a good step toward improving the overall robustness of the system.
  2. The addition of governance-related gRPC messages and services is a significant enhancement, but it introduces new security risks that must be addressed before deployment.
  3. The Cargo.lock file includes a large number of dependencies, which increases the attack surface. Regular dependency audits are essential for maintaining the security of the project.

Action Items

  1. Address the race condition in EvaluatePolicy by implementing atomic policy evaluation and enforcement.
  2. Strengthen the trust validation mechanism in VerifyPeerTrust to prevent impersonation attacks.
  3. Sanitize and validate the metadata field in AuditEntry to prevent credential exposure.
  4. Replace or update the serde_yaml dependency to mitigate deserialization attack risks.
  5. Perform a full dependency audit and update outdated or deprecated libraries.

Let me know if you need further assistance or clarification!

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Agent: code-reviewer

Review Summary

This pull request addresses a critical issue by replacing unsafe unwrap() calls with safer alternatives in the Rust SDK. The changes aim to improve the resilience of the codebase against runtime panics caused by poisoned locks or failed conversions. Additionally, the PR introduces new governance-related gRPC services and messages in the registration.proto file. While the changes are generally positive, there are a few areas that require further attention.


🔴 CRITICAL

  1. Tamper-Evident Audit Log Implementation:

    • The AuditEntry message in the registration.proto file includes fields for hash and previous_hash to create a tamper-evident chain. However, there is no information about how these hashes are generated, verified, or stored. Without a clear implementation, this feature might be vulnerable to tampering or replay attacks.
    • Actionable Recommendation: Provide documentation or implementation details about how the hash chain is constructed and verified. Ensure that cryptographic best practices are followed, such as using a secure hash function (e.g., SHA-256) and salting to prevent hash collisions.
  2. Concurrency Safety in Rust SDK:

    • While replacing unwrap() with unwrap_or_else improves panic safety, it is unclear if the fallback logic in unwrap_or_else properly handles poisoned locks. If the fallback logic simply retries the lock acquisition without addressing the underlying issue, it may lead to undefined behavior or deadlocks.
    • Actionable Recommendation: Audit the fallback logic in unwrap_or_else to ensure it handles poisoned locks appropriately. Consider using std::sync::PoisonError to recover from poisoned locks safely.
  3. Trust Score Calculation:

    • The TrustScoreResult message includes an overall_score field (0-1000) and a tier field (e.g., "Untrusted", "Trusted"). However, there is no information about how these scores are calculated or whether they are tamper-proof.
    • Actionable Recommendation: Document the trust score calculation methodology and ensure that it is resistant to manipulation. If cryptographic techniques are used, ensure they are implemented securely.

🟡 WARNING

  1. Breaking Changes in gRPC API:
    • The addition of the GovernanceService and its associated messages in registration.proto introduces new RPCs. While this does not break existing functionality, it is a significant change to the public API.
    • Actionable Recommendation: Clearly communicate these changes in the release notes and ensure that downstream consumers of the gRPC API are aware of the new services.

💡 SUGGESTIONS

  1. Test Coverage:

    • The PR mentions that all 31 tests pass, but it is unclear if new tests were added to cover the changes in lock handling and the new governance-related gRPC services.
    • Actionable Recommendation: Add unit tests to specifically validate the new unwrap_or_else logic for handling poisoned locks. Additionally, create integration tests for the new governance-related gRPC services.
  2. Error Handling in gRPC Services:

    • The GovernanceService RPCs (e.g., EvaluatePolicy, RecordAudit, GetTrustScore) do not specify how errors are communicated to clients. For example, what happens if the policy engine fails to evaluate a policy or if the audit log cannot be written?
    • Actionable Recommendation: Define error codes or messages in the gRPC API to handle failure scenarios gracefully. Consider using gRPC status codes (e.g., INVALID_ARGUMENT, INTERNAL, etc.) to standardize error handling.
  3. Documentation:

    • The new governance-related gRPC services and their associated messages lack detailed documentation. For example, the context field in PolicyRequest is described as "Additional context for policy evaluation," but it is unclear what kind of data is expected.
    • Actionable Recommendation: Provide detailed documentation for each field in the new gRPC messages and services. Include examples of typical usage scenarios to help developers understand how to use the API.
  4. Backward Compatibility for Rust SDK:

    • The PR states that there are no behavioral changes, but it is important to ensure that the changes to lock handling do not introduce subtle differences in behavior, especially in edge cases.
    • Actionable Recommendation: Perform a thorough review of the Rust SDK's public API to confirm that the changes do not unintentionally alter its behavior. Consider adding regression tests for edge cases.
  5. Code Style and Consistency:

    • While the PR replaces unwrap() with unwrap_or_else, it is important to ensure that the fallback logic is consistent across the codebase.
    • Actionable Recommendation: Use a centralized utility function for handling poisoned locks and failed conversions. This will ensure consistency and make future audits easier.

Final Assessment

The PR addresses a critical issue by improving panic safety in the Rust SDK and introduces useful governance-related gRPC services. However, there are critical concerns regarding the implementation of the tamper-evident audit log and concurrency safety in the Rust SDK. Additionally, the changes to the gRPC API should be clearly communicated to downstream consumers to avoid potential integration issues.

  • Approval Status: Changes are not approved until the critical issues are addressed.
  • Follow-Up: Address the critical issues, add tests for the new functionality, and provide detailed documentation for the gRPC API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-mesh agent-mesh package documentation Improvements or additions to documentation size/XL Extra large PR (500+ lines) tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant