Skip to content

fix(dotnet): fix GovernanceMetrics test isolation#417

Merged
imran-siddique merged 5 commits intomicrosoft:mainfrom
imran-siddique:main
Mar 24, 2026
Merged

fix(dotnet): fix GovernanceMetrics test isolation#417
imran-siddique merged 5 commits intomicrosoft:mainfrom
imran-siddique:main

Conversation

@imran-siddique
Copy link
Member

Fix \RecordDecision_DeniedIncrementsBlockedCounter\ test that failed with \Expected 1, Actual 7.

Root cause: MeterListener captured counter increments from other test classes before baseline was set. Fix: flush listener with \RecordObservableInstruments()\ before capturing baseline.

1 file, +2/-1

imran-siddique and others added 5 commits March 24, 2026 09:32
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mandatory review rules before merging any PR:
- Read actual diff (CI green is not sufficient)
- Dependency confusion scan on all install commands
- Verify __init__.py for new modules
- Verify dependencies declared in pyproject.toml
- No hardcoded secrets or plaintext config in pipelines
- Verify PR has actual changes (additions > 0)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…crosoft#410, microsoft#409, microsoft#400)

- agent_os.compat: NoOp fallbacks for optional toolkit dependency
- agent_os.policies.budget: BudgetPolicy + BudgetTracker for token/cost/tool limits
- agent_os.audit_logger: GovernanceAuditLogger with pluggable backends

16 tests passing. Closes microsoft#410, microsoft#409, microsoft#400.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…crosoft#398, microsoft#396, microsoft#395, microsoft#394)

- event_bus.py: GovernanceEventBus with pub/sub for cross-gate composition
- task_outcome.py: TaskOutcomeRecorder with severity scoring + recovery
- diff_policy.py: DiffPolicy for git change scope enforcement
- sandbox_provider.py: Pluggable SandboxProvider ABC + subprocess impl

22 tests passing. Closes microsoft#398, microsoft#396, microsoft#395, microsoft#394.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fore baseline

RecordDecision_DeniedIncrementsBlockedCounter failed with Expected 1,
Actual 7 because the MeterListener captured counter increments from
other test classes before the baseline was captured. Fix: call
RecordObservableInstruments() before setting the baseline so all
pre-existing measurements are flushed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@imran-siddique imran-siddique merged commit 1fdedcf into microsoft:main Mar 24, 2026
57 checks passed
@github-actions
Copy link

🤖 AI Agent: breaking-change-detector — Summary

🔍 API Compatibility Report

Summary

This pull request introduces several new features and modules to the agent-os package, including new classes, methods, and utilities. No breaking changes were detected in existing APIs, as no existing public functions, classes, or methods were removed, renamed, or had their signatures altered. The changes are additive and expand the functionality of the package.

Findings

Severity Package Change Impact
🔵 agent-os Added audit_logger.py module New public API for audit logging functionality.
🔵 agent-os Added compat.py module New public API for optional toolkit dependency handling.
🔵 agent-os Added diff_policy.py module New public API for git change scope enforcement.
🔵 agent-os Added event_bus.py module New public API for governance event bus functionality.
🔵 agent-os Added policies/budget.py module New public API for budget policy rules and tracking.
🔵 agent-os Added sandbox_provider.py module New public API for sandbox execution isolation.
🔵 agent-compliance Added pydantic>=2.4.0 dependency New dependency added to pyproject.toml.

Migration Guide

No migration is necessary as no breaking changes were identified. However, downstream users should be aware of the new features and modules introduced, as they may require documentation updates or additional dependencies (e.g., pydantic>=2.4.0).

Recommendations

  1. Documentation: Ensure that the new modules (audit_logger.py, compat.py, diff_policy.py, event_bus.py, policies/budget.py, sandbox_provider.py) are documented thoroughly, including examples for usage.
  2. Dependency Management: Downstream users should verify compatibility with pydantic>=2.4.0 if they intend to use the agent-compliance package.
  3. Testing: Add comprehensive tests for the new modules to ensure robustness and compatibility.

No breaking changes detected.

@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file tests ci/cd CI/CD and workflows size/XL Extra large PR (500+ lines) labels Mar 24, 2026
@github-actions
Copy link

🤖 AI Agent: docs-sync-checker — Issues Found

📝 Documentation Sync Report

Issues Found

  1. AuditEntry, AuditBackend, JsonlFileBackend, InMemoryBackend, LoggingBackend, GovernanceAuditLogger in packages/agent-os/src/agent_os/audit_logger.py — missing docstrings for some public methods and parameters.
  2. NoOpPolicyEvaluator, NoOpGovernanceMiddleware, get_evaluator in packages/agent-os/src/agent_os/compat.py — missing docstrings for public methods and parameters.
  3. DiffFile, DiffPolicyResult, DiffPolicy in packages/agent-os/src/agent_os/diff_policy.py — missing docstrings for some public methods and parameters.
  4. GovernanceEvent, GovernanceEventBus in packages/agent-os/src/agent_os/event_bus.py — missing docstrings for some public methods and parameters.
  5. BudgetPolicy, BudgetTracker in packages/agent-os/src/agent_os/policies/budget.py — missing docstrings for some public methods and parameters.
  6. SandboxConfig, SandboxResult, SandboxProvider, SubprocessSandboxProvider in packages/agent-os/src/agent_os/sandbox_provider.py — missing docstrings for some public methods and parameters.
  7. ⚠️ packages/agent-os/README.md — missing documentation for new modules and classes added (audit_logger.py, compat.py, diff_policy.py, event_bus.py, policies/budget.py, sandbox_provider.py).
  8. ⚠️ CHANGELOG.md — no entry for the addition of new modules and classes.

Suggestions

  • 💡 Add detailed docstrings for all public classes, methods, and functions, including descriptions of their purpose, parameters, return values, and exceptions.
  • 💡 Update packages/agent-os/README.md to include documentation for the new modules and classes, such as GovernanceAuditLogger, DiffPolicy, GovernanceEventBus, BudgetPolicy, and SandboxProvider.
  • 💡 Add a CHANGELOG entry summarizing the addition of new modules and their functionality.
  • 💡 Ensure all new public APIs have complete type annotations for parameters and return values.

Additional Notes

  • The new modules and classes added in packages/agent-os/src/ introduce significant functionality, such as governance auditing, policy evaluation, event handling, budget tracking, and sandboxing. These require proper documentation to ensure that users and developers can understand and use them effectively.
  • The example code in the new modules appears to be consistent with the API signatures, so no updates are needed for examples/ at this time.

Conclusion

The documentation is not fully in sync with the code changes. Please address the issues and suggestions listed above.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Agent: code-reviewer

Review Summary

This pull request introduces several new components to the agent-governance-toolkit repository, including audit logging, event bus, policy evaluation, budget tracking, and sandboxing functionality. While these additions are valuable, there are critical security concerns, potential breaking changes, and areas for improvement.


🔴 CRITICAL Issues

1. Sandbox Escape Vectors in SubprocessSandboxProvider

  • The SubprocessSandboxProvider implementation lacks proper security isolation. It explicitly states that it does not provide container isolation, making it vulnerable to sandbox escape attacks. For example:
    • The subprocess inherits the environment variables, which could leak sensitive information.
    • The subprocess runs with the same privileges as the parent process, enabling privilege escalation.
    • No restrictions are placed on file system access, allowing the subprocess to modify or access sensitive files.
  • Recommendation: Replace SubprocessSandboxProvider with a container-based sandbox (e.g., Docker or Firecracker) that enforces strict resource limits, file system isolation, and network restrictions.

2. Unvalidated Input in DiffPolicy

  • The DiffPolicy class uses fnmatch for path matching but does not validate inputs. Malicious paths (e.g., ../../etc/passwd) could bypass policy checks.
  • Recommendation: Sanitize and normalize file paths before evaluating them against allowed_paths and blocked_paths.

3. Potential Denial of Service in GovernanceEventBus

  • The GovernanceEventBus stores event history in memory (self._history). Without proper limits, this could lead to memory exhaustion if a large number of events are published.
  • Recommendation: Enforce stricter limits on self._max_history and implement a mechanism to drop old events when the limit is exceeded.

4. Missing Cryptographic Validation in AuditEntry

  • The AuditEntry class does not include cryptographic integrity checks (e.g., HMAC or digital signatures). This could allow tampering with audit logs.
  • Recommendation: Add cryptographic signing to audit entries to ensure their integrity and authenticity.

5. Thread Safety in GovernanceEventBus

  • The GovernanceEventBus is not thread-safe. Concurrent access to self._handlers and self._history could lead to race conditions.
  • Recommendation: Use thread-safe data structures (e.g., queue.Queue) or synchronization primitives (e.g., threading.Lock) to ensure thread safety.

🟡 WARNING: Potential Breaking Changes

1. Dependency Addition (pydantic>=2.4.0)

  • The addition of pydantic>=2.4.0 in pyproject.toml introduces a new dependency. If consumers of this library are using older versions of Pydantic, this could cause compatibility issues.
  • Recommendation: Clearly document the dependency change and test compatibility with older versions.

2. No-Op Fallbacks in agent_os/compat.py

  • The fallback mechanism (NoOpPolicyEvaluator, NoOpGovernanceMiddleware) changes behavior when agent-os-kernel is not installed. This could lead to unexpected behavior for consumers relying on strict policy enforcement.
  • Recommendation: Log warnings prominently when falling back to no-op implementations and provide a mechanism for consumers to opt-in explicitly.

💡 Suggestions for Improvement

1. Audit Logging Enhancements

  • Add support for structured logging formats (e.g., JSON) to facilitate integration with log aggregation tools.
  • Include a configurable log rotation mechanism for the JsonlFileBackend to prevent unbounded file growth.

2. Policy Engine Improvements

  • Enhance DiffPolicy to support regular expressions for more robust path matching.
  • Add support for policy chaining or composition to allow more complex policy definitions.

3. Budget Tracker

  • Add support for real-time monitoring and alerts when budget limits are close to being exceeded.
  • Include a mechanism to reset the tracker for long-running tasks.

4. Event Bus Wildcard Handling

  • The wildcard handler in GovernanceEventBus (self._handlers.get("*", [])) could lead to unintended behavior if misused. Add documentation and examples to clarify its use.

5. Testing

  • Add unit tests for edge cases, such as:
    • Sandbox execution with malicious commands.
    • DiffPolicy evaluation with edge-case paths.
    • GovernanceEventBus under high load or concurrent access.

6. Documentation

  • Expand documentation for new classes and methods, including usage examples and security considerations.
  • Add a security section to the README to highlight best practices for using sandboxing, audit logging, and policy evaluation.

Final Recommendations

  1. Address the 🔴 CRITICAL issues immediately, especially the sandbox escape vectors and thread safety concerns.
  2. Consider the 🟡 WARNING items to avoid breaking changes for consumers.
  3. Implement the 💡 SUGGESTIONS to improve functionality, usability, and security.

This pull request introduces valuable features but requires significant attention to security and backward compatibility before merging.

@github-actions
Copy link

🤖 AI Agent: test-generator — `audit_logger.py`

🧪 Test Coverage Analysis

audit_logger.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_audit_entry_serialization — Verify that AuditEntry.to_dict and AuditEntry.to_json correctly serialize data, including edge cases like empty strings, special characters, and large numbers.
    2. test_jsonl_file_backend_write — Test that JsonlFileBackend.write correctly appends a JSONL entry to a file and handles file creation.
    3. test_in_memory_backend_storage — Verify that InMemoryBackend.write correctly stores entries in memory and that flush does not affect the stored entries.
    4. test_logging_backend_output — Ensure that LoggingBackend.write logs the correct message format and content.
    5. test_governance_audit_logger_integration — Test the integration of GovernanceAuditLogger with multiple backends, ensuring all backends receive and process the same audit entry.

compat.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_no_op_policy_evaluator_allows_all — Verify that NoOpPolicyEvaluator.evaluate always returns an allowed decision regardless of input.
    2. test_no_op_governance_middleware_passthrough — Ensure that NoOpGovernanceMiddleware correctly passes through function calls without modification.
    3. test_get_evaluator_with_toolkit — Test that get_evaluator returns a real PolicyEvaluator when the toolkit is installed.
    4. test_get_evaluator_without_toolkit — Test that get_evaluator returns a NoOpPolicyEvaluator when the toolkit is not installed.

diff_policy.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_diff_policy_max_files_violation — Test that DiffPolicy.evaluate correctly identifies violations when the number of changed files exceeds max_files.
    2. test_diff_policy_max_lines_violation — Test that DiffPolicy.evaluate correctly identifies violations when the total lines changed exceed max_lines.
    3. test_diff_policy_blocked_paths — Verify that DiffPolicy.evaluate flags files matching blocked_paths.
    4. test_diff_policy_allowed_paths — Verify that DiffPolicy.evaluate flags files not matching allowed_paths when allowed_paths is set.
    5. test_diff_policy_no_violations — Ensure that DiffPolicy.evaluate passes when no violations occur.

event_bus.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_event_subscription_and_publication — Verify that subscribing to an event type and publishing an event triggers the correct handler.
    2. test_event_unsubscription — Test that unsubscribing a handler prevents it from being called when the event is published.
    3. test_event_history — Verify that get_history correctly retrieves recent events, optionally filtered by type.
    4. test_event_bus_wildcard_subscription — Test that subscribing to the wildcard event type (*) triggers the handler for all events.
    5. test_event_handler_exception_handling — Ensure that exceptions in event handlers are logged but do not prevent other handlers from being called.

policies/budget.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_budget_tracker_exceeded_tokens — Verify that BudgetTracker.is_exceeded returns True when token usage exceeds the policy limit.
    2. test_budget_tracker_exceeded_tool_calls — Verify that BudgetTracker.is_exceeded returns True when tool call usage exceeds the policy limit.
    3. test_budget_tracker_remaining_resources — Test that BudgetTracker.remaining correctly calculates remaining resources.
    4. test_budget_tracker_utilization — Verify that BudgetTracker.utilization correctly calculates resource utilization percentages.
    5. test_budget_tracker_no_exceedance — Ensure that BudgetTracker.is_exceeded returns False when all usage is within policy limits.

sandbox_provider.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_subprocess_sandbox_provider_success — Verify that SubprocessSandboxProvider.run correctly executes a command and captures output when successful.
    2. test_subprocess_sandbox_provider_timeout — Test that SubprocessSandboxProvider.run correctly handles timeouts and returns a SandboxResult with killed=True.
    3. test_subprocess_sandbox_provider_env_vars — Verify that environment variables are correctly passed to the subprocess.
    4. test_subprocess_sandbox_provider_memory_limit — Simulate and verify behavior when memory limits are exceeded (if applicable).
    5. test_sandbox_provider_availability — Test that is_available correctly reports the availability of the sandbox provider.

task_outcome.py

  • Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
  • Missing coverage: The entire file lacks test coverage.
  • 💡 Suggested test cases:
    1. test_task_outcome_success — Verify that a successful task outcome is correctly represented.
    2. test_task_outcome_failure — Test that a failed task outcome is correctly represented, including error messages and codes.
    3. test_task_outcome_serialization — Verify that task outcomes can be serialized and deserialized correctly.
    4. test_task_outcome_edge_cases — Test edge cases, such as extremely long error messages or unusual error codes.

Summary

  • Files with missing test coverage: All changed files are new and lack corresponding test files.
  • Total suggested test cases: 30
  • Priority: High — These files are new and require comprehensive test coverage to ensure reliability and robustness.

Would you like me to generate specific test code for any of these cases?

@github-actions
Copy link

🤖 AI Agent: security-scanner — Security Analysis of the Pull Request

Security Analysis of the Pull Request

1. Prompt Injection Defense Bypass

  • Risk: No direct evidence of prompt injection vulnerabilities in the changes. However, the audit_logger.py and sandbox_provider.py modules introduce new functionality that could potentially be exploited if user-controlled input is logged or executed without proper sanitization.
  • Rating: 🔵 LOW
  • Recommendation: Ensure that all user-provided inputs logged in audit_logger.py are sanitized to prevent injection attacks. Similarly, in sandbox_provider.py, validate and sanitize the command parameter to prevent command injection vulnerabilities.

2. Policy Engine Circumvention

  • Risk: The compat.py file introduces a NoOpPolicyEvaluator that allows all actions if the agent-os-kernel dependency is not installed. This could lead to a complete bypass of policy enforcement in environments where the dependency is missing.
  • Rating: 🔴 CRITICAL
  • Attack Vector: If an attacker can manipulate the environment to prevent the agent-os-kernel dependency from being installed or loaded, the NoOpPolicyEvaluator will be used, effectively disabling all policy checks.
  • Recommendation: Remove the NoOpPolicyEvaluator or ensure that the system fails securely (e.g., deny all actions) when the agent-os-kernel dependency is not available. Alternatively, log a critical error and halt execution if the dependency is missing.

3. Trust Chain Weaknesses

  • Risk: No changes directly related to SPIFFE/SVID validation or certificate pinning were identified in this PR.
  • Rating: 🔵 LOW
  • Recommendation: No action required for this PR.

4. Credential Exposure

  • Risk: The audit_logger.py file logs events, including metadata, which could potentially include sensitive information. If the metadata contains sensitive data (e.g., API keys, tokens, or PII), it could be exposed in logs.
  • Rating: 🟠 HIGH
  • Attack Vector: If sensitive information is logged, an attacker with access to the logs could extract this data.
  • Recommendation: Implement a mechanism to sanitize or redact sensitive information from the metadata before logging it. Add a warning in the documentation to caution developers about including sensitive data in the metadata.

5. Sandbox Escape

  • Risk: The SubprocessSandboxProvider in sandbox_provider.py does not provide true isolation. It explicitly states that it does not offer security isolation and is not suitable for production use.
  • Rating: 🔴 CRITICAL
  • Attack Vector: An attacker could exploit this lack of isolation to execute malicious commands or access resources outside the intended sandbox.
  • Recommendation: Clearly document that SubprocessSandboxProvider is for testing only and must not be used in production. Add runtime checks to prevent its use in production environments. Consider implementing a secure sandboxing mechanism (e.g., Docker or Firecracker) for production use.

6. Deserialization Attacks

  • Risk: No evidence of unsafe deserialization in the changes. The yaml.safe_load() guideline is followed in the copilot-instructions.md.
  • Rating: 🔵 LOW
  • Recommendation: No action required for this PR.

7. Race Conditions

  • Risk: The event_bus.py module introduces a publish/subscribe mechanism that could be vulnerable to race conditions if multiple threads or processes publish or handle events concurrently.
  • Rating: 🟡 MEDIUM
  • Attack Vector: Concurrent access to the _handlers dictionary or _history list could lead to data corruption or unexpected behavior.
  • Recommendation: Use thread-safe data structures (e.g., queue.Queue or threading.Lock) to manage _handlers and _history to prevent race conditions.

8. Supply Chain

  • Risk: The addition of pydantic>=2.4.0 in pyproject.toml introduces a new dependency. While pydantic is a well-known library, its inclusion increases the attack surface.
  • Rating: 🟡 MEDIUM
  • Attack Vector: If an attacker compromises the pydantic package on PyPI, they could inject malicious code into the dependency.
  • Recommendation: Use dependency pinning to a specific version or hash to prevent supply chain attacks. Regularly audit dependencies for vulnerabilities.

Summary of Findings

Issue Rating Recommendation
Prompt Injection Defense Bypass 🔵 LOW Sanitize user inputs in audit_logger.py and sandbox_provider.py.
Policy Engine Circumvention 🔴 CRITICAL Remove NoOpPolicyEvaluator or fail securely if agent-os-kernel is missing.
Trust Chain Weaknesses 🔵 LOW No action required.
Credential Exposure 🟠 HIGH Sanitize metadata in audit_logger.py to prevent sensitive data exposure.
Sandbox Escape 🔴 CRITICAL Restrict SubprocessSandboxProvider to testing environments only.
Deserialization Attacks 🔵 LOW No action required.
Race Conditions 🟡 MEDIUM Use thread-safe structures in event_bus.py.
Supply Chain 🟡 MEDIUM Pin pydantic dependency to a specific version or hash.

Final Recommendation

The PR introduces critical security risks, particularly in the compat.py and sandbox_provider.py modules. These issues must be addressed before merging. Additionally, the high-risk credential exposure in audit_logger.py and medium-risk race conditions in event_bus.py should be mitigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/cd CI/CD and workflows dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation size/XL Extra large PR (500+ lines) tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant