fix(dotnet): fix GovernanceMetrics test isolation by imran-siddique · Pull Request #417 · microsoft/agent-governance-toolkit

imran-siddique · 2026-03-24T21:58:43Z

Fix \RecordDecision_DeniedIncrementsBlockedCounter\ test that failed with \Expected 1, Actual 7.

Root cause: MeterListener captured counter increments from other test classes before baseline was set. Fix: flush listener with \RecordObservableInstruments()\ before capturing baseline.

1 file, +2/-1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mandatory review rules before merging any PR: - Read actual diff (CI green is not sufficient) - Dependency confusion scan on all install commands - Verify __init__.py for new modules - Verify dependencies declared in pyproject.toml - No hardcoded secrets or plaintext config in pipelines - Verify PR has actual changes (additions > 0) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…crosoft#410, microsoft#409, microsoft#400) - agent_os.compat: NoOp fallbacks for optional toolkit dependency - agent_os.policies.budget: BudgetPolicy + BudgetTracker for token/cost/tool limits - agent_os.audit_logger: GovernanceAuditLogger with pluggable backends 16 tests passing. Closes microsoft#410, microsoft#409, microsoft#400. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…crosoft#398, microsoft#396, microsoft#395, microsoft#394) - event_bus.py: GovernanceEventBus with pub/sub for cross-gate composition - task_outcome.py: TaskOutcomeRecorder with severity scoring + recovery - diff_policy.py: DiffPolicy for git change scope enforcement - sandbox_provider.py: Pluggable SandboxProvider ABC + subprocess impl 22 tests passing. Closes microsoft#398, microsoft#396, microsoft#395, microsoft#394. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…fore baseline RecordDecision_DeniedIncrementsBlockedCounter failed with Expected 1, Actual 7 because the MeterListener captured counter increments from other test classes before the baseline was captured. Fix: call RecordObservableInstruments() before setting the baseline so all pre-existing measurements are flushed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-03-24T21:59:17Z

🤖 AI Agent: breaking-change-detector — Summary

🔍 API Compatibility Report

Summary

This pull request introduces several new features and modules to the agent-os package, including new classes, methods, and utilities. No breaking changes were detected in existing APIs, as no existing public functions, classes, or methods were removed, renamed, or had their signatures altered. The changes are additive and expand the functionality of the package.

Findings

Severity	Package	Change	Impact
🔵	agent-os	Added `audit_logger.py` module	New public API for audit logging functionality.
🔵	agent-os	Added `compat.py` module	New public API for optional toolkit dependency handling.
🔵	agent-os	Added `diff_policy.py` module	New public API for git change scope enforcement.
🔵	agent-os	Added `event_bus.py` module	New public API for governance event bus functionality.
🔵	agent-os	Added `policies/budget.py` module	New public API for budget policy rules and tracking.
🔵	agent-os	Added `sandbox_provider.py` module	New public API for sandbox execution isolation.
🔵	agent-compliance	Added `pydantic>=2.4.0` dependency	New dependency added to `pyproject.toml`.

Migration Guide

No migration is necessary as no breaking changes were identified. However, downstream users should be aware of the new features and modules introduced, as they may require documentation updates or additional dependencies (e.g., pydantic>=2.4.0).

Recommendations

Documentation: Ensure that the new modules (audit_logger.py, compat.py, diff_policy.py, event_bus.py, policies/budget.py, sandbox_provider.py) are documented thoroughly, including examples for usage.
Dependency Management: Downstream users should verify compatibility with pydantic>=2.4.0 if they intend to use the agent-compliance package.
Testing: Add comprehensive tests for the new modules to ensure robustness and compatibility.

✅ No breaking changes detected.

github-actions · 2026-03-24T21:59:23Z

🤖 AI Agent: docs-sync-checker — Issues Found

📝 Documentation Sync Report

Issues Found

❌ AuditEntry, AuditBackend, JsonlFileBackend, InMemoryBackend, LoggingBackend, GovernanceAuditLogger in packages/agent-os/src/agent_os/audit_logger.py — missing docstrings for some public methods and parameters.
❌ NoOpPolicyEvaluator, NoOpGovernanceMiddleware, get_evaluator in packages/agent-os/src/agent_os/compat.py — missing docstrings for public methods and parameters.
❌ DiffFile, DiffPolicyResult, DiffPolicy in packages/agent-os/src/agent_os/diff_policy.py — missing docstrings for some public methods and parameters.
❌ GovernanceEvent, GovernanceEventBus in packages/agent-os/src/agent_os/event_bus.py — missing docstrings for some public methods and parameters.
❌ BudgetPolicy, BudgetTracker in packages/agent-os/src/agent_os/policies/budget.py — missing docstrings for some public methods and parameters.
❌ SandboxConfig, SandboxResult, SandboxProvider, SubprocessSandboxProvider in packages/agent-os/src/agent_os/sandbox_provider.py — missing docstrings for some public methods and parameters.
⚠️ packages/agent-os/README.md — missing documentation for new modules and classes added (audit_logger.py, compat.py, diff_policy.py, event_bus.py, policies/budget.py, sandbox_provider.py).
⚠️ CHANGELOG.md — no entry for the addition of new modules and classes.

Suggestions

💡 Add detailed docstrings for all public classes, methods, and functions, including descriptions of their purpose, parameters, return values, and exceptions.
💡 Update packages/agent-os/README.md to include documentation for the new modules and classes, such as GovernanceAuditLogger, DiffPolicy, GovernanceEventBus, BudgetPolicy, and SandboxProvider.
💡 Add a CHANGELOG entry summarizing the addition of new modules and their functionality.
💡 Ensure all new public APIs have complete type annotations for parameters and return values.

Additional Notes

The new modules and classes added in packages/agent-os/src/ introduce significant functionality, such as governance auditing, policy evaluation, event handling, budget tracking, and sandboxing. These require proper documentation to ensure that users and developers can understand and use them effectively.
The example code in the new modules appears to be consistent with the API signatures, so no updates are needed for examples/ at this time.

Conclusion

The documentation is not fully in sync with the code changes. Please address the issues and suggestions listed above.

github-actions

🤖 AI Agent: code-reviewer

Review Summary

This pull request introduces several new components to the agent-governance-toolkit repository, including audit logging, event bus, policy evaluation, budget tracking, and sandboxing functionality. While these additions are valuable, there are critical security concerns, potential breaking changes, and areas for improvement.

🔴 CRITICAL Issues

1. Sandbox Escape Vectors in `SubprocessSandboxProvider`

The SubprocessSandboxProvider implementation lacks proper security isolation. It explicitly states that it does not provide container isolation, making it vulnerable to sandbox escape attacks. For example:
- The subprocess inherits the environment variables, which could leak sensitive information.
- The subprocess runs with the same privileges as the parent process, enabling privilege escalation.
- No restrictions are placed on file system access, allowing the subprocess to modify or access sensitive files.
Recommendation: Replace SubprocessSandboxProvider with a container-based sandbox (e.g., Docker or Firecracker) that enforces strict resource limits, file system isolation, and network restrictions.

2. Unvalidated Input in `DiffPolicy`

The DiffPolicy class uses fnmatch for path matching but does not validate inputs. Malicious paths (e.g., ../../etc/passwd) could bypass policy checks.
Recommendation: Sanitize and normalize file paths before evaluating them against allowed_paths and blocked_paths.

3. Potential Denial of Service in `GovernanceEventBus`

The GovernanceEventBus stores event history in memory (self._history). Without proper limits, this could lead to memory exhaustion if a large number of events are published.
Recommendation: Enforce stricter limits on self._max_history and implement a mechanism to drop old events when the limit is exceeded.

4. Missing Cryptographic Validation in `AuditEntry`

The AuditEntry class does not include cryptographic integrity checks (e.g., HMAC or digital signatures). This could allow tampering with audit logs.
Recommendation: Add cryptographic signing to audit entries to ensure their integrity and authenticity.

5. Thread Safety in `GovernanceEventBus`

The GovernanceEventBus is not thread-safe. Concurrent access to self._handlers and self._history could lead to race conditions.
Recommendation: Use thread-safe data structures (e.g., queue.Queue) or synchronization primitives (e.g., threading.Lock) to ensure thread safety.

🟡 WARNING: Potential Breaking Changes

1. Dependency Addition (`pydantic>=2.4.0`)

The addition of pydantic>=2.4.0 in pyproject.toml introduces a new dependency. If consumers of this library are using older versions of Pydantic, this could cause compatibility issues.
Recommendation: Clearly document the dependency change and test compatibility with older versions.

2. No-Op Fallbacks in `agent_os/compat.py`

The fallback mechanism (NoOpPolicyEvaluator, NoOpGovernanceMiddleware) changes behavior when agent-os-kernel is not installed. This could lead to unexpected behavior for consumers relying on strict policy enforcement.
Recommendation: Log warnings prominently when falling back to no-op implementations and provide a mechanism for consumers to opt-in explicitly.

💡 Suggestions for Improvement

1. Audit Logging Enhancements

Add support for structured logging formats (e.g., JSON) to facilitate integration with log aggregation tools.
Include a configurable log rotation mechanism for the JsonlFileBackend to prevent unbounded file growth.

2. Policy Engine Improvements

Enhance DiffPolicy to support regular expressions for more robust path matching.
Add support for policy chaining or composition to allow more complex policy definitions.

3. Budget Tracker

Add support for real-time monitoring and alerts when budget limits are close to being exceeded.
Include a mechanism to reset the tracker for long-running tasks.

4. Event Bus Wildcard Handling

The wildcard handler in GovernanceEventBus (self._handlers.get("*", [])) could lead to unintended behavior if misused. Add documentation and examples to clarify its use.

5. Testing

Add unit tests for edge cases, such as:
- Sandbox execution with malicious commands.
- DiffPolicy evaluation with edge-case paths.
- GovernanceEventBus under high load or concurrent access.

6. Documentation

Expand documentation for new classes and methods, including usage examples and security considerations.
Add a security section to the README to highlight best practices for using sandboxing, audit logging, and policy evaluation.

Final Recommendations

Address the 🔴 CRITICAL issues immediately, especially the sandbox escape vectors and thread safety concerns.
Consider the 🟡 WARNING items to avoid breaking changes for consumers.
Implement the 💡 SUGGESTIONS to improve functionality, usability, and security.

This pull request introduces valuable features but requires significant attention to security and backward compatibility before merging.

github-actions · 2026-03-24T21:59:35Z

🤖 AI Agent: test-generator — `audit_logger.py`

🧪 Test Coverage Analysis

`audit_logger.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_audit_entry_serialization — Verify that AuditEntry.to_dict and AuditEntry.to_json correctly serialize data, including edge cases like empty strings, special characters, and large numbers.
2. test_jsonl_file_backend_write — Test that JsonlFileBackend.write correctly appends a JSONL entry to a file and handles file creation.
3. test_in_memory_backend_storage — Verify that InMemoryBackend.write correctly stores entries in memory and that flush does not affect the stored entries.
4. test_logging_backend_output — Ensure that LoggingBackend.write logs the correct message format and content.
5. test_governance_audit_logger_integration — Test the integration of GovernanceAuditLogger with multiple backends, ensuring all backends receive and process the same audit entry.

`compat.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_no_op_policy_evaluator_allows_all — Verify that NoOpPolicyEvaluator.evaluate always returns an allowed decision regardless of input.
2. test_no_op_governance_middleware_passthrough — Ensure that NoOpGovernanceMiddleware correctly passes through function calls without modification.
3. test_get_evaluator_with_toolkit — Test that get_evaluator returns a real PolicyEvaluator when the toolkit is installed.
4. test_get_evaluator_without_toolkit — Test that get_evaluator returns a NoOpPolicyEvaluator when the toolkit is not installed.

`diff_policy.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_diff_policy_max_files_violation — Test that DiffPolicy.evaluate correctly identifies violations when the number of changed files exceeds max_files.
2. test_diff_policy_max_lines_violation — Test that DiffPolicy.evaluate correctly identifies violations when the total lines changed exceed max_lines.
3. test_diff_policy_blocked_paths — Verify that DiffPolicy.evaluate flags files matching blocked_paths.
4. test_diff_policy_allowed_paths — Verify that DiffPolicy.evaluate flags files not matching allowed_paths when allowed_paths is set.
5. test_diff_policy_no_violations — Ensure that DiffPolicy.evaluate passes when no violations occur.

`event_bus.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_event_subscription_and_publication — Verify that subscribing to an event type and publishing an event triggers the correct handler.
2. test_event_unsubscription — Test that unsubscribing a handler prevents it from being called when the event is published.
3. test_event_history — Verify that get_history correctly retrieves recent events, optionally filtered by type.
4. test_event_bus_wildcard_subscription — Test that subscribing to the wildcard event type (*) triggers the handler for all events.
5. test_event_handler_exception_handling — Ensure that exceptions in event handlers are logged but do not prevent other handlers from being called.

`policies/budget.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_budget_tracker_exceeded_tokens — Verify that BudgetTracker.is_exceeded returns True when token usage exceeds the policy limit.
2. test_budget_tracker_exceeded_tool_calls — Verify that BudgetTracker.is_exceeded returns True when tool call usage exceeds the policy limit.
3. test_budget_tracker_remaining_resources — Test that BudgetTracker.remaining correctly calculates remaining resources.
4. test_budget_tracker_utilization — Verify that BudgetTracker.utilization correctly calculates resource utilization percentages.
5. test_budget_tracker_no_exceedance — Ensure that BudgetTracker.is_exceeded returns False when all usage is within policy limits.

`sandbox_provider.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_subprocess_sandbox_provider_success — Verify that SubprocessSandboxProvider.run correctly executes a command and captures output when successful.
2. test_subprocess_sandbox_provider_timeout — Test that SubprocessSandboxProvider.run correctly handles timeouts and returns a SandboxResult with killed=True.
3. test_subprocess_sandbox_provider_env_vars — Verify that environment variables are correctly passed to the subprocess.
4. test_subprocess_sandbox_provider_memory_limit — Simulate and verify behavior when memory limits are exceeded (if applicable).
5. test_sandbox_provider_availability — Test that is_available correctly reports the availability of the sandbox provider.

`task_outcome.py`

✅ Existing coverage: No corresponding test file found in tests/. The file is new, so no existing tests are expected.
❌ Missing coverage: The entire file lacks test coverage.
💡 Suggested test cases:
1. test_task_outcome_success — Verify that a successful task outcome is correctly represented.
2. test_task_outcome_failure — Test that a failed task outcome is correctly represented, including error messages and codes.
3. test_task_outcome_serialization — Verify that task outcomes can be serialized and deserialized correctly.
4. test_task_outcome_edge_cases — Test edge cases, such as extremely long error messages or unusual error codes.

Summary

Files with missing test coverage: All changed files are new and lack corresponding test files.
Total suggested test cases: 30
Priority: High — These files are new and require comprehensive test coverage to ensure reliability and robustness.

Would you like me to generate specific test code for any of these cases?

github-actions · 2026-03-24T21:59:36Z

🤖 AI Agent: security-scanner — Security Analysis of the Pull Request

Security Analysis of the Pull Request

1. Prompt Injection Defense Bypass

Risk: No direct evidence of prompt injection vulnerabilities in the changes. However, the audit_logger.py and sandbox_provider.py modules introduce new functionality that could potentially be exploited if user-controlled input is logged or executed without proper sanitization.
Rating: 🔵 LOW
Recommendation: Ensure that all user-provided inputs logged in audit_logger.py are sanitized to prevent injection attacks. Similarly, in sandbox_provider.py, validate and sanitize the command parameter to prevent command injection vulnerabilities.

2. Policy Engine Circumvention

Risk: The compat.py file introduces a NoOpPolicyEvaluator that allows all actions if the agent-os-kernel dependency is not installed. This could lead to a complete bypass of policy enforcement in environments where the dependency is missing.
Rating: 🔴 CRITICAL
Attack Vector: If an attacker can manipulate the environment to prevent the agent-os-kernel dependency from being installed or loaded, the NoOpPolicyEvaluator will be used, effectively disabling all policy checks.
Recommendation: Remove the NoOpPolicyEvaluator or ensure that the system fails securely (e.g., deny all actions) when the agent-os-kernel dependency is not available. Alternatively, log a critical error and halt execution if the dependency is missing.

3. Trust Chain Weaknesses

Risk: No changes directly related to SPIFFE/SVID validation or certificate pinning were identified in this PR.
Rating: 🔵 LOW
Recommendation: No action required for this PR.

4. Credential Exposure

Risk: The audit_logger.py file logs events, including metadata, which could potentially include sensitive information. If the metadata contains sensitive data (e.g., API keys, tokens, or PII), it could be exposed in logs.
Rating: 🟠 HIGH
Attack Vector: If sensitive information is logged, an attacker with access to the logs could extract this data.
Recommendation: Implement a mechanism to sanitize or redact sensitive information from the metadata before logging it. Add a warning in the documentation to caution developers about including sensitive data in the metadata.

5. Sandbox Escape

Risk: The SubprocessSandboxProvider in sandbox_provider.py does not provide true isolation. It explicitly states that it does not offer security isolation and is not suitable for production use.
Rating: 🔴 CRITICAL
Attack Vector: An attacker could exploit this lack of isolation to execute malicious commands or access resources outside the intended sandbox.
Recommendation: Clearly document that SubprocessSandboxProvider is for testing only and must not be used in production. Add runtime checks to prevent its use in production environments. Consider implementing a secure sandboxing mechanism (e.g., Docker or Firecracker) for production use.

6. Deserialization Attacks

Risk: No evidence of unsafe deserialization in the changes. The yaml.safe_load() guideline is followed in the copilot-instructions.md.
Rating: 🔵 LOW
Recommendation: No action required for this PR.

7. Race Conditions

Risk: The event_bus.py module introduces a publish/subscribe mechanism that could be vulnerable to race conditions if multiple threads or processes publish or handle events concurrently.
Rating: 🟡 MEDIUM
Attack Vector: Concurrent access to the _handlers dictionary or _history list could lead to data corruption or unexpected behavior.
Recommendation: Use thread-safe data structures (e.g., queue.Queue or threading.Lock) to manage _handlers and _history to prevent race conditions.

8. Supply Chain

Risk: The addition of pydantic>=2.4.0 in pyproject.toml introduces a new dependency. While pydantic is a well-known library, its inclusion increases the attack surface.
Rating: 🟡 MEDIUM
Attack Vector: If an attacker compromises the pydantic package on PyPI, they could inject malicious code into the dependency.
Recommendation: Use dependency pinning to a specific version or hash to prevent supply chain attacks. Regularly audit dependencies for vulnerabilities.

Summary of Findings

Issue	Rating	Recommendation
Prompt Injection Defense Bypass	🔵 LOW	Sanitize user inputs in `audit_logger.py` and `sandbox_provider.py`.
Policy Engine Circumvention	🔴 CRITICAL	Remove `NoOpPolicyEvaluator` or fail securely if `agent-os-kernel` is missing.
Trust Chain Weaknesses	🔵 LOW	No action required.
Credential Exposure	🟠 HIGH	Sanitize metadata in `audit_logger.py` to prevent sensitive data exposure.
Sandbox Escape	🔴 CRITICAL	Restrict `SubprocessSandboxProvider` to testing environments only.
Deserialization Attacks	🔵 LOW	No action required.
Race Conditions	🟡 MEDIUM	Use thread-safe structures in `event_bus.py`.
Supply Chain	🟡 MEDIUM	Pin `pydantic` dependency to a specific version or hash.

Final Recommendation

The PR introduces critical security risks, particularly in the compat.py and sandbox_provider.py modules. These issues must be addressed before merging. Additionally, the high-risk credential exposure in audit_logger.py and medium-risk race conditions in event_bus.py should be mitigated.

imran-siddique and others added 5 commits March 24, 2026 09:32

fix: dep confusion + pydantic dependency (post-merge review)

f5854f0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

imran-siddique merged commit 1fdedcf into microsoft:main Mar 24, 2026
57 checks passed

github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file tests ci/cd CI/CD and workflows size/XL Extra large PR (500+ lines) labels Mar 24, 2026

github-actions bot reviewed Mar 24, 2026

View reviewed changes

Conversation

imran-siddique commented Mar 24, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

🔍 API Compatibility Report

Summary

Findings

Migration Guide

Recommendations

Uh oh!

github-actions bot commented Mar 24, 2026

📝 Documentation Sync Report

Issues Found

Suggestions

Additional Notes

Conclusion

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🤖 AI Agent: code-reviewer

Review Summary

🔴 CRITICAL Issues

1. Sandbox Escape Vectors in SubprocessSandboxProvider

2. Unvalidated Input in DiffPolicy

3. Potential Denial of Service in GovernanceEventBus

4. Missing Cryptographic Validation in AuditEntry

5. Thread Safety in GovernanceEventBus

🟡 WARNING: Potential Breaking Changes

1. Dependency Addition (pydantic>=2.4.0)

2. No-Op Fallbacks in agent_os/compat.py

💡 Suggestions for Improvement

1. Audit Logging Enhancements

2. Policy Engine Improvements

3. Budget Tracker

4. Event Bus Wildcard Handling

5. Testing

6. Documentation

Final Recommendations

Uh oh!

github-actions bot commented Mar 24, 2026

🧪 Test Coverage Analysis

audit_logger.py

compat.py

diff_policy.py

event_bus.py

policies/budget.py

sandbox_provider.py

task_outcome.py

Summary

Uh oh!

github-actions bot commented Mar 24, 2026

Security Analysis of the Pull Request

1. Prompt Injection Defense Bypass

2. Policy Engine Circumvention

3. Trust Chain Weaknesses

4. Credential Exposure

5. Sandbox Escape

6. Deserialization Attacks

7. Race Conditions

8. Supply Chain

Summary of Findings

Final Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Sandbox Escape Vectors in `SubprocessSandboxProvider`

2. Unvalidated Input in `DiffPolicy`

3. Potential Denial of Service in `GovernanceEventBus`

4. Missing Cryptographic Validation in `AuditEntry`

5. Thread Safety in `GovernanceEventBus`

1. Dependency Addition (`pydantic>=2.4.0`)

2. No-Op Fallbacks in `agent_os/compat.py`

`audit_logger.py`

`compat.py`

`diff_policy.py`

`event_bus.py`

`policies/budget.py`

`sandbox_provider.py`

`task_outcome.py`