feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit by ppcvote · Pull Request #854 · microsoft/agent-governance-toolkit

ppcvote · 2026-04-06T20:17:27Z

Summary

Pre-deployment compliance check that scans agent system prompts for missing defenses against 12 attack vectors mapped to OWASP LLM Top 10.

Implements the proposal discussed in #821 with @imran-siddique and @jagmarques.

Pure regex — deterministic, zero LLM cost, < 5ms per prompt
12 vectors: role escape, instruction override, data leakage, output control, multi-language bypass, Unicode attacks, context overflow, indirect injection, social engineering, output weaponization, abuse prevention, input validation
Follows SupplyChainGuard pattern — Config dataclass + Evaluator class + Finding dataclass

Integration points

Layer	Method	Description
Compliance check	`evaluate(prompt)`	Returns `PromptDefenseReport` with grade A-F, 0-100 score, per-vector findings
MerkleAuditChain	`to_audit_entry(report, agent_did)`	Produces `AuditEntry`-compatible dict (no raw prompt stored — SHA-256 hash only)
ComplianceViolation	`to_compliance_violation(report)`	Generates violation records for undefended vectors
Promotion gate	`report.is_blocking(min_grade)`	Boolean check for `PromotionChecker` integration

Files

File	Lines	Description
`agent_compliance/prompt_defense.py`	~460	Evaluator, config, report, findings, audit/violation helpers
`tests/test_prompt_defense.py`	~490	58 tests across 11 test classes

Test plan

Design decisions

dataclass over Pydantic — avoids adding pydantic as a hard dependency for this module; matches SupplyChainGuard pattern
No raw prompt in audit trail — only SHA-256 hash stored, following the existing PromptInjectionDetector pattern in agent-os
Configurable severity map — teams can customize which vectors are critical vs low for their risk profile
evaluate_batch() — accepts dict[str, str] for bulk pre-deployment scanning across an agent fleet

Relationship to existing `PromptInjectionDetector`

agent-os has PromptInjectionDetector which detects attacks in user input at runtime. This evaluator checks system prompts for missing defenses before deployment. They are complementary:

PromptDefenseEvaluator (pre-deployment, static)  →  PromptInjectionDetector (runtime, dynamic)
"Does the prompt have defenses?"                     "Is this input an attack?"

Closes #821

… prompt audit Pre-deployment compliance check that scans system prompts for missing defenses against 12 attack vectors mapped to OWASP LLM Top 10. Pure regex, deterministic, zero LLM cost, < 5ms per prompt. - PromptDefenseEvaluator: evaluate(), evaluate_file(), evaluate_batch() - PromptDefenseReport: grade (A-F), score (0-100), per-vector findings - PromptDefenseConfig: configurable vectors, severity map, min grade - MerkleAuditChain integration: to_audit_entry() — no raw prompt stored - ComplianceViolation integration: to_compliance_violation() - 58 tests: vectors, grading, config, determinism, serialization, audit entry, compliance violations, edge cases, performance - Code style: black (100), ruff clean, mypy --strict clean Closes microsoft#821 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-06T20:17:52Z

🤖 AI Agent: contributor-guide — Welcome to the microsoft/agent-governance-toolkit project! 🎉

Welcome to the microsoft/agent-governance-toolkit project! 🎉

Hi there, and welcome to the community! Thank you so much for your contribution to the project. It's always exciting to see new contributors, and we truly appreciate the time and effort you've put into this pull request. Your work is helping us make this project even better!

What You Did Well 🌟

Thorough Implementation: Your implementation of the PromptDefenseEvaluator is detailed and well-structured. The use of regex for deterministic, low-cost evaluations is a smart design choice, especially for pre-deployment compliance checks.
Comprehensive Test Coverage: Including 58 tests across 11 test classes is fantastic! It’s clear you’ve put a lot of thought into ensuring the robustness of your code.
Adherence to OWASP Standards: Mapping the 12 attack vectors to OWASP LLM Top 10 is a great way to align with industry best practices.
Documentation: The inline docstrings and comments are clear and provide excellent context for future maintainers.
Privacy Considerations: Storing only the SHA-256 hash of prompts in the audit trail is a thoughtful approach to maintaining user privacy.

Suggestions for Improvement 🛠️

While your PR is excellent, here are a few suggestions to align with our project conventions and ensure consistency:

Linting with ruff:
- You mentioned that ruff check passed, which is great! However, we use ruff with specific rules (E, F, W). Please confirm that your code adheres to these rules. You can run:
```
ruff check --select E,F,W .
```
Test File Placement:
- Tests should be placed under the packages/{name}/tests/ directory. Currently, your test file is in packages/agent-compliance/tests/. Could you move it to packages/agent-compliance/tests/test_prompt_defense.py to match our conventions?
Commit Message Style:
- We follow the Conventional Commits standard. Your commit message feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit is close, but it would be great to use a colon (:) instead of an em dash (—). For example:
```
feat(agent-compliance): add PromptDefenseEvaluator for 12-vector system prompt audit
```
Security-Sensitive Code:
- Since this PR involves security-sensitive functionality (e.g., compliance checks for attack vectors), it will undergo additional scrutiny. While your implementation looks solid, we’ll review the regex patterns and logic to ensure they are robust against edge cases. If you’ve already considered specific edge cases, feel free to share them in the PR comments.

Helpful Resources 📚

Here are some resources to help you navigate the project and make any necessary updates:

CONTRIBUTING.md: Guidelines for contributing to the project.
QUICKSTART.md: A quick guide to getting started with the project.

Next Steps 🚀

Address the suggestions above (if applicable).
Push any updates to this PR — GitHub Actions will automatically re-run the CI checks.
Once you've made the changes, a maintainer will review your PR again. If everything looks good, we'll merge it!

If you have any questions or need help with anything, don’t hesitate to ask. We're here to support you!

Thank you again for your contribution — we’re thrilled to have you as part of the community! 😊

ppcvote · 2026-04-06T20:24:10Z

@microsoft-github-policy-service agree

lawcontinue · 2026-04-07T13:06:13Z

PR #854 Review Feedback

Thank you for the excellent work on PromptDefenseEvaluator! The code quality is outstanding, and the integration design is well-thought-out. Here are my thoughts:

Key Strengths

1. SupplyChainGuard Pattern Alignment

The implementation perfectly matches the existing pattern in SupplyChainGuard:

PromptDefenseConfig (Config dataclass) - Correct
PromptDefenseEvaluator (Evaluator class) - Correct
PromptDefenseFinding (Finding dataclass) - Correct

This consistency makes the code easy to understand and maintain.

2. MerkleAuditChain Integration

The integration with IntegrityVerifier is correct:

to_audit_entry() generates the correct AuditEntry format
SHA-256 hash logic is correct (prompt_hash field)
No raw prompt stored (privacy protection) - Good practice
event_type: "prompt.defense.evaluated" - Clear and specific

3. ComplianceViolation Integration

The integration with ComplianceViolation is well-designed:

to_compliance_violation() maps correctly to ComplianceViolation
Only generates violations for undefended vectors (efficiency optimization) - Good
is_blocking(min_grade) correctly implements PromotionChecker integration - Correct

4. Exceptional Test Coverage

The test coverage is exemplary:

58 tests, 11 test classes
Covers all 12 vectors
Covers edge cases (empty prompt, special characters, large text)
Covers integration points (audit_entry, compliance_violation)
Covers performance validation (< 5ms)

This gives me confidence in the implementation.

5. Zero-Dependency Design

Using only stdlib (re, hashlib, json, dataclasses) is a great decision:

No new dependencies needed
Fits perfectly within agent-compliance package constraints
Keeps the package lightweight

Points to Discuss

Discussion Point 1: Field Naming in `to_audit_entry()`

Current implementation:

return {
    "event_type": "prompt.defense.evaluated",
    "action": "pre_deployment_check",
    "outcome": "success" if not report.is_blocking(self.config.min_grade) else "denied",
    ...
}

Could you verify that event_type, action, and outcome match the naming conventions used in SupplyChainGuard.to_audit_entry()? This would ensure consistency across evaluators and make log querying easier.

Discussion Point 2: `is_blocking()` Semantics

Current implementation:

def is_blocking(self, min_grade: str) -> bool:
    """Check if the report grade is below the minimum passing grade."""
    grade_values = {"A": 5, "B": 4, "C": 3, "D": 2, "F": 1}
    return grade_values.get(self.grade, 0) < grade_values.get(min_grade, 0)

I'm wondering about the semantics here:

is_blocking() returns True when "below minimum grade" (should block)
Is this semantic consistent with how PromotionChecker uses is_blocking()?
Does is_blocking() = True mean "should block" or "is currently blocking"?

Could you check how PromotionChecker calls is_blocking() and verify the semantic is consistent? This would avoid potential confusion.

Suggested Improvements

Suggestion 3: Confidence Calculation Logic

Current implementation:

confidence = (
    min(0.9, 0.5 + matched * 0.2) if defended
    else (0.4 if matched > 0 else 0.8)
)

I found the confidence calculation a bit counterintuitive:

If defended: confidence = 0.5 + matched * 0.2 (min 0.5, max 0.9)
If not defended but has matches: confidence = 0.4 (low confidence)
If not defended and no matches: confidence = 0.8 (high confidence)

The counterintuitive part is: higher confidence when no defense found (0.8) vs. partial defense found (0.4).

Consider inverting the logic or renaming to certainty (confidence in certainty of finding, not accuracy of assessment). This would make the reports more readable.

Edge Case Tests (Optional)

The test coverage is already exceptional, but you might consider adding:

Super-long prompts (> 1MB)
Non-UTF-8 encoded prompts
Prompts with many special characters (emoji, zero-width characters)

That said, the current coverage is already excellent, so this is low priority.

Final Thoughts

Recommend: Accept PR and Open to Co-authoring

Why I recommend this PR:

Outstanding code quality
Good integration compatibility
Exceptional test coverage
All discussion points are minor (consistency and semantics)
You proactively invited co-authoring (collaborative attitude)

This is a high-quality PR that shows deep understanding of the toolkit architecture.

Co-authoring Discussion

If you're interested in co-authoring, here are some potential areas:

Documentation improvements
- More integration examples (with GovernanceVerifier)
- Performance benchmark results
- Best practices guide
Edge case tests
- Super-long prompts
- Non-UTF-8 encoding
- Special characters
Tooling scripts
- CLI tool (e.g., prompt-defense-audit)
- Pre-configured config files (e.g., prompt-defense.yaml)

That said, these are just suggestions - I'm happy to discuss what makes sense for you.

Closing Thoughts

Thank you for this excellent contribution! You've clearly taken time to study the toolkit's internal architecture, and it shows in the quality of the implementation.

I'm happy to discuss co-authoring further if you're interested.

Best regards,
@lawcontinue

ppcvote · 2026-04-07T14:00:17Z

@lawcontinue — thank you for the thorough review. This is exactly the kind of feedback that makes the integration stronger. Let me address each point:

Discussion Point 1: Field Naming in `to_audit_entry()`

Good catch. I modeled the field names after the AuditEntry schema in the docs, but I haven't verified them against SupplyChainGuard's actual implementation. I'll cross-reference and align the naming — consistency across evaluators is important for log querying.

If you have access to SupplyChainGuard.to_audit_entry(), a quick pointer to the field names it uses would save me some digging.

Discussion Point 2: `is_blocking()` Semantics

The current semantic is: is_blocking(min_grade) → True means "this report's grade falls below the minimum, and deployment should be blocked."

So yes, is_blocking() = True means "should block" — it's a recommendation, not a state. The PromotionChecker would call report.is_blocking(config.min_grade) and act on the boolean.

I can add a docstring clarification:

def is_blocking(self, min_grade: str = "C") -> bool:
    """Return True if deployment should be blocked (grade below threshold)."""

Suggestion 3: Confidence Calculation

You're right that the semantics deserve clarification. The confidence value represents certainty of the assessment, not accuracy of the defense:

Not defended, no matches (0.8): We're fairly certain the defense is absent — we checked all patterns and found nothing.
Not defended, partial matches (0.4): We're less certain — some defensive language exists but doesn't meet the threshold. Could be a near-miss or a different phrasing we don't cover.
Defended (0.5–0.9): Confidence scales with the number of pattern matches.

I'll add an inline comment explaining this. Renaming to certainty is also an option — happy to go either way based on what's more consistent with the toolkit's conventions.

Co-authoring

I'd welcome your contributions. The areas you suggested are all high-value:

Integration examples with GovernanceVerifier — this would help users understand the end-to-end flow
Edge case tests (super-long prompts, non-UTF-8) — good hardening

Feel free to push directly to my branch (ppcvote:feat/prompt-defense-evaluator) or open a follow-up PR. I'll add you as co-author on any commits you contribute.

Thanks again for the quality of this review.

lawcontinue · 2026-04-07T15:26:27Z

Thanks for the kind words! I'd be happy to contribute to the integration examples. The GovernanceVerifier integration seems like a good starting point—it demonstrates the defensive value clearly. I'll open a follow-up PR in the coming days.

Re: the semantic discussion on is_blocking(), I agree that "should block" is clearer than "is blocking"—it separates the decision from the state, which makes the intent more explicit.

Looking forward to collaborating!

imran-siddique

Excellent work @ppcvote — 58 tests, zero dependencies, and clean integration with existing toolkit schemas. A few items before merge:

Blocking:

Please mark this PR as Ready for Review (exit draft) so CI security checks can run

Should fix:

Add path validation or try/except in evaluate_file() to handle missing/unauthorized paths
Add the inline comment for the confidence calculation (you mentioned this)
Export the public API in agent_compliance/init.py

Nit:

Consider a max input length guard before regex evaluation (ReDoS defense-in-depth)

This is a strong contribution — close to merge-ready.

@imran-siddique

Addresses @imran-siddique review: - evaluate_file(): validates path exists, raises FileNotFoundError/ValueError - Confidence calculation: inline comments explaining 0.5+0.2n/0.4/0.8 logic - MAX_PROMPT_LENGTH (100KB): defense-in-depth against ReDoS - Public API exported in agent_compliance/__init__.py - 5 new tests for file handling and input length guard (63 total) - black/ruff/mypy --strict all clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ppcvote · 2026-04-08T04:22:13Z

@imran-siddique — all feedback addressed:

✅ Ready for review — exited draft, CI can run
✅ evaluate_file() — raises FileNotFoundError / ValueError for missing/empty files
✅ Confidence comment — inline explanation of the 0.5+0.2n / 0.4 / 0.8 scoring logic
✅ Public API — PromptDefenseEvaluator, Config, Finding, Report exported in __init__.py
✅ MAX_PROMPT_LENGTH — 100KB cap, raises ValueError with "ReDoS protection" message

63 tests passing (was 58), black/ruff/mypy --strict all clean.

imran-siddique

All feedback addressed — draft exited, path validation added, confidence documented, init.py updated. Excellent contribution with 58 tests and zero dependencies. Merging.

@imran-siddique

…it (microsoft#854) * feat(agent-compliance): add PromptDefenseEvaluator — 12-vector system prompt audit Pre-deployment compliance check that scans system prompts for missing defenses against 12 attack vectors mapped to OWASP LLM Top 10. Pure regex, deterministic, zero LLM cost, < 5ms per prompt. - PromptDefenseEvaluator: evaluate(), evaluate_file(), evaluate_batch() - PromptDefenseReport: grade (A-F), score (0-100), per-vector findings - PromptDefenseConfig: configurable vectors, severity map, min grade - MerkleAuditChain integration: to_audit_entry() — no raw prompt stored - ComplianceViolation integration: to_compliance_violation() - 58 tests: vectors, grading, config, determinism, serialization, audit entry, compliance violations, edge cases, performance - Code style: black (100), ruff clean, mypy --strict clean Closes microsoft#821 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address review feedback — path validation, ReDoS guard, API exports Addresses @imran-siddique review: - evaluate_file(): validates path exists, raises FileNotFoundError/ValueError - Confidence calculation: inline comments explaining 0.5+0.2n/0.4/0.8 logic - MAX_PROMPT_LENGTH (100KB): defense-in-depth against ReDoS - Public API exported in agent_compliance/__init__.py - 5 new tests for file handling and input length guard (63 total) - black/ruff/mypy --strict all clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added tests size/XL Extra large PR (500+ lines) labels Apr 6, 2026

This was referenced Apr 7, 2026

Proposal: Pre-deployment system prompt defense audit #821

Closed

📝 Blog Post: OWASP Agentic Top 10 — What Every AI Developer Should Know in 2026 #851

Closed

imran-siddique reviewed Apr 8, 2026

View reviewed changes

ppcvote marked this pull request as ready for review April 8, 2026 04:21

imran-siddique approved these changes Apr 8, 2026

View reviewed changes

imran-siddique merged commit 6c7c65d into microsoft:main Apr 8, 2026
10 of 11 checks passed

ppcvote mentioned this pull request Apr 8, 2026

Security: Prompt injection via fake system context in user messages emcie-co/parlant#759

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit#854

feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit#854
imran-siddique merged 2 commits intomicrosoft:mainfrom
ppcvote:feat/prompt-defense-evaluator

ppcvote commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Welcome to the microsoft/agent-governance-toolkit project! 🎉

What You Did Well 🌟

Suggestions for Improvement 🛠️

Helpful Resources 📚

Next Steps 🚀

Uh oh!

ppcvote commented Apr 6, 2026

Uh oh!

lawcontinue commented Apr 7, 2026

Uh oh!

ppcvote commented Apr 7, 2026

Uh oh!

lawcontinue commented Apr 7, 2026

Uh oh!

imran-siddique left a comment

Uh oh!

ppcvote commented Apr 8, 2026

Uh oh!

imran-siddique left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ppcvote commented Apr 6, 2026

Summary

Integration points

Files

Test plan

Design decisions

Relationship to existing PromptInjectionDetector

Uh oh!

github-actions bot commented Apr 6, 2026

Welcome to the microsoft/agent-governance-toolkit project! 🎉

What You Did Well 🌟

Suggestions for Improvement 🛠️

Helpful Resources 📚

Next Steps 🚀

Uh oh!

ppcvote commented Apr 6, 2026

Uh oh!

lawcontinue commented Apr 7, 2026

PR #854 Review Feedback

Key Strengths

1. SupplyChainGuard Pattern Alignment

2. MerkleAuditChain Integration

3. ComplianceViolation Integration

4. Exceptional Test Coverage

5. Zero-Dependency Design

Points to Discuss

Discussion Point 1: Field Naming in to_audit_entry()

Discussion Point 2: is_blocking() Semantics

Suggested Improvements

Suggestion 3: Confidence Calculation Logic

Edge Case Tests (Optional)

Final Thoughts

Recommend: Accept PR and Open to Co-authoring

Co-authoring Discussion

Closing Thoughts

Uh oh!

ppcvote commented Apr 7, 2026

Discussion Point 1: Field Naming in to_audit_entry()

Discussion Point 2: is_blocking() Semantics

Suggestion 3: Confidence Calculation

Co-authoring

Uh oh!

lawcontinue commented Apr 7, 2026

Uh oh!

imran-siddique left a comment

Choose a reason for hiding this comment

Uh oh!

ppcvote commented Apr 8, 2026

Uh oh!

imran-siddique left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Relationship to existing `PromptInjectionDetector`

Discussion Point 1: Field Naming in `to_audit_entry()`

Discussion Point 2: `is_blocking()` Semantics

Discussion Point 1: Field Naming in `to_audit_entry()`

Discussion Point 2: `is_blocking()` Semantics