Skip to content

feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit#854

Merged
imran-siddique merged 2 commits intomicrosoft:mainfrom
ppcvote:feat/prompt-defense-evaluator
Apr 8, 2026
Merged

feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit#854
imran-siddique merged 2 commits intomicrosoft:mainfrom
ppcvote:feat/prompt-defense-evaluator

Conversation

@ppcvote
Copy link
Copy Markdown
Contributor

@ppcvote ppcvote commented Apr 6, 2026

Summary

Pre-deployment compliance check that scans agent system prompts for missing defenses against 12 attack vectors mapped to OWASP LLM Top 10.

Implements the proposal discussed in #821 with @imran-siddique and @jagmarques.

  • Pure regex — deterministic, zero LLM cost, < 5ms per prompt
  • 12 vectors: role escape, instruction override, data leakage, output control, multi-language bypass, Unicode attacks, context overflow, indirect injection, social engineering, output weaponization, abuse prevention, input validation
  • Follows SupplyChainGuard patternConfig dataclass + Evaluator class + Finding dataclass

Integration points

Layer Method Description
Compliance check evaluate(prompt) Returns PromptDefenseReport with grade A-F, 0-100 score, per-vector findings
MerkleAuditChain to_audit_entry(report, agent_did) Produces AuditEntry-compatible dict (no raw prompt stored — SHA-256 hash only)
ComplianceViolation to_compliance_violation(report) Generates violation records for undefended vectors
Promotion gate report.is_blocking(min_grade) Boolean check for PromotionChecker integration

Files

File Lines Description
agent_compliance/prompt_defense.py ~460 Evaluator, config, report, findings, audit/violation helpers
tests/test_prompt_defense.py ~490 58 tests across 11 test classes

Test plan

  • pytest — 58 tests passing (0.36s)
  • black --check --line-length 100 — clean
  • ruff check — clean
  • mypy --strict --ignore-missing-imports — clean
  • All 12 vectors individually tested (defended + undefended)
  • Grading boundaries (A/B/C/D/F)
  • Determinism verified (same input → same output)
  • Audit entry contains no raw prompt text (privacy)
  • Performance: < 5ms per prompt (verified in test)
  • No new dependencies — uses only re, hashlib, json, dataclasses from stdlib

Design decisions

  • dataclass over Pydantic — avoids adding pydantic as a hard dependency for this module; matches SupplyChainGuard pattern
  • No raw prompt in audit trail — only SHA-256 hash stored, following the existing PromptInjectionDetector pattern in agent-os
  • Configurable severity map — teams can customize which vectors are critical vs low for their risk profile
  • evaluate_batch() — accepts dict[str, str] for bulk pre-deployment scanning across an agent fleet

Relationship to existing PromptInjectionDetector

agent-os has PromptInjectionDetector which detects attacks in user input at runtime. This evaluator checks system prompts for missing defenses before deployment. They are complementary:

PromptDefenseEvaluator (pre-deployment, static)  →  PromptInjectionDetector (runtime, dynamic)
"Does the prompt have defenses?"                     "Is this input an attack?"

Closes #821

… prompt audit

Pre-deployment compliance check that scans system prompts for missing
defenses against 12 attack vectors mapped to OWASP LLM Top 10.

Pure regex, deterministic, zero LLM cost, < 5ms per prompt.

- PromptDefenseEvaluator: evaluate(), evaluate_file(), evaluate_batch()
- PromptDefenseReport: grade (A-F), score (0-100), per-vector findings
- PromptDefenseConfig: configurable vectors, severity map, min grade
- MerkleAuditChain integration: to_audit_entry() — no raw prompt stored
- ComplianceViolation integration: to_compliance_violation()
- 58 tests: vectors, grading, config, determinism, serialization,
  audit entry, compliance violations, edge cases, performance
- Code style: black (100), ruff clean, mypy --strict clean

Closes microsoft#821

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added tests size/XL Extra large PR (500+ lines) labels Apr 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

🤖 AI Agent: contributor-guide — Welcome to the microsoft/agent-governance-toolkit project! 🎉

Welcome to the microsoft/agent-governance-toolkit project! 🎉

Hi there, and welcome to the community! Thank you so much for your contribution to the project. It's always exciting to see new contributors, and we truly appreciate the time and effort you've put into this pull request. Your work is helping us make this project even better!


What You Did Well 🌟

  1. Thorough Implementation: Your implementation of the PromptDefenseEvaluator is detailed and well-structured. The use of regex for deterministic, low-cost evaluations is a smart design choice, especially for pre-deployment compliance checks.
  2. Comprehensive Test Coverage: Including 58 tests across 11 test classes is fantastic! It’s clear you’ve put a lot of thought into ensuring the robustness of your code.
  3. Adherence to OWASP Standards: Mapping the 12 attack vectors to OWASP LLM Top 10 is a great way to align with industry best practices.
  4. Documentation: The inline docstrings and comments are clear and provide excellent context for future maintainers.
  5. Privacy Considerations: Storing only the SHA-256 hash of prompts in the audit trail is a thoughtful approach to maintaining user privacy.

Suggestions for Improvement 🛠️

While your PR is excellent, here are a few suggestions to align with our project conventions and ensure consistency:

  1. Linting with ruff:

    • You mentioned that ruff check passed, which is great! However, we use ruff with specific rules (E, F, W). Please confirm that your code adheres to these rules. You can run:
      ruff check --select E,F,W .
  2. Test File Placement:

    • Tests should be placed under the packages/{name}/tests/ directory. Currently, your test file is in packages/agent-compliance/tests/. Could you move it to packages/agent-compliance/tests/test_prompt_defense.py to match our conventions?
  3. Commit Message Style:

    • We follow the Conventional Commits standard. Your commit message feat(agent-compliance): PromptDefenseEvaluator — 12-vector system prompt audit is close, but it would be great to use a colon (:) instead of an em dash (). For example:
      feat(agent-compliance): add PromptDefenseEvaluator for 12-vector system prompt audit
      
  4. Security-Sensitive Code:

    • Since this PR involves security-sensitive functionality (e.g., compliance checks for attack vectors), it will undergo additional scrutiny. While your implementation looks solid, we’ll review the regex patterns and logic to ensure they are robust against edge cases. If you’ve already considered specific edge cases, feel free to share them in the PR comments.

Helpful Resources 📚

Here are some resources to help you navigate the project and make any necessary updates:


Next Steps 🚀

  1. Address the suggestions above (if applicable).
  2. Push any updates to this PR — GitHub Actions will automatically re-run the CI checks.
  3. Once you've made the changes, a maintainer will review your PR again. If everything looks good, we'll merge it!

If you have any questions or need help with anything, don’t hesitate to ask. We're here to support you!

Thank you again for your contribution — we’re thrilled to have you as part of the community! 😊

@ppcvote
Copy link
Copy Markdown
Contributor Author

ppcvote commented Apr 6, 2026

@microsoft-github-policy-service agree

@lawcontinue
Copy link
Copy Markdown
Contributor

PR #854 Review Feedback

Thank you for the excellent work on PromptDefenseEvaluator! The code quality is outstanding, and the integration design is well-thought-out. Here are my thoughts:


Key Strengths

1. SupplyChainGuard Pattern Alignment

The implementation perfectly matches the existing pattern in SupplyChainGuard:

  • PromptDefenseConfig (Config dataclass) - Correct
  • PromptDefenseEvaluator (Evaluator class) - Correct
  • PromptDefenseFinding (Finding dataclass) - Correct

This consistency makes the code easy to understand and maintain.


2. MerkleAuditChain Integration

The integration with IntegrityVerifier is correct:

  • to_audit_entry() generates the correct AuditEntry format
  • SHA-256 hash logic is correct (prompt_hash field)
  • No raw prompt stored (privacy protection) - Good practice
  • event_type: "prompt.defense.evaluated" - Clear and specific

3. ComplianceViolation Integration

The integration with ComplianceViolation is well-designed:

  • to_compliance_violation() maps correctly to ComplianceViolation
  • Only generates violations for undefended vectors (efficiency optimization) - Good
  • is_blocking(min_grade) correctly implements PromotionChecker integration - Correct

4. Exceptional Test Coverage

The test coverage is exemplary:

  • 58 tests, 11 test classes
  • Covers all 12 vectors
  • Covers edge cases (empty prompt, special characters, large text)
  • Covers integration points (audit_entry, compliance_violation)
  • Covers performance validation (< 5ms)

This gives me confidence in the implementation.


5. Zero-Dependency Design

Using only stdlib (re, hashlib, json, dataclasses) is a great decision:

  • No new dependencies needed
  • Fits perfectly within agent-compliance package constraints
  • Keeps the package lightweight

Points to Discuss

Discussion Point 1: Field Naming in to_audit_entry()

Current implementation:

return {
    "event_type": "prompt.defense.evaluated",
    "action": "pre_deployment_check",
    "outcome": "success" if not report.is_blocking(self.config.min_grade) else "denied",
    ...
}

Could you verify that event_type, action, and outcome match the naming conventions used in SupplyChainGuard.to_audit_entry()? This would ensure consistency across evaluators and make log querying easier.


Discussion Point 2: is_blocking() Semantics

Current implementation:

def is_blocking(self, min_grade: str) -> bool:
    """Check if the report grade is below the minimum passing grade."""
    grade_values = {"A": 5, "B": 4, "C": 3, "D": 2, "F": 1}
    return grade_values.get(self.grade, 0) < grade_values.get(min_grade, 0)

I'm wondering about the semantics here:

  • is_blocking() returns True when "below minimum grade" (should block)
  • Is this semantic consistent with how PromotionChecker uses is_blocking()?
  • Does is_blocking() = True mean "should block" or "is currently blocking"?

Could you check how PromotionChecker calls is_blocking() and verify the semantic is consistent? This would avoid potential confusion.


Suggested Improvements

Suggestion 3: Confidence Calculation Logic

Current implementation:

confidence = (
    min(0.9, 0.5 + matched * 0.2) if defended
    else (0.4 if matched > 0 else 0.8)
)

I found the confidence calculation a bit counterintuitive:

  • If defended: confidence = 0.5 + matched * 0.2 (min 0.5, max 0.9)
  • If not defended but has matches: confidence = 0.4 (low confidence)
  • If not defended and no matches: confidence = 0.8 (high confidence)

The counterintuitive part is: higher confidence when no defense found (0.8) vs. partial defense found (0.4).

Consider inverting the logic or renaming to certainty (confidence in certainty of finding, not accuracy of assessment). This would make the reports more readable.


Edge Case Tests (Optional)

The test coverage is already exceptional, but you might consider adding:

  • Super-long prompts (> 1MB)
  • Non-UTF-8 encoded prompts
  • Prompts with many special characters (emoji, zero-width characters)

That said, the current coverage is already excellent, so this is low priority.


Final Thoughts

Recommend: Accept PR and Open to Co-authoring

Why I recommend this PR:

  1. Outstanding code quality
  2. Good integration compatibility
  3. Exceptional test coverage
  4. All discussion points are minor (consistency and semantics)
  5. You proactively invited co-authoring (collaborative attitude)

This is a high-quality PR that shows deep understanding of the toolkit architecture.


Co-authoring Discussion

If you're interested in co-authoring, here are some potential areas:

  1. Documentation improvements

    • More integration examples (with GovernanceVerifier)
    • Performance benchmark results
    • Best practices guide
  2. Edge case tests

    • Super-long prompts
    • Non-UTF-8 encoding
    • Special characters
  3. Tooling scripts

    • CLI tool (e.g., prompt-defense-audit)
    • Pre-configured config files (e.g., prompt-defense.yaml)

That said, these are just suggestions - I'm happy to discuss what makes sense for you.


Closing Thoughts

Thank you for this excellent contribution! You've clearly taken time to study the toolkit's internal architecture, and it shows in the quality of the implementation.

I'm happy to discuss co-authoring further if you're interested.

Best regards,
@lawcontinue

@ppcvote
Copy link
Copy Markdown
Contributor Author

ppcvote commented Apr 7, 2026

@lawcontinue — thank you for the thorough review. This is exactly the kind of feedback that makes the integration stronger. Let me address each point:


Discussion Point 1: Field Naming in to_audit_entry()

Good catch. I modeled the field names after the AuditEntry schema in the docs, but I haven't verified them against SupplyChainGuard's actual implementation. I'll cross-reference and align the naming — consistency across evaluators is important for log querying.

If you have access to SupplyChainGuard.to_audit_entry(), a quick pointer to the field names it uses would save me some digging.

Discussion Point 2: is_blocking() Semantics

The current semantic is: is_blocking(min_grade) → True means "this report's grade falls below the minimum, and deployment should be blocked."

So yes, is_blocking() = True means "should block" — it's a recommendation, not a state. The PromotionChecker would call report.is_blocking(config.min_grade) and act on the boolean.

I can add a docstring clarification:

def is_blocking(self, min_grade: str = "C") -> bool:
    """Return True if deployment should be blocked (grade below threshold)."""

Suggestion 3: Confidence Calculation

You're right that the semantics deserve clarification. The confidence value represents certainty of the assessment, not accuracy of the defense:

  • Not defended, no matches (0.8): We're fairly certain the defense is absent — we checked all patterns and found nothing.
  • Not defended, partial matches (0.4): We're less certain — some defensive language exists but doesn't meet the threshold. Could be a near-miss or a different phrasing we don't cover.
  • Defended (0.5–0.9): Confidence scales with the number of pattern matches.

I'll add an inline comment explaining this. Renaming to certainty is also an option — happy to go either way based on what's more consistent with the toolkit's conventions.


Co-authoring

I'd welcome your contributions. The areas you suggested are all high-value:

  1. Integration examples with GovernanceVerifier — this would help users understand the end-to-end flow
  2. Edge case tests (super-long prompts, non-UTF-8) — good hardening

Feel free to push directly to my branch (ppcvote:feat/prompt-defense-evaluator) or open a follow-up PR. I'll add you as co-author on any commits you contribute.

Thanks again for the quality of this review.

@lawcontinue
Copy link
Copy Markdown
Contributor

Thanks for the kind words! I'd be happy to contribute to the integration examples. The GovernanceVerifier integration seems like a good starting point—it demonstrates the defensive value clearly. I'll open a follow-up PR in the coming days.

Re: the semantic discussion on is_blocking(), I agree that "should block" is clearer than "is blocking"—it separates the decision from the state, which makes the intent more explicit.

Looking forward to collaborating!

Copy link
Copy Markdown
Member

@imran-siddique imran-siddique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @ppcvote — 58 tests, zero dependencies, and clean integration with existing toolkit schemas. A few items before merge:

Blocking:

  • Please mark this PR as Ready for Review (exit draft) so CI security checks can run

Should fix:

  • Add path validation or try/except in evaluate_file() to handle missing/unauthorized paths
  • Add the inline comment for the confidence calculation (you mentioned this)
  • Export the public API in agent_compliance/init.py

Nit:

  • Consider a max input length guard before regex evaluation (ReDoS defense-in-depth)

This is a strong contribution — close to merge-ready.

Addresses @imran-siddique review:

- evaluate_file(): validates path exists, raises FileNotFoundError/ValueError
- Confidence calculation: inline comments explaining 0.5+0.2n/0.4/0.8 logic
- MAX_PROMPT_LENGTH (100KB): defense-in-depth against ReDoS
- Public API exported in agent_compliance/__init__.py
- 5 new tests for file handling and input length guard (63 total)
- black/ruff/mypy --strict all clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ppcvote ppcvote marked this pull request as ready for review April 8, 2026 04:21
@ppcvote
Copy link
Copy Markdown
Contributor Author

ppcvote commented Apr 8, 2026

@imran-siddique — all feedback addressed:

  • Ready for review — exited draft, CI can run
  • evaluate_file() — raises FileNotFoundError / ValueError for missing/empty files
  • Confidence comment — inline explanation of the 0.5+0.2n / 0.4 / 0.8 scoring logic
  • Public APIPromptDefenseEvaluator, Config, Finding, Report exported in __init__.py
  • MAX_PROMPT_LENGTH — 100KB cap, raises ValueError with "ReDoS protection" message

63 tests passing (was 58), black/ruff/mypy --strict all clean.

Copy link
Copy Markdown
Member

@imran-siddique imran-siddique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feedback addressed — draft exited, path validation added, confidence documented, init.py updated. Excellent contribution with 58 tests and zero dependencies. Merging.

@imran-siddique imran-siddique merged commit 6c7c65d into microsoft:main Apr 8, 2026
10 of 11 checks passed
harinarayansrivatsan pushed a commit to harinarayansrivatsan/agent-governance-toolkit that referenced this pull request Apr 9, 2026
…it (microsoft#854)

* feat(agent-compliance): add PromptDefenseEvaluator — 12-vector system prompt audit

Pre-deployment compliance check that scans system prompts for missing
defenses against 12 attack vectors mapped to OWASP LLM Top 10.

Pure regex, deterministic, zero LLM cost, < 5ms per prompt.

- PromptDefenseEvaluator: evaluate(), evaluate_file(), evaluate_batch()
- PromptDefenseReport: grade (A-F), score (0-100), per-vector findings
- PromptDefenseConfig: configurable vectors, severity map, min grade
- MerkleAuditChain integration: to_audit_entry() — no raw prompt stored
- ComplianceViolation integration: to_compliance_violation()
- 58 tests: vectors, grading, config, determinism, serialization,
  audit entry, compliance violations, edge cases, performance
- Code style: black (100), ruff clean, mypy --strict clean

Closes microsoft#821

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review feedback — path validation, ReDoS guard, API exports

Addresses @imran-siddique review:

- evaluate_file(): validates path exists, raises FileNotFoundError/ValueError
- Confidence calculation: inline comments explaining 0.5+0.2n/0.4/0.8 logic
- MAX_PROMPT_LENGTH (100KB): defense-in-depth against ReDoS
- Public API exported in agent_compliance/__init__.py
- 5 new tests for file handling and input length guard (63 total)
- black/ruff/mypy --strict all clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR (500+ lines) tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Pre-deployment system prompt defense audit

3 participants