This document maps ACF-SDK to the MITRE ATLAS threat model and explains, in practical terms, which attack categories the project can mitigate well, which it only partially addresses, and which remain outside its scope.
The short version is:
- ACF-SDK is strongest against runtime, inference-time, agent-mediated attacks
- It is especially well aligned to prompt injection, indirect prompt injection, tool abuse, and memory poisoning
- It is not a complete defense for the full ATLAS landscape, especially for training-time, model supply chain, model theft, or infrastructure-level attacks
This is a feature of the architecture, not a failure of it. The project is intentionally designed as a cognitive firewall around the agent execution path.
MITRE describes ATLAS as a knowledge base of adversary tactics and techniques for AI-enabled systems. SAFE-AI further describes ATLAS as spanning tactics "from reconnaissance to attacks on AI and their impact."
That framing matters because ACF-SDK does not try to secure every part of an AI system equally. Instead, it inserts enforcement at the places where an LLM agent ingests or acts on data:
on_promptfor direct user inputon_contextfor retrieved context and RAG chunkson_tool_callfor agent actions before tool executionon_memoryfor agent memory reads and writes
These are implemented in the Python SDK at firewall.py and evaluated by the sidecar pipeline centered on pipeline.go.
From an ATLAS perspective, this means ACF-SDK is primarily an inference-time control plane rather than a general-purpose AI security platform.
The current implementation is important to understand before mapping to ATLAS.
The repo enforces decisions through an isolated Go sidecar:
- Request framing, HMAC verification, and nonce replay protection happen in the transport layer
- Payloads are run through
validate -> normalise -> scan -> aggregate - Final decisions are currently based mainly on score thresholds in the pipeline dispatcher
Relevant files:
The currently wired protection comes from:
- input boundary enforcement at four hook points
- canonicalisation and evasion resistance in the normalise stage
- lexical pattern matching via the jailbreak pattern library
- provenance-aware scoring
- tool and memory allowlist checks
- hard process isolation between the SDK and the policy decision point
The repository also contains a more ambitious policy layer:
- Rego policies under policies/v1
- a stubbed policy package under engine.go
However, the current runtime path is still dominated by the Phase 2 threshold-based pipeline. In other words:
- the design vision is policy-driven and OPA-backed
- the implemented runtime is currently strongest as a scored pipeline firewall
That distinction matters when discussing ATLAS coverage. The repo already aligns conceptually with several ATLAS techniques, but some of that alignment is stronger in the architecture and policy files than in the fully wired runtime behavior.
The table below uses broad ATLAS-style categories to show where ACF-SDK helps most.
| ATLAS category | How ACF-SDK helps | Coverage |
|---|---|---|
| Reconnaissance | May reduce useful probing feedback by blocking suspicious prompts or tool attempts, but does not stop adversaries from studying a public service | Low |
| Resource Development | Does not prevent attackers from building payloads, tooling, or attack infrastructure | None |
| Initial Access | Helps at the AI boundary by screening malicious prompts and retrieved content before they influence the agent | Partial |
| AI Model Access | Helps mediate interactions before model invocation, but does not secure weights, endpoints, or provider-side controls directly | Partial |
| Discovery | Can interfere with attempts to enumerate system prompts, tools, and memory through guarded hooks | Partial |
| AI Attack Staging | Strong for prompt-driven and tool-driven staging, especially with obfuscation handling and hook-based enforcement | Strong |
| Execution | Strong when harmful actions depend on tools that are routed through on_tool_call |
Strong |
| Persistence | Partial through on_memory, which can reduce poisoning of long-lived agent state |
Partial |
| Privilege Escalation | Partial through instruction-override and role-escalation style controls, stronger in policy vision than current runtime semantics | Partial |
| Defense Evasion | Strong for text-based evasion because of normalisation and canonicalisation | Strong for text |
| Credential Access | Partial by reducing coercion, unsafe tool use, and instruction hijacking, but not a secret-management system | Partial |
| Collection | Partial when collection depends on guarded tools or memory access patterns | Partial |
| Exfiltration | Partial to strong when exfiltration must occur through guarded tool calls or poisoned context flows | Partial to Strong |
| Impact | Moderate when impact depends on agent behavior; weak for non-agent attack paths | Partial |
ATLAS includes early-stage activity where adversaries learn about the target AI system, its boundaries, and its weaknesses.
ACF-SDK can make reconnaissance less informative when the probing occurs through agent-facing channels:
- direct prompt probes such as "reveal your system prompt"
- attempts to enumerate tools by inducing the agent to call them
- attempts to learn whether certain jailbreak phrases succeed
Pattern matching and tool allowlist enforcement can block or sanitise those probes before they yield useful agent behavior.
It does not stop:
- public API fingerprinting
- rate-based probing
- traffic analysis
- external observation of app behavior
- attackers studying your repo, prompts, or deployment artifacts outside the hook path
This is low coverage. ACF-SDK can reduce the usefulness of AI-facing recon, but it is not a reconnaissance defense system.
This covers the attacker building payloads, infrastructure, or tools before interacting with the target.
Almost nothing directly. At most, it can make common payload families less effective once they reach the defended agent.
It does not prevent attackers from:
- generating jailbreak corpora
- building prompt mutation tools
- preparing malicious RAG documents
- staging phishing, malware, or hosting infrastructure
This is out of scope.
For AI systems, initial access often means the attacker first gaining influence over the model or agent's reasoning path.
This is one of the repo's important strengths:
on_promptscreens direct user input before the model sees iton_contextscreens retrieved content before it becomes model contexton_tool_callcan block risky tool invocations before executionon_memorycan reduce poisoned state being written or re-read
The hook model means access to the agent's cognition is treated as a controlled boundary, not a trusted one.
Examples already present in the repo:
This is not authentication or authorization for the application itself. If an attacker can access the app, ACF-SDK helps constrain what they can achieve through the agent, but it does not replace:
- identity
- session controls
- API authorization
- tenancy boundaries
This is partial but meaningful coverage. It is best described as controlling cognitive initial access rather than system initial access.
ATLAS considers attacks aimed at gaining access to the model or its behavior.
ACF-SDK mediates data before it reaches the model, which reduces unsafe interactions such as:
- prompt attempts to reveal instructions
- tool-mediated attempts to expand model capabilities
- malicious context that would steer model behavior
It does not directly secure:
- model weights
- provider APIs
- inference endpoint auth
- provider-side logging and retention
- jailbreak resistance inside the underlying model itself
This is partial coverage. The firewall secures the path to the model, not the model asset itself.
Discovery in ATLAS-style workflows includes learning what resources, tools, prompts, or data are available through the AI system.
ACF-SDK can reduce AI-mediated discovery attempts such as:
- system prompt extraction
- instruction extraction
- probing which tools are callable
- probing sensitive memory keys or memory values
This comes from:
- the pattern library in jailbreak_patterns.json
- tool allowlist checks in scan.go
- memory key allowlist checks in scan.go
It does not stop discovery outside the hook path:
- server-side metadata exposure
- leaked configs
- model hosting metadata
- cloud inventory discovery
This is partial coverage focused on AI-mediated discovery.
This is where ACF-SDK is most naturally aligned with ATLAS.
Attack staging often means preparing malicious inputs that alter future agent behavior without immediately causing impact.
The normalise stage is specifically built to defeat common text obfuscation:
- recursive URL decoding
- recursive Base64 decoding
- Unicode NFKC normalization
- zero-width character stripping
- leetspeak cleanup
This is documented in pipeline.md and implemented in the sidecar pipeline.
That is directly relevant to staged prompt injection because adversaries rarely send the most obvious string when they expect a scanner.
An attacker trying to hide:
ignore all previous instructionsrm -rf- path traversal fragments
- system prompt exfiltration language
behind encoding or Unicode tricks is much more likely to be caught once the payload is converted into canonical text.
This is strong coverage for text-centric attack staging.
For agent systems, execution often happens through the tools the model can invoke.
The on_tool_call hook is a major control point:
- it can inspect tool name and parameters
- it can enforce allowlists
- it can stop obviously dangerous command-like payloads if they match the scan logic and configured patterns
The example at 03_block_tool.py demonstrates the intended model.
Many high-impact LLM attacks are only dangerous if the agent can act:
- shell execution
- file writes
- external network requests
- connector usage
- privileged API calls
If every such action is forced through on_tool_call, the firewall becomes an execution gate.
Coverage depends on architecture discipline:
- if tools bypass the hook, ACF-SDK cannot see them
- if allowlists are left empty, control is weaker
- if tool semantics are complex, lexical scanning alone may be insufficient
This is strong coverage when the agent is actually wired through the hook model.
Persistence in AI-enabled systems can take the form of poisoning memory, state, cached context, or other durable reasoning artifacts.
The on_memory hook gives the system a chance to inspect values before they become trusted future context.
This is a valuable pattern because malicious instructions stored in memory can survive beyond a single interaction and re-infect later turns.
Today, the strongest implemented controls are:
- pattern scanning over memory values
- memory key allowlists
- provenance-aware scoring
The policy design also anticipates integrity and provenance concepts for memory under memory.rego, though that richer behavior is ahead of the current fully wired runtime path.
This is partial coverage with a strong architectural direction.
In LLM systems, privilege escalation often looks like:
- changing the model's effective role
- overriding higher-priority instructions
- coercing the agent into using tools beyond intended authority
The project is clearly designed for this:
- the pattern library includes common instruction override and jailbreak language
- the policy set includes prompt, context, tool, and memory logic intended to separate authority boundaries by hook
The current runtime is still more lexical and score-based than semantically authority-aware. That means:
- obvious escalation attempts are handled well
- nuanced or novel authority-confusion attacks may slip through
This is partial coverage. The architecture points in the right direction, but the runtime is not yet a full authority model.
This is one of the repo's strongest ATLAS alignments.
The normalise stage directly targets common evasion strategies in text-based attacks:
- encoding layers
- invisible characters
- Unicode compatibility forms
- simple obfuscation through character substitution
That is exactly the kind of defensive preprocessing needed to stop naive scanners from being bypassed.
Evasion is an open-ended problem. The current approach is strongest for:
- text payloads
- known phrase families
- shallow obfuscation
It is weaker for:
- deeply novel semantic attacks
- multilingual reformulations not covered by pattern logic
- non-text modalities
- model-specific adversarial perturbations
This is strong coverage for text-based evasion, not a universal evasion defense.
Credential access in AI systems may involve coercing the agent to reveal secrets, retrieve tokens, or call tools that expose sensitive data.
ACF-SDK can reduce the chance that the model is successfully manipulated into:
- revealing hidden instructions or memory
- invoking unsafe tools
- using parameters associated with exfiltration attempts
It is not:
- a secret vault
- a token broker
- a data loss prevention platform
- an access control layer for backend systems
This is partial coverage through reduction of agent-mediated abuse.
Collection refers to gathering target data for later abuse.
It can help when collection depends on the defended agent:
- reading sensitive memory
- gathering data through tools
- poisoning context to influence collection behavior
If data collection occurs outside the guarded agent workflow, ACF-SDK has little influence.
This is partial coverage.
Exfiltration is highly relevant for agentic systems because a compromised agent can leak data through model output, tools, or remote connectors.
It helps most when exfiltration requires:
- a tool call
- a suspicious outbound parameter
- indirect prompt injection through RAG or memory
- coercing the agent to reveal hidden content
This is where tool allowlists and instruction-pattern blocking matter most.
It is strongest if:
- every outbound action is modeled as a tool call
- tools are tightly allowlisted
- policies are narrowed to expected destinations and argument shapes
It is weaker when exfiltration happens through:
- plain model output alone
- provider-side retention
- logs
- connectors not routed through the firewall
- covert channels outside the hook model
This is partial to strong coverage, highly dependent on integration discipline.
Impact is the final outcome of successful abuse.
It reduces impact when the harmful result depends on:
- prompt injection
- unsafe RAG injection
- tool misuse
- poisoned memory
In other words, if the attack must pass through the agent's reasoning path, ACF-SDK can materially reduce impact.
It does not stop:
- infrastructure compromise unrelated to the agent
- model supply chain compromise
- training data poisoning already embedded in a model
- abuse paths outside the guarded hooks
This is partial coverage with strong value inside the agent boundary.
The most important ATLAS-aligned techniques for this repo are below.
| Technique family | ACF-SDK fit | Notes |
|---|---|---|
| Direct prompt injection | Strong | on_prompt plus lexical scan is the clearest current strength |
| Indirect prompt injection | Strong | on_context is explicitly designed for poisoned RAG and untrusted external content |
| Tool abuse | Strong | Depends on on_tool_call being mandatory for all tool execution |
| Memory poisoning | Partial to Strong | Hook exists and design is sound; richer semantics still depend on future policy wiring |
| Textual evasion | Strong | Normalisation is one of the best-implemented technical controls in the architecture |
| Data exfiltration via tools | Partial to Strong | Strong when all outbound effects go through tools |
| System prompt extraction | Partial to Strong | Good against obvious extraction prompts, weaker against novel semantic variants |
| Model evasion in non-text domains | Weak | This repo is text-centric |
| Model extraction and theft | Weak | Not the target layer |
| Training/fine-tune poisoning | Weak | Mostly outside runtime scope |
| Supply chain compromise | Weak | Outside core design scope |
MITRE SAFE-AI explicitly calls out both direct and indirect prompt injection under AML.T005 "Prompt Injection". That is the attack family this repo is most obviously aligned with.
Three design decisions explain why:
The repo does not assume a single front door. It treats:
- user prompts
- RAG content
- tool invocations
- memory
as separate attack surfaces.
That is a much better fit for real agent attacks than a single prompt filter.
The normalise stage is a practical defense against attacker obfuscation.
A malicious phrase from a user is treated differently from the same phrase in a retrieved document. That distinction is operationally useful and reflects actual agent threat models.
The example in 04_rag_sanitise.py is the clearest demonstration of this idea.
ACF-SDK should not be presented as a complete ATLAS defense by itself.
Major gaps include:
- training data poisoning
- fine-tune poisoning
- model provenance and supply chain verification
- model weight theft and extraction
- endpoint abuse and rate control
- infrastructure compromise
- non-text adversarial input attacks
- provider-side data leakage
- output-only exfiltration channels that bypass guarded tool hooks
These gaps are normal for a runtime firewall. They simply define the layer this system protects.
Even in the attack families ACF-SDK covers well, residual risk remains.
Pattern libraries and canonicalisation are useful, but attackers can invent prompts that do not resemble known bad phrases.
If a developer forgets to route a tool, memory operation, or context ingestion step through the firewall, that surface is unprotected.
The current runtime is strongest on detectable patterns and simple score composition. More nuanced policy reasoning is present in design and policy files, but not yet the center of the live enforcement path.
If harmful behavior occurs primarily through model output rather than tool calls or ingested context, protection is weaker unless additional outbound hooks are added.
The most accurate way to position ACF-SDK is:
ACF-SDK is a Zero Trust, sidecar-based cognitive firewall for inference-time agent workflows. It is strongest against prompt injection, indirect prompt injection, tool abuse, and memory poisoning, and should be paired with broader controls for full-spectrum ATLAS coverage.
That avoids both understatement and overclaiming.
If the goal is stronger ATLAS alignment, the most useful next steps would be:
- Fully wire the OPA/Rego policy engine into the live decision path
- Add richer semantic detection beyond lexical pattern matching
- Enforce stricter tool schemas and destination policies, not just allowlists
- Add outbound response hooks to reduce output-mediated exfiltration
- Add explicit coverage for tool results and sub-agent spawning
- Add stronger memory provenance and integrity enforcement
- Expand tests from adversarial examples into ATLAS-style scenario coverage
ACF-SDK maps well to the part of MITRE ATLAS that matters most for modern LLM agents: the runtime path where untrusted text, retrieved content, memory, and tools meet model reasoning.
It is not a full defense for the entire ATLAS matrix, and it does not try to be. Its value lies in making the agent's cognition itself enforceable.
That makes it a strong control for:
- direct prompt injection
- indirect prompt injection
- tool misuse
- text-based evasion
- memory poisoning
It remains a partial control for:
- privilege escalation
- discovery
- collection
- exfiltration
- impact reduction
And it is largely outside scope for:
- training-time attacks
- model supply chain attacks
- model theft
- infrastructure compromise
Used at the right layer, it is a meaningful and well-structured defense.
- MITRE ATLAS fact sheet: https://atlas.mitre.org/pdf-files/MITRE_ATLAS_Fact_Sheet.pdf
- MITRE SAFE-AI report: https://atlas.mitre.org/pdf-files/SAFEAI_Full_Report.pdf
- Project architecture: architecture.md
- Pipeline behavior: pipeline.md
- Policy templates: policies/v1/README.md