Status: Draft
Date: 2026-02-27
License: Apache 2.0
Authors: Contributors
The Document Sovereignty Protocol & Privacy (DSSP) is an open standard that defines how sensitive documents are exposed, accessed, processed, and audited — without the documents ever leaving infrastructure controlled by the document owner.
DSP enables organizations to share document processing capabilities with third-party tools, auditors, and agents while maintaining full sovereignty over their data. No document content or personally identifiable information (PII) crosses the owner's infrastructure boundary. Only structured extraction results — with provable redaction — leave the boundary.
Organizations in regulated industries (financial services, healthcare, legal, government) must frequently share sensitive documents with external parties for processing: audit firms extracting bank statement data, healthcare processors reading claims, legal teams reviewing contracts.
Current approaches force a choice between:
-
Upload to third-party SaaS — Document content leaves the owner's control. Raises compliance, licensing, and "do you train on our data?" concerns.
-
Manual exchange — Email, file shares, physical media. No audit trail, no access control, no revocability.
-
Vendor-specific portals — Lock-in to a single provider's platform and terms.
DSP eliminates this choice by defining a protocol where:
- Documents stay on the owner's infrastructure.
- Processing happens in attested compute environments.
- Only structured, PII-redacted results exit the boundary.
- Every operation is cryptographically auditable.
- Data Residency — Documents MUST NOT leave the owner's storage boundary.
- Processing Isolation — Document content MUST only be accessed inside attested compute (enclaves or equivalent).
- Result Sanitization — Only structured results exit the boundary — never raw content, never unredacted PII.
- Provable Integrity — Every operation MUST produce a cryptographic attestation.
- Owner Sovereignty — The document owner decides who can do what, and can revoke at any time.
- Defense in Depth — No single mechanism is trusted alone. Redaction rules, result scanning, enclave attestation, sidecar verification, and privacy budgets form overlapping defenses.
- AI-Aware by Default — The protocol explicitly addresses LLM/AI agent risks including free-text leakage, non-deterministic outputs, and prompt injection.
| Term | Definition |
|---|---|
| Owner | The organization that owns and controls the documents |
| Consumer | An external organization that needs to process documents (e.g., audit firm) |
| Agent | Software that runs inside an attested enclave to process documents |
| Agent Type | Processing model: deterministic, ml_structured, or llm_freeform |
| Gateway | The DSP orchestration layer that manages manifests, contracts, and results |
| Manifest | Metadata describing available documents — never containing content |
| Contract | Policy defining what a consumer can do, enforced by the protocol |
| Result Envelope | Structured extraction output with attestation proof |
| Result Scanner | Independent process that inspects results for PII before they exit |
| Sidecar Verifier | Independent enclave co-process that monitors agent network/memory/syscalls |
| Audit Event | Immutable record of a protocol operation |
| Enclave / TEE | Trusted Execution Environment providing hardware-level isolation |
| Attestation | Cryptographic proof that code ran in an enclave with specific properties |
| PII | Personally Identifiable Information as defined by applicable regulation |
| PII-bearing field | A field that MAY contain embedded PII in free-text (e.g., transaction descriptions) |
| Privacy Budget | Quantitative limit on information extractable across sessions |
| Document Sanitization | Pre-processing that removes hidden content and injection patterns |
| Split-Knowledge | Architecture where no single party can reconstruct the full picture |
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
DSP consists of five layers, each independently specifiable:
┌───────────────────────────────────────────────────┐
│ Layer 4: AUDIT LEDGER │
│ Immutable chain of operation records │
├───────────────────────────────────────────────────┤
│ Layer 3: RESULT ENVELOPE │
│ Structured extractions + attestation proofs │
│ + result scanning verdicts │
├───────────────────────────────────────────────────┤
│ Layer 2: PROCESSING CONTRACT │
│ Permissions, restrictions, attestation policy │
│ + AI-specific controls + privacy budget │
├───────────────────────────────────────────────────┤
│ Layer 1: DOCUMENT MANIFEST │
│ Metadata catalog — never content │
├───────────────────────────────────────────────────┤
│ Layer 0: STORAGE BINDING │
│ Abstract interface to any storage backend │
└───────────────────────────────────────────────────┘Schema: storage-binding.schema.json
DSSP does NOT mandate any storage technology. It defines an abstract interface with four operations that any storage backend must implement:
list_documents— Enumerate documents and produce manifests (no content exposure)grant_access— Issue scoped, time-limited tokens to attested enclavesread_document— Serve encrypted content only to verified enclavesverify_integrity— Confirm document hashes match stored content
Supported adapters include S3-compatible (MinIO, AWS S3, SeaweedFS, Ceph RGW), Azure Blob Storage, Google Cloud Storage, POSIX filesystems, NFS, and GlusterFS. Customers MAY implement custom adapters.
Security requirements:
read_documentMUST verify the caller's attestation token before serving content.- Access tokens MUST be scoped to specific documents and operations.
- Access tokens MUST expire. Maximum TTL SHOULD be specified in the processing contract.
- TLS MUST be used for all storage communications in production deployments.
Schema: manifest.schema.json
A manifest describes documents available for processing. It is the discovery mechanism — it tells consumers "here's what exists and what you can do with it."
Critical constraint: A manifest MUST NOT contain document content, text excerpts, image thumbnails, or any data from which PII could be inferred.
A manifest contains for each document:
- Opaque document ID (not the filename)
- Classification (e.g.,
financial/bank-statement) - Sensitivity level
- Format (MIME type)
- Content hash (for integrity verification)
- Declared PII field types (tells processors what to expect)
- Allowed and denied operations
Schema: contract.schema.json
A processing contract is created by the document owner. It defines:
Permissions:
- Which operations the consumer can perform
- Which documents (by classification, tag, or explicit ID)
- Maximum session duration and document count
- Validity period
Restrictions:
- Network policy (deny all egress, or allow-list specific destinations)
- Storage policy (memory-only, encrypted ephemeral, or encrypted persistent)
- Result policy with PII redaction rules per field type
- Custom regex-based redaction patterns
- Result scanning requirements (§4.4)
- Document sanitization policy (§4.5)
- Privacy budget (§4.6)
- Numeric precision policy (§4.10) — anti-steganographic controls
- Gateway visibility controls (§5.3)
AI Agent Restrictions (§4.3):
- Agent type classification determines scanning rigor
- LLM-specific free-text policies
- Mandatory NER scanning for
llm_freeformagents - Sub-agent composition policy (§4.9): which sub-models are allowed, purposes, hash requirements
Attestation requirements:
- Accepted enclave types (SGX, SEV-SNP, TDX, Nitro, CCA)
- Required attestation claims
- Measurement signing authorities
- Attestation freshness window
- Runtime verification controls (§4.7)
Contracts are versioned. The owner can update, suspend, or revoke contracts at any time. Revocation MUST prevent new processing sessions immediately.
Contract revocation MUST propagate from the owner through the gateway to active enclaves within a bounded time window:
- The gateway MUST poll for contract status updates at least every
attestation_freshness_seconds(default: 300s). - Alternatively, implementations MAY use a push mechanism (webhook, gRPC stream, or WebSocket) for lower-latency propagation.
- Upon detecting a revocation, the gateway MUST reject new session requests for the revoked contract immediately.
- For running sessions: the gateway MUST send a termination signal to the enclave
within
attestation_freshness_secondsof revocation. - The maximum revocation propagation delay is
2 × attestation_freshness_seconds. Implementations SHOULD document their actual propagation latency. - A
contract.revokedaudit event MUST be emitted when revocation is processed, followed bysession.terminatedevents for any active sessions that were terminated as a result.
Schema: result.schema.json
The result envelope is the ONLY artifact that crosses from the owner's infrastructure to the outside world. It contains:
Structured extractions:
- Key-value fields (with PII redacted per contract rules)
- Tables (with column-level PII handling and
pii_bearingannotations) - Classification results with confidence scores
- Self-validation checks
Attestation proof:
- Enclave type and measurement
- Agent binary hash (proves which code ran)
- Input document hashes (proves which documents were read)
- Output result hash (proves result integrity)
- Network connection log (proves no unauthorized egress)
- Sub-agent chain (§4.9): ordered list of all sub-models used, with hashes
Result scan report (§4.4):
- Verdicts from each independent scanner (regex, NER, statistical)
- Fields modified by scanning (separate from agent-applied redaction)
- Overall pass/fail determination
End-of-session attestation (§4.7):
- Fresh enclave measurement at session end
- Proof that the measurement matches the start
PII handling report:
- Fields encountered vs. fields redacted
- Redaction methods applied
- Compliance status (compliant / violation_detected / unknown)
Schema: audit-event.schema.json
Every DSP operation produces an audit event. Events form a Merkle chain — each event references the hash of the previous event, making the ledger tamper-evident.
All hash computations over JSON objects — including event_hash, previous_event_hash,
output_result_hash, and any other hash referenced in this specification — MUST use
RFC 8785 (JSON Canonicalization Scheme) to
produce a deterministic byte sequence before hashing.
This ensures that two independent implementations computing a hash over the same logical JSON object will always produce the same hash value, which is essential for Merkle chain interoperability and cross-implementation verification.
Implementations MUST NOT rely on property insertion order, whitespace formatting, or locale-specific number serialization. RFC 8785 defines canonical treatment of:
- Object key ordering (lexicographic by Unicode code point)
- Number representation (no trailing zeros, no positive sign, no leading zeros)
- String escaping (minimal escaping)
- No whitespace between tokens
Event types cover the full lifecycle:
- Manifest creation, updates, expiry
- Contract creation, updates, suspension, revocation
- Session start, completion, failure, termination
- Document access and processing
- Result production, delivery, scanning outcomes
- Attestation verification (start, heartbeat, end-of-session)
- Document sanitization and injection detection
- Sidecar verifier anomaly detection
- Privacy budget consumption and exhaustion
- Violation detection and escalation
The audit ledger is stored on the owner's infrastructure. Events MAY be replicated to the DSSP Gateway for dashboard visibility, but replicated events MUST NOT contain PII.
DOCUMENT OWNER
│
trusts nothing
by default
│
┌────────────┼───────────────┐
▼ ▼ ▼
DSSP Gateway Processing Agent Consumer App
│ │ │
Can see: Can see: Can see:
manifests documents result envelopes
audit logs (in enclave manifests (filtered)
results only)
(filtered) │ │
│ Cannot see: Cannot see:
Cannot see: other docs documents
documents network raw content
PII owner keys other consumers
raw text- Agent boots inside a TEE (enclave).
- Platform provides hardware attestation (CPU-signed measurement).
- Agent presents attestation to the storage adapter via
grant_access. - Storage adapter verifies attestation against the contract's requirements.
- If verified, storage issues a scoped access token.
- Document sanitization runs before the agent processes content (§4.5).
- Agent processes documents, produces a result envelope.
- Result scanning independently inspects the result for PII leakage (§4.4).
- Result envelope includes the attestation proof and scan verdicts.
- End-of-session attestation proves enclave integrity throughout (§4.7).
- Gateway and owner can independently verify the attestation chain.
| Verifier | Can verify |
|---|---|
| Owner | Full audit trail, attestation proofs, result integrity, PII compliance, scan verdicts, privacy budget consumption |
| Consumer | Result integrity, attestation proof (proves their agent ran correctly) |
| Regulator | Audit chain integrity, PII handling compliance, data residency, privacy budget adherence |
| Gateway | Attestation validity, contract compliance, result schema conformance, scan pass/fail |
This section specifies the cryptographic algorithms and formats used throughout the protocol.
| Algorithm | Status | Use Case |
|---|---|---|
| Ed25519 | REQUIRED | Default signature algorithm. All implementations MUST support Ed25519. |
| ECDSA P-256 (secp256r1) | RECOMMENDED | Interoperability with existing PKI infrastructure. |
| RSA-2048+ | MAY | Legacy compatibility. Key size MUST be at least 2048 bits. |
Implementations MUST support Ed25519. Implementations SHOULD support ECDSA P-256. If multiple algorithms are supported, the attestation token MUST indicate which algorithm was used.
The input to any signature operation MUST be the RFC 8785
canonical JSON serialization of the object being signed, excluding the signature
field itself. Specifically:
- Remove the
signaturefield from the JSON object (if present). - Serialize the remaining object using RFC 8785 canonical form.
- Compute the signature over the resulting byte sequence.
This applies to AttestationToken.signature, end_of_session_attestation.signature,
SubAgentAttestation.separate_attestation.signature, and any other signature field
defined in this specification.
| Data | Format |
|---|---|
| Signatures | Base64url encoding (RFC 4648 §5), no padding |
| Hash digests | Hex-encoded lowercase string (as defined in HashDigest) |
| Public keys (exchange) | JWK (RFC 7517) |
| Certificates | PEM-encoded X.509 |
| Algorithm | Status | Use Case |
|---|---|---|
| SHA-256 | REQUIRED | Baseline for Merkle chain (event_hash, previous_event_hash), document integrity, result integrity |
| SHA-384 | MAY | Higher security margin where required by policy |
| SHA-512 | MAY | Higher security margin where required by policy |
| BLAKE3 | MAY | Performance-optimized alternative for high-throughput scenarios |
The Merkle chain in the audit ledger MUST use SHA-256 as the baseline hash algorithm. Implementations MAY support additional algorithms but MUST always support SHA-256 for interoperability.
PII safety is not a feature — it is enforced at every protocol layer through multiple overlapping defenses.
| Layer | Protection |
|---|---|
| Manifest | Metadata only. No content, no snippets, no previews. |
| Contract | pii_redaction_rules force masking/hashing before results leave enclave |
| Sanitization | Documents stripped of hidden content and injection patterns before agent sees them |
| Agent | Applies redaction rules from contract |
| Result Scanner | Independent process re-checks results for PII leakage (especially free-text) |
| Result | pii_report + result_scan declare what was redacted and how. Machine-verifiable. |
| Privacy Budget | Statistical limits prevent re-identification across sessions |
| Audit | Events contain IDs and hashes — never content. Even filenames can be hashed. |
| Storage | Documents encrypted at rest with customer-held keys |
PII fields not explicitly listed in the contract's pii_redaction_rules with a
method of allow MUST be suppressed (removed entirely) from results.
| Method | Behavior |
|---|---|
allow |
Pass through unchanged (only for fields the owner explicitly permits) |
mask_last_4 |
Replace all but last 4 characters with * |
mask_first_6 |
Replace first 6 characters with * |
mask_all |
Replace entire value with **** |
hash_sha256 |
Replace with SHA-256 hash (allows cross-reference without revealing value) |
hash_blake3 |
Replace with BLAKE3 hash |
round_thousands |
Round numeric value to nearest thousand |
round_millions |
Round numeric value to nearest million |
range_bucket |
Replace with a range (e.g., "$1M-$5M") |
suppress |
Remove entirely from output |
tokenize |
Replace with a reversible token (owner can de-tokenize) |
k_anonymize |
Apply k-anonymity transformation |
Agents MUST declare their processing model. The agent type determines minimum scanning requirements and privacy budget enforcement:
| Agent Type | Description | Minimum Scanning | Privacy Budget |
|---|---|---|---|
deterministic |
Rule-based extraction (regex, template matching). Outputs are predictable and type-safe. | regex scanner |
RECOMMENDED |
ml_structured |
ML model that produces typed fields/tables only. No free-text output. | regex + ner |
RECOMMENDED for pii-high |
llm_freeform |
LLM that MAY produce free-text output. Non-deterministic. Risk of PII in generated text. | regex + ner + llm_output_filter REQUIRED |
REQUIRED |
LLM-specific risks:
- LLMs may memorize training data and reproduce PII in generated text.
- Free-text fields (descriptions, summaries) may contain embedded PII.
- Non-deterministic outputs make testing insufficient — runtime scanning is essential.
- Prompt injection via document content can manipulate LLM behavior.
Result scanning is a separate, independent process (not the agent itself) that inspects result content for PII leakage before it exits the enclave boundary.
Requirements:
result_scanning.enabledMUST betruewhenagent_typeisllm_freeform.- RECOMMENDED for all other agent types.
- Each scanner runs independently and produces a verdict.
- If ANY scanner fails, the result MUST be handled per
scan_failure_action:block_result(default),flag_and_deliver, orquarantine.
Scanner types:
| Scanner | Purpose | When required |
|---|---|---|
regex |
Pattern-based PII detection (IBANs, SSNs, credit cards) | All agent types |
ner |
Named Entity Recognition (persons, orgs, locations) | ml_structured, llm_freeform |
llm_output_filter |
Specialized model for PII in generated free-text | llm_freeform |
statistical |
Detects re-identification risk (uniqueness analysis) | RECOMMENDED for pii-high+ |
Attestation requirement: Scanner binaries SHOULD be attested separately
from the agent. The contract MAY specify approved_scanner_hashes that the
scanner MUST match.
Document sanitization is a pre-processing layer that cleans documents before the agent processes them. It is the primary defense against prompt injection attacks where malicious content in documents manipulates LLM behavior.
Sanitization steps:
- Strip hidden text layers, white-on-white text, zero-width characters
- Strip JavaScript from PDFs
- Strip embedded files and attachments
- Normalize Unicode to NFC form (prevents homoglyph attacks)
- Detect and flag/remove known prompt injection patterns
- Truncate pages exceeding
max_text_length_per_page(prevents token-stuffing)
Requirements:
- RECOMMENDED when
agent_typeisllm_freeform. - The sanitizer binary SHOULD be attested (contract MAY specify
sanitizer_hash). - The contract's
attestation_requirements.must_includeSHOULD includesanitizer_execution_proofto prove sanitization actually ran.
The privacy budget prevents re-identification attacks where an adversary combines results from multiple sessions to identify individuals.
Controls:
| Control | Purpose |
|---|---|
epsilon / delta |
Differential privacy budget. Limits total information extractable. |
k_anonymity_min |
Result fields must be indistinguishable from k-1 others. |
max_unique_values_per_field |
Prevents exact figures from being unique identifiers. |
aggregation_minimum_records |
No field from fewer than N source documents. |
budget_window |
Time window for budget tracking (per_session, per_day, per_contract, lifetime). |
Requirements:
- REQUIRED when
agent_typeisllm_freeform. - RECOMMENDED for all agent types processing
pii-highorpii-criticaldocuments. - The gateway MUST track budget consumption across sessions and reject requests that would exceed the budget.
Attestation at session start proves the enclave booted correctly. But an agent could be compromised during processing. Runtime verification addresses this gap.
End-of-session attestation:
- Agent MUST produce a fresh hardware attestation at session end.
- The end measurement MUST match the start measurement. A mismatch indicates
potential tampering and MUST trigger a
violation.detectedaudit event. - The result envelope carries the end-of-session attestation.
Periodic heartbeats:
- The contract MAY require attested heartbeats at a specified interval.
- Missing heartbeats MUST trigger session termination.
Sidecar verifier:
- An independent co-processor running in a separate enclave.
- Monitors network I/O, memory allocation, DNS queries, and syscalls.
- Independently logs all network activity and compares against the agent's self-reported claims.
- If the sidecar detects a mismatch (e.g., agent claims 0 network connections
but the sidecar observed 5), a
sidecar.network_mismatchevent is emitted. - The contract MAY require a sidecar (
sidecar_verifier.required: true).
Some table columns contain embedded PII in free-text even though the column itself is not a dedicated PII field. Common examples:
descriptioncolumns in bank statement transactions (contain counterparty names, addresses, invoice numbers)memoandnotesfieldsreferencecolumns with free-text references
Columns MUST declare pii_bearing: true when they MAY contain embedded PII.
Values in pii_bearing columns MUST be scanned by NER before leaving the enclave.
Real-world document processing agents are rarely a single model. A typical pipeline:
PDF → OCR (Tesseract) → Layout Detection (LayoutLM) → Table Extraction → ValidationWithout explicit declaration of the full chain, a malicious orchestrator could:
- Substitute an untrusted model mid-pipeline
- Inject an undeclared LLM sub-agent into an
ml_structuredpipeline - Use a sub-agent that phones home via a separate network path
Sub-Agent Policy (contract):
The contract's consumer.sub_agent_policy controls:
- Whether sub-agents are allowed at all
- Maximum pipeline steps
- Which purposes are permitted (OCR, classification, etc.)
- Whether LLM sub-agents are allowed (RECOMMENDED:
falseunless scanning covers the full pipeline) - Whether cross-enclave invocation is permitted
- Approved sub-agent binary hashes
Sub-Agent Chain (result attestation):
The result envelope's attestation.claims.sub_agent_chain is an ordered array of
SubAgentAttestation records — one per pipeline step. Each declares:
- Agent type, hash, and version
- Purpose and input/output types
- Whether it ran in the same or a separate enclave
- Processing time for that step
Requirements:
- If
sub_agent_policy.require_sub_agent_hashesistrue(default), every entry MUST have a verifiableagent_hash. - If
sub_agent_policy.llm_sub_agent_allowedisfalse(default), no entry MAY haveagent_type: "llm_freeform". - The
sub_agent_chain_declarationclaim inmust_includeproves the chain was fully declared. Undeclared sub-agents indicate a contract violation. - Cross-enclave sub-agents MUST carry their own
separate_attestation.
A subtle exfiltration vector: an adversarial agent can encode information in the
trailing digits of numeric fields. For example, a balance of 1,247,832.50
contains no hidden information — but 1,247,832.5031742 encodes ~16 extra bits
in the trailing digits that no human would notice.
With hundreds of numeric fields per result (amounts, balances, running totals, confidence scores), an agent could exfiltrate thousands of bits per session.
Numeric Precision Policy (contract):
The contract's result_policy.numeric_precision_policy controls:
| Control | Purpose |
|---|---|
max_decimal_places |
Maximum decimals for currency fields (default: 2) |
max_significant_digits |
Maximum significant digits for any numeric field |
enforce_standard_rounding |
Require IEEE 754 round-half-to-even (prevents encoding via rounding direction) |
detect_entropy_anomaly |
Scanner checks trailing digits for non-natural distribution (Benford's law) |
currency_fields_integer_cents |
Force all currency as integer cents (eliminates decimals) |
How it works:
- The agent produces results with numeric fields.
- The result scanner (if
detect_entropy_anomalyis enabled) runs a statistical test on trailing digits. Natural financial data follows Benford's law; artificially encoded data does not. - The scanner also verifies all numeric fields comply with
max_decimal_placesandmax_significant_digits. Non-compliant values are rounded. - If
enforce_standard_roundingis true, the scanner verifies rounding direction matches IEEE 754 round-half-to-even (no systematic bias exploitable for encoding).
Scope of application:
The numeric_precision_policy MUST apply to ALL numeric fields in the result
envelope, not only currency-related fields. This includes:
confidencescores (e.g.,0.9700000000000001MUST be rounded to0.97whenmax_decimal_placesis 2)processing_duration_ms(timing channel)memory_peak_bytes(encoding channel)row_countand other count fields- Any numeric value in
fieldsortableswithin extractions
Each of these fields represents a potential covert channel. An IEEE 754 double has 52 bits of mantissa — every unrestricted numeric field can leak up to 52 bits per value. With hundreds of fields per result, this is a significant exfiltration vector.
RECOMMENDED defaults for financial data:
{
"max_decimal_places": 2,
"max_significant_digits": 12,
"enforce_standard_rounding": true,
"detect_entropy_anomaly": true,
"currency_fields_integer_cents": false
}DSP REQUIRES hardware-attested enclaves for all non-public documents. The following enclave types are supported:
| Type | Platform | Notes |
|---|---|---|
sgx |
Intel SGX | Process-level isolation, smallest TCB |
sev-snp |
AMD SEV-SNP | VM-level isolation, full memory encryption |
tdx |
Intel TDX | VM-level isolation, trust domain extensions |
nitro |
AWS Nitro Enclaves | Cloud-specific, no persistent storage |
cca |
Arm CCA | Realm-based isolation |
The sandbox type is available for development and testing only:
- MUST NOT be used in production.
- Contracts using
sandboxMUST setsensitivity_maxtopublicorinternal. - Conformance tests MUST reject
sandboxforpii-lowor higher sensitivity.
Note: The enclave_type: "none" option was removed in v0.1.1. All processing
of non-public documents REQUIRES attested compute.
OWNER INFRASTRUCTURE CONSUMER
┌─────────────────────────────────────────┐ ┌──────────┐
│ │ │ │
│ Storage ──► Sanitizer ──► Agent (TEE) ─┼──► │ Result │
│ ▲ │ │ │ │ │ Envelope │
│ │ ▼ ▼ │ │ └──────────┘
│ Docs Audit Event Scanner │ │
│ │ │ │
│ ▼ ▼ │
│ Verdict EoS │
│ Attest │
└─────────────────────────────────────────┘- Owner creates manifest and contract
- Agent boots in enclave, attests to storage adapter
- Storage issues scoped access token
- Sanitizer cleans documents (strips hidden content, injection patterns)
- Agent processes sanitized documents
- Agent produces result with PII redaction per contract rules
- Result scanner independently checks result for PII leakage
- Agent produces end-of-session attestation
- Result envelope (with scan verdicts + EoS attestation) exits boundary
- Audit events recorded for every step
The DSSP Gateway orchestrates the protocol but SHOULD NOT accumulate enough data to re-identify individuals or correlate results across engagements.
Gateway visibility is configurable per contract:
| Data Type | Visibility Options |
|---|---|
| Manifests | full, summary_only, none |
| Results | full, metadata_only, verdict_only, none |
| Audit Events | full, summary_only, none |
Cross-engagement correlation:
cross_engagement_correlation: false(default) means the gateway MUST NOT correlate data across different engagements from the same owner.- Implementations SHOULD use per-engagement encryption keys.
- This prevents the gateway from building a statistical profile of the owner's documents across multiple audit periods.
RECOMMENDED default for regulated data:
{
"manifests": "summary_only",
"results": "metadata_only",
"audit_events": "summary_only",
"cross_engagement_correlation": false
}A DSSP processing session begins when the gateway creates a session context (binding a contract, agent, and set of documents) and ends when the agent produces a result or the session fails.
┌───────────┐
│ created │
└─────┬─────┘
│ agent attests
▼
┌───────────┐
│ active │◄── heartbeats
└─────┬─────┘
│
┌─────────┼──────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────────┐
│completed│ │ failed │ │ terminated │
└────────┘ └────────┘ └────────────┘- created — Session allocated, waiting for agent attestation.
- active — Agent has attested and is processing documents.
- completed — Agent produced a result and end-of-session attestation.
- failed — Session ended due to agent crash, unrecoverable error, or timeout.
- terminated — Session ended due to external action (contract revocation, missing heartbeat, sidecar anomaly).
Sessions are NOT resumable. A crashed or terminated session MUST be treated as failed. Specifically:
- A failed session MUST produce a
session.failedaudit event with the failure reason. - Partial results from a failed session MUST NOT be delivered to the consumer. Any partial state MUST be discarded.
- The consumer MAY start a new session under the same contract (if still active and within session limits).
- The contract's
max_concurrent_sessionslimit counts sessions increatedandactivestates. Sessions infailedorterminatedstate MUST be cleaned up withinattestation_freshness_secondsto free the slot.
Result delivery is idempotent:
- The gateway assigns each result a unique
result_id. - The consumer MAY request the same result multiple times via
GET /result/{id}. - If the consumer does not acknowledge receipt, the gateway MAY retry delivery.
- The result MUST NOT change between retries (same content, same hash).
- Results MUST be available for retrieval for at least the contract's
valid_untiltimestamp.
- If a session exceeds
max_session_duration_seconds, the gateway MUST transition it toterminatedstate. - A
session.timeoutaudit event MUST be emitted. - The gateway MUST instruct the enclave to halt processing.
- If the agent produces a result after timeout, the result MUST be rejected.
| Level | Requirements |
|---|---|
| DSSP Core | Layers 1-4 schemas. Manifests, contracts, results, audit events. |
| DSSP Attested | DSSP Core + hardware attestation (Layer 2 attestation_requirements enforced) |
| DSSP Sovereign | DSSP Attested + customer-managed encryption keys + data residency enforcement |
| DSSP AI-Safe | DSSP Attested + mandatory result scanning + NER + privacy budget + document sanitization |
DSSP is designed to be adopted incrementally. Each conformance level builds on the previous one, and implementations can start simple and add capabilities over time.
- Install the
dssp-gatewayreference implementation. - Create a manifest from a folder of sample documents.
- Create a processing contract with basic permissions.
- Run the demo agent against the sample documents.
- View the result envelope and audit trail.
See the reference/sandbox/ Docker Compose demo for a fully working example.
Requirements:
- Implement or integrate
dssp-gatewaywith your storage backend. - Use the MinIO adapter or write a custom storage adapter implementing the four required operations (list, grant, read, verify).
- Deploy a
deterministicagent for structured extraction. - Enable
regexresult scanning. - Verify audit chain integrity with the conformance test suite.
Requirements (in addition to DSSP Core):
- Set up a TEE environment (AWS Nitro Enclaves, Azure Confidential Computing, Intel SGX, or AMD SEV-SNP).
- Configure
attestation_requirementsin contracts with real enclave types (notsandbox). - Deploy the agent inside the enclave with proper attestation.
- Enable end-of-session attestation to prove enclave integrity.
- Configure periodic heartbeats for long-running sessions.
Requirements (in addition to DSSP Attested):
- Integrate NER scanning (Microsoft Presidio or equivalent).
- Configure privacy budgets for contracts processing PII-high+ documents.
- Enable document sanitization (injection pattern detection, hidden text stripping).
- Set up the sidecar verifier for high-sensitivity workloads.
- Configure
numeric_precision_policyfor anti-steganographic controls. - If using
llm_freeformagents, add thellm_output_filterscanner.
Requirements (in addition to DSSP AI-Safe):
- Deploy with customer-managed encryption keys.
- Configure
data_residencyzones and verify documents never leave the zone. - Enable split-knowledge gateway isolation (
metadata_onlyorverdict_onlyvisibility for results). - Disable cross-engagement correlation.
- Conduct regular attestation audits with independent verifiers.
A conformance test suite (published separately) validates implementations against:
- Schema conformance — All messages validate against the JSON schemas.
- PII safety — Results MUST NOT contain PII fields that should be redacted.
- Result scan validation — Scan verdicts MUST be present when required by contract.
- Audit integrity — Event chain hashes MUST be correct and continuous.
- Contract enforcement — Operations outside the contract MUST be rejected.
- Attestation verification — Invalid attestations MUST be rejected.
- End-of-session attestation — Measurement MUST match start measurement.
- Privacy budget — Requests exceeding budget MUST be rejected.
- Enclave type constraints —
sandboxMUST be rejected for pii-low+. - Agent type scanning —
llm_freeformMUST have NER scanning enabled.
| Threat | Mitigation |
|---|---|
| Malicious agent exfiltrates data via network | Contract network_policy with allow-list; sidecar verifier cross-checks |
| Agent stores data to disk and retrieves later | Contract storage_policy: memory_only; enclave prevents disk access |
| LLM embeds PII in free-text "description" fields | Result scanner with NER detects and redacts; pii_bearing column flag |
| Result contains hidden PII in "innocuous" fields | Multi-scanner pipeline (regex + NER + statistical) |
| Manifest reveals PII through document metadata | Manifest filenames are replaced with opaque IDs; no content fields |
| Audit ledger tampered to hide access | Merkle chain makes tampering detectable |
| Replay attack using old attestation | attestation_freshness_seconds limits token age |
| Enclave compromised during processing | End-of-session attestation detects measurement mismatch |
| Agent lies about network activity | Sidecar verifier independently monitors and cross-checks |
| Re-identification via aggregated results | Privacy budget (k-anonymity, differential privacy) limits extraction |
| Prompt injection via malicious document content | Document sanitization strips injection patterns before agent sees them |
| Gateway correlates results across engagements | Split-knowledge model with per-engagement keys; configurable visibility |
| Agent runs without enclave in production | enclave_type: "none" removed; sandbox rejected for pii-low+ |
| Steganographic exfiltration via numeric precision | numeric_precision_policy limits trailing digits; detect_entropy_anomaly flags non-natural distributions |
| Undeclared sub-model in multi-model pipeline | sub_agent_chain declaration required; sub_agent_policy restricts allowed sub-agents and purposes |
| Hidden LLM in a "deterministic" pipeline | sub_agent_policy.llm_sub_agent_allowed: false blocks LLM sub-agents; chain attestation makes substitution detectable |
-
Side-channel attacks on enclaves (Spectre, etc.) are hardware-level risks that DSP cannot fully mitigate. Implementations SHOULD follow platform-specific hardening guides.
-
Result inference — Even with redacted PII and privacy budgets, sufficiently detailed structured results could allow re-identification in extreme cases. Owners SHOULD consider the sensitivity of allowed fields in aggregate.
-
Agent trustworthiness — DSP proves the agent ran specific code in an enclave. It does not prove the code is free of bugs or malicious logic. Code auditing remains the owner's responsibility.
-
NER model limitations — Named Entity Recognition is not perfect. Novel PII patterns, multilingual content, and domain-specific terminology may evade detection. The multi-scanner pipeline mitigates this but cannot guarantee zero false negatives.
-
Sidecar overhead — Running a sidecar verifier in a separate enclave adds computational and operational overhead. This is acceptable for high-sensitivity workloads but may be excessive for
publicorinternaldocuments. -
Privacy budget tracking — Budget enforcement requires a stateful component (typically the gateway) that tracks consumption across sessions. This introduces a single point of failure for budget enforcement.
| Regulation | DSP Feature |
|---|---|
| GDPR Art. 5(1)(f) | Enclave attestation, encryption at rest, result scanning |
| GDPR Art. 25 | Data protection by design: default-deny PII, privacy budget |
| GDPR Art. 28 | Processing contract defines exact scope |
| GDPR Art. 30 | Audit ledger with full chain |
| GDPR Art. 35 | Privacy budget supports DPIA for automated processing |
| GDPR Art. 44-49 | Documents never leave data_residency zone |
| HIPAA Security Rule | PHI covered by pii_redaction_rules + NER scanning |
| PCI DSS Req. 3 | Card numbers masked via mask_last_4, regex scanner catches stray PANs |
| SOC 2 Type II | Attestation proves control enforcement; sidecar verifies runtime |
| SOX Section 404 | Immutable audit trail for financial data |
| NIS2 (EU) | Customer retains full sovereignty; enclave-only processing |
| ISAE 3402 | Full audit trail per engagement |
| AI Act (EU) Art. 15 | Result scanning provides transparency for AI-generated outputs |
This specification uses semantic versioning:
- Major version: Breaking changes to schemas or protocol behavior.
- Minor version: Backward-compatible additions (new fields, new event types).
- Patch version: Clarifications and editorial corrections.
All messages in a DSSP session MUST use the same major version. Implementations SHOULD accept messages with the same major version but different minor versions.
| Version | Change |
|---|---|
| 0.1.1 | Added AgentType enum; enclave_type: "none" removed, replaced by sandbox for dev only |
| 0.1.1 | Added result scanning (result_scan) as required field in result envelope |
| 0.1.1 | Added end_of_session_attestation to result envelope |
| 0.1.1 | Added pii_bearing flag to ExtractedTable column definitions |
| 0.1.1 | Added document_sanitization to contract restrictions |
| 0.1.1 | Added privacy_budget to contract restrictions |
| 0.1.1 | Added gateway_visibility for split-knowledge model |
| 0.1.1 | Added sidecar_verifier and runtime_verification to attestation requirements |
| 0.1.1 | Added new audit event types: sanitization.*, sidecar.*, privacy_budget.*, attestation.end_of_session, result.scan_passed/failed |
| 0.1.2 | Added SubAgentAttestation definition and sub_agent_chain in result attestation claims |
| 0.1.2 | Added sub_agent_policy to contract consumer for multi-model composition controls |
| 0.1.2 | Added NumericPrecisionPolicy and numeric_precision_policy to contract result policy |
| 0.1.2 | Added sub_agent_chain_declaration to attestation must_include options |
| 0.1.3 | Added canonical serialization requirement (RFC 8785) for all hash computations (§2.5.1) |
| 0.1.3 | Added cryptographic requirements: signature algorithms, input format, encoding (§3.3) |
| 0.1.3 | Added revocation propagation mechanism with bounded delay (§2.3.1) |
| 0.1.3 | Added session lifecycle states and recovery semantics (§5.4) |
| 0.1.3 | Extended numeric_precision_policy scope to ALL numeric fields (§4.10) |
| 0.1.3 | Added progressive adoption path with implementation guides (§6.2) |
| 0.1.3 | Added wire protocol specification (OpenAPI 3.1) — spec/dssp-api-v0.1.yaml |
| 0.1.3 | Added interoperability test vectors — reference/test-vectors/ |
- RFC 2119 — Key words for RFCs
- RFC 4648 — Base16, Base32, Base64 encodings (§5: Base64url)
- RFC 7517 — JSON Web Key (JWK) format
- RFC 8785 — JSON Canonicalization Scheme (JCS)
- JSON Schema Draft 2020-12
- W3C Verifiable Credentials — Attestation model reference
- NIST SP 800-233 — Confidential computing guidelines
- TCG DICE — Device attestation
- Open Data Format — Metadata standardization reference
- OASIS CMIS — Content management interoperability
- Microsoft Presidio — PII detection and anonymization (reference NER scanner)
- OpenDP — Differential privacy framework (reference for privacy budget)
- OpenAPI 3.1 — API specification format for wire protocol
See examples/bank-statement-extraction/ for a complete example of:
- A manifest describing 3 documents for an annual audit engagement
- A processing contract granting DocuVerify access to bank statements (with AI agent restrictions, result scanning, document sanitization, privacy budget, and gateway isolation configured)
- A result envelope with extracted data, attestation proof, PII report, result scan verdicts, and end-of-session attestation
- An audit trail showing the complete chain of events (including sanitization, result scanning, and end-of-session attestation events)
schemas/common.schema.json— Shared types and definitions (including AgentType, ResultScanVerdict, PrivacyBudget, GatewayVisibility, SubAgentAttestation, NumericPrecisionPolicy)schemas/manifest.schema.json— Layer 1: Document Manifestschemas/contract.schema.json— Layer 2: Processing Contractschemas/result.schema.json— Layer 3: Result Envelopeschemas/audit-event.schema.json— Layer 4: Audit Ledgerschemas/storage-binding.schema.json— Layer 0: Storage Binding Interface
The DSP wire protocol is defined in spec/dssp-api-v0.1.yaml (OpenAPI 3.1). It
specifies all HTTP endpoints, request/response schemas, authentication, error
handling, and version negotiation for DSP implementations.
Key endpoints:
POST /v0.1/manifests— Create a document manifestPOST /v0.1/contracts— Create a processing contractPOST /v0.1/sessions— Start a processing sessionPOST /v0.1/sessions/{id}/complete— Submit result and end sessionGET /v0.1/audit/events— Read the audit chainGET /v0.1/.well-known/dssp-configuration— Discovery and version negotiation
reference/gateway/— Reference DSP gateway (Go). Implements the full wire protocol with in-memory storage, contract enforcement, audit chain management, and RFC 8785 canonical JSON.reference/storage-adapters/minio/— MinIO storage adapter (Go). Maps DSP storage operations to MinIO S3-compatible API.reference/scanner/— Reference result scanner (Python). Implements regex, NER (Presidio), statistical (Benford's law), and LLM output filter scanners.reference/conformance/— Conformance test suite (Python/pytest). Tests behavior across all four conformance levels.reference/sandbox/— Docker Compose demo. Full working DSP environment with gateway, MinIO, scanner, demo agent, and audit viewer.
Interoperability test vectors are in reference/test-vectors/:
canonical-json/— RFC 8785 canonical JSON serialization test caseshash-computation/— SHA-256 hash computation over canonical JSONmerkle-chain/— Three-event Merkle chain with computed hashespii-redaction/— mask_last_4, hash_sha256, and suppress redaction methodsnumeric-precision/— Rounding, banker's rounding, significant digitssub-agent-chain/— Valid and invalid pipeline configurations
Each vector file contains input, expected_output, and description fields
for automated verification.
Is the agent rule-based (regex/template)?
├─ YES → agent_type: "deterministic"
│ Minimum scanning: regex
│ Privacy budget: RECOMMENDED
│
└─ NO → Does the agent produce free-text output?
├─ NO → agent_type: "ml_structured"
│ Minimum scanning: regex + ner
│ Privacy budget: RECOMMENDED for pii-high
│
└─ YES → agent_type: "llm_freeform"
Minimum scanning: regex + ner + llm_output_filter
Privacy budget: REQUIRED
Document sanitization: RECOMMENDED
Result scanning: REQUIRED