Skip to content

Commit 2969f81

Browse files
authored
Create determinism-contract.md
1 parent 64ae5c4 commit 2969f81

1 file changed

Lines changed: 311 additions & 0 deletions

File tree

Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
# Determinism Contract
2+
3+
This document defines the determinism contract for SIGNIA: the explicit rules and guarantees that ensure identical inputs produce identical outputs (byte-for-byte), enabling reliable hashing, proofs, and independent verification.
4+
5+
Determinism is not a convenience feature in SIGNIA. It is a security property.
6+
7+
---
8+
9+
## 1) Contract statement
10+
11+
Given:
12+
- the same input artifact bytes (or pinned immutable reference)
13+
- the same SIGNIA version
14+
- the same plugin set and plugin versions
15+
- the same normalization policy and configuration
16+
- the same canonicalization and hashing specifications
17+
18+
SIGNIA must produce:
19+
- identical canonical `schema.json` bytes
20+
- identical canonical `manifest.json` bytes (for hashed fields)
21+
- identical `proof.json` (root and proof material derived from defined leaves)
22+
- identical schema hash, manifest hash (if used), and proof root
23+
24+
**Same input → same output (byte-for-byte).**
25+
26+
---
27+
28+
## 2) Determinism scope
29+
30+
Determinism applies to:
31+
32+
- Canonical bytes used for hashing:
33+
- schema canonical bytes
34+
- manifest canonical bytes (as specified)
35+
- proof leaf encodings
36+
- Hashing:
37+
- domain-separated hash definitions
38+
- leaf hashing and Merkle root derivation
39+
- Ordering:
40+
- every collection (maps, sets, lists) in hashed domains
41+
- Normalization:
42+
- paths, line endings, encoding rules, timestamps
43+
- Plugin outputs:
44+
- IR must be deterministic for the same normalized input
45+
46+
Determinism does not require:
47+
- identical performance metrics
48+
- identical logs
49+
- identical non-hashed metadata (unless explicitly specified)
50+
51+
---
52+
53+
## 3) Inputs: what must be pinned
54+
55+
### 3.1 Immutable references are required for reproducibility
56+
Acceptable pinning strategies include:
57+
- commit SHA (for VCS sources)
58+
- content checksum (for archives and files)
59+
- explicit versioned releases with checksums
60+
61+
Floating references are allowed only if they are converted into pinned inputs at compile time:
62+
- branch names (e.g., `main`)
63+
- mutable URLs
64+
- “latest” tags
65+
66+
If a floating ref is used, the manifest must record the resolved immutable reference.
67+
68+
### 3.2 Network access policy
69+
Default:
70+
- no network access during compilation
71+
72+
If network access is enabled:
73+
- every fetched input must be content-addressed or pinned
74+
- caches must not change results
75+
- the manifest must record the resolved immutable identifiers
76+
77+
---
78+
79+
## 4) Normalization contract (input canonicalization)
80+
81+
Normalization removes environment variance before parsing.
82+
83+
### 4.1 Paths
84+
- All paths must be represented in normalized POSIX form using `/` separators in hashed domains.
85+
- Absolute paths must never appear in hashed domains.
86+
- Input roots must be mapped to a logical root (e.g., `/` or `repo://`).
87+
88+
### 4.2 Newlines and encoding
89+
- Text inputs must be normalized to LF (`\n`) for hashing domains.
90+
- UTF-8 is the canonical encoding.
91+
- If an input is not valid UTF-8 and the plugin expects text, the plugin must:
92+
- reject with a deterministic error, or
93+
- define a deterministic byte-to-text mapping strategy (must be documented).
94+
95+
### 4.3 Timestamps and environment-derived values
96+
- Wall-clock timestamps must never influence hashed domains.
97+
- If timestamps exist in inputs (e.g., metadata files), plugins must:
98+
- ignore them, or
99+
- normalize them into a deterministic placeholder, or
100+
- treat them as non-hashed metadata.
101+
102+
### 4.4 Symlinks
103+
Default recommended policy:
104+
- deny symlinks
105+
106+
If symlinks are allowed:
107+
- resolve only within the input root
108+
- validate canonical path containment
109+
- define deterministic resolution behavior and record policy version in the manifest
110+
111+
---
112+
113+
## 5) IR determinism contract
114+
115+
Plugins produce IR. IR is untrusted until validated and canonicalized.
116+
117+
### 5.1 IR must be deterministic
118+
For the same normalized input and plugin config, plugins must produce identical IR.
119+
120+
Plugins must not:
121+
- iterate using filesystem order without sorting
122+
- depend on locale/timezone
123+
- generate random IDs
124+
- use nondeterministic concurrency for ordering
125+
- include host-specific paths or usernames
126+
127+
### 5.2 Stable identities
128+
Every entity and edge must have a stable identity strategy documented by the plugin.
129+
130+
Examples:
131+
- entity ID derived from normalized path + kind
132+
- edge ID derived from (from_id, to_id, relation_type)
133+
134+
### 5.3 Bounded outputs
135+
Plugins must enforce bounds:
136+
- maximum nodes/edges
137+
- maximum attribute sizes
138+
- maximum recursion depth
139+
140+
Bounds must produce deterministic failures.
141+
142+
---
143+
144+
## 6) Canonical JSON encoding (byte-level contract)
145+
146+
All hashed JSON documents must be serialized in a canonical way.
147+
148+
### 6.1 Key ordering
149+
- Object keys must be sorted lexicographically by Unicode code point.
150+
- No “insertion order” reliance.
151+
152+
### 6.2 Whitespace
153+
- No insignificant whitespace.
154+
- No trailing spaces.
155+
- Use `:` and `,` without extra spaces.
156+
157+
### 6.3 Numbers
158+
- Integers encoded in base-10 without leading zeros (except `0`).
159+
- Floats, if allowed, must follow a strict canonical format.
160+
- Recommended: avoid floats in hashed domains; represent as rational or string if needed.
161+
162+
### 6.4 Strings
163+
- Use JSON standard escaping.
164+
- No ambiguous unicode normalization at encoding time unless explicitly defined.
165+
166+
### 6.5 Null/boolean
167+
- Standard JSON literals: `null`, `true`, `false`.
168+
169+
### 6.6 UTF-8 output
170+
- Canonical bytes must be UTF-8 encoded.
171+
172+
---
173+
174+
## 7) Hashing contract
175+
176+
### 7.1 Hash function
177+
The hash function must be documented and stable for a given major version.
178+
179+
Recommended:
180+
- SHA-256 or BLAKE3 (choose one per spec; do not mix without domain separation)
181+
182+
The hash function is part of the determinism contract. Changing it requires a version bump.
183+
184+
### 7.2 Domain separation
185+
Every hash must include a domain tag prefix.
186+
187+
Examples (illustrative):
188+
- `signia:schema:v1`
189+
- `signia:manifest:v1`
190+
- `signia:proof:v1`
191+
- `signia:leaf:entity:v1`
192+
- `signia:leaf:edge:v1`
193+
194+
### 7.3 Hash inputs
195+
Hashes must be computed over canonical bytes.
196+
197+
Rules:
198+
- never hash in-memory structures without canonical serialization
199+
- never hash debug outputs
200+
- never hash non-deterministic representations
201+
202+
---
203+
204+
## 8) Proof construction contract
205+
206+
### 8.1 Leaf set definition
207+
Proof leaves must be defined by the spec:
208+
- what constitutes a leaf
209+
- how leaves are encoded (canonical bytes)
210+
- how leaves are ordered
211+
212+
### 8.2 Leaf ordering
213+
Leaf ordering must be deterministic and stable:
214+
- sort by (leaf_type, stable_id) or an equivalent stable key
215+
- define a total ordering (no ties)
216+
217+
### 8.3 Merkle tree construction
218+
Tree construction must be deterministic:
219+
- define whether odd leaves are duplicated, promoted, or padded
220+
- define node hashing domain and concatenation rules
221+
- define root representation
222+
223+
### 8.4 Proof material
224+
If inclusion proofs are included:
225+
- define sibling ordering
226+
- define direction markers (left/right) deterministically
227+
- encode proofs canonically (JSON canonical encoding or binary spec)
228+
229+
---
230+
231+
## 9) Error determinism contract
232+
233+
Failures must be deterministic:
234+
- same input → same error category and message class
235+
236+
Guidelines:
237+
- errors should include stable identifiers, not host-dependent paths
238+
- avoid embedding OS-specific errno strings in stable outputs
239+
- provide structured error codes for programmatic handling
240+
241+
Non-goal:
242+
- exact byte-for-byte matching of logs across environments
243+
244+
---
245+
246+
## 10) Determinism testing requirements
247+
248+
### 10.1 Golden fixtures
249+
For each plugin and core pipeline:
250+
- commit at least one realistic fixture
251+
- commit expected canonical outputs
252+
- CI must validate byte-for-byte equality
253+
254+
### 10.2 Cross-run checks
255+
Run compilation twice in CI and compare:
256+
- schema bytes
257+
- schema hash
258+
- proof root
259+
260+
### 10.3 Cross-platform checks
261+
At minimum:
262+
- Linux and macOS builds validate determinism fixtures
263+
- Windows is recommended if path handling is supported
264+
265+
### 10.4 Negative tests
266+
Mutate bundle files and ensure verification fails:
267+
- schema tampering
268+
- manifest tampering
269+
- proof tampering
270+
271+
---
272+
273+
## 11) Change control and versioning
274+
275+
Changes that affect determinism require:
276+
- an explicit version bump where relevant:
277+
- schema version
278+
- manifest version
279+
- proof version
280+
- hash domain version
281+
- updated specs and JSON schemas
282+
- updated fixtures
283+
- documented migration notes
284+
285+
Security-sensitive changes include:
286+
- canonical JSON encoding rules
287+
- ordering rules
288+
- hashing domains
289+
- proof leaf definitions
290+
- path normalization policies
291+
292+
---
293+
294+
## 12) Consumer obligations
295+
296+
Consumers must:
297+
- verify bundles before trusting them
298+
- apply policy for publisher allowlists if needed
299+
- avoid relying on non-hashed metadata for security decisions
300+
- pin inputs in CI and record immutable refs
301+
302+
---
303+
304+
## 13) Summary
305+
306+
SIGNIA’s determinism contract ensures:
307+
- stable canonical bytes
308+
- stable hashes and proof roots
309+
- independent verification without trusting operators
310+
311+
Determinism failures are treated as integrity vulnerabilities and must be fixed with priority.

0 commit comments

Comments
 (0)