Skip to content

Commit cd4c433

Browse files
committed
add sans-ir
1 parent 9c1a5b3 commit cd4c433

12 files changed

Lines changed: 1257 additions & 3 deletions
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
---
2+
name: Sprint 1 IR Mutation Surface (Revised)
3+
overview: Establish sans.ir as the single canonical editable IR. Remove all legacy/obsolete IR schema references. Implement deterministic normalization from IRDoc → sans.ir and sans.ir → expanded.sans printing via existing printer. Add run-ir CLI that executes sans.ir through the existing runtime core. No alternate schemas. No save splitting. Stable step handles derived from semantic anchors (not positional indices).
4+
todos: []
5+
isProject: false
6+
---
7+
8+
# Sprint 1 — IR as the Mutation Surface (Authoritative Version)
9+
10+
## Critical Clarification
11+
12+
There is **one canonical IR format going forward: `sans.ir`.**
13+
14+
Any pre-existing IR schema documentation (e.g. [docs/sans-ir-v0.1-schema](docs/sans-ir-v0.1-schema)) is obsolete and must either:
15+
16+
- be **removed**, or
17+
- be **renamed** explicitly to describe `plan.ir.json` (execution IR with loc + transform ids)
18+
19+
There must be **no competing IR definitions** in the repo after this sprint.
20+
21+
---
22+
23+
## Core Principles
24+
25+
1. `sans.ir` is the mutation surface.
26+
2. `sans.ir` contains *only semantic data*.
27+
3. Execution continues to use in-memory `IRDoc`.
28+
4. Both `run` (text) and `run-ir` (IR file) feed the same `execute_plan`.
29+
5. **No positional step IDs.** No s1/s2 numbering — use semantic anchors only.
30+
31+
---
32+
33+
## 1. Define Canonical `sans.ir` Schema
34+
35+
**Location:** `sans/ir/` package.
36+
37+
Convert existing [sans/sans/ir.py](sans/sans/ir.py) into package:
38+
39+
```
40+
sans/ir/
41+
__init__.py # exports IRDoc + related types (move current ir.py content here or to doc.py)
42+
schema.py # sans.ir schema definitions
43+
normalize.py # IRDoc → sans.ir
44+
adapter.py # sans.ir → IRDoc
45+
```
46+
47+
**Do not add** `printer.py` — the printer is the existing `irdoc_to_expanded_sans(IRDoc)`.
48+
49+
---
50+
51+
### sans.ir Structure (Authoritative)
52+
53+
```json
54+
{
55+
"version": "0.1",
56+
"datasources": {
57+
"lb": {
58+
"kind": "csv",
59+
"path": "lb.csv",
60+
"columns": {
61+
"USUBJID": "string",
62+
"VISITNUM": "int"
63+
}
64+
}
65+
},
66+
"steps": [
67+
{
68+
"id": "out:__t2__",
69+
"op": "identity",
70+
"inputs": ["__datasource__lb"],
71+
"outputs": ["__t2__"],
72+
"params": {}
73+
},
74+
{
75+
"id": "out:sorted_high",
76+
"op": "save",
77+
"inputs": ["sorted_high"],
78+
"outputs": [],
79+
"params": { "path": "sorted_high.csv" }
80+
}
81+
]
82+
}
83+
```
84+
85+
---
86+
87+
### Hard Rules
88+
89+
**Excluded fields (must NOT appear in sans.ir):**
90+
91+
- transform_id
92+
- transform_class_id
93+
- step_id
94+
- loc
95+
- any runtime fingerprint
96+
- registry ids
97+
98+
**Step identity rule (semantic anchors only):**
99+
100+
- If step produces **one output table:** `id = "out:<output_table_name>"`
101+
- For **datasource steps:** `id = "ds:<datasource_name>"`
102+
103+
This gives stable anchors and avoids cascade renumbering on insertion (mutation-friendly).
104+
105+
**Save steps:** Do **not** split saves into a top-level `saves[]`. `save` remains a normal step in `steps[]`. One semantic representation only.
106+
107+
**Canonical ordering:**
108+
109+
- datasources: sorted by key
110+
- steps: topologically sorted
111+
- params: keys sorted (recursively for nested structures)
112+
- JSON serialization: deterministic; no UUIDs; no timestamps
113+
114+
---
115+
116+
## 2. Normalization: IRDoc → sans.ir
117+
118+
**File:** [sans/ir/normalize.py](sans/ir/normalize.py)
119+
120+
- **Input:** IRDoc
121+
- **Output:** canonical sans.ir dict
122+
123+
**Responsibilities:**
124+
125+
- Strip all execution-only fields.
126+
- Generate **semantic** step ids per rule above (`out:<table>`, `ds:<name>`).
127+
- Preserve step structure exactly.
128+
- Do **not** collapse identity steps.
129+
- Do **not** rewrite logic or perform semantic optimizations.
130+
- Canonicalize param key ordering recursively (e.g. reuse [sans/sans_script/canon.py](sans/sans_script/canon.py) style).
131+
132+
Normalization must be **deterministic**.
133+
134+
---
135+
136+
## 3. Adapter: sans.ir → IRDoc
137+
138+
**File:** [sans/ir/adapter.py](sans/ir/adapter.py)
139+
140+
- **Input:** sans.ir dict
141+
- **Output:** IRDoc
142+
143+
**Responsibilities:**
144+
145+
- Rebuild `OpStep` objects (and datasources, table_facts as needed).
146+
- Provide **synthetic Loc** objects if runtime/validator require them.
147+
- Preserve step order exactly as in sans.ir.
148+
- No mutation or inference of logic.
149+
150+
Target invariant:
151+
152+
```
153+
expanded.sans → compile → IRDoc
154+
IRDoc → normalize → sans.ir
155+
sans.ir → adapter → IRDoc
156+
IRDoc → irdoc_to_expanded_sans → expanded.sans
157+
```
158+
159+
must be lossless (byte-identical expanded.sans and semantically identical sans.ir).
160+
161+
---
162+
163+
## 4. Printer
164+
165+
**Do not** implement a new printer.
166+
167+
Use existing [sans/sans_script/expand_printer.py](sans/sans_script/expand_printer.py):
168+
169+
```
170+
sans.ir → adapter → IRDoc → irdoc_to_expanded_sans(IRDoc)
171+
```
172+
173+
No alternate formatting logic.
174+
175+
---
176+
177+
## 5. run-ir CLI
178+
179+
**Subcommand:**
180+
181+
```
182+
sans run-ir input.sans.ir --out outdir
183+
```
184+
185+
**Implementation:**
186+
187+
- Load sans.ir JSON from `input.sans.ir`.
188+
- Convert via adapter → IRDoc.
189+
- Pass IRDoc to existing `execute_plan` (same path as `sans run`).
190+
- Produce identical bundle shape as `sans run` (report.json, artifacts, outputs).
191+
192+
Execution logic must **not** fork. Register in [sans/**main**.py](sans/sans/__main__.py).
193+
194+
---
195+
196+
## 6. Obsolete IR Schema Cleanup
197+
198+
- **Remove or rename** [docs/sans-ir-v0.1-schema](docs/sans-ir-v0.1-schema):
199+
- Either delete it, or
200+
- Rename and repurpose to document **plan.ir.json** (execution IR: loc, transform_id, step_id, etc.) so it is explicit that it is *not* the canonical mutation IR.
201+
202+
After this sprint there must be **exactly one** canonical IR definition (sans.ir); no competing docs.
203+
204+
---
205+
206+
## 7. Tests
207+
208+
### A. IR round-trip
209+
210+
**File:** [tests/test_ir_roundtrip.py](tests/test_ir_roundtrip.py)
211+
212+
**Pipeline:**
213+
214+
1. `expanded.sans` → compile → IRDoc → normalize → **sans.ir A**
215+
2. **sans.ir A** → adapter → IRDoc → printer → **expanded.sans'**
216+
3. **expanded.sans'** → compile → IRDoc → normalize → **sans.ir B**
217+
218+
**Assertions:**
219+
220+
- `expanded.sans == expanded.sans'` (byte-identical).
221+
- `A == B` (canonical IR equality).
222+
223+
Both assertions are required.
224+
225+
---
226+
227+
### B. Execution equivalence
228+
229+
**File:** [tests/test_ir_execution_equivalence.py](tests/test_ir_execution_equivalence.py)
230+
231+
- Run `sans run script.sans` and `sans run-ir script.sans.ir` (script.sans.ir produced from script.sans via normalize).
232+
- Assert: report.json identical status; output artifact hashes identical.
233+
234+
---
235+
236+
### C. Canonicalization
237+
238+
**File:** [tests/test_ir_canonicalization.py](tests/test_ir_canonicalization.py)
239+
240+
- No `transform_id` in sans.ir.
241+
- No `loc` in sans.ir.
242+
- Deterministic JSON ordering.
243+
- Multiple normalizations of same IRDoc produce identical sans.ir JSON.
244+
245+
---
246+
247+
## 8. Exit Criteria
248+
249+
Sprint complete when:
250+
251+
- There is **exactly one** canonical mutation IR (`sans.ir`).
252+
- No obsolete IR schema remains (removed or renamed to plan.ir.json only).
253+
- Round-trip is **byte-stable** (expanded.sans) and **canonical IR equality** (sans.ir A == B).
254+
- run-ir produces **identical execution results** (status + output hashes) to run for same logic.
255+
- No execution-only artifacts (transform_id, loc, step_id, etc.) appear in sans.ir.
256+
257+
---
258+
259+
## 9. Non-Goals
260+
261+
- No optimizer logic.
262+
- No move system.
263+
- No IR rewriting or mapping logic.
264+
265+
This sprint establishes the foundation only. All mutation work depends on this being correct.
266+
267+
---
268+
269+
## 10. File-Level Summary
270+
271+
272+
| Action | Path |
273+
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
274+
| Refactor | `sans/ir.py``sans/ir/` package with `__init__.py`, `schema.py`, `normalize.py`, `adapter.py` |
275+
| Modify | [sans/**main**.py](sans/sans/__main__.py): register `run-ir` subcommand |
276+
| Remove/rename | [docs/sans-ir-v0.1-schema](docs/sans-ir-v0.1-schema) — remove or rename to plan.ir.json documentation |
277+
| Add tests | [tests/test_ir_roundtrip.py](tests/test_ir_roundtrip.py), [tests/test_ir_execution_equivalence.py](tests/test_ir_execution_equivalence.py), [tests/test_ir_canonicalization.py](tests/test_ir_canonicalization.py) |
278+
279+

sans/sans/__main__.py

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@
1010
from .validator_sdtm import validate_sdtm
1111
from .fmt import FMT_STYLE_ID, format_text, normalize_newlines
1212
from . import __version__ as _engine_version
13+
from .ir.adapter import sans_ir_to_irdoc
14+
from .ir.schema import validate_sans_ir
15+
from .sans_script import irdoc_to_expanded_sans
1316

1417

1518
def _parse_tables(tables_arg: str | None) -> set[str] | None:
@@ -119,6 +122,22 @@ def main(argv: list[str] | None = None) -> int:
119122
run_parser.add_argument("--lock-only", action="store_true", help="With --emit-schema-lock: generate lock only, do not execute (even if all datasources are typed)")
120123
run_parser.add_argument("--bundle-mode", choices=["full", "thin"], default="full", help="Bundle mode: full (embed datasource bytes) or thin (fingerprints only)")
121124

125+
run_ir_parser = subparsers.add_parser("run-ir", help="Execute a canonical sans.ir file")
126+
run_ir_parser.add_argument("script", help="Path to the sans.ir file")
127+
run_ir_parser.add_argument("--out", required=True, help="Output directory for plan/report and outputs")
128+
run_ir_parser.add_argument("--tables", default="", help="Comma-separated table bindings name=path.csv")
129+
run_ir_parser.add_argument("--format", default="csv", choices=["csv", "xpt"], help="Output format (csv, xpt)")
130+
run_ir_parser.add_argument("--strict", dest="strict", action="store_true", default=True)
131+
run_ir_parser.add_argument("--no-strict", dest="strict", action="store_false")
132+
run_ir_parser.add_argument("--schema-lock", metavar="path", default=None, help="Path to schema.lock.json to enforce when ingesting datasources")
133+
run_ir_parser.add_argument("--emit-schema-lock", metavar="path", default=None, help="After successful run, write schema.lock.json to this path")
134+
run_ir_parser.add_argument("--lock-only", action="store_true", help="With --emit-schema-lock: generate lock only, do not execute (even if all datasources are typed)")
135+
run_ir_parser.add_argument("--bundle-mode", choices=["full", "thin"], default="full", help="Bundle mode: full (embed datasource bytes) or thin (fingerprints only)")
136+
137+
ir_validate_parser = subparsers.add_parser("ir-validate", help="Validate sans.ir structure only")
138+
ir_validate_parser.add_argument("script", help="Path to the sans.ir file")
139+
ir_validate_parser.add_argument("--strict", action="store_true", default=False, help="Treat structural warnings as errors")
140+
122141
schema_lock_parser = subparsers.add_parser("schema-lock", help="Generate schema.lock.json without execution (no --out required)")
123142
schema_lock_parser.add_argument("script", help="Path to the script file")
124143
schema_lock_parser.add_argument("--write", "-o", dest="write", default=None, metavar="path", help="Lock output path (default: <script_dir>/<script_stem>.schema.lock.json); relative paths resolved against script dir")
@@ -420,6 +439,84 @@ def main(argv: list[str] | None = None) -> int:
420439
print(f"ok: wrote schema lock to {emit_path}{suffix}")
421440
return int(report.get("exit_code_bucket", 50))
422441

442+
if args.command == "run-ir":
443+
script_path = Path(args.script)
444+
out_dir = Path(args.out)
445+
bindings = {}
446+
if args.tables:
447+
for item in args.tables.split(","):
448+
if not item.strip():
449+
continue
450+
if "=" not in item:
451+
return _write_failed_report(out_dir, f"Invalid table binding '{item}'")
452+
name, path = item.split("=", 1)
453+
name = name.strip()
454+
if name in bindings:
455+
return _write_failed_report(out_dir, f"Duplicate table binding for '{name}'")
456+
bindings[name] = path.strip()
457+
try:
458+
sans_ir = json.loads(script_path.read_text(encoding="utf-8"))
459+
validate_sans_ir(sans_ir)
460+
ir_doc = sans_ir_to_irdoc(sans_ir, file_name=str(script_path))
461+
except OSError as exc:
462+
return _write_failed_report(out_dir, str(exc))
463+
except Exception as exc:
464+
return _write_failed_report(out_dir, f"Invalid sans.ir: {exc}")
465+
466+
# Reuse existing run path and runtime core by rendering canonical expanded.sans
467+
# from IR, then executing as a normal .sans script.
468+
expanded = irdoc_to_expanded_sans(ir_doc)
469+
virtual_script_path = script_path.with_suffix(".expanded.sans")
470+
schema_lock_path = Path(args.schema_lock) if args.schema_lock else None
471+
emit_schema_lock_path = Path(args.emit_schema_lock) if args.emit_schema_lock else None
472+
report = run_script(
473+
text=expanded,
474+
file_name=str(virtual_script_path),
475+
bindings=bindings,
476+
out_dir=out_dir,
477+
strict=args.strict,
478+
output_format=args.format,
479+
include_roots=None,
480+
allow_absolute_includes=False,
481+
allow_include_escape=False,
482+
legacy_sas=False,
483+
schema_lock_path=schema_lock_path,
484+
emit_schema_lock_path=emit_schema_lock_path,
485+
lock_only=getattr(args, "lock_only", False),
486+
bundle_mode=getattr(args, "bundle_mode", "full"),
487+
)
488+
status = report.get("status")
489+
if status == "refused":
490+
primary = report.get("primary_error") or {}
491+
loc = primary.get("loc") or {}
492+
loc_str = f"{loc.get('file')}:{loc.get('line_start')}" if loc else ""
493+
print(f"refused: {primary.get('code')} at {loc_str}".rstrip())
494+
elif status == "failed":
495+
primary = report.get("primary_error") or {}
496+
loc = primary.get("loc") or {}
497+
loc_str = f"{loc.get('file')}:{loc.get('line_start')}" if loc else ""
498+
print(f"failed: {primary.get('code')} at {loc_str}".rstrip())
499+
else:
500+
print("ok: wrote plan.ir.json report.json registry.candidate.json runtime.evidence.json")
501+
return int(report.get("exit_code_bucket", 50))
502+
503+
if args.command == "ir-validate":
504+
script_path = Path(args.script)
505+
try:
506+
sans_ir = json.loads(script_path.read_text(encoding="utf-8"))
507+
warnings = validate_sans_ir(sans_ir, strict=bool(getattr(args, "strict", False)))
508+
except OSError as exc:
509+
print(f"invalid: {exc}", file=sys.stderr)
510+
return 2
511+
except Exception as exc:
512+
print(f"invalid: {exc}", file=sys.stderr)
513+
return 2
514+
if warnings:
515+
print(f"ok: valid sans.ir ({len(warnings)} warning(s))", file=sys.stderr)
516+
else:
517+
print("ok: valid sans.ir", file=sys.stderr)
518+
return 0
519+
423520
if args.command == "schema-lock":
424521
from .runtime import generate_schema_lock_standalone
425522
script_path = Path(args.script)

0 commit comments

Comments
 (0)