feat(laurel): Seq<T> and Array<T> with bounds-checked subscript and diagnostics#1073
feat(laurel): Seq<T> and Array<T> with bounds-checked subscript and diagnostics#1073fabiomadge wants to merge 14 commits into
Conversation
589d615 to
4f5dd33
Compare
6424973 to
1f4b0ce
Compare
MikaelMayer
left a comment
There was a problem hiding this comment.
🤖 Well-structured PR. The separation between ValidateSubscriptUsage (pure diagnostics) and SubscriptElim (rewriting) is clean, and the centralized SeqOp.* / arrayCompositeName constants prevent name drift between passes. The highTypeTag approach for per-element-type Box constructors is a good solution to the deduplication collision. The bounds-precondition infrastructure (mkSeqBoundsPrecond + classifyPrecondition) integrates naturally with the existing PrecondElim pattern.
A few items below.
| | other => other | ||
| let program' := { program with | ||
| types := types' | ||
| staticProcedures := program.staticProcedures.map (elimProcedure model) } |
There was a problem hiding this comment.
🤖 subscriptElim processes types (instance procedures) and staticProcedures, but not program.constants. A constant with a subscript initializer (e.g. const x: int := someSeq[0]) would leave an un-eliminated Subscript node in the tree, triggering the "Subscript should have been eliminated" diagnostic in LaurelToCoreTranslator. Unlikely in practice today, but worth either handling or documenting as unsupported.
There was a problem hiding this comment.
Done — extended subscriptElim to map elimExpr model over each constant's initializer, alongside types and staticProcedures. Still latent today (no grammar path produces constants with subscript initializers), but defensive for any future pass that synthesises them.
| /-- Eliminate `Subscript` nodes and desugar `Array.length` across a program. | ||
| Conditionally injects the `$Array` synthetic composite when the program | ||
| uses `Array<T>` anywhere. -/ | ||
| public def subscriptElim (_model : SemanticModel) (program : Program) |
There was a problem hiding this comment.
🤖 Nit: _model is accepted and immediately discarded. Since the pipeline's LaurelPass.run signature requires Program → SemanticModel → ..., this is unavoidable at the call site, but the docstring could note why the caller's model is stale (it predates $Array injection) to save future readers from wondering whether this is a bug.
There was a problem hiding this comment.
Done — added a docstring on subscriptElim explaining that _model is accepted to satisfy LaurelPass.run's signature but is intentionally unused because the caller's model predates our $Array injection, so the pass rebuilds via resolve before dispatching to elimProcedure.
| /-- `Sequence.length(s) : int` — length of the sequence. -/ | ||
| def SeqOp.length := "Sequence.length" | ||
| /-- Name of the `$data` field on the synthetic `$Array` composite. -/ | ||
| def SeqOp.dataField := "$data" |
There was a problem hiding this comment.
🤖 Nit: SeqOp.dataField is semantically different from the other SeqOp.* entries — it's a field name on the synthetic composite, not a Sequence.* operation name. Consider a separate namespace or at least a comment distinguishing it from the operation names above.
There was a problem hiding this comment.
Done — renamed SeqOp.dataField to arrayDataField (moved out of the SeqOp.* namespace) and added a note distinguishing it from the Sequence.* operation names. Eight call sites in SubscriptElim.lean updated.
handleZeroaryOps fell back to logging an error and returning re.none() for any 0-ary op outside the regex set. That silently substituted a regex primitive for unrelated ops in VC printer output; users saw re.none() where e.g. Sequence.empty() was intended. Switch the fallback to mkGenericCall, matching how handleUnaryOps and handleBinaryOps already handle unknown ops. The printer now emits the op name as a free-variable reference, preserving the intent. Parseable Sequence.empty<T>() syntax is still a separate grammar-level feature; this commit only fixes the printer-side noise.
polyUneval is the combinator used to declare unevaluated polymorphic functions with axioms. Unlike unaryOp and binaryOp, it had no way to attach preconditions; callers had to hand-build the WFLFunc. Add a 'preconditions' parameter and the matching free-vars proof obligation (subset of the function's input names), defaulting to empty. No behavioral change for existing callers.
Sequence.select and Sequence.update now require `0 <= i < length(s)`; Sequence.take and Sequence.drop require `0 <= n <= length(s)`. PrecondElim picks these up and generates VC obligations at call sites, both in statement positions (via transformStmt) and in pure positions (via mkContractWFProc / mkFuncWFProc) — so requires/ensures/quantifier-body subscripts are also covered. Obligations carry the propertyType metadata "outOfBoundsAccess" (new MetaData constant) and flow through a new PropertyType.outOfBoundsAccess enum variant — with matching entries in the statement-eval / obligation-extraction / cmd-eval metadata-to-PropertyType conversion sites — to finally render as "out-of-bounds-access" in SARIF output, matching how divisionByZero and arithmeticOverflow are classified. Side effect: `propertyTypeToClassification` in SarifOutput.lean was previously dead code; `vcResultToSarifResult` never set `properties.propertyType` so the SARIF output defaulted every obligation to "assert". Wiring this up means divisionByZero and arithmeticOverflow obligations now also classify correctly in SARIF — a pre-existing bug this PR incidentally fixes.
New tests in StrataTest/Transform/PrecondElim.lean:
- Test 10: Sequence.select in a procedure body emits the bounds assert
(PrecondElim is unconditional — it inserts regardless of any
surrounding requires guard; the SMT solver discharges).
- Test 10c: Sequence.select inside a requires clause triggers the
$$wf-procedure path (mkContractWFProc).
- Test 10d: Sequence.select inside a function body triggers the
function-body $$wf path (mkFuncWFStmts).
- Test 11: collectPrecondAsserts attaches outOfBoundsAccess metadata
for all four partial ops and a nested call. Mirrors
OverflowCheckTest.lean. Also verifies Sequence.length emits
no obligation (it is total).
- Test 12: Sequence.empty printer regression for the commit-1 fix —
renders as a generic call, not re.none().
New property-classification tests in
StrataTest/Languages/Core/Tests/SarifOutputTests.lean cover all five
PropertyType variants, exercising the SARIF wiring fix in commit 3.
Collateral test updates for real behavioral changes:
- StrataTest/Languages/Core/Examples/Seq.lean: expected VC output
includes the new bounds obligations (all SMT-provable from the
surrounding context, except the pre-existing contains_yes unknown).
- StrataTest/Languages/Core/Tests/ProgramEvalTests.lean: Sequence func
signatures now render with the attached requires clauses.
- StrataTest/Languages/Core/Examples/Loops.lean: commit-1 printer fix
propagates (re.none() -> top, error message format updated).
…tions Sequences (immutable value types): - TSeq variant in HighType; Seq<T> grammar syntax - [1, 2, 3] sequence literals (desugared to Sequence.build chains) - s[i] subscript read and s[i := v] functional update - 9 external Sequence.* operations (empty, build, select, update, length, append, contains, take, drop) - Seq<T> translates to Core's polymorphic Sequence type Arrays (mutable heap-backed): - TArray variant in HighType; Array<T> grammar syntax - a[i] read and a[i] := v write with heap semantics (aliasing) - Seq literal to Array conversion: var a: Array<int> := [1, 2, 3] - Synthetic $Array composite with $data: Seq<T> field - Conditional injection — no $Array in programs that don't use arrays - Array<T> recognized as composite in modifies clauses - Array.length(a) desugared to Sequence.length(a.$data) Shared infrastructure: - Subscript AST node with type-aware SubscriptElim pass - Grammar productions: seqType, arrayType, subscript, seqLiteral Co-authored-by: Fabio Madge <fmadge@amazon.com>
…tics Add a Laurel-layer validator (ValidateSubscriptUsage) that runs alongside validateDiamondFieldAccesses and flags four common misuses with Dafny-style messages that suggest the correct syntax: 1. `a[i := v]` on `Array<T>` — arrays are mutable; use `a[i] := v` or declare `a` as `Seq<T>`. 2. `s[i] := v` on `Seq<T>` — sequences are immutable; use `s[i := v]` or declare `s` as `Array<T>`. 3. `Array.length(x)` where `x` is not an `Array<T>` — reports the actual argument type. 4. `Array<T>` where `T ≠ int` — flagged with a note about the current SMT limitation. Pipeline integration: - runLaurelPasses now returns a `skipCore : Bool` flag (true when the validator emitted diagnostics) so translateWithLaurel can skip Core translation and VC generation when the Laurel-layer diagnostic is the actionable error. This prevents confusing Core type-checking noise from stacking on top of the validator's helpful message. - SubscriptElim hardens a couple of edge cases so downstream passes don't stack follow-on errors when the validator has already flagged a misuse (no-op for Seq destructive update; LiteralInt 0 fallback for Array.length on a non-Array). Negative tests and docs: - T18_Sequences: negative case for diagnostic 2. - T19_Arrays: negative cases for diagnostics 1, 3, and 4. - docs/verso/LaurelDoc: new "Common mistakes" subsection with example snippets for each of the four validator diagnostics.
Adds the Sequence.fromArray(a) builtin for taking an immutable Seq<T> snapshot of an Array<T>'s current contents. SubscriptElim rewrites calls into the a#$data internal field, preserving the existing Array-layout convention. Validator gains a fifth diagnostic flagging Sequence.fromArray calls whose argument type is not an Array<T>. Tests: - Positive cases in T19 covering snapshot semantics (mutation to the array after extraction is not reflected in the captured sequence). - Negative case asserting the new validator diagnostic fires on a non-Array argument. - T19's inter-procedural setFirst now also ensures length-preservation, which is required under the Core-level bounds preconditions introduced by the preceding Core commits in this branch (fixes a failure that surfaced when bounds obligations became checks rather than no-ops). Docs: - Rewrites the out-of-bounds semantics note from 'unconstrained' to 'verification obligation', matching the Core preconditions introduced upstream in this branch. - New 'Array to sequence conversion' section covering Sequence.fromArray and why there is no implicit Array to Seq coercion. - New 'Common mistakes' entry for Sequence.fromArray on a non-Array argument.
The TSeq arms of boxConstructorName/boxDestructorName/boxConstructorDef returned a fixed 'BoxSeq' / 'Box..SeqVal!' string regardless of the sequence's element type. HeapParameterization deduplicates usedBoxConstructors by name, so two composites with Seq fields of different element types produced two BoxSeq entries with incompatible argument types; only one survived dedup and Core type-checking then failed on the other. Derive a per-element-type tag and append it to the base name: Seq<int> becomes BoxSeq_int, Seq<bool> becomes BoxSeq_bool, Seq<Seq<int>> becomes BoxSeq_Seq_int, etc. Mirrors the existing per-primitive approach (BoxInt, BoxBool, ...). The new highTypeTag helper is also prepared to handle TArray / TSet / TMap / Pure / TTypedField by recursion, so the same pattern extends if box constructors are ever needed for those. T22_MixedSeqFields regression exercises two composites with Seq<int> and Seq<bool> fields and writes/reads through both. Fixes the bug previously tracked in #1101 (now closed).
- Document `_model` parameter: accepted to satisfy LaurelPass.run's signature but unused because the caller's model predates $Array injection; SubscriptElim rebuilds via resolve. - Process `program.constants` in addition to types and staticProcedures so subscript access in constant initializers is eliminated too. Latent today (no grammar path produces it), defensive. - Rename `SeqOp.dataField` -> `arrayDataField" (moved out of `SeqOp.*` namespace) since it is a field name on the synthetic composite, not a Sequence.* operation. Addresses MikaelMayer's three inline comments on PR #1073.
1f4b0ce to
37649a2
Compare
MikaelMayer
left a comment
There was a problem hiding this comment.
🔍 LGTM — all comments addressed.
CI surfaced test failures after porting onto main's current tip: 1. Main PR #1076 disallows transparent non-function procedures. Added 'opaque' to all test procedures in T18/T19/T22 that lacked it. For procedures with 'requires' clauses, placed 'opaque' after them. 2. ValidateSubscriptUsage was emitting Diagnostic 1 ('a[i := v] on Array<T> unsupported') for legitimate statement-position 'a[i] := v' on Array<T>. Root cause: the top-level .Subscript arm fires Diag 1 regardless of position. Post-port, 'a[i] := v' is represented as '.Subscript a i (some v)' at statement position in a .Block, so it was hitting the expression-position diagnostic. Reworked the .Block arm to short-circuit statement-position .Subscript nodes entirely, emitting Diag 2 (Seq destructive-update) for Seq and no diagnostic for Array, without falling through to the top-level Subscript arm. 3. Opaque procedures modifying heap-backed state need explicit 'modifies' clauses under main's new rules. 'arrayLoop' in T19 (writes to locally-declared Array) needed 'modifies *'. 'touch' in T22 (writes to a parameter's composite fields) needed 'modifies x'. 4. Seq-destructive-update negative test's caret span widened from 4 to 10 chars to match the new diagnostic span (emitted on the whole '.Subscript t i (some v)' node rather than just the subscript LHS). 5. Added a LIMITATION comment in SubscriptElim noting that statement-position .Subscript detection only fires in .Block children; a bare 'while (...) a[i] := v' body is not covered. Tests don't exercise this today. An elimStmt helper refactor would close it.
Three cleanups surfaced by review of the AST port: 1. Explicit leaf arms with a '⚠ must add arm' warning in elimExpr and validateStmtExpr. Previously a '| _ => ...' catch-all would silently skip recursion into children of any new StmtExpr variant. Mirrors main's MapStmtExprM pattern. 2. Statement-vs-expression-position disambiguation for '.Subscript t i (some v)' is now first-class: elimStmt/elimStmtAsSingle (mutual with elimExpr) and validateStmt (mutual with validateStmtExpr). Called from every statement-position container - Block children, IfThenElse branches, While body. Closes a gap where a bare 'while (...) a[i] := v' body (single statement, no Block wrapper) would slip past the Block-only detection and be miscompiled as an expression-position functional update (also spuriously emitting Diagnostic 1 on Array). 3. Strengthened the validator-elim coupling comment for Seq destructive update: makes the invariant (validator runs before elim, non-Warning diag aborts pre-Core) explicit so a future refactor doesn't silently drop the error. Not done here (follow-up): restoring structural recursion on the Subscript/validator walkers. Still 'partial' because the mutual structure makes termination proofs non-trivial; main's MapStmtExprM provides a template if someone wants to land it.
Attempted to restore structural recursion on elimExpr/elimStmt/ elimStmtAsSingle and validateStmtExpr/validateStmt. Hit a termination- proof dead-end: the '.Assign targets value' arm recurses through 'List VariableMd' into '.Field subTarget _', requiring omega to prove sizeOf subTarget < sizeOf expr across five levels (subTarget -> t.val -> sizeOf t -> sizeOf targets -> sizeOf expr.val -> sizeOf expr) — a depth omega can't traverse with the standard hints main's MapStmtExprM uses. Leaving as 'partial' with explanatory TODO comments indicating the cross-type (Variable/StmtExpr) recursion as the blocker. Two viable paths for a follow-up: - Manually prove a sizeOf lemma connecting Variable.Field children to the enclosing VariableMd. - Extract an elimVariable helper to keep cross-type recursion at a single level.
…agnosis Second attempt at structural recursion tried tuple termination '(sizeOf expr, 0) / (sizeOf s, 1) / (sizeOf s, 2)' for the mutual elimExpr/elimStmt/elimStmtAsSingle. Cross-call ordering works, but 'decreasing_by' tactics (omega, simp_all, rename_i, Variable.sizeOf_* helpers) fail to close the inner '.Assign targets value' goals that require chaining sizeOf subTarget < sizeOf t < sizeOf targets < sizeOf expr through List.sizeOf_lt_of_mem + Variable.sizeOf_field_target_lt_of_eq. Comparing to working HeapParameterization.collectExpr, the structural difference is that its walker takes bare StmtExpr with a thin StmtExprMd wrapper, whereas mine takes StmtExprMd directly and does both .val peeling and pattern matching in one function. A refactor to split into 'elimExpr : StmtExprMd → StmtExprMd' wrapper + 'elimExprVal : StmtExpr → Option FileRange → StmtExprMd' worker would likely make termination tractable. Documented in the TODO comment.
MikaelMayer
left a comment
There was a problem hiding this comment.
🔍 LGTM — all comments addressed. subscriptElim now maps elimExpr model over program.constants initializers, closing the gap I flagged. The subsequent refactoring and documentation commits look clean.
Restore structural recursion on elimExpr and validateStmtExpr by:
1. Adding StmtExpr.sizeOf_subscript_{target,index,value}_lt and _of_eq
variants in Laurel.lean, mirroring the existing
Variable.sizeOf_field_target_lt_of_eq pattern. Also
StmtExpr.sizeOf_assign_value_lt_of_eq for the initExpr case.
2. Inlining the previously-mutual elimStmt / elimStmtAsSingle and
validateStmt helpers directly into the .Block, .IfThenElse, and
.While arms of their respective walkers. Removes the cross-call that
Lean's termination checker couldn't resolve (fallback
'| _ => [elimExpr model s]' went from same-size cross-call to
same-function recursion with a strictly smaller subterm).
3. Using named match hypotheses (match _h : expr.val, match _hsub : s.val,
match _htsub : t.val, match _hesub : e.val, match _hbsub : body.val,
match _htv : t.val) so decreasing_by can use the
..._lt_of_eq sizeOf lemmas to chain subterms to the outer expr.
Functionally equivalent to the previous partial version. Termination is
now mechanical; any future infinite-loop bug in these walkers is ruled
out at compile time. The three remaining 'partial' defs
(containsTArray, stmtExprUsesTArray, validateHighType) were already
partial in the pre-port baseline.
MikaelMayer
left a comment
There was a problem hiding this comment.
🔍 LGTM — all comments addressed. The new structural-recursion commit cleanly resolves the partial TODO by adding sizeOf lemmas and inlining the mutual helpers. Previous feedback (constants handling, _model docstring, arrayDataField separation) remains intact.
Adds
Seq<T>(immutable value sequences) andArray<T>(mutable heap-backed arrays) to Laurel, with type-aware desugaring, bounds-checked subscript, validator diagnostics for common misuses, and aSequence.fromArraysnapshot operation.Supersedes #787, which became stale against
main.Depends on #1100 (Core PR for the Sequence well-formedness infrastructure). This branch is based on
feat/core-sequence-wf, so its first four commits are #1100's. When #1100 merges, those commits become part ofmainand this PR collapses to five Laurel commits.scope
Sequences (
Seq<T>, immutable):Seq<T>types,[a, b, c]sequence literals,s[i]ands[i := v]subscript.Sequence.*primitives:empty,build,select,update,length,append,contains,take,drop.Sequence.Arrays (
Array<T>, mutable heap-backed):Array<T>types,a[i]read,a[i] := vdestructive write,Array.length(a)length.$Arraycomposite with a$data: Seq<int>field (theintelement type matches the currentArray<int>-only restriction; see the validator diagnostic below). The$prefix follows the existing convention for compiler-internal names.Array<T>don't get the synthetic composite.modifiesclauses.Sequence.fromArray(a)takes an immutable snapshot of anArray<T>'s current contents. Snapshot is independent of subsequent mutations.bounds checking
Handled by Core preconditions from #1100:
Sequence.select,Sequence.update:0 <= i < length(s)Sequence.take,Sequence.drop:0 <= n <= length(s)Core's
PrecondElimpass generates VC obligations at every call site — both in imperative code (via inserted asserts) and in pure positions likerequires,ensures, quantifier bodies, and function bodies (via synthetic$$wfprocedures). Errors are classified asoutOfBoundsAccessfor SARIF reporting, matching how division by zero is handled.Validator diagnostics
ValidateSubscriptUsageflags five syntactic misuses before verification runs:a[i := v]onArray<T>— functional update not supported on mutable arrays.s[i] := vonSeq<T>— destructive update not allowed on immutable sequences.Array.length(x)wherexis not anArray<T>.Array<T>whereT ≠ int(current SMT limitation).Sequence.fromArray(x)wherexis not anArray<T>.Validator errors cause Core translation to be skipped via the pipeline's generic "any non-Warning diagnostic → halt before Core" check, preventing follow-on type-checker noise from obscuring the helpful message.
BoxSeq per-element-type constructor names
Initial versions of this PR shared a single
BoxSeqconstructor across allSeq<T>field types. This collided inHeapParameterizationwhen a program had a composite withSeqfields of different element types: deduplication by constructor name kept one, and Core type-checking then failed on the other. Fixed by deriving a per-element-type tag (BoxSeq_int,BoxSeq_bool, …) matching the existing per-primitiveBoxInt/BoxBool/… approach.T22_MixedSeqFieldsregresses it.Out of scope
Array<T>forT ≠ int: rejected by diagnostic 4. Lifting this would require per-element-type$Arrayinjection (or similar) at the Laurel layer.Sequence.empty<T>()syntax in raw Core source — separate grammar-level design.Tests
StrataTest/Languages/Laurel/Examples/Fundamentals/T18_Sequences.lean,T19_Arrays.lean, andT22_MixedSeqFields.leancover:Sequence.*operations, contracts withrequires/ensures/opaque, quantifiers, nested sequences, aliasing, loops, inter-proceduralmodifies,Sequence.fromArraysnapshot semantics, one composite carrying twoSeqfields of different element types.Core-side tests for the bounds preconditions live in #1100.
Docs
docs/verso/LaurelDoc.leangains a# Sequences and Arrayssection covering literals, subscripts, operations,Array.length,Sequence.fromArraywith snapshot semantics, the verification-obligation treatment of OOB, a "Common mistakes" list tied to the five validator diagnostics, and internal representation.Design notes
Why
Subscriptis aStmtExprvariant, not aVariablevariant. After main'sVariablerefactor (#1077), write targets (Variable.Local,Variable.Field,Variable.Declare) and expression-position references are distinguished at the type level. A natural question is whethers[i]as a subscript should be aVariable.Subscriptvariant to unify read and write positions the wayVariable.Fielddoes.I considered this and kept
SubscriptonStmtExprfor now. The sticking point is thats[i := v]as an expression (functional update onSeq<T>) is a value, not a storage designator — it doesn't fit onVariable. A truly principled split would need three variants:StmtExpr.SubscriptRead,StmtExpr.SubscriptUpdate(functional), andVariable.Subscript(destructive). That's a larger refactor (one new arm in each of ~18 pre-existingVariable-matching files, plus an AST split) than is warranted here. Current design usesStmtExpr.Subscript target index (update : Option StmtExprMd)with position-based disambiguation: grammar translator lowersa[i] := vto.Subscript a i (some v)at statement position;SubscriptElim.elimExpr's.Blockarm detects statement-position subscripts before general recursion and rewrites accordingly.Dropped
skipCoreflag. An earlier iteration ofrunLaurelPassesreturned askipCore : Boolsignalling thatvalidateSubscriptUsagehad emitted a misuse diagnostic, to telltranslateWithLaurelto halt before Core translation. Main'stranslateWithLaurelnow has a generic "any non-Warning diagnostic → return early" check that subsumes this — the validator's diagnostics flow throughpassDiagsand trigger the same halt. Dropped the bespoke flag as part of the port.Review response
The most recent push addresses @MikaelMayer's three earlier comments:
subscriptElim's_modelparameter now carries a docstring explaining it is accepted to satisfyLaurelPass.run's signature but is intentionally unused (the caller's model predates$Arrayinjection; the pass rebuilds viaresolve).subscriptElimnow processesprogram.constantsalongsidetypesandstaticProcedures, so subscript access in a constant's initializer is eliminated too. Latent today (no grammar path produces such constants) but defensive.SeqOp.dataFieldrenamed toarrayDataFieldand moved out of theSeqOp.*namespace since it is a field name on the synthetic composite, not aSequence.*operation.