Skip to content

Refactor: Introduce intermediate JSON-matching AST for cleaner parser/translation separation #21

@charles-cooper

Description

@charles-cooper

Summary

The current vyperTestLib.sml mixes two concerns:

  1. Parsing JSON into HOL terms
  2. Making semantic decisions to fit the target vyperAST

This issue tracks refactoring to introduce an intermediate AST that mirrors the JSON structure, with a separate HOL translation to vyperAST.

Problem

Examples of semantic work currently buried in the parser:

  • Computing loop bounds from array types
  • Constant folding (target AST assumes constants folded)
  • Recognizing msg.sender from nested JSON structure
  • Deducing String vs Bytes bound from type field

Proposed Solution

JSON  →  jsonASTScript.sml (dumb 1:1 mapping)
              ↓
         jsonToVyperScript.sml (HOL translation, all semantic decisions)
              ↓
         vyperAST (clean, elaborated)

Agent-Optimized Implementation Plan

The following plan is written for an agent to execute. It contains detailed AST sketches, translation examples, and implementation phases.


# Implementation Plan: Intermediate JSON-matching AST

## Motivation

Currently `vyperTestLib.sml` does two things at once:
1. Parses JSON into HOL terms
2. Makes semantic decisions to fit the target `vyperAST`

Examples of semantic work in the parser:
- Computing loop bounds from array types
- Constant folding (target AST assumes constants folded)
- Recognizing `msg.sender` from nested JSON structure and emitting `Builtin (Env Sender) []`
- Deducing `String` vs `Bytes` bound from type field

This mixes concerns. The proposal: introduce an intermediate AST that mirrors JSON structure, then translate to `vyperAST` in HOL.

```
JSON  →  jsonASTScript.sml (dumb 1:1 mapping)

         jsonToVyperScript.sml (HOL translation, all semantic decisions)

         vyperAST (clean, elaborated)
```

## Intermediate AST Sketch

File: `jsonASTScript.sml`

```sml
Theory jsonAST
Ancestors
  string words

(* ===== Type Information ===== *)
(* Mirrors the "type" field in JSON *)

Datatype:
  json_typeclass
  = TC_integer
  | TC_bytes_m
  | TC_static_array
  | TC_dynamic_array
  | TC_struct
  | TC_flag
  | TC_tuple
  | TC_hashmap
  | TC_interface
  | TC_contract_function
  | TC_builtin_function
  | TC_module
  | TC_other string  (* catch-all *)
End

Datatype:
  json_type
  = JT_Named string                           (* name field only, e.g. "bool", "address" *)
  | JT_Integer num bool                       (* bits, is_signed *)
  | JT_BytesM num                             (* m for bytes1..bytes32 *)
  | JT_String num                             (* length *)
  | JT_Bytes num                              (* length *)
  | JT_StaticArray json_type num              (* value_type, length *)
  | JT_DynArray json_type num                 (* value_type, length *)
  | JT_Struct string                          (* name *)
  | JT_Flag string                            (* name *)
  | JT_Tuple (json_type list)                 (* member_types *)
  | JT_HashMap json_type json_type            (* key_type, value_type *)
  | JT_None                                   (* null type *)
End

(* ===== Binary/Unary Operators ===== *)
(* Direct mirror of ast_type for operators *)

Datatype:
  json_binop
  = JBop_Add | JBop_Sub | JBop_Mult | JBop_Div | JBop_FloorDiv
  | JBop_Mod | JBop_Pow
  | JBop_And | JBop_Or | JBop_BitAnd | JBop_BitOr | JBop_BitXor
  | JBop_LShift | JBop_RShift
  | JBop_Eq | JBop_NotEq | JBop_Lt | JBop_LtE | JBop_Gt | JBop_GtE
  | JBop_In | JBop_NotIn
End

Datatype:
  json_unaryop = JUop_USub | JUop_Not | JUop_Invert
End

Datatype:
  json_boolop = JBoolop_And | JBoolop_Or
End

(* ===== Expressions ===== *)
(* Only keep type info where needed for translation *)

Datatype:
  json_expr
  (* Literals - need type for int bounds, string/bytes length *)
  = JE_Int int json_type                               (* value, type for bounds *)
  | JE_Decimal string                                  (* value as string *)
  | JE_Str num string                                  (* length, value *)
  | JE_Bytes num string                                (* length, hex value *)
  | JE_Hex string                                      (* hex value for fixed bytes *)
  | JE_Bool bool                                       (* True/False *)

  (* Variables and access *)
  | JE_Name string                                     (* id *)
  | JE_Attribute json_expr string                      (* value, attr *)
  | JE_Subscript json_expr json_expr                   (* value, slice *)

  (* Operators *)
  | JE_BinOp json_expr json_binop json_expr            (* left, op, right *)
  | JE_BoolOp json_boolop (json_expr list)             (* op, values *)
  | JE_UnaryOp json_unaryop json_expr                  (* op, operand *)
  | JE_IfExp json_expr json_expr json_expr             (* test, body, orelse *)

  (* Compound - need type for array element type, tuple handling *)
  | JE_Tuple (json_expr list)                          (* elements *)
  | JE_List (json_expr list) json_type                 (* elements, type needed for elem type *)

  (* Calls - need type for builtins like concat/slice that embed return length *)
  | JE_Call json_expr (json_expr list) (json_keyword list) json_type
End
;
  json_keyword = JKeyword string json_expr             (* arg, value *)
End

(* ===== Statements ===== *)

Datatype:
  json_stmt
  = JS_Pass
  | JS_Break
  | JS_Continue
  | JS_Expr json_expr
  | JS_Return (json_expr option)
  | JS_Raise (json_expr option)
  | JS_Assert json_expr (json_expr option)             (* test, msg *)
  | JS_Log string (json_expr list)                     (* event name, args *)
  | JS_If json_expr (json_stmt list) (json_stmt list)  (* test, body, orelse *)
  | JS_For string json_type json_iter (json_stmt list) (* var, var_type, iter, body *)
  | JS_Assign json_expr json_expr                      (* target, value *)
  | JS_AnnAssign string json_type json_expr            (* var name, type, value *)
  | JS_AugAssign json_expr json_binop json_expr        (* target, op, value *)
  | JS_Append json_expr json_expr                      (* target, value *)
;
  (* Iterator - need type info for computing bounds *)
  json_iter
  = JIter_Range (json_expr list) (num option)          (* args, explicit bound if given *)
  | JIter_Array json_expr json_type                    (* array expr, array type for bound *)
End

(* ===== Top-level Declarations ===== *)

Datatype:
  json_decorator = JDec string                         (* decorator name: external, internal, view, etc *)
End

Datatype:
  json_arg = JArg string json_type                     (* arg name, type *)
End

Datatype:
  json_func_type = JFuncType (json_type list) json_type  (* argument_types, return_type *)
End

Datatype:
  json_toplevel
  = JTL_FunctionDef string (json_decorator list) (json_arg list) json_func_type (json_stmt list)
      (* name, decorators, args, func_type, body *)
  | JTL_VariableDecl string json_type bool bool bool (json_expr option)
      (* name, type, is_public, is_immutable, is_transient, value (for constants) *)
  | JTL_EventDef string (json_arg list)
  | JTL_StructDef string (json_arg list)
  | JTL_FlagDef string (string list)                   (* name, member names *)
  | JTL_InterfaceDef string                            (* ignored but parsed *)
End

(* ===== Module ===== *)

Datatype:
  json_module = JModule (json_toplevel list)
End
```

## Translation to vyperAST

File: `jsonToVyperScript.sml`

This is where semantic decisions live:

```sml
Theory jsonToVyper
Ancestors
  jsonAST vyperAST

(* ===== Type Translation ===== *)

Definition translate_type_def:
  translate_type (JT_Named "bool") = BaseT BoolT ∧
  translate_type (JT_Named "address") = BaseT AddressT ∧
  translate_type (JT_Named "decimal") = BaseT DecimalT ∧
  translate_type (JT_Integer bits F) = BaseT (UintT bits) ∧
  translate_type (JT_Integer bits T) = BaseT (IntT bits) ∧
  translate_type (JT_BytesM m) = BaseT (BytesT (Fixed m)) ∧
  translate_type (JT_String n) = BaseT (StringT n) ∧
  translate_type (JT_Bytes n) = BaseT (BytesT (Dynamic n)) ∧
  translate_type (JT_StaticArray vt len) = ArrayT (translate_type vt) (Fixed len) ∧
  translate_type (JT_DynArray vt len) = ArrayT (translate_type vt) (Dynamic len) ∧
  translate_type (JT_Struct name) = StructT name ∧
  translate_type (JT_Flag name) = FlagT name ∧
  translate_type (JT_Tuple tys) = TupleT (MAP translate_type tys) ∧
  translate_type JT_None = NoneT
Termination
  ...
End

(* ===== Expression Translation ===== *)
(* This is where msg.sender recognition etc. happens *)

Definition translate_expr_def:
  (* msg.sender → Builtin (Env Sender) [] *)
  translate_expr (JE_Attribute (JE_Name "msg") "sender") =
    Builtin (Env Sender) [] ∧

  (* msg.value → Builtin (Env ValueSent) [] *)
  translate_expr (JE_Attribute (JE_Name "msg") "value") =
    Builtin (Env ValueSent) [] ∧

  (* block.timestamp → Builtin (Env TimeStamp) [] *)
  translate_expr (JE_Attribute (JE_Name "block") "timestamp") =
    Builtin (Env TimeStamp) [] ∧

  (* block.number → Builtin (Env BlockNumber) [] *)
  translate_expr (JE_Attribute (JE_Name "block") "number") =
    Builtin (Env BlockNumber) [] ∧

  (* self.x → TopLevelName x *)
  translate_expr (JE_Attribute (JE_Name "self") attr) =
    TopLevelName attr ∧

  (* General attribute: e.attr *)
  translate_expr (JE_Attribute e attr) =
    Attribute (translate_expr e) attr ∧

  (* Name *)
  translate_expr (JE_Name id) = Name id ∧

  (* Integer literal *)
  translate_expr (JE_Int v ty) =
    Literal (IntL (int_bound_of_type ty) v) ∧

  (* ... etc ... *)
Termination
  ...
End

(* ===== Statement Translation ===== *)
(* Loop bounds computed here from iterator type *)

Definition get_iter_bound_def:
  (* Range with explicit bound *)
  get_iter_bound (JIter_Range args (SOME n)) = n ∧
  (* Range without bound: end - start *)
  get_iter_bound (JIter_Range [e] NONE) = ... ∧   (* 0 to e *)
  get_iter_bound (JIter_Range [s;e] NONE) = ... ∧ (* s to e *)
  (* Array: get length from type *)
  get_iter_bound (JIter_Array _ (JT_StaticArray _ len)) = len ∧
  get_iter_bound (JIter_Array _ (JT_DynArray _ len)) = len
End

Definition translate_iter_def:
  translate_iter var_ty (JIter_Range args _) =
    Range (translate_range_start var_ty args) (translate_range_end var_ty args) ∧
  translate_iter var_ty (JIter_Array e _) =
    Array (translate_expr e)
End

Definition translate_stmt_def:
  translate_stmt JS_Pass = Pass ∧
  translate_stmt JS_Break = Break ∧
  translate_stmt JS_Continue = Continue ∧

  translate_stmt (JS_For var ty iter body) =
    For var (translate_type ty) (translate_iter ty iter)
        (get_iter_bound iter) (MAP translate_stmt body) ∧

  translate_stmt (JS_Append tgt e) =
    Append (translate_base_target tgt) (translate_expr e) ∧

  (* ... etc ... *)
End
```

## Implementation Steps

### Phase 1: Define jsonAST (1-2 days)
1. Create `jsonASTScript.sml` with datatypes above
2. Keep it simple - mirror JSON structure exactly
3. Add `cv_auto_trans` for all types
4. Build and verify it compiles

### Phase 2: SML JSON Parser (2-3 days)
1. Create `jsonASTLib.sml` - simplified parser that just maps JSON → jsonAST terms
2. Much simpler than current `vyperTestLib.sml` - no semantic decisions
3. Each JSON `ast_type` maps to one constructor
4. Type field parsing is mechanical

### Phase 3: HOL Translation (3-4 days)
1. Create `jsonToVyperScript.sml`
2. `translate_type` - straightforward
3. `translate_expr` - this is where `msg.sender` etc. recognition goes
4. `translate_stmt` - loop bound computation here
5. `translate_toplevel`
6. Prove termination for recursive functions

### Phase 4: Integration (1-2 days)
1. Modify `vyperTestLib.sml` to use new pipeline:
   - Parse JSON → jsonAST term
   - Apply `translate_module` (via cv_compute or similar)
   - Result is vyperAST term
2. Verify all tests still pass

### Phase 5: Cleanup (1 day)
1. Remove old parsing code from `vyperTestLib.sml`
2. Document the new architecture
3. Update CLAUDE.md/AGENTS.md

## Files to Create

| File | Purpose |
|------|---------|
| `jsonASTScript.sml` | Intermediate AST datatypes |
| `jsonASTSyntax.sml` | SML syntax functions (auto-generated mostly) |
| `jsonASTLib.sml` | JSON → jsonAST parser (SML) |
| `jsonToVyperScript.sml` | jsonAST → vyperAST translation (HOL) |

## Design Decisions

1. **Trace/test infrastructure**: Keep as-is. The existing trace translation (call, deployment) is already minimal - no need for intermediate AST there.

2. **Granularity**: Only include fields we actually use. Skip `node_id`, `variable_reads`, `variable_writes`, `source_id`, etc. The intermediate AST is not a complete JSON mirror, just what's needed for translation.

3. **Folded values**: Use `folded_value` when available (in SML parser), don't represent both in intermediate AST. The intermediate AST holds the value we'll use.

4. **Error handling**: Catch malformed JSON in SML parser. Fail fast. The HOL translation assumes well-formed input.

## Benefits When Complete

1. **SML parser becomes trivial** - just field access, no logic
2. **Semantic decisions documented in HOL** - inspectable, modifiable
3. **JSON format explicitly specified** - jsonAST documents what we expect
4. **Easier to extend** - new JSON construct = new constructor + translation clause
5. **Potential for proving translation correct** - if ever desired

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions