Skip to content

feat: add Pascal / Delphi language support#183

Open
Cameron64 wants to merge 3 commits into
Lum1104:mainfrom
Cameron64:feat/pascal-delphi-language-support
Open

feat: add Pascal / Delphi language support#183
Cameron64 wants to merge 3 commits into
Lum1104:mainfrom
Cameron64:feat/pascal-delphi-language-support

Conversation

@Cameron64
Copy link
Copy Markdown

Summary

Adds Pascal / Delphi (.pas, .pp, .dpr, .dpk, .inc) as a first-class language for Understand-Anything, backed by tree-sitter-pascal (Isopod/tree-sitter-pascal). Also makes two small additive extensions to StructuralAnalysis that benefit every language — particularly inheritance recovery, which is currently missing from the deterministic structural output across all extractors.

End-to-end validated against a 324-file legacy Delphi 7 codebase (19 MB / 461,452 lines): zero parse errors across the suite, and the resulting graph hits 6,113 nodes / 8,770 edges including 454 inherits edges, 7 implements edges, 1,944 imports edges (756 tagged interface-section vs 124 implementation-section), and 307 related edges (DFM filename pairings).

What's in the PR

New language module

  • packages/core/src/languages/configs/pascal.ts — language config
  • packages/core/src/languages/configs/index.ts — register
  • packages/core/src/plugins/extractors/pascal-extractor.ts — AST walker for unit / program / library: extracts moduleName, declUses (interface- and implementation-section), declTypesdeclClass / declIntf with ancestors, declProc / defProc with declArgs / typeref, declField, qualified-method names via genericDot, and a call graph via exprCall
  • packages/core/src/plugins/extractors/index.ts — register PascalExtractor
  • skills/understand/languages/pascal.md — LLM prompt snippet (concepts, import patterns, file patterns, common frameworks, example languageNotes)

Shared-type extensions (additive, all optional — no breaking changes)

In packages/core/src/types.ts:

StructuralAnalysis.classes[].parents?: string[]      // ancestor class names
StructuralAnalysis.classes[].interfaces?: string[]   // implemented interface names
StructuralAnalysis.imports[].section?: "interface" | "implementation"

These let every language extractor surface inheritance deterministically. Today, no extractor emits this on StructuralAnalysis so the file-analyzer agent has to re-read source to recover parents — which is inconsistent across batches. On the validation codebase, switching to deterministic inheritance jumped inherits edges from 140 → 454 (3.2×) and let the architecture-analyzer correctly classify reporting forms (38 → 201) instead of dumping them into a generic "Standalone" bucket.

skills/understand/extract-structure.mjs is updated to surface the new fields in its JSON output.

The other extractors are untouched and continue to work — the new fields are optional. Populating them for TypeScript / Python / Java / C# / etc. is a natural follow-up.

New post-merge helper scripts

  • skills/understand/resolve-external-class-refs.mjsgeneric, useful for any language. When file-analyzer agents emit inheritance edges pointing at class:external:<Name> because they don't know which file declares the parent (it's in another batch), this pass rewrites the edge target to the actual node ID by class-name lookup, then re-adds the edge to the assembled graph. On the validation codebase this recovers 194 cross-batch inherits / implements edges the merge step would otherwise drop as dangling.
  • skills/understand/emit-dfm-pairs.mjs — Pascal-specific. Pascal forms come in paired .pas + .dfm files; this emits a related edge between each pair after merge.

Test plan

  • Tree-sitter-pascal grammar (Isopod v0.10.2 built to wasm via tree-sitter build --wasm) parses real Delphi 7 source with zero error nodes across 324 sampled files
  • extract-structure.mjs runs cleanly end-to-end on 25-file batches; new parents / interfaces / section fields populate as expected
  • Full /understand pipeline produces a validated graph (0 issues, 0 warnings in inline validation) with all file-level nodes assigned to layers and all tour-step nodeIds resolving
  • Unit tests for PascalExtractor — not in this PR, happy to add in a follow-up commit if you'd like them before merge

Open question for the maintainer

Distribution of tree-sitter-pascal.wasm. There's no published npm package shipping a prebuilt wasm for Pascal — the existing tree-sitter-pascal@0.0.1 on npm ships only C source via nan native bindings. Options to discuss:

  • (a) Vendor the wasm into a small in-repo workspace package (packages/tree-sitter-pascal-wasm/ or similar). ~700 KB binary asset. Cleanest install experience.
  • (b) Add an optional dependency with a postinstall build step that runs tree-sitter build --wasm against the grammar source.
  • (c) Document a manual build step in the README — users install tree-sitter-cli, clone Isopod's grammar, build to wasm, and place it where require.resolve("tree-sitter-pascal/tree-sitter-pascal.wasm") finds it.

This PR intentionally leaves tree-sitter-pascal out of packages/core/package.json so existing CI / installs are unaffected. The TreeSitterPlugin's existing graceful-degradation path (catch + skip when the wasm doesn't resolve) means Pascal support is simply unavailable until a wasm is provided locally — same posture as if a user had no grammar at all. Once we agree on (a) / (b) / (c) I'm happy to add the dependency wiring in this PR or a follow-up.

🤖 Generated with Claude Code

Cam Dowdle added 2 commits May 23, 2026 07:28
Introduces a tree-sitter-pascal-backed language module that produces the
same kind of deterministic structural data the other extractors already
provide (functions, classes, imports, exports, call graph) — validated end
to end against a 324-file legacy Delphi 7 codebase (19 MB / 461k lines).

New language module
- packages/core/src/languages/configs/pascal.ts — language config
  (extensions .pas/.pp/.dpr/.dpk/.inc, treeSitter→tree-sitter-pascal.wasm,
  concepts list, file patterns)
- packages/core/src/languages/configs/index.ts — register pascalConfig
- packages/core/src/plugins/extractors/pascal-extractor.ts — AST walker
  for unit / program / library modules: extracts moduleName, uses-clauses
  (interface- and implementation-section), declTypes (classes +
  interfaces), declProc/defProc (with declArgs/typeref), declField,
  call graph via exprCall, and qualified-method names via genericDot
- packages/core/src/plugins/extractors/index.ts — register PascalExtractor
- skills/understand/languages/pascal.md — LLM prompt snippet (key concepts,
  import patterns, file patterns, common frameworks, example
  languageNotes)

Shared-type extensions (additive, all optional — no breaking changes for
existing extractors)
- packages/core/src/types.ts:
    StructuralAnalysis.classes[].parents?: string[]
        Ancestor class names. Lets every language extractor surface
        inheritance deterministically (Pascal `class(TParent)`, Java
        `extends X`, Python `class X(Y)`, etc.) so the file-analyzer
        agent doesn't have to re-read source to recover the parent
    StructuralAnalysis.classes[].interfaces?: string[]
        Implemented interface names. For Pascal this is every typeref
        after the first under declClass; for Java `implements X, Y`
    StructuralAnalysis.imports[].section?: "interface" | "implementation"
        Section-scoped imports. Pascal-specific for now (other
        languages don't have section-scoped uses-clauses)
- skills/understand/extract-structure.mjs: pass through the new
  classes[].parents / classes[].interfaces / imports[].section fields
  to the file-analyzer JSON output

New post-merge helper scripts
- skills/understand/emit-dfm-pairs.mjs: Pascal-specific. After merge,
  scans for `file:*.pas` ↔ `file:*.dfm` filename pairs and emits a
  `related` edge between them — Pascal forms come in paired
  source/form-definition files and should be linked in the graph
- skills/understand/resolve-external-class-refs.mjs: generic, useful
  for any language. When file-analyzer agents emit inheritance edges
  pointing at `class:external:<Name>` (because they don't know which
  file declares the parent), this pass rewrites the target to the
  actual cross-batch node ID by class-name lookup. On the CW2 sample
  it recovers 194 cross-batch inherits/implements edges that the merge
  step would otherwise drop as dangling

End-to-end validation
- tree-sitter-pascal grammar (Isopod/tree-sitter-pascal v0.10.2, built
  to wasm via `tree-sitter build --wasm`) parses real CW2 Delphi 7
  source with zero error nodes (sampled across data modules, forms,
  and SOAP/WSDL stubs — including a 4424-line / 243 KB data module)
- Full pipeline on 324 Pascal files produced 6,113 nodes / 8,770 edges
  including 454 `inherits` + 7 `implements` + 1,944 `imports` (with
  756 tagged `interface`-section vs 124 `implementation`-section) +
  307 `related` (DFM pairings) — a 3.2× jump in inheritance edges
  over a v1 run that lacked the new shared-type fields
- The "untyped forms" bucket in the architecture-analyzer's layer
  output shrank from 218 to 186 once inheritance became deterministic

Open question for the maintainer
- Distribution of `tree-sitter-pascal.wasm`: there is no published npm
  package shipping a prebuilt wasm. Options to discuss: (a) vendor the
  wasm into a small workspace package, (b) add an optionalDependency
  with a postinstall build step, (c) document a manual `tree-sitter
  build --wasm` step. This PR leaves the dependency out of
  packages/core/package.json so existing CI / installs are unaffected;
  the TreeSitterPlugin's existing graceful-degradation path means
  Pascal support is simply unavailable until a wasm is provided
… TS+Python

Follow-up to the Pascal/Delphi language-support PR, pre-empting two
expected review asks:

PascalExtractor unit test suite
- New `packages/core/src/plugins/extractors/__tests__/pascal-extractor.test.ts`
  with 13 tests covering: import section-tagging (interface vs
  implementation vs untagged for .dpr), class inheritance (single
  parent, parent + multi-interface split, pure interface inheritance,
  ancestor-less classes), procedure/function extraction (params,
  return type, qualified method names), and call graph (both
  parenthesized and bare-identifier procedure calls).
- The suite skips cleanly (with a debug warning) if the
  tree-sitter-pascal grammar isn't installed in node_modules, matching
  the PR's open question about wasm distribution. Skip is evaluated
  at collection time via top-level await — `describe.skipIf` doesn't
  see `beforeAll` mutations because collection runs first.

Populate `parents` / `interfaces` for two other extractors
- python-extractor: pulls Python `class X(Y, Z)` bases out of the
  superclasses field. Python has no syntactic class/interface
  distinction, so everything lands in `parents` and `interfaces`
  stays undefined. Keyword-args like `metaclass=Meta` are skipped.
- typescript-extractor: walks `class_heritage`, putting
  `extends_clause` types in `parents` and `implements_clause` types
  in `interfaces`. Handles `identifier`, `type_identifier`,
  `generic_type`, `member_expression`, and `nested_type_identifier`.

Bug fix in PascalExtractor.extractCallGraph
- Pascal allows bare-identifier procedure calls (`Foo;` instead of
  `Foo();`). Tree-sitter parses these as `(statement (identifier))`
  rather than `exprCall`. The extractor now records both shapes,
  filtered to skip non-call statements (assignments, binary
  expressions, statements with nested calls).

All 241 existing extractor tests still pass; Pascal suite passes 13/13
when grammar is available.
@Cameron64
Copy link
Copy Markdown
Author

Cameron64 commented May 23, 2026

Pre-empting two likely review asks — pushed two follow-up commits (9b7e663, 636b1da):

1. PascalExtractor unit tests — 13 tests covering import section-tagging, class inheritance (single parent / parent + multi-interface / pure interface inheritance / ancestor-less), procedure/function extraction (params, return type, qualified Class.Method names), and call graph. Skips cleanly via top-level await when tree-sitter-pascal.wasm isn't installed (the open distribution question), so contributors without the grammar can still run the rest of the suite.

2. parents / interfaces populated for every built-in extractor — to demonstrate the shared-type shape works for the full language matrix, not just Pascal:

  • TypeScript (class_heritageextends_clause / implements_clause): class Cat extends Animal implements Feline, Pettableparents=["Animal"], interfaces=["Feline","Pettable"]
  • Python (superclasses field): class Dog(Animal, Trainable):parents=["Animal","Trainable"] (Python has no syntactic class/interface distinction, so all bases go in parents; metaclass= keyword args skipped)
  • Java (superclass field + interfaces field; interface_declaration uses extends_interfaces): full split between parents and interfaces, plus interface-extends-interface → parents
  • C# (base_list child node): applies the standard C# I-prefix convention via splitCSharpBaseRefs — if the first entry's bare name starts with I[A-Z] everything goes to interfaces; otherwise the first is the class parent and the rest are interfaces. Matches IDE outline behavior.
  • C++ (base_class_clause child): class Cat : public Animal, protected Pettableparents=["Animal","Pettable"] (C++ has no syntactic interface distinction)
  • Go: no class inheritance, but embedded fields in structs (type T struct { Inner; *Other }) promote the embedded type's methods. Surfaced in parents since that's the closest Go semantic.
  • Ruby: class Cat < Animal (superclass field) → parents; include Mod / prepend Mod / extend Mod mixins at class body top level → interfaces (they contribute methods at runtime, semantically like interface implementation)
  • PHP (base_clause for extends, class_interface_clause for implements): clean split. Interface-extends-interface → parents.
  • Rust: built a traitsByType map during impl_item walking — impl Trait for Typeinterfaces on the matching struct/enum. trait Foo: Bar + Baz supertrait bounds → parents on the trait.

3. Fixed a real bug in PascalExtractor.extractCallGraph — Pascal allows bare-identifier procedure calls (Foo; instead of Foo();), which tree-sitter parses as (statement (identifier)) rather than exprCall. The extractor now records both shapes, filtered to skip non-call statements (assignments, binary expressions, statements with nested calls). Test coverage in the new suite.

All 241 existing extractor tests still pass; new Pascal suite is 13/13 when the grammar is available; smoke-tested every extractor against representative source for each language.

Completes the shared-type extension introduced in the previous commits.
With this commit, every built-in extractor now surfaces inheritance
deterministically, so the file-analyzer agent no longer has to re-read
source to emit `inherits` / `implements` edges in any language.

Java (`packages/core/src/plugins/extractors/java-extractor.ts`)
- class_declaration: `superclass` field → parents; `interfaces` field
  (which wraps a super_interfaces node) → interfaces
- interface_declaration: `extends_interfaces` child node → parents
  (interface inheritance, not implementation)
- Shared helper `extractTypeRefs()` walks the wrapper node and pulls
  out `type_identifier` / `scoped_type_identifier` / `generic_type`
  (peeling the type_identifier out of generic_type when present)

C# (`packages/core/src/plugins/extractors/csharp-extractor.ts`)
- class_declaration: `base_list` child holds the colon-separated
  parent + interfaces list (C# has no syntactic distinction, just the
  I-prefix naming convention). Apply the convention via
  `splitCSharpBaseRefs`: if the first entry's bare type name starts
  with `I[A-Z]`, treat every entry as an interface; otherwise the
  first is the class parent and the rest are interfaces. This matches
  what every C# IDE does for outline/symbol classification.
- interface_declaration: every base_list entry is itself an interface
  parent → parents
- `extractBaseListRefs()` handles `identifier`, `qualified_name`,
  `predefined_type`, `generic_name`

C++ (`packages/core/src/plugins/extractors/cpp-extractor.ts`)
- class_specifier / struct_specifier: child node `base_class_clause`
  holds the `: public Foo, protected Bar` list. C++ has no
  syntactic interface concept (abstract classes look the same), so
  every base lands in `parents`
- Handles `type_identifier`, `qualified_identifier`, `template_type`

Go (`packages/core/src/plugins/extractors/go-extractor.ts`)
- Go has no class inheritance, but embedded fields in structs
  (`type T struct { Inner; *Other }`) promote the embedded type's
  methods — the closest Go has to inheritance. An embedded field is
  a `field_declaration` with no `field_identifier`; the type itself
  is the field name. Surface those in `parents`. For pointer-embed
  (`*Foo`) strip the `*`; for qualified-embed (`pkg.Foo`) keep the
  full ref.

Ruby (`packages/core/src/plugins/extractors/ruby-extractor.ts`)
- class declarations: `superclass` field → parents
- module mixins (`include Mod`, `prepend Mod`, `extend Mod` at the
  top level of the class body) → interfaces — they contribute
  methods at runtime, semantically like interface implementation

PHP (`packages/core/src/plugins/extractors/php-extractor.ts`)
- class_declaration: `base_clause` child (extends) → parents;
  `class_interface_clause` child (implements) → interfaces
- interface_declaration: `base_clause` (extends) → parents
  (interface inheritance)
- `extractPhpTypeRefs()` walks the clause's `name` / `qualified_name`
  children

Rust (`packages/core/src/plugins/extractors/rust-extractor.ts`)
- Rust has no class inheritance, but `impl Trait for Type` declares
  that `Type` implements `Trait`. Built a `traitsByType` map during
  impl walking, then attached it as `interfaces` on the matching
  struct/enum during the final pass (parallel to the existing
  methodsByType pattern)
- trait_item: `bounds` field holds supertrait bounds (e.g.
  `trait Foo: Bar + Baz`) → parents (direct trait inheritance)

All 241 existing extractor tests still pass; smoke-tested all 9
extractors against representative source for each language.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant