feat: add Pascal / Delphi language support#183
Conversation
Introduces a tree-sitter-pascal-backed language module that produces the
same kind of deterministic structural data the other extractors already
provide (functions, classes, imports, exports, call graph) — validated end
to end against a 324-file legacy Delphi 7 codebase (19 MB / 461k lines).
New language module
- packages/core/src/languages/configs/pascal.ts — language config
(extensions .pas/.pp/.dpr/.dpk/.inc, treeSitter→tree-sitter-pascal.wasm,
concepts list, file patterns)
- packages/core/src/languages/configs/index.ts — register pascalConfig
- packages/core/src/plugins/extractors/pascal-extractor.ts — AST walker
for unit / program / library modules: extracts moduleName, uses-clauses
(interface- and implementation-section), declTypes (classes +
interfaces), declProc/defProc (with declArgs/typeref), declField,
call graph via exprCall, and qualified-method names via genericDot
- packages/core/src/plugins/extractors/index.ts — register PascalExtractor
- skills/understand/languages/pascal.md — LLM prompt snippet (key concepts,
import patterns, file patterns, common frameworks, example
languageNotes)
Shared-type extensions (additive, all optional — no breaking changes for
existing extractors)
- packages/core/src/types.ts:
StructuralAnalysis.classes[].parents?: string[]
Ancestor class names. Lets every language extractor surface
inheritance deterministically (Pascal `class(TParent)`, Java
`extends X`, Python `class X(Y)`, etc.) so the file-analyzer
agent doesn't have to re-read source to recover the parent
StructuralAnalysis.classes[].interfaces?: string[]
Implemented interface names. For Pascal this is every typeref
after the first under declClass; for Java `implements X, Y`
StructuralAnalysis.imports[].section?: "interface" | "implementation"
Section-scoped imports. Pascal-specific for now (other
languages don't have section-scoped uses-clauses)
- skills/understand/extract-structure.mjs: pass through the new
classes[].parents / classes[].interfaces / imports[].section fields
to the file-analyzer JSON output
New post-merge helper scripts
- skills/understand/emit-dfm-pairs.mjs: Pascal-specific. After merge,
scans for `file:*.pas` ↔ `file:*.dfm` filename pairs and emits a
`related` edge between them — Pascal forms come in paired
source/form-definition files and should be linked in the graph
- skills/understand/resolve-external-class-refs.mjs: generic, useful
for any language. When file-analyzer agents emit inheritance edges
pointing at `class:external:<Name>` (because they don't know which
file declares the parent), this pass rewrites the target to the
actual cross-batch node ID by class-name lookup. On the CW2 sample
it recovers 194 cross-batch inherits/implements edges that the merge
step would otherwise drop as dangling
End-to-end validation
- tree-sitter-pascal grammar (Isopod/tree-sitter-pascal v0.10.2, built
to wasm via `tree-sitter build --wasm`) parses real CW2 Delphi 7
source with zero error nodes (sampled across data modules, forms,
and SOAP/WSDL stubs — including a 4424-line / 243 KB data module)
- Full pipeline on 324 Pascal files produced 6,113 nodes / 8,770 edges
including 454 `inherits` + 7 `implements` + 1,944 `imports` (with
756 tagged `interface`-section vs 124 `implementation`-section) +
307 `related` (DFM pairings) — a 3.2× jump in inheritance edges
over a v1 run that lacked the new shared-type fields
- The "untyped forms" bucket in the architecture-analyzer's layer
output shrank from 218 to 186 once inheritance became deterministic
Open question for the maintainer
- Distribution of `tree-sitter-pascal.wasm`: there is no published npm
package shipping a prebuilt wasm. Options to discuss: (a) vendor the
wasm into a small workspace package, (b) add an optionalDependency
with a postinstall build step, (c) document a manual `tree-sitter
build --wasm` step. This PR leaves the dependency out of
packages/core/package.json so existing CI / installs are unaffected;
the TreeSitterPlugin's existing graceful-degradation path means
Pascal support is simply unavailable until a wasm is provided
… TS+Python Follow-up to the Pascal/Delphi language-support PR, pre-empting two expected review asks: PascalExtractor unit test suite - New `packages/core/src/plugins/extractors/__tests__/pascal-extractor.test.ts` with 13 tests covering: import section-tagging (interface vs implementation vs untagged for .dpr), class inheritance (single parent, parent + multi-interface split, pure interface inheritance, ancestor-less classes), procedure/function extraction (params, return type, qualified method names), and call graph (both parenthesized and bare-identifier procedure calls). - The suite skips cleanly (with a debug warning) if the tree-sitter-pascal grammar isn't installed in node_modules, matching the PR's open question about wasm distribution. Skip is evaluated at collection time via top-level await — `describe.skipIf` doesn't see `beforeAll` mutations because collection runs first. Populate `parents` / `interfaces` for two other extractors - python-extractor: pulls Python `class X(Y, Z)` bases out of the superclasses field. Python has no syntactic class/interface distinction, so everything lands in `parents` and `interfaces` stays undefined. Keyword-args like `metaclass=Meta` are skipped. - typescript-extractor: walks `class_heritage`, putting `extends_clause` types in `parents` and `implements_clause` types in `interfaces`. Handles `identifier`, `type_identifier`, `generic_type`, `member_expression`, and `nested_type_identifier`. Bug fix in PascalExtractor.extractCallGraph - Pascal allows bare-identifier procedure calls (`Foo;` instead of `Foo();`). Tree-sitter parses these as `(statement (identifier))` rather than `exprCall`. The extractor now records both shapes, filtered to skip non-call statements (assignments, binary expressions, statements with nested calls). All 241 existing extractor tests still pass; Pascal suite passes 13/13 when grammar is available.
|
Pre-empting two likely review asks — pushed two follow-up commits ( 1. PascalExtractor unit tests — 13 tests covering import section-tagging, class inheritance (single parent / parent + multi-interface / pure interface inheritance / ancestor-less), procedure/function extraction (params, return type, qualified 2.
3. Fixed a real bug in All 241 existing extractor tests still pass; new Pascal suite is 13/13 when the grammar is available; smoke-tested every extractor against representative source for each language. |
Completes the shared-type extension introduced in the previous commits.
With this commit, every built-in extractor now surfaces inheritance
deterministically, so the file-analyzer agent no longer has to re-read
source to emit `inherits` / `implements` edges in any language.
Java (`packages/core/src/plugins/extractors/java-extractor.ts`)
- class_declaration: `superclass` field → parents; `interfaces` field
(which wraps a super_interfaces node) → interfaces
- interface_declaration: `extends_interfaces` child node → parents
(interface inheritance, not implementation)
- Shared helper `extractTypeRefs()` walks the wrapper node and pulls
out `type_identifier` / `scoped_type_identifier` / `generic_type`
(peeling the type_identifier out of generic_type when present)
C# (`packages/core/src/plugins/extractors/csharp-extractor.ts`)
- class_declaration: `base_list` child holds the colon-separated
parent + interfaces list (C# has no syntactic distinction, just the
I-prefix naming convention). Apply the convention via
`splitCSharpBaseRefs`: if the first entry's bare type name starts
with `I[A-Z]`, treat every entry as an interface; otherwise the
first is the class parent and the rest are interfaces. This matches
what every C# IDE does for outline/symbol classification.
- interface_declaration: every base_list entry is itself an interface
parent → parents
- `extractBaseListRefs()` handles `identifier`, `qualified_name`,
`predefined_type`, `generic_name`
C++ (`packages/core/src/plugins/extractors/cpp-extractor.ts`)
- class_specifier / struct_specifier: child node `base_class_clause`
holds the `: public Foo, protected Bar` list. C++ has no
syntactic interface concept (abstract classes look the same), so
every base lands in `parents`
- Handles `type_identifier`, `qualified_identifier`, `template_type`
Go (`packages/core/src/plugins/extractors/go-extractor.ts`)
- Go has no class inheritance, but embedded fields in structs
(`type T struct { Inner; *Other }`) promote the embedded type's
methods — the closest Go has to inheritance. An embedded field is
a `field_declaration` with no `field_identifier`; the type itself
is the field name. Surface those in `parents`. For pointer-embed
(`*Foo`) strip the `*`; for qualified-embed (`pkg.Foo`) keep the
full ref.
Ruby (`packages/core/src/plugins/extractors/ruby-extractor.ts`)
- class declarations: `superclass` field → parents
- module mixins (`include Mod`, `prepend Mod`, `extend Mod` at the
top level of the class body) → interfaces — they contribute
methods at runtime, semantically like interface implementation
PHP (`packages/core/src/plugins/extractors/php-extractor.ts`)
- class_declaration: `base_clause` child (extends) → parents;
`class_interface_clause` child (implements) → interfaces
- interface_declaration: `base_clause` (extends) → parents
(interface inheritance)
- `extractPhpTypeRefs()` walks the clause's `name` / `qualified_name`
children
Rust (`packages/core/src/plugins/extractors/rust-extractor.ts`)
- Rust has no class inheritance, but `impl Trait for Type` declares
that `Type` implements `Trait`. Built a `traitsByType` map during
impl walking, then attached it as `interfaces` on the matching
struct/enum during the final pass (parallel to the existing
methodsByType pattern)
- trait_item: `bounds` field holds supertrait bounds (e.g.
`trait Foo: Bar + Baz`) → parents (direct trait inheritance)
All 241 existing extractor tests still pass; smoke-tested all 9
extractors against representative source for each language.
Summary
Adds Pascal / Delphi (
.pas,.pp,.dpr,.dpk,.inc) as a first-class language for Understand-Anything, backed bytree-sitter-pascal(Isopod/tree-sitter-pascal). Also makes two small additive extensions toStructuralAnalysisthat benefit every language — particularly inheritance recovery, which is currently missing from the deterministic structural output across all extractors.End-to-end validated against a 324-file legacy Delphi 7 codebase (19 MB / 461,452 lines): zero parse errors across the suite, and the resulting graph hits 6,113 nodes / 8,770 edges including 454
inheritsedges, 7implementsedges, 1,944importsedges (756 taggedinterface-section vs 124implementation-section), and 307relatededges (DFM filename pairings).What's in the PR
New language module
packages/core/src/languages/configs/pascal.ts— language configpackages/core/src/languages/configs/index.ts— registerpackages/core/src/plugins/extractors/pascal-extractor.ts— AST walker forunit/program/library: extractsmoduleName,declUses(interface- and implementation-section),declTypes→declClass/declIntfwith ancestors,declProc/defProcwithdeclArgs/typeref,declField, qualified-method names viagenericDot, and a call graph viaexprCallpackages/core/src/plugins/extractors/index.ts— registerPascalExtractorskills/understand/languages/pascal.md— LLM prompt snippet (concepts, import patterns, file patterns, common frameworks, examplelanguageNotes)Shared-type extensions (additive, all optional — no breaking changes)
In
packages/core/src/types.ts:These let every language extractor surface inheritance deterministically. Today, no extractor emits this on
StructuralAnalysisso thefile-analyzeragent has to re-read source to recover parents — which is inconsistent across batches. On the validation codebase, switching to deterministic inheritance jumpedinheritsedges from 140 → 454 (3.2×) and let the architecture-analyzer correctly classify reporting forms (38 → 201) instead of dumping them into a generic "Standalone" bucket.skills/understand/extract-structure.mjsis updated to surface the new fields in its JSON output.The other extractors are untouched and continue to work — the new fields are optional. Populating them for TypeScript / Python / Java / C# / etc. is a natural follow-up.
New post-merge helper scripts
skills/understand/resolve-external-class-refs.mjs— generic, useful for any language. When file-analyzer agents emit inheritance edges pointing atclass:external:<Name>because they don't know which file declares the parent (it's in another batch), this pass rewrites the edge target to the actual node ID by class-name lookup, then re-adds the edge to the assembled graph. On the validation codebase this recovers 194 cross-batchinherits/implementsedges the merge step would otherwise drop as dangling.skills/understand/emit-dfm-pairs.mjs— Pascal-specific. Pascal forms come in paired.pas+.dfmfiles; this emits arelatededge between each pair after merge.Test plan
tree-sitter build --wasm) parses real Delphi 7 source with zero error nodes across 324 sampled filesextract-structure.mjsruns cleanly end-to-end on 25-file batches; newparents/interfaces/sectionfields populate as expected/understandpipeline produces a validated graph (0 issues, 0 warnings in inline validation) with all file-level nodes assigned to layers and all tour-step nodeIds resolvingPascalExtractor— not in this PR, happy to add in a follow-up commit if you'd like them before mergeOpen question for the maintainer
Distribution of
tree-sitter-pascal.wasm. There's no published npm package shipping a prebuilt wasm for Pascal — the existingtree-sitter-pascal@0.0.1on npm ships only C source vianannative bindings. Options to discuss:packages/tree-sitter-pascal-wasm/or similar). ~700 KB binary asset. Cleanest install experience.tree-sitter build --wasmagainst the grammar source.tree-sitter-cli, clone Isopod's grammar, build to wasm, and place it whererequire.resolve("tree-sitter-pascal/tree-sitter-pascal.wasm")finds it.This PR intentionally leaves
tree-sitter-pascalout ofpackages/core/package.jsonso existing CI / installs are unaffected. TheTreeSitterPlugin's existing graceful-degradation path (catch + skip when the wasm doesn't resolve) means Pascal support is simply unavailable until a wasm is provided locally — same posture as if a user had no grammar at all. Once we agree on (a) / (b) / (c) I'm happy to add the dependency wiring in this PR or a follow-up.🤖 Generated with Claude Code