Skip to content

Expose language-agnostic owner, comment, match, and enclosing-context metadata #557

@buger

Description

@buger

Summary

probe search -o json already gives the right high-level primitive for source discovery: language-aware blocks with line ranges, AST node types, and useful scope flags. For downstream traceability/evidence tooling, the JSON contract needs a bit more language-agnostic AST/search metadata so callers do not have to run their own repo-wide parsers.

This issue intentionally does not ask Probe to understand requirement semantics or test frameworks. Probe should not need to know what Vitest/Jest/Mocha/etc. means. The ask is only for generic facts Probe can know from the AST/search result:

  • what symbol or declaration owns this block?
  • what enclosing AST context/call/declaration contains this block?
  • what comments are attached to the returned block?
  • did a text match occur in a comment, string literal, or ordinary code token?
  • can search/extract return one stable semantic owner block without merging unrelated owners?

Downstream tools can then apply their own policy. For example, Proof can decide that only comments matching Implements:, Verifies:, or MCDC count as evidence. Probe only needs to expose the source facts.

Tested with:

  • CLI JSON output reports "version": "0.6.0"
  • npm package installed as @probelabs/probe@0.6.0-rc315
  • macOS, local CLI installed at ~/.npm-global/bin/probe

Reproduction Fixture

Create this disposable mixed-language fixture:

tmp="$(mktemp -d)"
mkdir -p "$tmp/web/src/__tests__" "$tmp/pkg/demo"

cat > "$tmp/web/src/service.ts" <<'EOF'
export class PolicyService {
  // Implements: SYS-REQ-424
  async evaluatePolicy(input: string): Promise<boolean> {
    return input.length > 0 && input !== "deny";
  }
}

// Implements: SYS-REQ-425
export const normalizeDecision = (raw: string) => {
  return raw.trim().toLowerCase();
};
EOF

cat > "$tmp/web/src/__tests__/service.test.ts" <<'EOF'
import { describe, it, expect, test } from "vitest";

// Verifies: SYS-REQ-424 [boundary]
test("accepts valid policy", () => {
  expect(true && true).toBe(true);
});

// MCDC SYS-REQ-424: input_valid=T, not_denied=T => TRUE
it("records witness row", () => {
  expect(true).toBe(true);
});

describe("normalization", () => {
  // Verifies: SYS-REQ-425
  it("normalizes decisions", () => {
    expect(" ALLOW ".trim().toLowerCase()).toBe("allow");
  });
});
EOF

cat > "$tmp/pkg/demo/demo.go" <<'EOF'
package demo

// Implements: SYS-REQ-426
func RunDemo(flag bool) bool {
    return flag
}
EOF

cat > "$tmp/web/src/noise.ts" <<'EOF'
export const literalOnly = "Implements: SYS-REQ-427";

export function unrelated() {
  return "SYS-REQ-428";
}

// SYS-REQ-429 appears here without an annotation verb.
export function looseComment() {
  return true;
}
EOF

Case 1: Go function ownership works and is the useful baseline

Command:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  -o json '"SYS-REQ-426"' "$tmp"

Actual useful result excerpt:

{
  "code": "// Implements: SYS-REQ-426\nfunc RunDemo(flag bool) bool {\n    return flag\n}",
  "node_type": "function_declaration",
  "owner_symbol": "RunDemo",
  "scope": "function",
  "lines": [3, 6]
}

This is the shape downstream tools need: the returned block includes the leading comment and exposes the owning symbol.

Expected: preserve this behavior and expose equivalent generic owner metadata where the AST makes it knowable in JS/TS/TSX.

Case 2: TypeScript class methods and exported const arrows miss generic owner symbols

Command:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  -o json '"SYS-REQ-424" OR "SYS-REQ-425"' "$tmp"

Actual TypeScript method result excerpt:

{
  "code": "  // Implements: SYS-REQ-424\n  async evaluatePolicy(input: string): Promise<boolean> {\n    return input.length > 0 && input !== \"deny\";\n  }",
  "node_type": "method_definition",
  "scope": "function",
  "lines": [2, 5]
}

Actual exported arrow result excerpt:

{
  "code": "// Implements: SYS-REQ-425\nexport const normalizeDecision = (raw: string) => {\n  return raw.trim().toLowerCase();\n};",
  "node_type": "export_statement",
  "scope": "declaration",
  "lines": [8, 11]
}

Missing/inconsistent generic fields:

  • no owner_symbol for method evaluatePolicy
  • no owner_symbol for variable declarator normalizeDecision
  • no generic containing declaration for the class method, e.g. containing class name
  • node_type: "export_statement" is too generic for the semantic owner

Related command:

probe symbols "$tmp/web/src/service.ts" -o json

Actual symbols excerpt:

[
  {
    "file": ".../web/src/service.ts",
    "symbols": [
      {
        "name": "export_statement",
        "kind": "export",
        "signature": "export class PolicyService { ... }",
        "line": 1,
        "end_line": 6
      },
      {
        "name": "export_statement",
        "kind": "export",
        "signature": "export const normalizeDecision = (raw: string) => {",
        "line": 9,
        "end_line": 11
      }
    ]
  }
]

Expected language-agnostic shape, using generic symbol/declaration concepts:

{
  "language": "typescript",
  "node_type": "method_definition",
  "owner_symbol": "evaluatePolicy",
  "owner_qualified_symbol": "PolicyService.evaluatePolicy",
  "enclosing_symbols": [
    {"kind": "class", "name": "PolicyService"}
  ],
  "scope": "function"
}

and:

{
  "language": "typescript",
  "node_type": "variable_declarator",
  "owner_symbol": "normalizeDecision",
  "owner_qualified_symbol": "normalizeDecision",
  "scope": "function"
}

No framework or domain knowledge is needed here. These are just AST owner/declaration facts.

Case 3: Callback blocks need generic enclosing-call context, not framework detection

Command:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  -o json '"SYS-REQ-424" OR "SYS-REQ-425"' "$tmp"

Actual callback result excerpts:

{
  "code": "// Verifies: SYS-REQ-424 [boundary]\ntest(\"accepts valid policy\", () => {\n  expect(true && true).toBe(true);\n});",
  "is_test": true,
  "node_type": "arrow_function",
  "scope": "test",
  "lines": [3, 6]
}
{
  "code": "// MCDC SYS-REQ-424: input_valid=T, not_denied=T => TRUE\nit(\"records witness row\", () => {\n  expect(true).toBe(true);\n});",
  "is_test": true,
  "node_type": "arrow_function",
  "scope": "test",
  "lines": [8, 11]
}
{
  "code": "  // Verifies: SYS-REQ-425\n  it(\"normalizes decisions\", () => {\n    expect(\" ALLOW \".trim().toLowerCase()).toBe(\"allow\");\n  });",
  "is_test": true,
  "node_type": "arrow_function",
  "scope": "test",
  "lines": [14, 17]
}

Probe does not need to know these are Vitest tests. The useful missing data is generic AST context:

  • this arrow function is an argument to a call expression
  • the call expression callee text is test or it
  • the call expression first argument is a string literal
  • the callback is nested inside another call expression with callee describe

Expected language-agnostic shape:

{
  "language": "typescript",
  "node_type": "arrow_function",
  "scope": "test",
  "enclosing_call": {
    "callee": "test",
    "first_arg_literal": "accepts valid policy",
    "line": 4
  },
  "enclosing_calls": [
    {
      "callee": "test",
      "first_arg_literal": "accepts valid policy",
      "line": 4
    }
  ]
}

For the nested callback:

{
  "enclosing_call": {
    "callee": "it",
    "first_arg_literal": "normalizes decisions",
    "line": 15
  },
  "enclosing_calls": [
    {
      "callee": "describe",
      "first_arg_literal": "normalization",
      "line": 13
    },
    {
      "callee": "it",
      "first_arg_literal": "normalizes decisions",
      "line": 15
    }
  ]
}

This remains framework-agnostic. Downstream tools can decide whether test, it, describe, or any other callee name matters.

Case 4: Requirement IDs inside strings or loose comments are returned as search hits

Command:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  -o json '"SYS-REQ-427" OR "SYS-REQ-428" OR "SYS-REQ-429"' "$tmp"

Actual result excerpts:

{
  "code": "export const literalOnly = \"Implements: SYS-REQ-427\";",
  "node_type": "export_statement",
  "scope": "declaration",
  "matched_keywords": ["sys-req-427"]
}
{
  "code": "export function unrelated() {\n  return \"SYS-REQ-428\";\n}",
  "node_type": "function_declaration",
  "owner_symbol": "unrelated",
  "scope": "function",
  "matched_keywords": ["sys-req-428"]
}
{
  "code": "// SYS-REQ-429 appears here without an annotation verb.\nexport function looseComment() {\n  return true;\n}",
  "node_type": "export_statement",
  "owner_symbol": "looseComment",
  "scope": "declaration",
  "matched_keywords": ["sys-req-429"]
}

It is correct for Probe to find these textual matches. The missing generic metadata is match classification:

  • did the match occur in a comment, string literal, identifier, or other code token?
  • if in a comment, is the comment leading/trailing/inner relative to the returned owner?
  • what are the exact comment line ranges?
  • what is the comment text without requiring callers to parse code?

Expected:

{
  "matches": [
    {
      "text": "SYS-REQ-427",
      "line": 1,
      "column": 40,
      "kind": "string"
    }
  ],
  "leading_comments": []
}

and for a real leading comment:

{
  "leading_comments": [
    {
      "text": "// Implements: SYS-REQ-424",
      "start_line": 2,
      "end_line": 2
    }
  ],
  "matches": [
    {
      "text": "SYS-REQ-424",
      "line": 2,
      "column": 18,
      "kind": "comment",
      "comment_role": "leading"
    }
  ]
}

Downstream tools can then reject string-literal matches or loose comments by policy.

Case 5: extract can drop attached leading comments or return partial callback blocks

Command:

probe extract -o json \
  "$tmp/web/src/service.ts:3" \
  "$tmp/web/src/service.ts:9"

Actual result excerpt:

{
  "code": "  async evaluatePolicy(input: string): Promise<boolean> {\n    return input.length > 0 && input !== \"deny\";\n  }",
  "lines": [3, 5],
  "node_type": "merged_ast_line"
}

The leading comment at line 2 is not included, even though line 3 is inside the commented method.

Command:

probe extract --allow-tests -o json \
  "$tmp/web/src/__tests__/service.test.ts:4" \
  "$tmp/web/src/__tests__/service.test.ts:15"

Actual result excerpt:

{
  "code": "test(\"accepts valid policy\", () => {",
  "lines": [4, 4],
  "node_type": "context"
}

Expected:

  • extracting a line inside a semantic owner can optionally include attached leading comments
  • extracting a line inside a callback can return the full enclosing call/callback block when requested
  • extraction exposes the same generic owner/comment/context metadata as search

Possible option:

probe extract --semantic-block --allow-tests -o json "$file:$line"

Where --semantic-block means:

  • choose the smallest useful semantic owner for the line
  • include attached leading comments
  • avoid partial fragments for functions, methods, declarations, and callback call blocks
  • include generic comments, matches, and enclosing context fields

Case 6: Default search merging can combine multiple semantic owners

Command without --no-merge:

probe search --allow-tests --strict-elastic-syntax --max-results 20 \
  -o json '"SYS-REQ-424" OR "SYS-REQ-425"' "$tmp"

Actual result excerpt:

{
  "code": "// Verifies: SYS-REQ-424 [boundary]\ntest(\"accepts valid policy\", () => {\n  expect(true && true).toBe(true);\n});\n\n// MCDC SYS-REQ-424: input_valid=T, not_denied=T => TRUE\nit(\"records witness row\", () => {\n  expect(true).toBe(true);\n});\n\ndescribe(\"normalization\", () => {\n  // Verifies: SYS-REQ-425\n  it(\"normalizes decisions\", () => {\n    expect(\" ALLOW \".trim().toLowerCase()).toBe(\"allow\");\n  });",
  "is_test": true,
  "lines": [3, 17],
  "matched_keywords": ["sys-req-424", "sys-req-425"]
}

This is useful for LLM context, but not for evidence tools because one result now contains multiple semantic owners and multiple requirement IDs.

Expected:

  • keep current merge behavior for general search if desired
  • provide a stable mode that prevents merging across semantic owner boundaries

Possible option:

probe search --semantic-blocks --allow-tests -o json '"SYS-REQ-424" OR "SYS-REQ-425"' "$tmp"

Where --semantic-blocks means:

  • one semantic owner per result
  • no merging across function/method/declaration/callback-owner boundaries
  • attached leading comments included
  • generic comments, matches, and enclosing context fields included

Proposed JSON Fields

This is intentionally generic:

{
  "file": "/path/to/file.ts",
  "language": "typescript",
  "lines": [3, 6],
  "code": "...",
  "node_type": "method_definition",
  "scope": "function",
  "owner_symbol": "evaluatePolicy",
  "owner_qualified_symbol": "PolicyService.evaluatePolicy",
  "enclosing_symbols": [
    {"kind": "class", "name": "PolicyService", "line": 1}
  ],
  "enclosing_call": null,
  "enclosing_calls": [],
  "symbol_signature": "async evaluatePolicy(input: string): Promise<boolean>",
  "leading_comments": [
    {
      "text": "// Implements: SYS-REQ-424",
      "start_line": 2,
      "end_line": 2
    }
  ],
  "matches": [
    {
      "text": "SYS-REQ-424",
      "start_line": 2,
      "start_column": 18,
      "end_line": 2,
      "end_column": 29,
      "kind": "comment",
      "comment_role": "leading"
    }
  ]
}

For callback blocks, the same shape can include generic call context:

{
  "node_type": "arrow_function",
  "enclosing_call": {
    "callee": "it",
    "first_arg_literal": "normalizes decisions",
    "line": 15
  },
  "enclosing_calls": [
    {"callee": "describe", "first_arg_literal": "normalization", "line": 13},
    {"callee": "it", "first_arg_literal": "normalizes decisions", "line": 15}
  ]
}

Acceptance Criteria

  • Go behavior remains compatible with current owner_symbol results.
  • TS/JS class methods expose method owner and containing class/object where knowable.
  • TS/JS exported const arrow functions expose the variable declarator name as owner where knowable.
  • Callback blocks can expose generic enclosing call context: callee text, first literal argument if present, and enclosing call chain.
  • probe symbols returns useful JS/TS symbol/declaration names instead of generic export_statement for common exported classes/functions/const arrows.
  • JSON results expose structured attached comments with line ranges.
  • JSON results expose match locations and token kind (comment, string, code, etc.).
  • A search/extract mode exists for evidence-style consumers that returns one semantic owner per result and includes attached comments.
  • The fixture above can be used as regression coverage for all cases.

Why this matters

Without these generic fields, downstream multi-language tools have to reparse source code using their own AST logic, which recreates language-specific behavior and makes Go, JS, TS, and TSX support diverge.

With these fields, Probe can remain the language-agnostic source-discovery layer. Downstream tools can apply their own domain policy on top.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions