Skip to content

feat: add query owner context metadata#560

Open
buger wants to merge 5 commits intomainfrom
feat/semantic-owner-context-557-558
Open

feat: add query owner context metadata#560
buger wants to merge 5 commits intomainfrom
feat/semantic-owner-context-557-558

Conversation

@buger
Copy link
Copy Markdown
Collaborator

@buger buger commented May 6, 2026

Summary

  • add probe query --with-context / --owner-context for opt-in JSON owner-block metadata
  • introduce a shared language-agnostic source context layer for AST match, owner, comments, enclosing symbols, and enclosing calls
  • improve search JSON metadata with language, owner symbols, qualified owners, enclosing symbols/calls, leading comments, and classified text matches
  • improve JS/TS symbols output by unwrapping exported declarations to their semantic inner names
  • handle multiline block comments, HTML comments, and multiline strings when classifying search match metadata
  • document the implemented query/search JSON context surfaces
  • add mixed TS/Go req-id search JSON regressions for the --no-merge evidence path, including negative string/comment hit classification

Implemented: probe query --with-context

probe query now has a new opt-in flag:

probe query 'fetch($$$ARGS)' ./src -l typescript --format json --with-context

The default query JSON remains backward-compatible. With --with-context, each result keeps the existing fields and additionally includes:

  • schema_version: "probe.query.context.v1" at the top level
  • language: inferred source language
  • pattern: source pattern and nullable pattern ID
  • match: exact AST match node type, content, line range, and column range
  • owner: smallest useful enclosing source block where available
  • owner.symbol and owner.qualified_symbol
  • owner.node_type, owner.scope, owner.lines, and owner.columns
  • owner.comments: attached leading/source comments as raw source facts
  • owner.enclosing_symbols: containing class/module/impl-style symbols where knowable
  • owner.enclosing_call / owner.enclosing_calls: generic call context for callback/call nesting where knowable
  • owner.content: owning block source text when available

This is intentionally generic. Probe reports source facts only; it does not parse requirement IDs, policy annotations, checklist semantics, test frameworks, or security meanings.

Implemented: search JSON source metadata

Search JSON keeps its existing shape but now includes additional source facts useful for text-first evidence discovery:

  • language
  • owner_symbol for common owner blocks, including TS/JS methods and exported const arrows
  • owner_qualified_symbol, e.g. PolicyService.evaluatePolicy
  • enclosing_symbols, e.g. containing class/module/impl symbols
  • enclosing_call / enclosing_calls, e.g. generic describe(...) / it(...) callback call chains without framework interpretation
  • leading_comments with line ranges and raw text
  • matches with text, line/column range, kind (comment, string, code), and comment_role when applicable

The added regressions cover the intended search path:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  --format json '"SYS-REQ-424" OR "SYS-REQ-425"' ./fixture

They verify TS method ownership, exported const arrow ownership, qualified class-method identity, nested callback call chains, test callback scope, leading comments, classified comment matches, string-literal hits, and loose comment hits. Probe still does not interpret which comments are valid evidence; downstream tools decide that from the raw source facts.

JS/TS fixes

  • probe search -o json can now report owner symbols for TS/JS method_definition blocks.
  • probe search -o json can now report owner symbols for exported const arrow functions even when the returned block is an export_statement.
  • probe symbols now unwraps common TS/JS export_statement / declare_statement wrappers and returns semantic names like PolicyService and normalizeDecision instead of generic export_statement names.

Architecture notes

  • Keeps existing default output compatible; richer query metadata is opt-in via --with-context.
  • Uses existing parser/language abstractions rather than adopting the issue contract verbatim.
  • Shares source-context helpers between query and search while keeping their JSON contracts compatible with existing command behavior.
  • Treats this as the strong Probe-native slice: structural query context and neutral search/source metadata.

Out of scope for this PR

  • batch pattern execution / --patterns-file
  • new search --semantic-blocks behavior
  • new extract --semantic-block behavior
  • wildcard query syntax changes beyond existing ast-grep support
  • policy interpretation of comments, requirement IDs, frameworks, or checklist semantics

Tests

  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --lib
  • cargo test --test integration_tests
  • cargo test --test query_command_tests
  • cargo test --test query_command_json_tests test_query_json_with_context_reports_owner_block
  • cargo test --test json_format_tests test_search_json_no_merge_reports_req_id_source_metadata
  • cargo test --test json_format_tests test_search_json_classifies_req_id_noise_without_policy_interpretation
  • npm test -- --runInBand npm/tests/unit/search-delegate.test.js
  • git diff --check

Refs #557
Refs #558

buger added 2 commits May 6, 2026 06:05
Add --with-context for query JSON output, backed by a shared language-agnostic source context helper.

Also expose compatible search JSON metadata for language, leading comments, and classified text matches, and unwrap JS/TS export statements in symbols output.
Track block comments, HTML comments, and multiline quoted strings when adding search match metadata so the new JSON fields do not regress existing search comment handling.
@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented May 6, 2026

{}


Powered by Visor from Probelabs

Last updated: 2026-05-06T10:11:03.359Z | Triggered by: pr_updated | Commit: 1a9be12

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented May 6, 2026

Architecture Issues (10)

Severity Location Issue
🟠 Error src/semantic_context.rs:130-180
build_query_source_context() and build_search_owner_context() contain similar AST traversal and owner detection logic but return different context types. This violates DRY and creates maintenance burden.
💡 SuggestionExtract common AST traversal and owner detection logic into shared functions. Use a unified context builder that can be configured for query vs. search output requirements.
🟠 Error src/semantic_context.rs:200-350
extract_owner_symbol_from_source() contains hard-coded parsing patterns for multiple languages with string manipulation. This duplicates functionality likely already present in language-specific implementations and does not scale.
💡 SuggestionLeverage existing language implementations in src/language/ rather than building parallel parsing logic. Use a strategy pattern where each language implementation provides its own symbol extraction logic.
🟠 Error src/semantic_context.rs:750-880
classify_match_kind_at() implements a full lexical scanner with state tracking for block comments, strings, and escape sequences. This is complex and error-prone compared to leveraging tree-sitter existing node type information.
💡 SuggestionUse tree-sitter node type information to determine if text is in a comment or string literal rather than reimplementing lexical scanning. Only fall back to custom scanning for edge cases that tree-sitter cannot handle.
🟢 Info docs/reference/output-formats.md:128-183
The schema_version field suggests awareness of evolution needs, but there is no clear migration path or backward compatibility strategy documented for when the schema changes.
💡 SuggestionDocument a clear schema versioning strategy including how consumers should handle unknown fields, how to detect version changes, and what the migration path will be when probe.query.context.v2 is introduced.
🟡 Warning src/semantic_context.rs:520-580
normalize_owner_node() contains special-case logic for variable_declarator and arguments nodes that could be generalized through a more extensible node normalization strategy.
💡 SuggestionReplace hard-coded node type checks with a strategy pattern that allows language implementations to provide their own normalization rules. Create a NodeNormalizer trait that can be implemented per language.
🟡 Warning src/semantic_context.rs:480-520
is_owner_node() and is_function_like() contain hard-coded lists of node types that must be manually updated for each new language support. This does not scale well as language support grows.
💡 SuggestionMove node type classification into language implementations using a trait-based approach. Each language implementation should provide methods to determine if a node is an owner, function-like, or module-like.
🟡 Warning src/extract/symbols.rs:144-165
semantic_symbol_node() specifically handles TypeScript/JavaScript export_statement and declare_statement wrappers as a special case by walking children to find the real symbol node. This creates language-specific logic in what should be a general symbol extraction module.
💡 SuggestionNormalize export/declare wrappers during AST parsing or in the language implementation layer rather than handling them as a special case during symbol extraction.
🟡 Warning src/search/search_output.rs:574-680
Search output now uses semantic_context functions but maintains its own JsonResult structure with different field names and organization than query output. This creates inconsistency between search and query JSON schemas.
💡 SuggestionAlign search and query JSON output structures to use consistent field names and organization. Share a common JSON serialization layer for context metadata to ensure consistency across commands.
🟡 Warning src/query.rs:396-443
format_and_print_query_results() directly calls probe_code::semantic_context::build_query_source_context() and manually constructs JSON output by building serde_json::json! objects inline. This creates tight coupling between query formatting and context extraction logic.
💡 SuggestionIntroduce a context formatter abstraction that handles the conversion from QuerySourceContext to JSON. This would allow the formatting logic to be reused and tested independently from the query command.
🟡 Warning src/semantic_context.rs:1-958
The semantic_context module handles multiple responsibilities: AST traversal, symbol extraction, comment parsing, string classification, and scope classification. This violates Single Responsibility Principle.
💡 SuggestionSplit the module into focused components: AST traversal utilities, symbol extraction, comment/string classification, and scope classification. Each component should have a clear, single responsibility.

Performance Issues (11)

Severity Location Issue
🟡 Warning src/semantic_context.rs:88
build_query_source_context reads the entire file into memory with std::fs::read_to_string() for every query match when --with-context is enabled. For large files or many matches, this creates significant memory pressure and I/O overhead.
💡 SuggestionConsider implementing a file content cache that shares the parsed source across multiple matches in the same file. The cache could be keyed by file path and modified time to avoid re-reading and re-parsing the same file multiple times.
🟡 Warning src/semantic_context.rs:88
Each call to build_query_source_context creates a new parser and parses the entire file. When processing multiple matches from the same file, this results in redundant parsing work.
💡 SuggestionImplement a parse cache keyed by file path that reuses the tree-sitter Tree across multiple context extractions. Only reparse if the file has changed.
🟡 Warning src/semantic_context.rs:324
extract_owner_symbol_from_source allocates new Strings for every symbol name extraction using .to_string(). When processing many search results, this creates significant heap allocation pressure.
💡 SuggestionConsider returning Cow<str> or using string slices with lifetimes to avoid allocation when the extracted name is already a substring of the source code.
🟡 Warning src/semantic_context.rs:324
extract_owner_symbol_from_source iterates through up to 20 lines of code and performs multiple string operations (strip_prefix, split, contains) for each line. This is O(n*m) where n=lines and m=operations per line.
💡 SuggestionConsider early termination once a symbol is found, or use a more efficient pattern matching approach like regex or a single pass parser.
🟡 Warning src/semantic_context.rs:409
leading_comments_from_block allocates a new Vec and Strings for every comment found. For code blocks with many leading comments, this creates multiple heap allocations.
💡 SuggestionConsider using SmallVec or pre-allocating with Vec::with_capacity if you can estimate the number of comments. Alternatively, return Cow<str> to avoid copying comment text.
🟡 Warning src/semantic_context.rs:451
classify_text_matches_in_block contains nested loops: outer loop over keywords, inner loop over lines, and inner search for each keyword position. This is O(k*l*s) where k=keywords, l=lines, s=searches per line.
💡 SuggestionConsider using Aho-Corasick algorithm for multi-pattern matching, or at least break early if keywords are found. For large code blocks with many keywords, this could be slow.
🟡 Warning src/semantic_context.rs:451
classify_text_matches_in_block calls .to_lowercase() on every line and keyword for comparison. This allocates new Strings for every comparison, creating significant memory pressure.
💡 SuggestionConsider case-insensitive matching without allocation using libraries like regex with case-insensitive flag, or implement a custom comparator that avoids allocation.
🟡 Warning src/semantic_context.rs:727
find_smallest_covering_node recursively traverses the AST tree to find the smallest node covering a byte range. In the worst case, this visits every node in the tree.
💡 SuggestionConsider using tree-sitter's built-in methods for finding nodes at specific positions, or implement early termination if the node is already small enough.
🟡 Warning src/query.rs:207
When with_context is true, build_query_source_context is called for every match in the results. For queries returning hundreds of matches, this means parsing the same file multiple times.
💡 SuggestionGroup matches by file and parse each file only once, then extract context for all matches in that file from the single parse.
🟡 Warning src/search/search_output.rs:642
extract_owner_symbol is called for every search result and allocates a new String via to_string(). For large result sets, this creates thousands of heap allocations.
💡 SuggestionConsider using a cache keyed by (node_type, code_signature) to avoid re-extracting the same owner symbol, or use string slices with lifetimes.
🟡 Warning src/extract/symbols.rs:109
semantic_symbol_node is called for every symbol node and performs additional AST traversal by iterating through children. For files with many symbols, this adds significant overhead.
💡 SuggestionCache the result of semantic_symbol_node lookups, or use a more efficient pattern like checking node.kind() directly before traversing children.
\n\n

Architecture Issues (10)

Severity Location Issue
🟠 Error src/semantic_context.rs:130-180
build_query_source_context() and build_search_owner_context() contain similar AST traversal and owner detection logic but return different context types. This violates DRY and creates maintenance burden.
💡 SuggestionExtract common AST traversal and owner detection logic into shared functions. Use a unified context builder that can be configured for query vs. search output requirements.
🟠 Error src/semantic_context.rs:200-350
extract_owner_symbol_from_source() contains hard-coded parsing patterns for multiple languages with string manipulation. This duplicates functionality likely already present in language-specific implementations and does not scale.
💡 SuggestionLeverage existing language implementations in src/language/ rather than building parallel parsing logic. Use a strategy pattern where each language implementation provides its own symbol extraction logic.
🟠 Error src/semantic_context.rs:750-880
classify_match_kind_at() implements a full lexical scanner with state tracking for block comments, strings, and escape sequences. This is complex and error-prone compared to leveraging tree-sitter existing node type information.
💡 SuggestionUse tree-sitter node type information to determine if text is in a comment or string literal rather than reimplementing lexical scanning. Only fall back to custom scanning for edge cases that tree-sitter cannot handle.
🟢 Info docs/reference/output-formats.md:128-183
The schema_version field suggests awareness of evolution needs, but there is no clear migration path or backward compatibility strategy documented for when the schema changes.
💡 SuggestionDocument a clear schema versioning strategy including how consumers should handle unknown fields, how to detect version changes, and what the migration path will be when probe.query.context.v2 is introduced.
🟡 Warning src/semantic_context.rs:520-580
normalize_owner_node() contains special-case logic for variable_declarator and arguments nodes that could be generalized through a more extensible node normalization strategy.
💡 SuggestionReplace hard-coded node type checks with a strategy pattern that allows language implementations to provide their own normalization rules. Create a NodeNormalizer trait that can be implemented per language.
🟡 Warning src/semantic_context.rs:480-520
is_owner_node() and is_function_like() contain hard-coded lists of node types that must be manually updated for each new language support. This does not scale well as language support grows.
💡 SuggestionMove node type classification into language implementations using a trait-based approach. Each language implementation should provide methods to determine if a node is an owner, function-like, or module-like.
🟡 Warning src/extract/symbols.rs:144-165
semantic_symbol_node() specifically handles TypeScript/JavaScript export_statement and declare_statement wrappers as a special case by walking children to find the real symbol node. This creates language-specific logic in what should be a general symbol extraction module.
💡 SuggestionNormalize export/declare wrappers during AST parsing or in the language implementation layer rather than handling them as a special case during symbol extraction.
🟡 Warning src/search/search_output.rs:574-680
Search output now uses semantic_context functions but maintains its own JsonResult structure with different field names and organization than query output. This creates inconsistency between search and query JSON schemas.
💡 SuggestionAlign search and query JSON output structures to use consistent field names and organization. Share a common JSON serialization layer for context metadata to ensure consistency across commands.
🟡 Warning src/query.rs:396-443
format_and_print_query_results() directly calls probe_code::semantic_context::build_query_source_context() and manually constructs JSON output by building serde_json::json! objects inline. This creates tight coupling between query formatting and context extraction logic.
💡 SuggestionIntroduce a context formatter abstraction that handles the conversion from QuerySourceContext to JSON. This would allow the formatting logic to be reused and tested independently from the query command.
🟡 Warning src/semantic_context.rs:1-958
The semantic_context module handles multiple responsibilities: AST traversal, symbol extraction, comment parsing, string classification, and scope classification. This violates Single Responsibility Principle.
💡 SuggestionSplit the module into focused components: AST traversal utilities, symbol extraction, comment/string classification, and scope classification. Each component should have a clear, single responsibility.
\n\n \n\n

Quality Issues (1)

Severity Location Issue
🟠 Error contract:0
Output schema validation failed: must have required property 'issues'

Powered by Visor from Probelabs

Last updated: 2026-05-06T10:18:41.601Z | Triggered by: pr_updated | Commit: 1a9be12

💡 TIP: You can chat with Visor using /visor ask <your question>

@buger
Copy link
Copy Markdown
Collaborator Author

buger commented May 6, 2026

Thanks, this PR now covers the important first slice for Proof's source-discovery path. The scoped split in the description is right: Probe should expose source facts, while Proof keeps requirement/test-policy interpretation and language-specific instrumentation.

For Proof's planned cleanup, this is enough to move req-id text discovery/autolink toward Probe:

  • search --allow-tests --no-merge -o json can return annotation-bearing blocks.
  • Search JSON now exposes language, owner_symbol, leading_comments, and classified matches.
  • That lets Proof reject string-literal hits and apply its own Implements / Verifies / MCDC comment policy without reparsing every file.
  • Common TS/JS owners like methods and exported const arrows are covered.

What is still missing before Proof can remove the remaining non-instrumentation AST/source readers:

  1. Search-level enclosing context

    query --with-context exposes owner.qualified_symbol, owner.enclosing_symbols, and owner.enclosing_calls, but search only exposes owner_symbol.

    For Proof, req-id discovery starts with text search (SYS-REQ-*, SW-REQ-*, etc.), not structural query patterns. When a req-id appears in a JS/TS callback block, Proof still needs generic context like:

    {
      "owner_symbol": null,
      "enclosing_calls": [
        {"callee": "describe", "first_arg_literal": "normalization", "line": 13},
        {"callee": "it", "first_arg_literal": "normalizes decisions", "line": 15}
      ]
    }

    Probe should not interpret this as a test framework. It only needs to expose the AST call chain. Proof can decide whether describe / it matters.

  2. Search-level qualified owners / containing symbols

    Search can now return owner_symbol, but Proof still cannot reliably distinguish PolicyService.evaluatePolicy from another evaluatePolicy in the same file/module without either reparsing or accepting weaker evidence identity.

    A lightweight search-side equivalent of the query owner fields would be enough:

    {
      "owner_symbol": "evaluatePolicy",
      "owner_qualified_symbol": "PolicyService.evaluatePolicy",
      "enclosing_symbols": [
        {"kind": "class", "name": "PolicyService", "line": 1}
      ]
    }
  3. Search regression for negative text hits

    The new search regression covers the positive path well. It would be useful to also assert the negative/source-classification cases through the real search JSON path:

    • req id inside a string -> matches[].kind == "string"
    • req id in a loose non-annotation comment -> matches[].kind == "comment" but no domain interpretation from Probe
    • leading annotation comment -> matches[].kind == "comment" and comment_role == "leading"

    Proof can then make its policy decision entirely from Probe JSON.

  4. Semantic block mode remains a later gap

    --no-merge is good enough for the first integration. Longer term, an explicit search mode that guarantees one semantic owner per result would let evidence tooling avoid relying on merge heuristics:

    probe search --semantic-blocks --allow-tests -o json '"SYS-REQ-424"'

    Not asking for that in this PR, just calling out that it remains the thing that would let Proof remove more local source-block cleanup.

With those additions, Proof could use Probe as the generic source-reader for trace/evidence discovery across Go/JS/TS and reserve local AST parsing for cases that truly require transformation or language execution semantics, such as MC/DC instrumentation and coverage/tool-specific processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant