feat: add query owner context metadata by buger · Pull Request #560 · probelabs/probe

buger · 2026-05-06T06:12:25Z

Summary

add probe query --with-context / --owner-context for opt-in JSON owner-block metadata
introduce a shared language-agnostic source context layer for AST match, owner, comments, enclosing symbols, and enclosing calls
improve search JSON metadata with language, owner symbols, qualified owners, enclosing symbols/calls, leading comments, and classified text matches
improve JS/TS symbols output by unwrapping exported declarations to their semantic inner names
handle multiline block comments, HTML comments, and multiline strings when classifying search match metadata
document the implemented query/search JSON context surfaces
add mixed TS/Go req-id search JSON regressions for the --no-merge evidence path, including negative string/comment hit classification

Implemented: `probe query --with-context`

probe query now has a new opt-in flag:

probe query 'fetch($$$ARGS)' ./src -l typescript --format json --with-context

The default query JSON remains backward-compatible. With --with-context, each result keeps the existing fields and additionally includes:

schema_version: "probe.query.context.v1" at the top level
language: inferred source language
pattern: source pattern and nullable pattern ID
match: exact AST match node type, content, line range, and column range
owner: smallest useful enclosing source block where available
owner.symbol and owner.qualified_symbol
owner.node_type, owner.scope, owner.lines, and owner.columns
owner.comments: attached leading/source comments as raw source facts
owner.enclosing_symbols: containing class/module/impl-style symbols where knowable
owner.enclosing_call / owner.enclosing_calls: generic call context for callback/call nesting where knowable
owner.content: owning block source text when available

This is intentionally generic. Probe reports source facts only; it does not parse requirement IDs, policy annotations, checklist semantics, test frameworks, or security meanings.

Implemented: search JSON source metadata

Search JSON keeps its existing shape but now includes additional source facts useful for text-first evidence discovery:

language
owner_symbol for common owner blocks, including TS/JS methods and exported const arrows
owner_qualified_symbol, e.g. PolicyService.evaluatePolicy
enclosing_symbols, e.g. containing class/module/impl symbols
enclosing_call / enclosing_calls, e.g. generic describe(...) / it(...) callback call chains without framework interpretation
leading_comments with line ranges and raw text
matches with text, line/column range, kind (comment, string, code), and comment_role when applicable

The added regressions cover the intended search path:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  --format json '"SYS-REQ-424" OR "SYS-REQ-425"' ./fixture

They verify TS method ownership, exported const arrow ownership, qualified class-method identity, nested callback call chains, test callback scope, leading comments, classified comment matches, string-literal hits, and loose comment hits. Probe still does not interpret which comments are valid evidence; downstream tools decide that from the raw source facts.

JS/TS fixes

probe search -o json can now report owner symbols for TS/JS method_definition blocks.
probe search -o json can now report owner symbols for exported const arrow functions even when the returned block is an export_statement.
probe symbols now unwraps common TS/JS export_statement / declare_statement wrappers and returns semantic names like PolicyService and normalizeDecision instead of generic export_statement names.

Architecture notes

Keeps existing default output compatible; richer query metadata is opt-in via --with-context.
Uses existing parser/language abstractions rather than adopting the issue contract verbatim.
Shares source-context helpers between query and search while keeping their JSON contracts compatible with existing command behavior.
Treats this as the strong Probe-native slice: structural query context and neutral search/source metadata.

Out of scope for this PR

batch pattern execution / --patterns-file
new search --semantic-blocks behavior
new extract --semantic-block behavior
wildcard query syntax changes beyond existing ast-grep support
policy interpretation of comments, requirement IDs, frameworks, or checklist semantics

Tests

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --lib
cargo test --test integration_tests
cargo test --test query_command_tests
cargo test --test query_command_json_tests test_query_json_with_context_reports_owner_block
cargo test --test json_format_tests test_search_json_no_merge_reports_req_id_source_metadata
cargo test --test json_format_tests test_search_json_classifies_req_id_noise_without_policy_interpretation
npm test -- --runInBand npm/tests/unit/search-delegate.test.js
git diff --check

Refs #557
Refs #558

Add --with-context for query JSON output, backed by a shared language-agnostic source context helper. Also expose compatible search JSON metadata for language, leading comments, and classified text matches, and unwrap JS/TS export statements in symbols output.

Track block comments, HTML comments, and multiline quoted strings when adding search match metadata so the new JSON fields do not regress existing search comment handling.

probelabs · 2026-05-06T06:15:25Z

{}

Powered by Visor from Probelabs

Last updated: 2026-05-06T10:11:03.359Z | Triggered by: pr_updated | Commit: 1a9be12

💡 TIP: You can chat with Visor using /visor ask <your question>

probelabs · 2026-05-06T06:19:30Z

Architecture Issues (10)

Severity	Location	Issue
🟠 Error	`src/semantic_context.rs:130-180`	build_query_source_context() and build_search_owner_context() contain similar AST traversal and owner detection logic but return different context types. This violates DRY and creates maintenance burden. 💡 Suggestion Extract common AST traversal and owner detection logic into shared functions. Use a unified context builder that can be configured for query vs. search output requirements.
🟠 Error	`src/semantic_context.rs:200-350`	extract_owner_symbol_from_source() contains hard-coded parsing patterns for multiple languages with string manipulation. This duplicates functionality likely already present in language-specific implementations and does not scale. 💡 Suggestion Leverage existing language implementations in src/language/ rather than building parallel parsing logic. Use a strategy pattern where each language implementation provides its own symbol extraction logic.
🟠 Error	`src/semantic_context.rs:750-880`	classify_match_kind_at() implements a full lexical scanner with state tracking for block comments, strings, and escape sequences. This is complex and error-prone compared to leveraging tree-sitter existing node type information. 💡 Suggestion Use tree-sitter node type information to determine if text is in a comment or string literal rather than reimplementing lexical scanning. Only fall back to custom scanning for edge cases that tree-sitter cannot handle.
🟢 Info	`docs/reference/output-formats.md:128-183`	The schema_version field suggests awareness of evolution needs, but there is no clear migration path or backward compatibility strategy documented for when the schema changes. 💡 Suggestion Document a clear schema versioning strategy including how consumers should handle unknown fields, how to detect version changes, and what the migration path will be when probe.query.context.v2 is introduced.
🟡 Warning	`src/semantic_context.rs:520-580`	normalize_owner_node() contains special-case logic for variable_declarator and arguments nodes that could be generalized through a more extensible node normalization strategy. 💡 Suggestion Replace hard-coded node type checks with a strategy pattern that allows language implementations to provide their own normalization rules. Create a NodeNormalizer trait that can be implemented per language.
🟡 Warning	`src/semantic_context.rs:480-520`	is_owner_node() and is_function_like() contain hard-coded lists of node types that must be manually updated for each new language support. This does not scale well as language support grows. 💡 Suggestion Move node type classification into language implementations using a trait-based approach. Each language implementation should provide methods to determine if a node is an owner, function-like, or module-like.
🟡 Warning	`src/extract/symbols.rs:144-165`	semantic_symbol_node() specifically handles TypeScript/JavaScript export_statement and declare_statement wrappers as a special case by walking children to find the real symbol node. This creates language-specific logic in what should be a general symbol extraction module. 💡 Suggestion Normalize export/declare wrappers during AST parsing or in the language implementation layer rather than handling them as a special case during symbol extraction.
🟡 Warning	`src/search/search_output.rs:574-680`	Search output now uses semantic_context functions but maintains its own JsonResult structure with different field names and organization than query output. This creates inconsistency between search and query JSON schemas. 💡 Suggestion Align search and query JSON output structures to use consistent field names and organization. Share a common JSON serialization layer for context metadata to ensure consistency across commands.
🟡 Warning	`src/query.rs:396-443`	format_and_print_query_results() directly calls probe_code::semantic_context::build_query_source_context() and manually constructs JSON output by building serde_json::json! objects inline. This creates tight coupling between query formatting and context extraction logic. 💡 Suggestion Introduce a context formatter abstraction that handles the conversion from QuerySourceContext to JSON. This would allow the formatting logic to be reused and tested independently from the query command.
🟡 Warning	`src/semantic_context.rs:1-958`	The semantic_context module handles multiple responsibilities: AST traversal, symbol extraction, comment parsing, string classification, and scope classification. This violates Single Responsibility Principle. 💡 Suggestion Split the module into focused components: AST traversal utilities, symbol extraction, comment/string classification, and scope classification. Each component should have a clear, single responsibility.

Performance Issues (11)

Severity	Location	Issue
🟡 Warning	`src/semantic_context.rs:88`	build_query_source_context reads the entire file into memory with std::fs::read_to_string() for every query match when --with-context is enabled. For large files or many matches, this creates significant memory pressure and I/O overhead. 💡 Suggestion Consider implementing a file content cache that shares the parsed source across multiple matches in the same file. The cache could be keyed by file path and modified time to avoid re-reading and re-parsing the same file multiple times.
🟡 Warning	`src/semantic_context.rs:88`	Each call to build_query_source_context creates a new parser and parses the entire file. When processing multiple matches from the same file, this results in redundant parsing work. 💡 Suggestion Implement a parse cache keyed by file path that reuses the tree-sitter Tree across multiple context extractions. Only reparse if the file has changed.
🟡 Warning	`src/semantic_context.rs:324`	extract_owner_symbol_from_source allocates new Strings for every symbol name extraction using .to_string(). When processing many search results, this creates significant heap allocation pressure. 💡 Suggestion Consider returning Cow<str> or using string slices with lifetimes to avoid allocation when the extracted name is already a substring of the source code.
🟡 Warning	`src/semantic_context.rs:324`	extract_owner_symbol_from_source iterates through up to 20 lines of code and performs multiple string operations (strip_prefix, split, contains) for each line. This is O(nm) where n=lines and m=operations per line. 💡 Suggestion* Consider early termination once a symbol is found, or use a more efficient pattern matching approach like regex or a single pass parser.
🟡 Warning	`src/semantic_context.rs:409`	leading_comments_from_block allocates a new Vec and Strings for every comment found. For code blocks with many leading comments, this creates multiple heap allocations. 💡 Suggestion Consider using SmallVec or pre-allocating with Vec::with_capacity if you can estimate the number of comments. Alternatively, return Cow<str> to avoid copying comment text.
🟡 Warning	`src/semantic_context.rs:451`	classify_text_matches_in_block contains nested loops: outer loop over keywords, inner loop over lines, and inner search for each keyword position. This is O(kls) where k=keywords, l=lines, s=searches per line. 💡 Suggestion Consider using Aho-Corasick algorithm for multi-pattern matching, or at least break early if keywords are found. For large code blocks with many keywords, this could be slow.
🟡 Warning	`src/semantic_context.rs:451`	classify_text_matches_in_block calls .to_lowercase() on every line and keyword for comparison. This allocates new Strings for every comparison, creating significant memory pressure. 💡 Suggestion Consider case-insensitive matching without allocation using libraries like regex with case-insensitive flag, or implement a custom comparator that avoids allocation.
🟡 Warning	`src/semantic_context.rs:727`	find_smallest_covering_node recursively traverses the AST tree to find the smallest node covering a byte range. In the worst case, this visits every node in the tree. 💡 Suggestion Consider using tree-sitter's built-in methods for finding nodes at specific positions, or implement early termination if the node is already small enough.
🟡 Warning	`src/query.rs:207`	When with_context is true, build_query_source_context is called for every match in the results. For queries returning hundreds of matches, this means parsing the same file multiple times. 💡 Suggestion Group matches by file and parse each file only once, then extract context for all matches in that file from the single parse.
🟡 Warning	`src/search/search_output.rs:642`	extract_owner_symbol is called for every search result and allocates a new String via to_string(). For large result sets, this creates thousands of heap allocations. 💡 Suggestion Consider using a cache keyed by (node_type, code_signature) to avoid re-extracting the same owner symbol, or use string slices with lifetimes.
🟡 Warning	`src/extract/symbols.rs:109`	semantic_symbol_node is called for every symbol node and performs additional AST traversal by iterating through children. For files with many symbols, this adds significant overhead. 💡 Suggestion Cache the result of semantic_symbol_node lookups, or use a more efficient pattern like checking node.kind() directly before traversing children.

\n\n

Architecture Issues (10)

Severity	Location	Issue
🟠 Error	`src/semantic_context.rs:130-180`	build_query_source_context() and build_search_owner_context() contain similar AST traversal and owner detection logic but return different context types. This violates DRY and creates maintenance burden. 💡 Suggestion Extract common AST traversal and owner detection logic into shared functions. Use a unified context builder that can be configured for query vs. search output requirements.
🟠 Error	`src/semantic_context.rs:200-350`	extract_owner_symbol_from_source() contains hard-coded parsing patterns for multiple languages with string manipulation. This duplicates functionality likely already present in language-specific implementations and does not scale. 💡 Suggestion Leverage existing language implementations in src/language/ rather than building parallel parsing logic. Use a strategy pattern where each language implementation provides its own symbol extraction logic.
🟠 Error	`src/semantic_context.rs:750-880`	classify_match_kind_at() implements a full lexical scanner with state tracking for block comments, strings, and escape sequences. This is complex and error-prone compared to leveraging tree-sitter existing node type information. 💡 Suggestion Use tree-sitter node type information to determine if text is in a comment or string literal rather than reimplementing lexical scanning. Only fall back to custom scanning for edge cases that tree-sitter cannot handle.
🟢 Info	`docs/reference/output-formats.md:128-183`	The schema_version field suggests awareness of evolution needs, but there is no clear migration path or backward compatibility strategy documented for when the schema changes. 💡 Suggestion Document a clear schema versioning strategy including how consumers should handle unknown fields, how to detect version changes, and what the migration path will be when probe.query.context.v2 is introduced.
🟡 Warning	`src/semantic_context.rs:520-580`	normalize_owner_node() contains special-case logic for variable_declarator and arguments nodes that could be generalized through a more extensible node normalization strategy. 💡 Suggestion Replace hard-coded node type checks with a strategy pattern that allows language implementations to provide their own normalization rules. Create a NodeNormalizer trait that can be implemented per language.
🟡 Warning	`src/semantic_context.rs:480-520`	is_owner_node() and is_function_like() contain hard-coded lists of node types that must be manually updated for each new language support. This does not scale well as language support grows. 💡 Suggestion Move node type classification into language implementations using a trait-based approach. Each language implementation should provide methods to determine if a node is an owner, function-like, or module-like.
🟡 Warning	`src/extract/symbols.rs:144-165`	semantic_symbol_node() specifically handles TypeScript/JavaScript export_statement and declare_statement wrappers as a special case by walking children to find the real symbol node. This creates language-specific logic in what should be a general symbol extraction module. 💡 Suggestion Normalize export/declare wrappers during AST parsing or in the language implementation layer rather than handling them as a special case during symbol extraction.
🟡 Warning	`src/search/search_output.rs:574-680`	Search output now uses semantic_context functions but maintains its own JsonResult structure with different field names and organization than query output. This creates inconsistency between search and query JSON schemas. 💡 Suggestion Align search and query JSON output structures to use consistent field names and organization. Share a common JSON serialization layer for context metadata to ensure consistency across commands.
🟡 Warning	`src/query.rs:396-443`	format_and_print_query_results() directly calls probe_code::semantic_context::build_query_source_context() and manually constructs JSON output by building serde_json::json! objects inline. This creates tight coupling between query formatting and context extraction logic. 💡 Suggestion Introduce a context formatter abstraction that handles the conversion from QuerySourceContext to JSON. This would allow the formatting logic to be reused and tested independently from the query command.
🟡 Warning	`src/semantic_context.rs:1-958`	The semantic_context module handles multiple responsibilities: AST traversal, symbol extraction, comment parsing, string classification, and scope classification. This violates Single Responsibility Principle. 💡 Suggestion Split the module into focused components: AST traversal utilities, symbol extraction, comment/string classification, and scope classification. Each component should have a clear, single responsibility.

\n\n \n\n

Quality Issues (1)

Severity	Location	Issue
🟠 Error	`contract:0`	Output schema validation failed: must have required property 'issues'

Powered by Visor from Probelabs

Last updated: 2026-05-06T10:18:41.601Z | Triggered by: pr_updated | Commit: 1a9be12

💡 TIP: You can chat with Visor using /visor ask <your question>

buger · 2026-05-06T10:03:03Z

Thanks, this PR now covers the important first slice for Proof's source-discovery path. The scoped split in the description is right: Probe should expose source facts, while Proof keeps requirement/test-policy interpretation and language-specific instrumentation.

For Proof's planned cleanup, this is enough to move req-id text discovery/autolink toward Probe:

search --allow-tests --no-merge -o json can return annotation-bearing blocks.
Search JSON now exposes language, owner_symbol, leading_comments, and classified matches.
That lets Proof reject string-literal hits and apply its own Implements / Verifies / MCDC comment policy without reparsing every file.
Common TS/JS owners like methods and exported const arrows are covered.

What is still missing before Proof can remove the remaining non-instrumentation AST/source readers:

Search-level enclosing context

query --with-context exposes owner.qualified_symbol, owner.enclosing_symbols, and owner.enclosing_calls, but search only exposes owner_symbol.

For Proof, req-id discovery starts with text search (SYS-REQ-*, SW-REQ-*, etc.), not structural query patterns. When a req-id appears in a JS/TS callback block, Proof still needs generic context like:
```
{
  "owner_symbol": null,
  "enclosing_calls": [
    {"callee": "describe", "first_arg_literal": "normalization", "line": 13},
    {"callee": "it", "first_arg_literal": "normalizes decisions", "line": 15}
  ]
}
```
Probe should not interpret this as a test framework. It only needs to expose the AST call chain. Proof can decide whether describe / it matters.
Search-level qualified owners / containing symbols

Search can now return owner_symbol, but Proof still cannot reliably distinguish PolicyService.evaluatePolicy from another evaluatePolicy in the same file/module without either reparsing or accepting weaker evidence identity.

A lightweight search-side equivalent of the query owner fields would be enough:
```
{
  "owner_symbol": "evaluatePolicy",
  "owner_qualified_symbol": "PolicyService.evaluatePolicy",
  "enclosing_symbols": [
    {"kind": "class", "name": "PolicyService", "line": 1}
  ]
}
```
Search regression for negative text hits

The new search regression covers the positive path well. It would be useful to also assert the negative/source-classification cases through the real search JSON path:
- req id inside a string -> matches[].kind == "string"
- req id in a loose non-annotation comment -> matches[].kind == "comment" but no domain interpretation from Probe
- leading annotation comment -> matches[].kind == "comment" and comment_role == "leading"
Proof can then make its policy decision entirely from Probe JSON.
Semantic block mode remains a later gap

--no-merge is good enough for the first integration. Longer term, an explicit search mode that guarantees one semantic owner per result would let evidence tooling avoid relying on merge heuristics:
```
probe search --semantic-blocks --allow-tests -o json '"SYS-REQ-424"'
```
Not asking for that in this PR, just calling out that it remains the thing that would let Proof remove more local source-block cleanup.

With those additions, Proof could use Probe as the generic source-reader for trace/evidence discovery across Go/JS/TS and reserve local AST parsing for cases that truly require transformation or language execution semantics, such as MC/DC instrumentation and coverage/tool-specific processing.

buger added 2 commits May 6, 2026 06:05

fix: classify multiline source comments in metadata

b995eb6

Track block comments, HTML comments, and multiline quoted strings when adding search match metadata so the new JSON fields do not regress existing search comment handling.

buger added 2 commits May 6, 2026 06:30

docs: document query owner context

56724ae

test: cover req id search json metadata

89b0fab

feat: add search owner context metadata

1a9be12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add query owner context metadata#560

feat: add query owner context metadata#560
buger wants to merge 5 commits intomainfrom
feat/semantic-owner-context-557-558

buger commented May 6, 2026 •

edited

Loading

Uh oh!

probelabs Bot commented May 6, 2026 •

edited

Loading

Uh oh!

probelabs Bot commented May 6, 2026 •

edited

Loading

Uh oh!

buger commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

buger commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implemented: probe query --with-context

Implemented: search JSON source metadata

JS/TS fixes

Architecture notes

Out of scope for this PR

Tests

Uh oh!

probelabs Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

probelabs Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture Issues (10)

Performance Issues (11)

Architecture Issues (10)

Quality Issues (1)

Uh oh!

buger commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

buger commented May 6, 2026 •

edited

Loading

Implemented: `probe query --with-context`

probelabs Bot commented May 6, 2026 •

edited

Loading

probelabs Bot commented May 6, 2026 •

edited

Loading