Skip to content

fix(base-extractor): preserve full string value across escape sequences in getStringValue#450

Open
tirth8205 wants to merge 1 commit into
Egonex-AI:mainfrom
tirth8205:fix/base-extractor-string-escapes
Open

fix(base-extractor): preserve full string value across escape sequences in getStringValue#450
tirth8205 wants to merge 1 commit into
Egonex-AI:mainfrom
tirth8205:fix/base-extractor-string-escapes

Conversation

@tirth8205

Copy link
Copy Markdown
Contributor

Problem

  • tree-sitter splits a string literal's contents into MULTIPLE string_fragment nodes whenever an escape_sequence (e.g. \t, ", \n) appears between them. getStringValue returns only the FIRST string_fragment child, so everything from the first escape onward is silently dropped. I verified this against the real tree-sitter-typescript WASM grammar shipped in the repo: for the source import x from './a\tb' the string node has children [" , string_fragment 'a', escape_sequence '\t', string_fragment 'b', '], and getStringValue returns "./a" instead of the full import source. Likewise "a\"b" returns "a" (should be a"b) and "he\tllo" returns "he". The only consumer, typescript-extractor.extractImport (line 364), therefore records a truncated import source for any module path containing an escape, and getStringValue is also re-exported publicly from extractors/index.ts.

Fix

  • Concatenate the text of every content child (string_fragment and escape_sequence) instead of returning only the first fragment: export function getStringValue(node: TreeSitterNode): string { let value = ""; let found = false; for (let i = 0; i < node.childCount; i++) { const child = node.child(i); if (child && (child.type === "string_fragment" || child.type === "escape_sequence")) { value += child.text; found = true; } } if (found) return value; return node.text.replace(/^['"]|['"]$/g, ""); }…

Testing

Adds unit test(s) that fail before the change and pass after. The full core test suite, eslint, and tsc --noEmit all pass locally on this branch.

Found via a static correctness audit of the shared tree-sitter base extractor.

🤖 Generated with Claude Code

…es in getStringValue

tree-sitter splits a string literal's contents into multiple string_fragment
nodes whenever an escape_sequence appears between them. getStringValue returned
only the first string_fragment, silently dropping everything from the first
escape onward (e.g. './a\tb' became './a'). This truncated import sources for
any module path containing an escape.

Fix concatenates the text of every content child (string_fragment and
escape_sequence) instead of returning only the first fragment, preserving the
full raw value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@thejesh23 thejesh23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Function returns raw source text, not the decoded string value.
escape_sequence.text is the literal source (e.g. the two characters \ + t), so getStringValue on './a\tb' now returns ./a\tb verbatim — a backslash plus t, not a tab. The name and JSDoc ("unquoted string value") imply a decoded value; consumers comparing import sources to filesystem paths or resolved module specifiers will still be wrong, just wrong differently. Either decode the escapes here or rename/document this as "raw inner text".

2. Behavior is JS/TS-grammar-specific but the helper is re-exported as generic.
Only string_fragment/escape_sequence are recognized; Python/Go/Rust/Ruby/etc. use different child node types (string_content, interpreted_string_literal children, raw_string_literal, etc.), so for those grammars the function silently falls through to the quote-stripping path. Worth either restricting the helper to JS-family extractors or adding the other content node types — relevant to #435 (Dart extractor already calls this for string-literal imports).

3. Test coverage misses the fallback and template-string paths.
All three new tests hit the new concatenation branch on TS grammar; the node.text.replace(/^['"]|['"]$/g, "") fallback and template_string (with template_chars / ${...}) are still unexercised, and the second assertion uses toContain rather than toBe so it would also pass on the pre-fix truncated output a. Tighten that assertion and add a fallback-path case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants