fix(base-extractor): preserve full string value across escape sequences in getStringValue#450
Conversation
…es in getStringValue tree-sitter splits a string literal's contents into multiple string_fragment nodes whenever an escape_sequence appears between them. getStringValue returned only the first string_fragment, silently dropping everything from the first escape onward (e.g. './a\tb' became './a'). This truncated import sources for any module path containing an escape. Fix concatenates the text of every content child (string_fragment and escape_sequence) instead of returning only the first fragment, preserving the full raw value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
thejesh23
left a comment
There was a problem hiding this comment.
1. Function returns raw source text, not the decoded string value.
escape_sequence.text is the literal source (e.g. the two characters \ + t), so getStringValue on './a\tb' now returns ./a\tb verbatim — a backslash plus t, not a tab. The name and JSDoc ("unquoted string value") imply a decoded value; consumers comparing import sources to filesystem paths or resolved module specifiers will still be wrong, just wrong differently. Either decode the escapes here or rename/document this as "raw inner text".
2. Behavior is JS/TS-grammar-specific but the helper is re-exported as generic.
Only string_fragment/escape_sequence are recognized; Python/Go/Rust/Ruby/etc. use different child node types (string_content, interpreted_string_literal children, raw_string_literal, etc.), so for those grammars the function silently falls through to the quote-stripping path. Worth either restricting the helper to JS-family extractors or adding the other content node types — relevant to #435 (Dart extractor already calls this for string-literal imports).
3. Test coverage misses the fallback and template-string paths.
All three new tests hit the new concatenation branch on TS grammar; the node.text.replace(/^['"]|['"]$/g, "") fallback and template_string (with template_chars / ${...}) are still unexercised, and the second assertion uses toContain rather than toBe so it would also pass on the pre-fix truncated output a. Tighten that assertion and add a fallback-path case.
Problem
string_fragmentnodes whenever anescape_sequence(e.g. \t, ", \n) appears between them. getStringValue returns only the FIRSTstring_fragmentchild, so everything from the first escape onward is silently dropped. I verified this against the real tree-sitter-typescript WASM grammar shipped in the repo: for the sourceimport x from './a\tb'the string node has children [" , string_fragment 'a', escape_sequence '\t', string_fragment 'b', '], and getStringValue returns "./a" instead of the full import source. Likewise"a\"b"returns "a" (should bea"b) and"he\tllo"returns "he". The only consumer, typescript-extractor.extractImport (line 364), therefore records a truncated importsourcefor any module path containing an escape, and getStringValue is also re-exported publicly from extractors/index.ts.Fix
]|['"]$/g, ""); }…Testing
Adds unit test(s) that fail before the change and pass after. The full core test suite,
eslint, andtsc --noEmitall pass locally on this branch.Found via a static correctness audit of the shared tree-sitter base extractor.
🤖 Generated with Claude Code