sync: upstream just-bash@2.14.5#10
Conversation
* jq: accept literal control chars inside JSON strings Real jq tolerates JSON strings containing unescaped control characters (newlines, tabs, carriage returns, etc.) even though RFC 8259 forbids them. Real-world inputs — webhooks, log lines piped through jq, output from terminals copy-pasted into a script — frequently contain literal tabs/newlines inside string values, and rejecting those at the JSON.parse boundary surprises users who expect parity with jq. Pre-process the slice we hand to JSON.parse: walk the string tracking quote/escape state, and when a control char (≤ 0x1f) appears inside a string, replace it with its JSON escape (\n, \t, \r, \b, \f, or \uXXXX). Outside strings the input is left untouched, so structural whitespace keeps its meaning. Tests cover newline, tab, and CR inside strings; preserved escape sequences alongside literal control chars; control chars outside strings (whitespace) leaving compact output unchanged; and concatenated JSON values where one contains a literal newline. * add changeset
…er (vercel-labs#218) * perf(grep): 5-123x speedup via RE2 matcher reuse and literal pre-filter Addresses three layered hot spots that together produced 700-800ms per-exec latency for case-insensitive, word-boundary, and alternation patterns over a ~25k-line corpus. 1. RE2JS Matcher cache (packages/just-bash/src/regex/user-regex.ts) regex.test()/exec() each allocated a new RE2 Matcher per call. For a recursive grep over 100+ files this meant ~25k Matcher allocations per invocation. Cache the Matcher and swap its charSequence in place when the input changes. Note: Matcher.resetMatcherInput is unsafe with raw strings — the constructor wraps strings via MatcherInput.utf16(), but resetMatcherInput assigns its argument directly and reset() then calls .length() as a method on the assigned value, throwing TypeError on strings. So we mutate the cached Utf16MatcherInput.charSequence field directly and reset(). 2. Combine test() + exec() in searchContent (packages/just-bash/src/commands/search-engine/matcher.ts) The fast path called regex.test() to decide if a line matched, then regex.exec() again on matched lines for the column position. Replace with a single regex.exec() — its return value already encodes both the boolean and the position. Saves one RE2.find() per matched line. 3. Substring pre-filter (packages/just-bash/src/commands/search-engine/regex.ts) Extract must-have literal substrings from the compiled pattern at buildRegex time. In the per-line loop, run String.indexOf first; if no needle is present, the regex provably cannot match, so RE2 is skipped entirely. Handles bare literals, \b...\b wrapped patterns (-w), and top-level alternation (-E 'a|b'). For -i, both line and needles compare in lowercase. Falls back to the full regex for any pattern containing quantifiers, anchors, character classes, or other regex metacharacters — false positives are fine, false negatives would be bugs. Benchmark (HTTP-API-fronted virtualfs sandbox backed by Neon Postgres, recursive grep over 113-file 25,956-line TypeScript source tree, server- reported ms, p50 of 3 runs after warmup): Pattern before after speedup -ri 'async' 680ms 14ms 48x -rw 'type' 476ms 5ms 90x -rE 'interface|type' 823ms 7ms 123x -r 'interface' 44ms 4ms 11x -rF 'Promise<' 26ms 3ms 9x -rn 'export' 21ms 4ms 5x Tests: - packages/just-bash/src/commands/search-engine/regex.test.ts 29/29 pass (NEW) direct preFilter extractor coverage: 8 happy-path (literal / -w wrap / alternation / -F escape / -i / \\n decode), 16 safety-negative (each quantifier, char class, anchor, dot, \\d \\w \\s \\b alone, capture groups, mixed alternation), and 5 structural (no | split inside groups or [classes], ordering preserved, partial \\b not stripped, empty pattern) - packages/just-bash/src/regex/user-regex.test.ts (67 tests) - packages/just-bash/src/commands/grep/{basic,advanced,perl,exclude,binary}.test.ts - packages/just-bash/src/spec-tests/grep/grep-spec.test.ts (332 tests) - packages/just-bash/src/comparison-tests/grep.comparison.test.ts (validates output against system grep) - packages/just-bash/src/commands/{sed,awk,rg}/*.test.ts (sister commands sharing UserRegex) * fix(grep): address Copilot review — hex escape false negatives, stdin preFilter, misleading comment - Reject \x/\u/\U escapes in literalFromAlternative to prevent false-negative pre-filter needles (e.g. \x41 would have produced 'x41' instead of 'A') - Pass preFilter to searchContent in the stdin code path (was only wired for file inputs, missing the perf win on piped input) - Reword acquireMatcher comment: says resetMatcherInput was used but the code explicitly avoids it due to a re2js 1.2.1 bug; clarify charSequence mutation - Add safety tests for \x and \u escape rejection * chore: add changeset for grep perf improvement
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
|
4ff4e40 to
f544617
Compare
| this._matcherInput = input; | ||
| return this._matcher; | ||
| } | ||
| if (this._matcherInput !== input) { |
There was a problem hiding this comment.
[MUST FIX] packages/just-bash/src/regex/user-regex.ts:96 — The acquireMatcher method mutates this._matcher.matcherInput.charSequence in-place across different UserRegex instances and calls. UserRegex instances are not necessarily single-threaded singletons — if the same UserRegex is reused across parallel command invocations (e.g., two concurrent grep calls sharing the same compiled regex object), two concurrent callers can race on this._matcher and this._matcherInput, leading to incorrect match results or crashes.
More critically, within a single call to searchContent, the firstMatch result from regex.exec(line) at line ~222 is stored, then regex.exec(line) is called again in the onlyMatching/vimgrep loop. Between these calls, regex.lastIndex is advanced by RE2's side effects but the cached Matcher's internal state from the previous exec call is still live. The acquireMatcher path only calls matcher.reset() when the input changes — if the same line is passed twice in a row (first exec to get firstMatch, then loop iterations), reset() is skipped and the Matcher resumes from wherever lastIndex left off. This is by design for global regex iteration, but the reset logic must correctly handle the case where lastIndex was explicitly set to 0 (regex.lastIndex = 0 before the diff's removed calls).
Please verify that: (1) the Matcher's internal position is always consistent with this._lastIndex at the start of acquireMatcher, and (2) concurrent use of a single UserRegex across parallel executions is either impossible or protected.
Reviewed at f544617
| // directly and then calls .length() as a method, which throws on a | ||
| // raw string). MatcherInput is not exported, so we mutate the existing | ||
| // wrapper's charSequence field — Matcher.reset() reads matcherInput.length() | ||
| // afterwards, so the new length is picked up correctly. |
There was a problem hiding this comment.
[MUST FIX] packages/just-bash/src/regex/user-regex.ts:104 — The comment explains that resetMatcherInput is broken in re2js 1.2.1 and works around it by directly mutating this._matcher.matcherInput.charSequence. This is coupling the implementation to private internals of re2js that could change in any patch release, and it suppresses the type error with as any. If re2js 1.2.1 is confirmed broken, pin it in package.json with an exact version (or at most ~1.2.1) to prevent accidental upgrades that would break this assumption silently. As written, a minor re2js upgrade could change the internal field name or layout, and this code would silently produce wrong matches with no type-level protection.
Reviewed at f544617
| for (let i = 0; i < lastIdx; i++) { | ||
| const line = lines[i]; | ||
| // Pre-filter: skip lines that can't contain any required literal. | ||
| // For invertMatch=true (-vc), a no-match line still counts, so we |
There was a problem hiding this comment.
[MUST FIX] packages/just-bash/src/commands/search-engine/matcher.ts:168 — In the onlyMatching branch (line ~230 in the diff), the loop is changed to start from firstMatch instead of a fresh regex.exec(line). This is correct only if regex.lastIndex has already been advanced past the first match before the loop continues with match = regex.exec(line). However, the pre-filter branch that sets firstMatch = null also never calls regex.exec, so regex.lastIndex is 0 at that point — that case is fine. The issue is in the matches && onlyMatching path: the first exec call sets lastIndex past the first match; the loop then calls regex.exec(line) for subsequent matches. This looks correct. But please confirm the case where firstMatch[0].length === 0 (zero-length match): the original code had regex.lastIndex++ inside the loop body — verify that this guard still applies correctly when starting from firstMatch rather than a fresh exec.
Reviewed at f544617
| if (core.length === 0) return null; | ||
|
|
||
| const alternatives = splitTopLevelAlternation(core); | ||
| if (alternatives === null) return null; |
There was a problem hiding this comment.
[SHOULD FIX] packages/just-bash/src/commands/search-engine/regex.ts:280 — The literalFromAlternative reject-list regex /[dDwWsSbBAZzGQE0-9ckpPNXRxuU]/ uses character ranges 0-9 and A-Z to block numeric and uppercase back-references. However, the range A-Z in a character class covers all uppercase letters, including some that are only used as literal-escape sequences in certain regex flavors (e.g., \C, \F, \H, \I, etc.). This is an overly broad rejection — fine for safety, but the intent comment only mentions \d, \w, \s, \b, \B. More importantly, the range 0-9 would block \0 through \9, which are back-references in BRE/ERE and octal escapes in PCRE, so rejection is correct. Please add a test case or comment that explicitly documents why the full A-Z range is rejected rather than individual characters, to make future maintenance safer.
Reviewed at f544617
| padding: "60px 80px", | ||
| padding: "60px 200px", | ||
| }} | ||
| > |
There was a problem hiding this comment.
[SHOULD FIX] examples/website/app/opengraph-image.tsx:33 — The <svg> element uses fill-rule and clip-rule as HTML attribute names, but in JSX/React these must be camelCased: fillRule and clipRule. Using hyphenated attribute names in JSX will produce a React warning and the attributes will be ignored in the rendered output (the triangle fill may not render correctly).
Reviewed at f544617
Review SummaryFive issues found across the diff:
5 comments posted · Model: |
Automated upstream sync to just-bash@2.14.5 has merge conflicts.
Check out this branch, resolve conflicts at TypeScript-source level, then run:
Commit and push to update this PR.
Need help on this PR? Tag
@codesmithwith what you need.