🔤 Update Unicode scanner to v2026.03.0#73
Conversation
refactor: Expand common Unicode exclusions and improve documentation Update version to 2026.03.0 across all references. Enhance the --exclude-common flag to cover additional typography characters including soft hyphens, superscripts, subscripts, Roman numerals, combining diacritical marks, and the replacement character. These additions reduce false positives in documentation and markdown repositories while maintaining security against actual threats. Improve help text clarity for the --exclude-common option.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughVersion bumped to v2026.03.0 AI+. The run script expands the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
check-for-unicode/run.sh (1)
207-219: Please add a negative-control test for these new exemptions.The current docs test in
check-for-unicode/test-suite/run-tests.sh, Lines 62-76, only checks that the fixtures pass with--exclude-common. Please add the companion assertion that the same fixtures still fail without the flag, so a future regression can't silently widen the default-safe set.Example follow-up in
check-for-unicode/test-suite/run-tests.sh# Test documentation with --exclude-common flag (should exit 0 - common Unicode excluded) echo -e "${YELLOW}Testing documentation with --exclude-common flag...${NC}" for file in "${SCRIPT_DIR}"/*docs*.md "${SCRIPT_DIR}"/*typography*; do if [ -f "$file" ]; then ((total++)) filename=$(basename "$file") if "$SCANNER" --exclude-common "$file" > /dev/null 2>&1; then echo -e " ${GREEN}✓ PASS${NC}: $filename (common Unicode excluded)" ((passed++)) else echo -e " ${RED}✗ FAIL${NC}: $filename (unexpected detection with --exclude-common)" ((failed++)) fi fi done + +# The same fixtures should still be flagged without --exclude-common +echo -e "${YELLOW}Testing documentation without --exclude-common flag...${NC}" +for file in "${SCRIPT_DIR}"/*docs*.md "${SCRIPT_DIR}"/*typography*; do + if [ -f "$file" ]; then + ((total++)) + filename=$(basename "$file") + if "$SCANNER" "$file" > /dev/null 2>&1; then + echo -e " ${RED}✗ FAIL${NC}: $filename (unexpected pass without --exclude-common)" + ((failed++)) + else + echo -e " ${GREEN}✓ PASS${NC}: $filename (still flagged without --exclude-common)" + ((passed++)) + fi + fi +done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@check-for-unicode/run.sh` around lines 207 - 219, Add a negative-control assertion to run-tests.sh: after the existing test that runs the fixtures with the --exclude-common flag, add a companion check that runs the exact same fixtures command without --exclude-common and asserts it fails (non-zero exit) so regressions widening the default-safe set are caught; use the same fixtures variable/command used in the current block and negate the exit (e.g., expect non-zero or use ! before the command) to make the intent explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@check-for-unicode/run.sh`:
- Around line 3-8: The startup banner is hardcoded as "v2.1.1 AI+" while the
script defines VERSION="2026.03.0"; update the banner rendering to use the
VERSION variable instead of the literal string so --help/--version/JSON output
match (e.g., construct the banner via VERSION or a derived variable like
BANNER="v${VERSION} AI+"). Locate uses of the hardcoded "v2.1.1 AI+" (the banner
print routine) and replace them to reference ${VERSION} (and keep the "AI+"
suffix if desired) so all outputs are consistent.
- Around line 56-58: The help text for the --exclude-common flag is missing two
characters now suppressed by is_common_unicode(): include explicit mention of
U+00AD (soft hyphen) and U+FFFD (replacement character) in the --exclude-common
description so the CLI accurately reflects behavior; update the descriptive
string where the --exclude-common option is defined (the help/usage text) to
list “soft hyphen (U+00AD)” and “replacement character (U+FFFD)” alongside the
other examples and ensure the wording matches the existing style and
punctuation.
---
Nitpick comments:
In `@check-for-unicode/run.sh`:
- Around line 207-219: Add a negative-control assertion to run-tests.sh: after
the existing test that runs the fixtures with the --exclude-common flag, add a
companion check that runs the exact same fixtures command without
--exclude-common and asserts it fails (non-zero exit) so regressions widening
the default-safe set are caught; use the same fixtures variable/command used in
the current block and negate the exit (e.g., expect non-zero or use ! before the
command) to make the intent explicit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7a2ed491-167c-46d0-80fa-a4043f4d929f
📒 Files selected for processing (1)
check-for-unicode/run.sh
Correct the Roman numeral Unicode range from U+2179 to U+217F to include the complete set of Roman numeral characters. Remove checks for combining diacritical marks (U+0300-U+030C) and the replacement character (U+FFFD) as these are not security threats and cause false positives in legitimate text processing.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@check-for-unicode/run.sh`:
- Around line 56-58: The help text and implementation disagree:
is_common_unicode() now handles soft hyphen (U+00AD) but the help text doesn't
mention it, and the help text claims "combining accents" (U+0300–U+030C) are
excluded although is_common_unicode() lacks that pattern. Update run.sh so the
help text and is_common_unicode() are consistent—either add U+0300–U+030C
matching logic to is_common_unicode() (e.g., include a regex/class for combining
accents) if you intend to exclude them, or remove "combining accents" from the
help and/or add U+00AD (soft hyphen) to the help text to reflect current
behavior; reference the help block in run.sh and the is_common_unicode()
function when making the change.
- Around line 207-215: The help text says --exclude-common should skip combining
accents (U+0300–U+030C) but is_common_unicode() lacks that pattern; add a check
in is_common_unicode() that returns 0 for codes in the U+0300–U+030C range (e.g.
[[ "$unicode_code" =~ ^030[0-9A-C]$ ]] && return 0) so those characters are
treated as common when --exclude-common is used, and ensure harmful_patterns no
longer unconditionally flags these combining-accent codepoints (or is skipped
when is_common_unicode() returns 0); reference the functions/variables
is_common_unicode, harmful_patterns, and unicode_code when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6e75333f-04f0-4ecf-ad06-be6012f7b289
📒 Files selected for processing (1)
check-for-unicode/run.sh
Update help text to clarify --exclude-common behavior and mention soft hyphen exclusion. Replace hardcoded version string with VERSION variable for dynamic version display in header output.
Update help text to clarify that --exclude-common also suppresses AI-confusion and homograph checks. Fix alignment of header banner text to center properly within the box borders.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
check-for-unicode/run.sh (1)
208-216:⚠️ Potential issue | 🟡 Minor
--exclude-commonstill doesn't excludeU+FFFD.Line 358 still classifies the replacement character as harmful, but
is_common_unicode()never returns true forFFFD. That leaves this PR short of the stated exclusion set and keeps a docs-heavy false positive active.Suggested fix
# Roman numerals: U+2160-U+217F (used in outlines, legal docs, lists) [[ "$unicode_code" =~ ^21[67][0-9A-F]$ ]] && return 0 + # Replacement Character: U+FFFD + [[ "$unicode_code" == "FFFD" ]] && return 0 return 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@check-for-unicode/run.sh` around lines 208 - 216, The exclusion set misses the replacement character U+FFFD causing --exclude-common to still flag it; update the common-unicode checks (the is_common_unicode() logic that examines unicode_code) to treat "FFFD" as common by returning success for that code (i.e., add a condition checking [[ "$unicode_code" == "FFFD" ]] && return 0 or equivalent within the same block that handles other common codes such as 00AD/00B*/207*/208*/21[67]*). Ensure the check uses the same unicode_code variable and placement as the existing patterns so U+FFFD is excluded consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@check-for-unicode/run.sh`:
- Around line 208-216: The exclusion set misses the replacement character U+FFFD
causing --exclude-common to still flag it; update the common-unicode checks (the
is_common_unicode() logic that examines unicode_code) to treat "FFFD" as common
by returning success for that code (i.e., add a condition checking [[
"$unicode_code" == "FFFD" ]] && return 0 or equivalent within the same block
that handles other common codes such as 00AD/00B*/207*/208*/21[67]*). Ensure the
check uses the same unicode_code variable and placement as the existing patterns
so U+FFFD is excluded consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d301de07-f91c-4fba-90b7-ee0a3d1aa5b7
📒 Files selected for processing (1)
check-for-unicode/run.sh
Reorganize help text for better readability by rewrapping lines at 72 characters. Remove soft hyphen from common Unicode exclusions and refine Roman numeral detection to exclude Latin-lookalike confusables (I, VI, X, v, x) while maintaining detection of other Roman numeral characters.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
check-for-unicode/run.sh (2)
648-649: Consider computing banner width instead of hand-tuning spaces.This now uses
${VERSION}, but the centering is still hardcoded for the current version length. The next version bump will need another manual spacing tweak to keep the box aligned. Rendering the line with a fixed inner width viaprintfwould make future bumps zero-maintenance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@check-for-unicode/run.sh` around lines 648 - 649, Replace the two hardcoded echo lines that print the banner (the lines containing "${VERSION}") with a printf-based rendering that computes a fixed inner width and centers the text; compute an INNER_WIDTH variable (or TOTAL_WIDTH minus borders), build the two content strings ("Big Bear Unicode Security Scanner v${VERSION} AI+" and "Detecting dangerous Unicode & AI injection attacks"), and use printf with a field width (and padding) to center each string inside the box borders so future version-length changes keep the box aligned (update the lines that currently call echo -e "\033[1;35m║ ... ${VERSION} ... ║\033[0m").
208-216: Add regression coverage for the new suppression branches.These new exclusions are not exercised by the current
--exclude-commontests:check-for-unicode/test-suite/run-tests.sh:62-76only scans*docs*.mdand*typography*, so the superscripts already present incheck-for-unicode/test-suite/ai-injection-test.txt:33-36never hit this path, and there is still no fixture for the Roman numeral carve-out. A focused test here would make the behavior change much safer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@check-for-unicode/run.sh` around lines 208 - 216, Add targeted regression tests exercising the new suppression branches by adding a fixture containing the specific superscript (U+00B2, U+00B3, U+00B9, U+2070-U+2079), subscript (U+2080-U+2084) and Roman numeral codepoints (including the carve-out examples U+2160, U+2165, U+2169, U+2174, U+2179) and update the test-runner invocation so that this fixture is scanned (modify check-for-unicode/test-suite/run-tests.sh to include the new fixture or broaden the globs it uses instead of only *docs*.md and *typography*); ensure the test asserts that lines with those codes are excluded by the suppression logic that checks the unicode_code variable (the regex branches starting with [[ "$unicode_code" =~ ^00B[239]$ ]], [[ "$unicode_code" =~ ^207[0-9]$ ]], [[ "$unicode_code" =~ ^208[0-4]$ ]], and the Roman numeral branch [[ "$unicode_code" =~ ^21[67][0-9A-F]$ ]] and its exclusion list).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@check-for-unicode/run.sh`:
- Around line 56-59: The help text for the "--exclude-common" option incorrectly
claims it excludes Roman numerals while the scanner only reports five confusable
Roman numerals defined in the harmful_patterns set (and those are explicitly
retained); either update the help string to say something like "non-confusable
Roman numerals" or expand harmful_patterns to include the full U+2160–U+217F
range so Roman numerals are actually suppressible—modify the "--exclude-common"
help text or append the U+2160–U+217F codepoints to the harmful_patterns
collection accordingly.
---
Nitpick comments:
In `@check-for-unicode/run.sh`:
- Around line 648-649: Replace the two hardcoded echo lines that print the
banner (the lines containing "${VERSION}") with a printf-based rendering that
computes a fixed inner width and centers the text; compute an INNER_WIDTH
variable (or TOTAL_WIDTH minus borders), build the two content strings ("Big
Bear Unicode Security Scanner v${VERSION} AI+" and "Detecting dangerous Unicode
& AI injection attacks"), and use printf with a field width (and padding) to
center each string inside the box borders so future version-length changes keep
the box aligned (update the lines that currently call echo -e "\033[1;35m║ ...
${VERSION} ... ║\033[0m").
- Around line 208-216: Add targeted regression tests exercising the new
suppression branches by adding a fixture containing the specific superscript
(U+00B2, U+00B3, U+00B9, U+2070-U+2079), subscript (U+2080-U+2084) and Roman
numeral codepoints (including the carve-out examples U+2160, U+2165, U+2169,
U+2174, U+2179) and update the test-runner invocation so that this fixture is
scanned (modify check-for-unicode/test-suite/run-tests.sh to include the new
fixture or broaden the globs it uses instead of only *docs*.md and
*typography*); ensure the test asserts that lines with those codes are excluded
by the suppression logic that checks the unicode_code variable (the regex
branches starting with [[ "$unicode_code" =~ ^00B[239]$ ]], [[ "$unicode_code"
=~ ^207[0-9]$ ]], [[ "$unicode_code" =~ ^208[0-4]$ ]], and the Roman numeral
branch [[ "$unicode_code" =~ ^21[67][0-9A-F]$ ]] and its exclusion list).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e3a4328-de56-4795-91ae-db50e55168a8
📒 Files selected for processing (1)
check-for-unicode/run.sh
Expand the --exclude-common option to cover additional common Unicode characters including common spaces, angle quotes, and per mille sign. Update documentation and add clarifying notes about superscript character coverage to reduce false positives in documentation and markdown repositories.
Extend the subscript digits Unicode range from U+2080-U+2084 to U+2080-U+2089 to include all subscript digits. Update the regex pattern from ^208[0-4]$ to ^208[0-9]$ to match the complete range of subscript digit characters.
Add explanatory comment to clarify that the subscript digits regex pattern ^208[0-9]$ covers Unicode range U+2085-U+2089, which are not currently included in the harmful_patterns list. This documents the intentional broader matching for future compatibility.
--exclude-commonflag to cover additional typography characters (soft hyphens, superscripts, subscripts, Roman numerals, combining diacritical marks, replacement character)--exclude-commonoptionGreptile Summary
This PR bumps the scanner version to
2026.03.0, expandsis_common_unicode()with new exclusions for superscript digits, subscript digits, and non-confusable Roman numerals, and updates the help text and banner accordingly.Key observations:
v${VERSION}) are applied consistently across the comment header,VERSIONvariable, and display output. The banner padding has been adjusted to 7 spaces on each side, producing a correct 62-character inner width for the2026.03.0version string.is_common_unicode()exclusions correctly implement the described intent: superscripts (^00B[239]$,^207[0-9]$), subscripts (^208[0-9]$), and Roman numerals (^21[67][0-9A-F]$) are all treated as common, while the five explicitly confusable Roman numerals (U+2160, U+2165, U+2169, U+2174, U+2179) are correctly preserved as detectable via the negated guard[[ ! "$unicode_code" =~ ^(2160|2165|2169|2174|2179)$ ]].(note: also suppresses some AI-confusion and homograph checks), which is a meaningful transparency improvement over the old(very permissive)label.^207[0-9]$,^208[0-9]$,^21[67][0-9A-F]$) over-cover the actualharmful_patternspopulation by design (the excess codes are not inharmful_patternsand are therefore functionally inert), and the inline comments acknowledge this explicitly.--exclude-common; those issues remain open and are not repeated here.Confidence Score: 3/5
--exclude-commonexclusions introduce documented trade-offs that suppress active security detections (AI-confusion superscripts/subscripts, confusable punctuation) and warrant review before merging.is_common_unicode()patterns are logically sound — the Roman numeral confusable-guard is correctly structured, regex ranges match their documented intent, and the over-inclusive subscript/superscript coverage is explicitly annotated. However, several of the newly excluded characters (00B2,00B3,00B9,2070–2079,2080–2084) are explicitly listed inharmful_patternsunder the AI-confusion category, and multiple previous review threads have flagged unresolved concerns about silencing those detections. The framing shift from "very permissive" to "recommended for docs/markdown repos" may increase adoption of the flag among users who don't fully appreciate the trade-offs. Score reflects that while no new blocking bugs were introduced, the open concerns from prior threads around security coverage regression have not yet been resolved.is_common_unicode()function incheck-for-unicode/run.sh(lines 209–219) warrants careful review due to the interplay between the new exclusions and existingharmful_patternsentries.Important Files Changed
is_common_unicode()expanded with superscript/subscript/Roman-numeral exclusions that silence entries already present inharmful_patterns; help text and banner updated. The exclusion logic is functionally correct but intentionally silences several AI-confusion and confusable-character detections when--exclude-commonis active.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Character from harmful_patterns] --> B{--exclude-common\nflag active?} B -- No --> Z[Run detection check] B -- Yes --> C{is_common_unicode?} C -- Smart quotes\nU+2018/19/1C/1D --> SKIP[Skip / no alert] C -- Dashes\nU+2010-2015 --> SKIP C -- Ellipsis U+2026 --> SKIP C -- Common spaces\nU+2007-200A --> SKIP C -- Angle quotes\nU+2039/203A --> SKIP C -- Per mille U+2030 --> SKIP C -- Superscripts\nU+00B2/B3/B9\nU+2070-2079 --> SKIP C -- Subscripts\nU+2080-2089 --> SKIP C -- Roman numerals\nU+2160-217F\nexcl. 2160/2165/2169/2174/2179 --> SKIP C -- Confusable Roman numerals\n2160 I, 2165 VI, 2169 X\n2174 v, 2179 x --> Z C -- No match --> Z Z --> R{Pattern found\nin file?} R -- Yes --> ALERT[Report finding] R -- No --> NEXT[Next pattern] style SKIP fill:#f96,color:#000 style ALERT fill:#f33,color:#fffComments Outside Diff (2)
check-for-unicode/run.sh, line 75 (link)Stale example comment
The
--exclude-commonhelp text (lines 56–58) was updated to remove "very permissive" in favor of "recommended for docs/markdown repos," but this example still uses the old phrasing. Update for consistency:Prompt To Fix With AI
check-for-unicode/run.sh, line 75 (link)The inline comment on this example contradicts the updated help text. The help describes the flag as "recommended for docs/markdown repos" (line 58), but this example labels it "Very permissive for documentation"—a different framing that could confuse users.
Prompt To Fix With AI
Last reviewed commit: b93595d