🔤 Update Unicode scanner to v2026.03.0 by dragonfire1119 · Pull Request #73 · bigbeartechworld/big-bear-scripts

dragonfire1119 · 2026-03-08T00:12:27Z

Update version to 2026.03.0 across all references
Expand --exclude-common flag to cover additional typography characters (soft hyphens, superscripts, subscripts, Roman numerals, combining diacritical marks, replacement character)
Reduce false positives in documentation and markdown repositories
Improve help text clarity for the --exclude-common option

Greptile Summary

This PR bumps the scanner version to 2026.03.0, expands is_common_unicode() with new exclusions for superscript digits, subscript digits, and non-confusable Roman numerals, and updates the help text and banner accordingly.

Key observations:

The version bump and dynamic banner (v${VERSION}) are applied consistently across the comment header, VERSION variable, and display output. The banner padding has been adjusted to 7 spaces on each side, producing a correct 62-character inner width for the 2026.03.0 version string.
The new is_common_unicode() exclusions correctly implement the described intent: superscripts (^00B[239]$, ^207[0-9]$), subscripts (^208[0-9]$), and Roman numerals (^21[67][0-9A-F]$) are all treated as common, while the five explicitly confusable Roman numerals (U+2160, U+2165, U+2169, U+2174, U+2179) are correctly preserved as detectable via the negated guard [[ ! "$unicode_code" =~ ^(2160|2165|2169|2174|2179)$ ]].
The updated help text now includes an inline note (note: also suppresses some AI-confusion and homograph checks), which is a meaningful transparency improvement over the old (very permissive) label.
All of the new exclusions (^207[0-9]$, ^208[0-9]$, ^21[67][0-9A-F]$) over-cover the actual harmful_patterns population by design (the excess codes are not in harmful_patterns and are therefore functionally inert), and the inline comments acknowledge this explicitly.
Multiple previous review threads have raised concerns about the security trade-offs of expanding --exclude-common; those issues remain open and are not repeated here.

Confidence Score: 3/5

The code logic is functionally correct but the expanded --exclude-common exclusions introduce documented trade-offs that suppress active security detections (AI-confusion superscripts/subscripts, confusable punctuation) and warrant review before merging.
The version bump and banner changes are straightforward and correctly implemented. The new is_common_unicode() patterns are logically sound — the Roman numeral confusable-guard is correctly structured, regex ranges match their documented intent, and the over-inclusive subscript/superscript coverage is explicitly annotated. However, several of the newly excluded characters (00B2, 00B3, 00B9, 2070–2079, 2080–2084) are explicitly listed in harmful_patterns under the AI-confusion category, and multiple previous review threads have flagged unresolved concerns about silencing those detections. The framing shift from "very permissive" to "recommended for docs/markdown repos" may increase adoption of the flag among users who don't fully appreciate the trade-offs. Score reflects that while no new blocking bugs were introduced, the open concerns from prior threads around security coverage regression have not yet been resolved.
The is_common_unicode() function in check-for-unicode/run.sh (lines 209–219) warrants careful review due to the interplay between the new exclusions and existing harmful_patterns entries.

Important Files Changed

Filename	Overview
check-for-unicode/run.sh	Version bumped to 2026.03.0; `is_common_unicode()` expanded with superscript/subscript/Roman-numeral exclusions that silence entries already present in `harmful_patterns`; help text and banner updated. The exclusion logic is functionally correct but intentionally silences several AI-confusion and confusable-character detections when `--exclude-common` is active.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Character from harmful_patterns] --> B{--exclude-common\nflag active?}
    B -- No --> Z[Run detection check]
    B -- Yes --> C{is_common_unicode?}

    C -- Smart quotes\nU+2018/19/1C/1D --> SKIP[Skip / no alert]
    C -- Dashes\nU+2010-2015 --> SKIP
    C -- Ellipsis U+2026 --> SKIP
    C -- Common spaces\nU+2007-200A --> SKIP
    C -- Angle quotes\nU+2039/203A --> SKIP
    C -- Per mille U+2030 --> SKIP

    C -- Superscripts\nU+00B2/B3/B9\nU+2070-2079 --> SKIP
    C -- Subscripts\nU+2080-2089 --> SKIP

    C -- Roman numerals\nU+2160-217F\nexcl. 2160/2165/2169/2174/2179 --> SKIP

    C -- Confusable Roman numerals\n2160 I, 2165 VI, 2169 X\n2174 v, 2179 x --> Z
    C -- No match --> Z

    Z --> R{Pattern found\nin file?}
    R -- Yes --> ALERT[Report finding]
    R -- No --> NEXT[Next pattern]

    style SKIP fill:#f96,color:#000
    style ALERT fill:#f33,color:#fff

Comments Outside Diff (2)

check-for-unicode/run.sh, line 75 (link)

Stale example comment

The --exclude-common help text (lines 56–58) was updated to remove "very permissive" in favor of "recommended for docs/markdown repos," but this example still uses the old phrasing. Update for consistency:

Prompt To Fix With AI

This is a comment left during a code review.
Path: check-for-unicode/run.sh
Line: 75

Comment:
**Stale example comment**

The `--exclude-common` help text (lines 56–58) was updated to remove "very permissive" in favor of "recommended for docs/markdown repos," but this example still uses the old phrasing. Update for consistency:



How can I resolve this? If you propose a fix, please make it concise.

check-for-unicode/run.sh, line 75 (link)

The inline comment on this example contradicts the updated help text. The help describes the flag as "recommended for docs/markdown repos" (line 58), but this example labels it "Very permissive for documentation"—a different framing that could confuse users.

Prompt To Fix With AI

This is a comment left during a code review.
Path: check-for-unicode/run.sh
Line: 75

Comment:
The inline comment on this example contradicts the updated help text. The help describes the flag as "recommended for docs/markdown repos" (line 58), but this example labels it "Very permissive for documentation"—a different framing that could confuse users.



How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: b93595d}

refactor: Expand common Unicode exclusions and improve documentation Update version to 2026.03.0 across all references. Enhance the --exclude-common flag to cover additional typography characters including soft hyphens, superscripts, subscripts, Roman numerals, combining diacritical marks, and the replacement character. These additions reduce false positives in documentation and markdown repositories while maintaining security against actual threats. Improve help text clarity for the --exclude-common option.

coderabbitai · 2026-03-08T00:12:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Version bumped to v2026.03.0 AI+. The run script expands the --exclude-common help text and is_common_unicode checks to treat ellipsis, soft hyphen, superscripts, subscripts, and Roman numerals as common; the ASCII banner now renders the dynamic VERSION variable.

Changes

Cohort / File(s)	Summary
Version, help text, and Unicode detection `check-for-unicode/run.sh`	Bumped `VERSION` to `v2026.03.0 AI+`. Updated usage/help text for `--exclude-common` to list additional typography. Expanded `is_common_unicode` to include U+00AD (soft hyphen), ellipsis, superscript digits (U+00B2, U+00B3, U+00B9, U+2070–U+2079), subscript digits (U+2080–U+2084), and Roman numerals (U+2160–U+217F) while excluding certain confusables. ASCII art banner now interpolates `VERSION`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐇 I nibble bytes and count each glyph anew,
Soft hyphens hide while superscripts peek through.
Subscripts hum, Roman numerals stand tall,
The banner sings the version for all.
A happy rabbit hops — small fixes, neat and true.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is related to the changeset but contains inaccuracies regarding implemented features.	Update the PR description to remove references to soft hyphens, combining diacritical marks, and U+FFFD, which do not appear in the actual diff. Align description with the actual code changes: superscripts, subscripts, and Roman numerals (with confusable carve-out).

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: updating the Unicode scanner to version 2026.03.0, which is the primary focus of the pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update-unicode-scanner-v2026.03.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

check-for-unicode/run.sh (1)

207-219: Please add a negative-control test for these new exemptions.

The current docs test in check-for-unicode/test-suite/run-tests.sh, Lines 62-76, only checks that the fixtures pass with --exclude-common. Please add the companion assertion that the same fixtures still fail without the flag, so a future regression can't silently widen the default-safe set.

Example follow-up in check-for-unicode/test-suite/run-tests.sh

 # Test documentation with --exclude-common flag (should exit 0 - common Unicode excluded)
 echo -e "${YELLOW}Testing documentation with --exclude-common flag...${NC}"
 for file in "${SCRIPT_DIR}"/*docs*.md "${SCRIPT_DIR}"/*typography*; do
     if [ -f "$file" ]; then
         ((total++))
         filename=$(basename "$file")
         if "$SCANNER" --exclude-common "$file" > /dev/null 2>&1; then
             echo -e "  ${GREEN}✓ PASS${NC}: $filename (common Unicode excluded)"
             ((passed++))
         else
             echo -e "  ${RED}✗ FAIL${NC}: $filename (unexpected detection with --exclude-common)"
             ((failed++))
         fi
     fi
 done
+
+# The same fixtures should still be flagged without --exclude-common
+echo -e "${YELLOW}Testing documentation without --exclude-common flag...${NC}"
+for file in "${SCRIPT_DIR}"/*docs*.md "${SCRIPT_DIR}"/*typography*; do
+    if [ -f "$file" ]; then
+        ((total++))
+        filename=$(basename "$file")
+        if "$SCANNER" "$file" > /dev/null 2>&1; then
+            echo -e "  ${RED}✗ FAIL${NC}: $filename (unexpected pass without --exclude-common)"
+            ((failed++))
+        else
+            echo -e "  ${GREEN}✓ PASS${NC}: $filename (still flagged without --exclude-common)"
+            ((passed++))
+        fi
+    fi
+done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@check-for-unicode/run.sh` around lines 207 - 219, Add a negative-control
assertion to run-tests.sh: after the existing test that runs the fixtures with
the --exclude-common flag, add a companion check that runs the exact same
fixtures command without --exclude-common and asserts it fails (non-zero exit)
so regressions widening the default-safe set are caught; use the same fixtures
variable/command used in the current block and negate the exit (e.g., expect
non-zero or use ! before the command) to make the intent explicit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@check-for-unicode/run.sh`:
- Around line 3-8: The startup banner is hardcoded as "v2.1.1 AI+" while the
script defines VERSION="2026.03.0"; update the banner rendering to use the
VERSION variable instead of the literal string so --help/--version/JSON output
match (e.g., construct the banner via VERSION or a derived variable like
BANNER="v${VERSION} AI+"). Locate uses of the hardcoded "v2.1.1 AI+" (the banner
print routine) and replace them to reference ${VERSION} (and keep the "AI+"
suffix if desired) so all outputs are consistent.
- Around line 56-58: The help text for the --exclude-common flag is missing two
characters now suppressed by is_common_unicode(): include explicit mention of
U+00AD (soft hyphen) and U+FFFD (replacement character) in the --exclude-common
description so the CLI accurately reflects behavior; update the descriptive
string where the --exclude-common option is defined (the help/usage text) to
list “soft hyphen (U+00AD)” and “replacement character (U+FFFD)” alongside the
other examples and ensure the wording matches the existing style and
punctuation.

---

Nitpick comments:
In `@check-for-unicode/run.sh`:
- Around line 207-219: Add a negative-control assertion to run-tests.sh: after
the existing test that runs the fixtures with the --exclude-common flag, add a
companion check that runs the exact same fixtures command without
--exclude-common and asserts it fails (non-zero exit) so regressions widening
the default-safe set are caught; use the same fixtures variable/command used in
the current block and negate the exit (e.g., expect non-zero or use ! before the
command) to make the intent explicit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a2ed491-167c-46d0-80fa-a4043f4d929f

📥 Commits

Reviewing files that changed from the base of the PR and between d241a43 and f32e56e.

📒 Files selected for processing (1)

check-for-unicode/run.sh

Correct the Roman numeral Unicode range from U+2179 to U+217F to include the complete set of Roman numeral characters. Remove checks for combining diacritical marks (U+0300-U+030C) and the replacement character (U+FFFD) as these are not security threats and cause false positives in legitimate text processing.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@check-for-unicode/run.sh`:
- Around line 56-58: The help text and implementation disagree:
is_common_unicode() now handles soft hyphen (U+00AD) but the help text doesn't
mention it, and the help text claims "combining accents" (U+0300–U+030C) are
excluded although is_common_unicode() lacks that pattern. Update run.sh so the
help text and is_common_unicode() are consistent—either add U+0300–U+030C
matching logic to is_common_unicode() (e.g., include a regex/class for combining
accents) if you intend to exclude them, or remove "combining accents" from the
help and/or add U+00AD (soft hyphen) to the help text to reflect current
behavior; reference the help block in run.sh and the is_common_unicode()
function when making the change.
- Around line 207-215: The help text says --exclude-common should skip combining
accents (U+0300–U+030C) but is_common_unicode() lacks that pattern; add a check
in is_common_unicode() that returns 0 for codes in the U+0300–U+030C range (e.g.
[[ "$unicode_code" =~ ^030[0-9A-C]$ ]] && return 0) so those characters are
treated as common when --exclude-common is used, and ensure harmful_patterns no
longer unconditionally flags these combining-accent codepoints (or is skipped
when is_common_unicode() returns 0); reference the functions/variables
is_common_unicode, harmful_patterns, and unicode_code when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6e75333f-04f0-4ecf-ad06-be6012f7b289

📥 Commits

Reviewing files that changed from the base of the PR and between f32e56e and ba70228.

📒 Files selected for processing (1)

check-for-unicode/run.sh

Update help text to clarify --exclude-common behavior and mention soft hyphen exclusion. Replace hardcoded version string with VERSION variable for dynamic version display in header output.

Update help text to clarify that --exclude-common also suppresses AI-confusion and homograph checks. Fix alignment of header banner text to center properly within the box borders.

coderabbitai

♻️ Duplicate comments (1)

check-for-unicode/run.sh (1)

208-216: ⚠️ Potential issue | 🟡 Minor

--exclude-common still doesn't exclude U+FFFD.

Line 358 still classifies the replacement character as harmful, but is_common_unicode() never returns true for FFFD. That leaves this PR short of the stated exclusion set and keeps a docs-heavy false positive active.

Suggested fix

     # Roman numerals: U+2160-U+217F (used in outlines, legal docs, lists)
     [[ "$unicode_code" =~ ^21[67][0-9A-F]$ ]] && return 0
+    # Replacement Character: U+FFFD
+    [[ "$unicode_code" == "FFFD" ]] && return 0
     return 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@check-for-unicode/run.sh` around lines 208 - 216, The exclusion set misses
the replacement character U+FFFD causing --exclude-common to still flag it;
update the common-unicode checks (the is_common_unicode() logic that examines
unicode_code) to treat "FFFD" as common by returning success for that code
(i.e., add a condition checking [[ "$unicode_code" == "FFFD" ]] && return 0 or
equivalent within the same block that handles other common codes such as
00AD/00B*/207*/208*/21[67]*). Ensure the check uses the same unicode_code
variable and placement as the existing patterns so U+FFFD is excluded
consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@check-for-unicode/run.sh`:
- Around line 208-216: The exclusion set misses the replacement character U+FFFD
causing --exclude-common to still flag it; update the common-unicode checks (the
is_common_unicode() logic that examines unicode_code) to treat "FFFD" as common
by returning success for that code (i.e., add a condition checking [[
"$unicode_code" == "FFFD" ]] && return 0 or equivalent within the same block
that handles other common codes such as 00AD/00B*/207*/208*/21[67]*). Ensure the
check uses the same unicode_code variable and placement as the existing patterns
so U+FFFD is excluded consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d301de07-f91c-4fba-90b7-ee0a3d1aa5b7

📥 Commits

Reviewing files that changed from the base of the PR and between c18fdac and 765be03.

📒 Files selected for processing (1)

check-for-unicode/run.sh

Reorganize help text for better readability by rewrapping lines at 72 characters. Remove soft hyphen from common Unicode exclusions and refine Roman numeral detection to exclude Latin-lookalike confusables (I, VI, X, v, x) while maintaining detection of other Roman numeral characters.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

check-for-unicode/run.sh (2)
648-649: Consider computing banner width instead of hand-tuning spaces.

This now uses ${VERSION}, but the centering is still hardcoded for the current version length. The next version bump will need another manual spacing tweak to keep the box aligned. Rendering the line with a fixed inner width via printf would make future bumps zero-maintenance.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@check-for-unicode/run.sh` around lines 648 - 649, Replace the two hardcoded
echo lines that print the banner (the lines containing "${VERSION}") with a
printf-based rendering that computes a fixed inner width and centers the text;
compute an INNER_WIDTH variable (or TOTAL_WIDTH minus borders), build the two
content strings ("Big Bear Unicode Security Scanner v${VERSION} AI+" and
"Detecting dangerous Unicode & AI injection attacks"), and use printf with a
field width (and padding) to center each string inside the box borders so future
version-length changes keep the box aligned (update the lines that currently
call echo -e "\033[1;35m║ ... ${VERSION} ... ║\033[0m").
208-216: Add regression coverage for the new suppression branches.

These new exclusions are not exercised by the current --exclude-common tests: check-for-unicode/test-suite/run-tests.sh:62-76 only scans *docs*.md and *typography*, so the superscripts already present in check-for-unicode/test-suite/ai-injection-test.txt:33-36 never hit this path, and there is still no fixture for the Roman numeral carve-out. A focused test here would make the behavior change much safer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@check-for-unicode/run.sh` around lines 208 - 216, Add targeted regression
tests exercising the new suppression branches by adding a fixture containing the
specific superscript (U+00B2, U+00B3, U+00B9, U+2070-U+2079), subscript
(U+2080-U+2084) and Roman numeral codepoints (including the carve-out examples
U+2160, U+2165, U+2169, U+2174, U+2179) and update the test-runner invocation so
that this fixture is scanned (modify check-for-unicode/test-suite/run-tests.sh
to include the new fixture or broaden the globs it uses instead of only
*docs*.md and *typography*); ensure the test asserts that lines with those codes
are excluded by the suppression logic that checks the unicode_code variable (the
regex branches starting with [[ "$unicode_code" =~ ^00B[239]$ ]], [[
"$unicode_code" =~ ^207[0-9]$ ]], [[ "$unicode_code" =~ ^208[0-4]$ ]], and the
Roman numeral branch [[ "$unicode_code" =~ ^21[67][0-9A-F]$ ]] and its exclusion
list).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@check-for-unicode/run.sh`:
- Around line 56-59: The help text for the "--exclude-common" option incorrectly
claims it excludes Roman numerals while the scanner only reports five confusable
Roman numerals defined in the harmful_patterns set (and those are explicitly
retained); either update the help string to say something like "non-confusable
Roman numerals" or expand harmful_patterns to include the full U+2160–U+217F
range so Roman numerals are actually suppressible—modify the "--exclude-common"
help text or append the U+2160–U+217F codepoints to the harmful_patterns
collection accordingly.

---

Nitpick comments:
In `@check-for-unicode/run.sh`:
- Around line 648-649: Replace the two hardcoded echo lines that print the
banner (the lines containing "${VERSION}") with a printf-based rendering that
computes a fixed inner width and centers the text; compute an INNER_WIDTH
variable (or TOTAL_WIDTH minus borders), build the two content strings ("Big
Bear Unicode Security Scanner v${VERSION} AI+" and "Detecting dangerous Unicode
& AI injection attacks"), and use printf with a field width (and padding) to
center each string inside the box borders so future version-length changes keep
the box aligned (update the lines that currently call echo -e "\033[1;35m║ ...
${VERSION} ... ║\033[0m").
- Around line 208-216: Add targeted regression tests exercising the new
suppression branches by adding a fixture containing the specific superscript
(U+00B2, U+00B3, U+00B9, U+2070-U+2079), subscript (U+2080-U+2084) and Roman
numeral codepoints (including the carve-out examples U+2160, U+2165, U+2169,
U+2174, U+2179) and update the test-runner invocation so that this fixture is
scanned (modify check-for-unicode/test-suite/run-tests.sh to include the new
fixture or broaden the globs it uses instead of only *docs*.md and
*typography*); ensure the test asserts that lines with those codes are excluded
by the suppression logic that checks the unicode_code variable (the regex
branches starting with [[ "$unicode_code" =~ ^00B[239]$ ]], [[ "$unicode_code"
=~ ^207[0-9]$ ]], [[ "$unicode_code" =~ ^208[0-4]$ ]], and the Roman numeral
branch [[ "$unicode_code" =~ ^21[67][0-9A-F]$ ]] and its exclusion list).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e3a4328-de56-4795-91ae-db50e55168a8

📥 Commits

Reviewing files that changed from the base of the PR and between 765be03 and 0160904.

📒 Files selected for processing (1)

check-for-unicode/run.sh

Expand the --exclude-common option to cover additional common Unicode characters including common spaces, angle quotes, and per mille sign. Update documentation and add clarifying notes about superscript character coverage to reduce false positives in documentation and markdown repositories.

Extend the subscript digits Unicode range from U+2080-U+2084 to U+2080-U+2089 to include all subscript digits. Update the regex pattern from ^208[0-4]$ to ^208[0-9]$ to match the complete range of subscript digit characters.

Add explanatory comment to clarify that the subscript digits regex pattern ^208[0-9]$ covers Unicode range U+2085-U+2089, which are not currently included in the harmful_patterns list. This documents the intentional broader matching for future compatibility.