Skip to content

add --sanitize option#3729

Merged
keith-hall merged 2 commits into
sharkdp:masterfrom
curious-rabbit:sanitize
Jul 1, 2026
Merged

add --sanitize option#3729
keith-hall merged 2 commits into
sharkdp:masterfrom
curious-rabbit:sanitize

Conversation

@curious-rabbit

@curious-rabbit curious-rabbit commented May 6, 2026

Copy link
Copy Markdown

This adds an alias for --strip-ansi and enforces more strict sanitation.

--strip-ansi is currently the only option to handle untrusted data, but the name is confusing and does not fully reflect what the code does and it is also incomplete for the purpose of sanitation.

This PR adds an alternative to the --strip-ansi option named --sanitize
it extends the filtered characters to cover all relevant sequences that could spoof content or trigger terminal commands.

Specifically the added changes are:

  • Parse 8-bit C1 introducers and DCS/SOS/PM/APC bodies (the fix sanitation for --strip-ansi #3725 fix).
  • Substitute bare CR with U+FFFD () (line-overwrite forgery).
  • Substitute SO/SI with U+FFFD () (charset-shift forgery).
  • Substitute non-introducer 8-bit C1 controls with U+FFFD () (RI is a cursor-up overwrite vector).
  • Substitute the remaining C0 controls and DEL (BEL, BS, VT, CAN, SUB, …) with U+FFFD ().
  • Substitute Unicode bidi and zero-width formatting characters (U+200B–U+200D, U+202A–U+202E, U+2066–U+2069, U+FEFF) with U+FFFD () (content-spoofing / Trojan-Source vector).

CRLF, FF, tab, newline pass through unchanged. auto-mode plain-text carve-out preserved. Loop-through cat-mode preserved.

@eth-p

eth-p commented May 6, 2026

Copy link
Copy Markdown
Collaborator

For a bit of context on why I introduced --strip-ansi in #2999:

The syntax highlighting definitions we use expect plain source code as its input. If bat tried to syntax highlight source code that was already highlighted with ANSI escape sequences, it would slow down considerably and produce an incorrect highlighting output.

Primarily, it was meant to support using bat as a pager for man. Newer versions of man emit ANSI escape sequences, and that caused issues highlighting using the manpage syntax we have.

At the same time, I also added an auto mode that would strip common ANSI escape sequences from the input when the highlighting language wasn't text. That was meant to ensure syntax highlighting worked as expected even if the file was pre-highlighted, while still preserving bat's ability to print formatted text piped in from other executables.

The reason it was named --strip-ansi was to reflect its intent of stripping out escape sequences. Since this pull request would change the behaviour to also replace invisible characters with textual representations, that would be a breaking change in my opinion.

Adding more control sequences to strip out is a very nice improvement and I would absolutely love to see that added, but I would prefer if the control character replacement was a superset of --strip-ansi rather than a change to it.

From a UX perspective, I think having --santize=auto/always/never could be a sensible way to approach that. It would imply --strip-ansi=auto/always/never and set it accordingly, but also enable the control character substitution on top of that.

@keith-hall, thoughts?

In any case, though, awesome work!

@keith-hall

Copy link
Copy Markdown
Collaborator

Completely agree 👍

@curious-rabbit

Copy link
Copy Markdown
Author

Alright I will make this a separate option that uses the existing filter and extends it to all relevant control characters.

@curious-rabbit

Copy link
Copy Markdown
Author

This should be ready to merge now

@curious-rabbit

Copy link
Copy Markdown
Author

I fixed the merge conflict and squashed the commits.
Please let me know if there is anything else needed to get this merged.

Comment thread src/preprocessor.rs Outdated
@keith-hall keith-hall merged commit 7895139 into sharkdp:master Jul 1, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants