Add dotdump_utf8 helper for UTF-8-aware file viewer rendering by C-Oliver · Pull Request #2146 · CybercentreCanada/assemblyline-base

C-Oliver · 2026-05-27T12:43:45Z

Summary

Adds dotdump_utf8() to assemblyline.common.str_utils — a UTF-8-aware
counterpart to the existing dotdump() helper. Valid UTF-8 multi-byte
sequences (accented Latin, CJK, emoji, etc.) are preserved as their decoded
characters; only invalid or non-printable bytes are replaced with ..

This is the supporting library change for an upcoming
assemblyline-ui PR that switches the File Viewer ASCII tab to render
UTF-8 instead of dot-replacing every non-ASCII byte.

Why

The current File Viewer applies bytes.translate(FILTER_ASCII), which maps
every byte outside printable ASCII (32–126, plus tab/LF/CR) to .. Because
every byte of a UTF-8 multi-byte sequence is in the 0x80–0xFF range,
files containing café, 日本語, 🎉, etc. are rendered as solid dots.

Implementation

Reuses the existing _valid_utf8 regex (RFC 3629, already used by
escape_str_strict) so the definition of "valid UTF-8" stays consistent
across the codebase.
_valid_utf8.split(s) returns alternating non-matching / captured
segments. Captured (valid UTF-8) segments are decoded to str;
non-matching bytes are replaced one-for-one with ., preserving the
byte length of invalid sections.

Behaviour

Input	Output
`b'hello\nworld'`	`'hello\nworld'`
`'café'.encode()`	`'café'`
`'日本語'.encode()`	`'日本語'`
`'🎉ok'.encode()`	`'🎉ok'`
`b'A\x00B\x07C'`	`'A.B.C'`
`b'\xc3('` (bad 2-byte start)	`'.('`
`b'\xed\xa0\x80'` (UTF-16 surrogate)	`'...'`
`b'\t\r\n abc'`	`'\t\r\n abc'`

Risk

New, additive helper. No existing call sites are changed in this PR.
No new dependencies; reuses the existing compiled _valid_utf8 regex.

Adds a new str_utils.dotdump_utf8() helper that preserves valid UTF-8 multi-byte characters (accented Latin, CJK, emoji, etc.) while replacing only invalid or non-printable bytes with '.'. Reuses the existing _valid_utf8 regex so the validation rules stay consistent with escape_str_strict. This unblocks the File Viewer ASCII tab being able to display UTF-8 text instead of dots.

C-Oliver mentioned this pull request May 27, 2026

File Viewer: render UTF-8 instead of dot-replacing non-ASCII bytes CybercentreCanada/assemblyline-ui#1363

Open

cccs-rs requested review from cccs-douglass, cccs-rs and gdesmar May 27, 2026 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dotdump_utf8 helper for UTF-8-aware file viewer rendering#2146

Add dotdump_utf8 helper for UTF-8-aware file viewer rendering#2146
C-Oliver wants to merge 1 commit into
CybercentreCanada:masterfrom
C-Oliver:feat/file-viewer-utf8-support

C-Oliver commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

C-Oliver commented May 27, 2026

Summary

Why

Implementation

Behaviour

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant