Skip to content

Add dotdump_utf8 helper for UTF-8-aware file viewer rendering#2146

Open
C-Oliver wants to merge 1 commit into
CybercentreCanada:masterfrom
C-Oliver:feat/file-viewer-utf8-support
Open

Add dotdump_utf8 helper for UTF-8-aware file viewer rendering#2146
C-Oliver wants to merge 1 commit into
CybercentreCanada:masterfrom
C-Oliver:feat/file-viewer-utf8-support

Conversation

@C-Oliver
Copy link
Copy Markdown

Summary

Adds dotdump_utf8() to assemblyline.common.str_utils — a UTF-8-aware
counterpart to the existing dotdump() helper. Valid UTF-8 multi-byte
sequences (accented Latin, CJK, emoji, etc.) are preserved as their decoded
characters; only invalid or non-printable bytes are replaced with ..

This is the supporting library change for an upcoming
assemblyline-ui PR that switches the File Viewer ASCII tab to render
UTF-8 instead of dot-replacing every non-ASCII byte.

Why

The current File Viewer applies bytes.translate(FILTER_ASCII), which maps
every byte outside printable ASCII (32–126, plus tab/LF/CR) to .. Because
every byte of a UTF-8 multi-byte sequence is in the 0x80–0xFF range,
files containing café, 日本語, 🎉, etc. are rendered as solid dots.

Implementation

  • Reuses the existing _valid_utf8 regex (RFC 3629, already used by
    escape_str_strict) so the definition of "valid UTF-8" stays consistent
    across the codebase.
  • _valid_utf8.split(s) returns alternating non-matching / captured
    segments. Captured (valid UTF-8) segments are decoded to str;
    non-matching bytes are replaced one-for-one with ., preserving the
    byte length of invalid sections.

Behaviour

Input Output
b'hello\nworld' 'hello\nworld'
'café'.encode() 'café'
'日本語'.encode() '日本語'
'🎉ok'.encode() '🎉ok'
b'A\x00B\x07C' 'A.B.C'
b'\xc3(' (bad 2-byte start) '.('
b'\xed\xa0\x80' (UTF-16 surrogate) '...'
b'\t\r\n abc' '\t\r\n abc'

Risk

  • New, additive helper. No existing call sites are changed in this PR.
  • No new dependencies; reuses the existing compiled _valid_utf8 regex.

Adds a new str_utils.dotdump_utf8() helper that preserves valid UTF-8 multi-byte characters (accented Latin, CJK, emoji, etc.) while replacing only invalid or non-printable bytes with '.'. Reuses the existing _valid_utf8 regex so the validation rules stay consistent with escape_str_strict. This unblocks the File Viewer ASCII tab being able to display UTF-8 text instead of dots.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant