Skip to content

Chez backend: Invalid UTF-8 encoding for paths with non-ASCII characters #3697

@thanberree

Description

@thanberree

Environment

  • Idris2 Version: 0.7.0-6e52f899b
  • OS: Linux (Ubuntu)
  • Locale: C.UTF-8 (UTF-8 enabled)
  • Backend: Chez Scheme

Description

When compiling an Idris2 project located in a directory path containing non-ASCII Unicode characters (e.g., accented characters like é), Idris2 generates the compileChez script with incorrectly encoded path strings. Instead of properly escaping UTF-8 multi-byte sequences, Idris2 writes raw bytes that Chez Scheme cannot interpret correctly.

Steps to Reproduce

  1. Create a directory with a non-ASCII character in its name:
mkdir tpRéférence
cd tpRéférence
  1. Create a minimal Idris2 project with main.ipkg file declaring a runmain executable

  2. Compile the project: idris2 --build main.ipkg

Expected Behavior

The compilation should succeed and generate a working executable. The generated compileChez script should contain properly escaped Unicode characters compatible with Chez Scheme's string syntax.

Actual Behavior

Compilation fails with:

Exception in compile-program: failed for /path/to/tpRfrence/build/exec/runmain_app/runmain.ss: no such file or directory
Error: INTERNAL ERROR: Chez exited with return code 255

Notice the tpRfrence instead of tpRéférence - the é characters have been corrupted.

Root Cause

Examining the generated build/exec/runmain_app/compileChez file reveals:

(parameterize ([optimize-level 3] [compile-file-message #f]) 
  (compile-program "/home/user/tpR\233f\233rence/build/exec/runmain_app/runmain.ss"))

The byte sequence for é in UTF-8 is 0xC3 0xA9, but Idris2 writes \233 (which is 0xE9, the ISO-8859-1 encoding). This appears to be writing the second byte of the UTF-8 sequence without the first byte, or incorrectly treating UTF-8 bytes as individual characters.

Impact

  • Projects cannot be compiled if their path contains any non-ASCII characters
  • This affects international users who may have accented characters in their usernames or project paths
  • The error message is misleading as it shows a corrupted path rather than indicating an encoding issue

Workaround

Avoid using non-ASCII characters in project paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions