Skip to content

Resolved: hhsuitedb.py UTF-8 decode crash + _cs219 misses from long/complex names #388

@Jiachen2000

Description

@Jiachen2000

Status

Resolved on our side — fixed by (1) making ffindex.py/hhsuitedb.py use binary I/O with latin-1 and no name truncation, and (2) normalizing a few problematic, overly long A3M basenames.

Original Symptoms

UnicodeDecodeError in ffindex.read_index() when hhsuitedb.py reads .ffindex.
Successful _a3m/_hhm but missing _cs219 files; later hhsearch -d <db> fails with could not open ..._cs219.ffdata.
Sporadic failures tied to very long/complex basenames (multiple double underscores, long family names).

Root Causes

  1. Scripts treated .ffindex/.ffdata as UTF-8 text; they’re effectively binary tab tables (non-UTF-8 bytes possible).
  2. Index writing truncated names with "{name:.64}", causing downstream mismatches.
  3. cstranslate --ffindex appears sensitive to very long basenames.

What We Changed (local patches)

ffindex.py
Read/write binary; decode/encode lines with latin-1 (1:1 byte mapping).
Removed 64-char truncation; write full names.
Actually sort entries (entries.sort(...)); use mmap.ACCESS_READ.
Use splitlines(); open output index in binary.

hhsuitedb.py
New robust read_ffindex(path) (binary + latin-1; ignore malformed short lines).
write_subset_index(...) writes exact, untruncated names in latin-1 with \n.
This stabilizes subset builds used by ffindex_apply_mpi and cstranslate.

After these changes, _a3m/_hhm build reliably and _cs219 builds for almost all families.

Filename Fix

Simplifiing input .a3m filenames can help with some cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions