-
Notifications
You must be signed in to change notification settings - Fork 148
Description
Status
Resolved on our side — fixed by (1) making ffindex.py/hhsuitedb.py use binary I/O with latin-1 and no name truncation, and (2) normalizing a few problematic, overly long A3M basenames.
Original Symptoms
UnicodeDecodeError in ffindex.read_index() when hhsuitedb.py reads .ffindex.
Successful _a3m/_hhm but missing _cs219 files; later hhsearch -d <db> fails with could not open ..._cs219.ffdata.
Sporadic failures tied to very long/complex basenames (multiple double underscores, long family names).
Root Causes
- Scripts treated
.ffindex/.ffdataas UTF-8 text; they’re effectively binary tab tables (non-UTF-8 bytes possible). - Index writing truncated names with
"{name:.64}", causing downstream mismatches. cstranslate --ffindexappears sensitive to very long basenames.
What We Changed (local patches)
ffindex.py
Read/write binary; decode/encode lines with latin-1 (1:1 byte mapping).
Removed 64-char truncation; write full names.
Actually sort entries (entries.sort(...)); use mmap.ACCESS_READ.
Use splitlines(); open output index in binary.
hhsuitedb.py
New robust read_ffindex(path) (binary + latin-1; ignore malformed short lines).
write_subset_index(...) writes exact, untruncated names in latin-1 with \n.
This stabilizes subset builds used by ffindex_apply_mpi and cstranslate.
After these changes, _a3m/_hhm build reliably and _cs219 builds for almost all families.
Filename Fix
Simplifiing input .a3m filenames can help with some cases.