Resolved: hhsuitedb.py UTF-8 decode crash + _cs219 misses from long/complex names

### Status
Resolved on our side — fixed by (1) making `ffindex.py`/`hhsuitedb.py` use binary I/O with latin-1 and no name truncation, and (2) normalizing a few problematic, overly long A3M basenames.


### Original Symptoms
`UnicodeDecodeError` in `ffindex.read_index()` when `hhsuitedb.py` reads `.ffindex`.
Successful `_a3m/_hhm` but missing `_cs219` files; later `hhsearch -d <db>` fails with could not open `..._cs219.ffdata`.
Sporadic failures tied to very long/complex basenames (multiple double underscores, long family names).


### Root Causes
1. Scripts treated `.ffindex`/`.ffdata` as UTF-8 text; they’re effectively binary tab tables (non-UTF-8 bytes possible).
2. Index writing truncated names with `"{name:.64}"`, causing downstream mismatches.
3. `cstranslate --ffindex` appears sensitive to very long basenames.


### What We Changed (local patches)
`ffindex.py`
Read/write binary; decode/encode lines with latin-1 (1:1 byte mapping).
Removed 64-char truncation; write full names.
Actually sort entries (`entries.sort(...)`); use `mmap.ACCESS_READ`.
Use `splitlines()`; open output index in binary.

`hhsuitedb.py`
New robust `read_ffindex(path)` (binary + latin-1; ignore malformed short lines).
`write_subset_index(...)` writes exact, untruncated names in latin-1 with` \n`.
This stabilizes subset builds used by `ffindex_apply_mpi `and `cstranslate`.

After these changes, `_a3m/_hhm` build reliably and `_cs219` builds for almost all families.

### Filename Fix
Simplifiing input .a3m filenames can help with some cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resolved: hhsuitedb.py UTF-8 decode crash + _cs219 misses from long/complex names #388

Status

Original Symptoms

Root Causes

What We Changed (local patches)

Filename Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resolved: hhsuitedb.py UTF-8 decode crash + _cs219 misses from long/complex names #388

Description

Status

Original Symptoms

Root Causes

What We Changed (local patches)

Filename Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions