Skip to content

BUG: _from_tsv crashes on Latin-1 encoded TSV files (e.g., µ character) #1594

@bruAristimunha

Description

@bruAristimunha

Description of the problem

_from_tsv uses encoding="utf-8-sig" which crashes on Latin-1 encoded TSV files containing characters like µ (micro-sign, common in European datasets with µV units in channels.tsv).

Found while processing OpenNeuro datasets in batch with eegdash. A scan of 577 cloned EEG/iEEG datasets (~413k TSV files) found 767 non-UTF-8 TSV files across 8 datasets: ds006233 (347), ds004621 (167), ds003620 (160), ds005692 (59), ds004598 (20), ds005691 (8, iEEG), ds003574 (5), ds004588 (1).

Steps to reproduce

Minimal:

from pathlib import Path
from mne_bids.tsv_handler import _from_tsv

tsv_path = Path("/tmp/test.tsv")
tsv_path.write_bytes(b"name\tunit\nEEG\t\xb5V\n")  # 'µV' in latin-1
_from_tsv(tsv_path)  # UnicodeDecodeError

End-to-end on a real OpenNeuro dataset:

import openneuro
from mne_bids import BIDSPath, read_raw_bids

openneuro.download(
    dataset="ds004621",
    target_dir="/tmp/ds004621",
    include=["dataset_description.json", "participants.tsv",
             "sub-13/eeg/sub-13_task-rest_*"],
)
bp = BIDSPath(subject="13", task="rest", datatype="eeg",
              suffix="eeg", extension=".vhdr", root="/tmp/ds004621")
read_raw_bids(bp, on_ch_mismatch="rename")

Expected results

The file is read successfully (encoding detected and applied), or a clear warning is emitted before falling back.

Actual results

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 25: invalid start byte — raised deep in numpy.loadtxt, no recovery path for users.

Proposed fix

Detect the encoding deterministically (BOM check + UTF-8 validity), then pass the detected encoding to np.loadtxt. Falls back to latin-1 (which never raises) when UTF-8 decoding fails. No new dependency.

Fix in #1593.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions