Description of the problem
_from_tsv uses encoding="utf-8-sig" which crashes on Latin-1 encoded TSV files containing characters like µ (micro-sign, common in European datasets with µV units in channels.tsv).
Found while processing OpenNeuro datasets in batch with eegdash. A scan of 577 cloned EEG/iEEG datasets (~413k TSV files) found 767 non-UTF-8 TSV files across 8 datasets: ds006233 (347), ds004621 (167), ds003620 (160), ds005692 (59), ds004598 (20), ds005691 (8, iEEG), ds003574 (5), ds004588 (1).
Steps to reproduce
Minimal:
from pathlib import Path
from mne_bids.tsv_handler import _from_tsv
tsv_path = Path("/tmp/test.tsv")
tsv_path.write_bytes(b"name\tunit\nEEG\t\xb5V\n") # 'µV' in latin-1
_from_tsv(tsv_path) # UnicodeDecodeError
End-to-end on a real OpenNeuro dataset:
import openneuro
from mne_bids import BIDSPath, read_raw_bids
openneuro.download(
dataset="ds004621",
target_dir="/tmp/ds004621",
include=["dataset_description.json", "participants.tsv",
"sub-13/eeg/sub-13_task-rest_*"],
)
bp = BIDSPath(subject="13", task="rest", datatype="eeg",
suffix="eeg", extension=".vhdr", root="/tmp/ds004621")
read_raw_bids(bp, on_ch_mismatch="rename")
Expected results
The file is read successfully (encoding detected and applied), or a clear warning is emitted before falling back.
Actual results
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 25: invalid start byte — raised deep in numpy.loadtxt, no recovery path for users.
Proposed fix
Detect the encoding deterministically (BOM check + UTF-8 validity), then pass the detected encoding to np.loadtxt. Falls back to latin-1 (which never raises) when UTF-8 decoding fails. No new dependency.
Fix in #1593.
References
Description of the problem
_from_tsvusesencoding="utf-8-sig"which crashes on Latin-1 encoded TSV files containing characters likeµ(micro-sign, common in European datasets with µV units inchannels.tsv).Found while processing OpenNeuro datasets in batch with eegdash. A scan of 577 cloned EEG/iEEG datasets (~413k TSV files) found 767 non-UTF-8 TSV files across 8 datasets:
ds006233(347),ds004621(167),ds003620(160),ds005692(59),ds004598(20),ds005691(8, iEEG),ds003574(5),ds004588(1).Steps to reproduce
Minimal:
End-to-end on a real OpenNeuro dataset:
Expected results
The file is read successfully (encoding detected and applied), or a clear warning is emitted before falling back.
Actual results
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 25: invalid start byte— raised deep innumpy.loadtxt, no recovery path for users.Proposed fix
Detect the encoding deterministically (BOM check + UTF-8 validity), then pass the detected encoding to
np.loadtxt. Falls back tolatin-1(which never raises) when UTF-8 decoding fails. No new dependency.Fix in #1593.
References