Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time–frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 17% improvement in Cohen’s Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10–20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability.
conda create --name tfm_tokenizer python=3.10
conda activate tfm_tokenizer
pip install -r requirements.txt
The datasets used for this study can be accessed at:
- TUEV and TUAB: https://isip.piconepress.com/projects/nedc/html/tuh_eeg/
- CHB-MIT: https://physionet.org/content/chbmit/1.0.0/
- EarEEG (EESM23): https://openneuro.org/datasets/ds005178/versions/1.0.0 The ./dataset_processing folder contains scripts for processing the above data. You can execute the processing by running the script shown below. Ensure that you update the paths in the script to correctly access and save the data:
./datasets_processing/data_set_processing.sh
Update the "data_dir" field in ./configs/dataset_configs.yaml to the appropriate directory path. Then run the following script to pretrain TFM-Tokenizer, followed by pretraining of TFM-Encoder and fine-tuning.
For single dataset pretraining setting:
./tfm_tokenizer_training_script_single_dataset.sh
For multiple dataset pretraining setting:
./tfm_tokenizer_training_script_multiple_dataset.sh
The ./pretrained_weights directory provides our pretrained weights for both the TFM-Tokenizer and downstream transformer for both single and multiple dataset settings. Edit and run the following scripts to obtain evaluation results on the test set. ( need to uncomment based on the experiment setting in the .sh file)
./tfm_tokenizer_inference.sh
We also provide ./token_visualization_samples.ipynb notebook with code to visualize the tokens from our tokenizer.
If you find our work or this repository useful, please consider giving a star ⭐ and citation.
@article{pradeepkumar2025single,
title={Single-channel eeg tokenization through time-frequency modeling},
author={Pradeepkumar, Jathurshan and Piao, Xihao and Chen, Zheng and Sun, Jimeng},
journal={arXiv preprint arXiv:2502.16060},
year={2025}
}
We appreciate your interest in our work! 😃😃😃😃😃