This folder includes dataset helpers and recipes used by xTuring examples.
Filters amazon-agi/SIFT-50M to English plus:
closed_ended_content_levelopen_ended- optional
controllable_generation
python examples/datasets/sift50m_subset_builder.py \
--output-dir ./data/sift50m_en_small \
--max-examples 100000 \
--include-controllable-generation \
--jsonlNotes:
- Use
--language-color--category-colif the dataset schema changes. - Set
--max-examples 0to keep all rows after filtering.
SIFT-50M includes audio_path and (often) data_source. This script adds a
resolved audio_file column and can drop rows with missing files.
python examples/datasets/sift50m_audio_mapper.py \
--input-dir ./data/sift50m_en_small \
--output-dir ./data/sift50m_en_small_mapped \
--audio-root mls=/data/mls \
--audio-root cv15=/data/commonvoice15 \
--audio-root vctk=/data/vctk \
--verify-exists \
--drop-missing \
--jsonlIf your dataset uses different columns:
python examples/datasets/sift50m_audio_mapper.py \
--input-dir ./data/sift50m_en_small \
--output-dir ./data/sift50m_en_small_mapped \
--audio-path-col audio_path \
--data-source-col data_sourceEach script writes:
- a Hugging Face dataset directory (via
save_to_disk) subset.jsonl(if--jsonlis set)- a
*_meta.jsonfile with the filter settings and detected columns