A unified, downstream-grounded data selection framework for LLM supervised fine-tuning. It exposes a single Selector protocol and a collection of ready-to-use strategies, so you can switch between random baselines, perplexity filters, embedding-based selectors, quality scorers, diversity algorithms, and LLM-as-a-judge methods without changing the surrounding pipeline.
The framework is designed around two principles:
- One interface for every strategy — every selector implements
select(samples) -> list[dict]with parameters configured at construction time. - Efficient multi-
kexperiments — score-based selectors can be scored once atmax(k)and truncated for smaller budgets, and expensive GPU/API scorers support resumable score caching.
This subproject uses uv and requires Python 3.12.
cd data-selection
uv syncTo install development tools (pre-commit, pyright, codespell):
uv sync --group devAdd new dependencies with uv add <package-name> instead of editing pyproject.toml manually.
| Selector | File | Key Dependency | Description |
|---|---|---|---|
RandomSelector |
random_selection.py |
— | Random baseline with optional seed. |
SourceBalancedRandomSelector |
source_balanced_random.py |
— | Random selection balanced by the source field. |
LengthBasedSelector |
length_based.py |
— | Select shortest or longest samples. |
PerplexityBasedSelector |
perplexity_based.py |
dataflow PerplexityScorer |
Select by perplexity (low, high, or mid). |
EmbeddingSimilaritySelector |
embedding_similarity.py |
dataflex offline_near_Selector |
NEAR-style similarity to a target set or domain proxy. |
DeitaQualitySelector |
deita_quality.py |
dataflow DeitaQuality/ComplexityScorer |
Quality × complexity product. |
QualityScorerSelector |
quality_scorer.py |
dataflow FineWebEduScorer/PairQualScorer |
FineWeb-Edu / PairQual / composite quality scoring. |
DiversityKCenterSelector |
diversity_kcenter.py |
dataflex offline_tsds_Selector |
TSDS diversity K-Center selection. |
LLMAsSelector |
llm_selector.py |
dataflow MetaScorer |
Multi-dimensional LLM-as-a-judge scoring. |
CompositeSelector |
composite.py |
— | Chain multiple selectors sequentially. |
The framework normalizes each input sample to a shared messages representation:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}Two raw formats are supported out of the box:
- Alpaca-style:
{"instruction": "...", "output": "..."} - Conversations-style:
{"conversations": [{"role": "...", "content": "..."}]}(also acceptsfrom/valueand nestedmessagesarrays)
You can configure the keys passed to run_selection via instruction_key, output_key, and conversations_key.
from data_selection import RandomSelector
samples = [
{"instruction": "task 1", "output": "result 1", "source": "a"},
{"instruction": "task 2", "output": "result 2", "source": "b"},
{"instruction": "task 3", "output": "result 3", "source": "a"},
]
selector = RandomSelector(k=2, seed=42)
selected = selector.select(samples)from data_selection import RandomSelector
from data_selection.runner import run_selection
selector = RandomSelector(k=100, seed=42)
run_selection(
selector,
input_path="data/input.jsonl",
output_path="data/output_random.jsonl",
)run_selection also accepts a list of k values. For score-based selectors, scoring runs once at max(k) and results are truncated for each smaller budget:
run_selection(
selector,
input_path="data/input.jsonl",
output_path="data/output_k{k}.jsonl",
k=[1000, 10000, 100000],
)RandomSelector(k, seed=None)— uniform random sample.SourceBalancedRandomSelector(k, source_key="source", seed=None)— samples evenly across uniquesourcevalues.LengthBasedSelector(k, strategy="shortest")—shortestorlongestby extracted text length.
from data_selection import PerplexityBasedSelector
selector = PerplexityBasedSelector(
k=1000,
strategy="low", # "low", "high", or "mid"
text_key="text",
lang="en",
model_name="dataflow/operators/eval/GeneralText/models/Kenlm/wikipedia",
)Set scores_cache_path="scores.jsonl" to cache expensive perplexity scores and resume interrupted runs.
from data_selection import EmbeddingSimilaritySelector
selector = EmbeddingSimilaritySelector(
k=100000,
domain_proxy_text="Solve the following math problem step by step.",
embed_model="Qwen/Qwen3-Embedding-8B",
embed_method="auto",
batch_size=64,
)You can also point query_path to a JSONL file containing target samples; the selector ranks candidates by cosine similarity to the centroid of the target embeddings.
from dataflow.operators.eval import DeitaQualityScorer, DeitaComplexityScorer
from data_selection import DeitaQualitySelector
selector = DeitaQualitySelector(
k=100,
quality_scorer=DeitaQualityScorer(device="cuda"),
complexity_scorer=DeitaComplexityScorer(device="cuda"),
scores_cache_path="deita_scores.jsonl",
)For FineWeb-Edu / PairQual scoring, use QualityScorerSelector.
from data_selection import DiversityKCenterSelector
selector = DiversityKCenterSelector(
k=100000,
embed_model="Qwen/Qwen3-Embedding-0.6B",
embed_method="auto",
batch_size=32,
sigma=0.75,
alpha=0.6,
)from dataflow.operators.eval import MetaScorer
from dataflow.serving.APILLMServing_request import APILLMServing_request
from data_selection import LLMAsSelector
llm_serving = APILLMServing_request(
api_url="https://api.openai.com/v1/chat/completions",
key_name_of_api_key="DF_API_KEY",
model_name="gpt-4o",
max_workers=10,
)
scorer = MetaScorer(llm_serving=llm_serving, dimensions=[...])
selector = LLMAsSelector(k=1000, text_key="text", scorer=scorer)from data_selection import CompositeSelector, RandomSelector, LengthBasedSelector
selector = CompositeSelector([
LengthBasedSelector(k=10000, strategy="longest"),
RandomSelector(k=1000, seed=42),
])The composite runs each selector in order, feeding the output of one into the next.
- Score-based selectors set
_score_based = True. Whenrun_selectionreceives multiplekvalues, it runs the selector once atmax(k)and truncates the ranked list for each smallerk. - Set
scores_cache_pathonPerplexityBasedSelector,DeitaQualitySelector,QualityScorerSelector, orLLMAsSelectorto write per-sample scores to JSONL. On the next run, cached samples are skipped.
uv run python main.pyuv run python examples/random.pyuv run --with pyright pyrightuv run pre-commit run --all-filesPre-commit hooks include isort, black-jupyter, pyupgrade --py312-plus, autoflake, bandit, pyright, and codespell. Some hooks rewrite files; if a commit fails, re-stage and retry.
- Create a module under
src/data_selection/selectors/. - Implement
select(self, samples: Sequence[Mapping[str, Any]]) -> list[dict[str, Any]]. - Configure all strategy parameters (including
k) in__init__. - If the selector sorts by a computed score, set
_score_based = True. - Export the selector in
src/data_selection/selectors/__init__.pyandsrc/data_selection/__init__.py. - Add an example under
examples/(optional but recommended).