Skip to content

Latest commit

 

History

History
626 lines (530 loc) · 47.1 KB

File metadata and controls

626 lines (530 loc) · 47.1 KB

Experiment Settings and Results

MMD Computing

The following settings are employed in our experiments:

  • Embedding model: Qwen3-Embedding-8B (4096-dimensional)
  • Truncate length: 40960 tokens (right truncation)
  • Kernel type: RBF
  • Kernel sigma: 1.0 (constant)
  • Estimator: Biased MMD estimator
  • Dataset size: 5000 samples with seed 42
  • Max concurrent requests: 1024

Reference Datasets:

Our experiments were conducted with the following package versions:

  • Python: 3.12.12
  • vllm: 0.16.0
  • torch: 2.9.1+cu128
  • transformers: 4.57.6

Training

Data Construction

Training Configuration:

cutoff_len: 4096
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 1e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1

We use DeepSpeed ZeRO-3 for distributed training. Chat templates are set according to model families:

  • qwen for Qwen2.5-7B
  • llama3 for Llama-3.1-8B

Data Quality

Training Configuration:

cutoff_len: 32768
packing: false
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

We use DeepSpeed ZeRO-3 for distributed training. Chat templates are set according to model families:

  • qwen for Qwen2.5-7B
  • llama3 for Llama-3.1-8B
  • mistral for Mistral-7B-v0.3

Training Datasets:

Domain Dataset Samples
Math OpenR1-Math-220k 20,000
Math ODA-Math-460k 20,000
Math ScaleQuest 20,000
Math Synthetic-1 20,000
General Infinity-Instruct 20,000
General dataflow-instruct-10k 10,000
General OpenHermes-2.5 20,000
General ultrachat_200k 20,000
General WizardLM_evol_instruct_V2_196k 20,000
General tulu-3-sft-mixture 20,000
General smoltalk-chinese 20,000
Science MegaScience 20,000
Science Nemotron-Science-v1 20,000
Medical UltraMedical 20,000
Finance Finance-Instruct-500k 20,000
Law Lawyer-Llama 20,000

Evaluation config

General Text

General text evaluation uses lm-evaluation-harness with MMLU-Redux as the primary benchmark:

  • max_model_len: 32768
  • num_fewshot: 5
  • apply_chat_template: true

Math

Math evaluation is performed using Qwen2.5-Math with the following generation parameters:

  • temperature: 0.6
  • max_tokens_per_call: 16384
  • top_p: 1
  • apply_chat_template: true

Benchmarks: GSM8K, AMC23, AIME24, Minerva Math, Gaokao2024-Mix, OlympiadBench, and MATH.

Science

Science evaluation follows the MegaScience evaluation protocol. Benchmarks include MMLU-STEM, MMLU-Pro, GPQA, SuperGPQA, ChemBench, PIQA, and SciBench.

Medical

Medical evaluation employs MedR-Bench, MedMCQA, and MedCaseReasoning.

Finance

Finance evaluation uses XFinBench, FinEval-KR, and CPA-QKA.

Law

Law evaluation uses LegalBench and LexGLUE.

For benchmarks consisting of multiple subtasks, we report the average score across all subtasks. For tasks evaluated by exact answer matching in Medical, Finance and Law, we additionally employ an LLM-as-a-judge protocol.

Models are served using vLLM for inference in Medical, Finance and Law.

Accuracy Results

General Text

Qwen2.5-7B

Method abstract_algebra anatomy astronomy business_ethics clinical_knowledge college_biology college_chemistry college_computer_science college_mathematics college_medicine college_physics computer_security conceptual_physics econometrics electrical_engineering elementary_mathematics formal_logic global_facts high_school_biology high_school_chemistry high_school_computer_science high_school_european_history high_school_geography high_school_government_and_politics high_school_macroeconomics high_school_mathematics high_school_microeconomics high_school_physics high_school_psychology high_school_statistics high_school_us_history high_school_world_history human_aging human_sexuality international_law jurisprudence logical_fallacies machine_learning management marketing medical_genetics miscellaneous moral_disputes moral_scenarios nutrition philosophy prehistory professional_accounting professional_law professional_medicine professional_psychology public_relations security_studies sociology us_foreign_policy virology world_religions Avg
dataflow 56.2 71.7 93.6 90.6 75.8 87.8 66.7 71.1 59.6 73.6 57.0 81.4 76.1 72.2 70.4 68.0 57.5 54.5 91.6 67.7 83.0 89.0 89.0 89.0 81.4 60.0 92.9 63.9 91.7 67.3 90.0 93.9 82.8 86.4 78.9 81.8 94.6 60.7 91.9 96.9 83.0 88.9 86.5 43.3 81.4 83.1 83.0 62.0 57.3 79.8 78.1 82.4 72.3 91.8 88.9 90.7 86.9 78.0
finance-instruct 62.9 67.7 91.5 82.4 72.7 88.8 62.7 72.2 58.6 74.7 52.0 79.4 76.1 72.2 73.5 69.1 55.2 61.4 89.5 63.6 79.0 86.8 90.0 91.0 81.4 58.0 91.8 61.9 91.7 69.4 85.0 94.9 77.0 84.0 84.2 85.9 94.6 68.5 90.9 95.8 83.0 87.8 82.3 48.5 81.4 82.0 83.0 62.0 50.0 79.8 80.2 82.4 75.5 89.8 87.9 93.0 86.9 77.6
lawyer-llama 52.8 71.7 88.3 89.4 74.7 84.7 61.3 69.1 55.6 71.3 54.0 81.4 76.1 70.1 69.4 69.1 56.3 63.6 89.5 62.6 82.0 87.9 88.0 91.0 82.5 56.0 91.8 66.0 90.6 73.5 87.0 93.9 82.8 85.2 80.0 81.8 93.2 65.2 91.9 95.8 84.0 87.8 85.4 50.5 82.5 80.9 85.0 57.6 56.1 78.8 82.3 79.1 75.5 90.8 86.9 90.7 85.9 77.5
megascience 59.6 73.7 89.4 90.6 76.8 85.7 61.3 71.1 54.5 72.4 59.0 81.4 76.1 68.0 68.4 71.1 64.4 56.8 90.5 68.7 81.0 87.9 87.0 89.0 83.5 59.0 91.8 63.9 89.6 70.4 85.0 91.9 81.6 88.9 80.0 80.8 93.2 65.2 87.9 94.8 79.0 86.7 81.2 42.3 81.4 79.8 85.0 64.1 58.5 80.8 81.2 84.6 73.4 88.8 88.9 88.4 87.9 77.6
nemotron-science 58.4 72.7 91.5 91.8 76.8 86.7 66.7 74.2 55.6 70.1 57.0 82.5 70.7 68.0 67.3 64.9 60.9 58.0 90.5 71.7 89.0 87.9 89.0 89.0 82.5 58.0 92.9 61.9 89.6 70.4 86.0 93.9 82.8 86.4 82.1 80.8 91.9 69.7 89.9 96.9 84.0 90.0 83.3 54.6 80.4 79.8 86.0 59.8 56.1 84.8 83.3 78.0 71.3 87.8 87.9 93.0 87.9 78.2
openhermes 56.2 69.7 87.2 88.2 77.8 86.7 62.7 70.1 62.6 71.3 54.0 79.4 78.3 68.0 70.4 73.2 57.5 59.1 92.6 64.6 82.0 86.8 90.0 92.0 82.5 56.0 94.9 63.9 91.7 69.4 83.0 93.9 82.8 87.7 82.1 83.8 91.9 64.0 91.9 96.9 84.0 86.7 85.4 41.2 84.5 80.9 83.0 62.0 57.3 77.8 81.2 80.2 75.5 87.8 87.9 90.7 88.9 77.8
openr1 58.4 69.7 89.4 87.1 73.7 84.7 61.3 63.9 51.5 74.7 57.0 83.5 77.2 71.1 71.4 67.0 57.5 53.4 92.6 68.7 83.0 89.0 85.0 90.0 80.4 60.0 91.8 60.8 90.6 73.5 87.0 94.9 81.6 84.0 76.8 80.8 90.5 59.6 86.9 96.9 79.0 87.8 84.4 49.5 81.4 80.9 82.0 59.8 53.7 78.8 80.2 79.1 74.5 86.7 88.9 88.4 88.9 76.9
scale 57.3 70.7 86.2 82.4 71.7 83.7 60.0 68.0 56.6 72.4 56.0 82.5 76.1 69.1 69.4 63.9 50.6 65.9 88.4 58.6 80.0 90.1 88.0 89.0 80.4 51.0 93.9 62.9 90.6 67.3 84.0 91.9 80.5 85.2 78.9 79.8 93.2 66.3 88.9 97.9 84.0 87.8 84.4 42.3 83.5 79.8 83.0 57.6 56.1 76.8 77.1 79.1 74.5 88.8 88.9 90.7 87.9 76.3
smoltalk 60.7 69.7 90.4 89.4 74.7 87.8 62.7 70.1 60.6 73.6 58.0 82.5 75.0 69.1 71.4 69.1 57.5 61.4 91.6 70.7 85.0 87.9 88.0 91.0 81.4 55.0 92.9 58.8 90.6 69.4 84.0 93.9 82.8 85.2 80.0 82.8 91.9 68.5 91.9 94.8 83.0 87.8 88.5 49.5 83.5 80.9 83.0 62.0 57.3 77.8 81.2 76.9 76.6 91.8 86.9 90.7 87.9 78.0
synthetic_1 58.4 70.7 92.6 85.9 72.7 83.7 64.0 64.9 56.6 73.6 55.0 82.5 80.4 72.2 68.4 60.8 46.0 59.1 91.6 65.7 79.0 89.0 87.0 90.0 78.4 59.0 88.8 57.7 88.5 71.4 87.0 90.9 77.0 84.0 81.1 80.8 89.2 60.7 89.9 95.8 79.0 87.8 81.2 45.4 81.4 79.8 79.0 59.8 53.7 77.8 82.3 78.0 73.4 86.7 88.9 93.0 90.9 76.3
tulu 57.3 71.7 90.4 87.1 72.7 85.7 62.7 70.1 61.6 74.7 55.0 82.5 76.1 66.0 71.4 69.1 59.8 63.6 93.7 71.7 80.0 87.9 88.0 89.0 82.5 57.0 93.9 60.8 90.6 67.3 86.0 92.9 83.9 88.9 83.2 81.8 90.5 64.0 89.9 93.8 82.0 87.8 85.4 54.6 83.5 80.9 80.0 62.0 56.1 80.8 80.2 80.2 75.5 89.8 89.9 88.4 87.9 77.9
ultrachat 56.2 72.7 90.4 89.4 76.8 86.7 66.7 72.2 53.5 71.3 56.0 78.4 76.1 66.0 70.4 67.0 59.8 53.4 91.6 67.7 78.0 87.9 89.0 89.0 82.5 59.0 91.8 60.8 90.6 69.4 85.0 94.9 79.3 86.4 82.1 84.8 94.6 64.0 88.9 96.9 83.0 87.8 86.5 40.2 83.5 80.9 85.0 58.7 51.2 79.8 81.2 80.2 77.7 89.8 87.9 93.0 87.9 77.4
ultramedical 59.6 72.7 91.5 92.9 75.8 84.7 65.3 71.1 52.5 74.7 51.0 82.5 78.3 68.0 68.4 68.0 56.3 55.7 87.4 67.7 84.0 90.1 85.0 90.0 81.4 58.0 92.9 63.9 92.7 68.4 90.0 92.9 79.3 85.2 83.2 81.8 89.2 64.0 86.9 96.9 84.0 87.8 83.3 45.4 80.4 79.8 83.0 56.5 56.1 76.8 82.3 74.7 72.3 91.8 89.9 95.3 86.9 77.3
wizardlm 50.6 70.7 89.4 88.2 74.7 89.8 66.7 71.1 64.6 74.7 56.0 82.5 78.3 72.2 72.4 70.1 58.6 60.2 91.6 68.7 81.0 87.9 88.0 90.0 83.5 54.0 92.9 62.9 92.7 70.4 86.0 91.9 81.6 85.2 84.2 82.8 91.9 64.0 88.9 94.8 83.0 87.8 83.3 42.3 82.5 79.8 82.0 60.9 54.9 79.8 81.2 84.6 77.7 92.9 85.9 90.7 87.9 77.9

Llama-3.1-8B

Method abstract_algebra anatomy astronomy business_ethics clinical_knowledge college_biology college_chemistry college_computer_science college_mathematics college_medicine college_physics computer_security conceptual_physics econometrics electrical_engineering elementary_mathematics formal_logic global_facts high_school_biology high_school_chemistry high_school_computer_science high_school_european_history high_school_geography high_school_government_and_politics high_school_macroeconomics high_school_mathematics high_school_microeconomics high_school_physics high_school_psychology high_school_statistics high_school_us_history high_school_world_history human_aging human_sexuality international_law jurisprudence logical_fallacies machine_learning management marketing medical_genetics miscellaneous moral_disputes moral_scenarios nutrition philosophy prehistory professional_accounting professional_law professional_medicine professional_psychology public_relations security_studies sociology us_foreign_policy virology world_religions Avg
dataflow 39.3 60.6 67.0 64.7 63.6 67.3 48.0 50.5 27.3 69.0 37.0 77.3 52.2 46.4 64.3 42.3 41.4 39.8 82.1 45.5 66.0 74.7 70.0 80.0 56.7 42.0 68.4 38.1 80.2 46.9 81.0 87.9 65.5 80.2 77.9 66.7 87.8 47.2 75.8 90.6 75.0 83.3 70.8 24.7 73.2 70.8 70.0 50.0 40.2 65.7 65.6 74.7 64.9 78.6 80.8 79.1 84.8 63.5
finance-instruct 41.6 57.6 68.1 69.4 72.7 66.3 45.3 46.4 33.3 70.1 37.0 79.4 51.1 51.5 56.1 41.2 41.4 44.3 77.9 48.5 59.0 80.2 76.0 84.0 54.6 37.0 74.5 38.1 75.0 46.9 85.0 87.9 65.5 80.2 80.0 74.7 87.8 51.7 79.8 92.7 77.0 84.4 74.0 30.9 71.1 71.9 68.0 47.8 45.1 68.7 71.9 73.6 69.1 87.8 82.8 90.7 82.8 65.1
lawyer-llama 37.1 62.6 67.0 63.5 68.7 69.4 46.7 46.4 37.4 67.8 39.0 78.4 58.7 50.5 56.1 45.4 40.2 45.5 73.7 45.5 65.0 75.8 67.0 76.0 48.5 33.0 72.4 33.0 74.0 38.8 86.0 85.9 64.4 76.5 78.9 66.7 79.7 53.9 71.7 92.7 75.0 84.4 74.0 30.9 77.3 65.2 62.0 44.6 45.1 68.7 72.9 75.8 67.0 86.7 84.8 93.0 83.8 63.7
megascience 36.0 65.7 73.4 64.7 68.7 69.4 46.7 49.5 31.3 72.4 34.0 79.4 45.7 49.5 58.2 42.3 37.9 45.5 78.9 44.4 67.0 81.3 76.0 80.0 52.6 34.0 75.5 41.2 75.0 42.9 86.0 86.9 72.4 80.2 76.8 68.7 86.5 50.6 79.8 92.7 84.0 87.8 68.8 32.0 74.2 70.8 65.0 42.4 46.3 66.7 71.9 67.0 59.6 85.7 88.9 83.7 88.9 64.6
nemotron-science 36.0 67.7 67.0 70.6 74.7 72.4 57.3 56.7 41.4 69.0 46.0 79.4 53.3 52.6 64.3 46.4 46.0 44.3 83.2 49.5 66.0 79.1 85.0 86.0 54.6 39.0 78.6 47.4 84.4 56.1 86.0 91.9 72.4 85.2 77.9 74.7 90.5 51.7 75.8 92.7 76.0 87.8 70.8 28.9 73.2 69.7 62.0 50.0 48.8 69.7 71.9 75.8 67.0 88.8 85.9 86.0 84.8 67.5
openhermes 43.8 66.7 66.0 69.4 70.7 73.5 52.0 52.6 38.4 69.0 49.0 81.4 54.3 54.6 65.3 41.2 48.3 39.8 77.9 47.5 67.0 83.5 82.0 85.0 62.9 41.0 76.5 47.4 82.3 49.0 86.0 90.9 74.7 80.2 77.9 74.7 91.9 53.9 80.8 94.8 84.0 86.7 70.8 46.4 69.1 75.3 63.0 52.2 51.2 68.7 76.0 73.6 74.5 86.7 87.9 83.7 86.9 68.1
openr1 29.2 35.4 30.9 29.4 28.3 22.4 17.3 25.8 22.2 28.7 20.0 34.0 27.2 23.7 30.6 25.8 23.0 25.0 24.2 31.3 32.0 33.0 26.0 27.0 18.6 28.0 26.5 21.6 25.0 23.5 40.0 39.4 25.3 23.5 38.9 28.3 31.1 32.6 28.3 39.6 28.0 30.0 31.2 26.8 29.9 29.2 33.0 29.3 25.6 18.2 34.4 34.1 23.4 21.4 39.4 34.9 35.4 28.5
scale 43.8 62.6 63.8 69.4 69.7 71.4 50.7 47.4 32.3 65.5 43.0 79.4 51.1 52.6 62.2 40.2 44.8 37.5 78.9 41.4 59.0 75.8 72.0 82.0 51.5 38.0 71.4 38.1 79.2 43.9 84.0 86.9 71.3 79.0 77.9 66.7 83.8 52.8 77.8 89.6 74.0 75.6 71.9 32.0 73.2 73.0 67.0 47.8 41.5 69.7 67.7 70.3 67.0 80.6 83.8 93.0 82.8 64.2
smoltalk 39.3 73.7 76.6 75.3 74.7 74.5 52.0 60.8 34.3 73.6 44.0 78.4 53.3 59.8 63.3 41.2 51.7 40.9 82.1 59.6 67.0 82.4 82.0 86.0 69.1 46.0 73.5 51.5 83.3 55.1 88.0 87.9 72.4 84.0 80.0 69.7 90.5 51.7 82.8 95.8 81.0 88.9 74.0 42.3 77.3 77.5 64.0 53.3 51.2 67.7 70.8 78.0 70.2 86.7 89.9 88.4 87.9 69.4
synthetic_1 27.0 33.3 34.0 25.9 33.3 22.4 17.3 30.9 21.2 29.9 21.0 43.3 27.2 28.9 30.6 26.8 26.4 31.8 27.4 28.3 41.0 51.6 32.0 33.0 22.7 26.0 21.4 21.6 24.0 25.5 53.0 51.5 25.3 38.3 37.9 25.3 32.4 31.5 30.3 38.5 22.0 37.8 32.3 18.6 29.9 33.7 33.0 28.3 31.7 22.2 37.5 35.2 23.4 27.6 41.4 32.6 32.3 30.7
tulu 36.0 63.6 67.0 76.5 68.7 68.4 50.7 48.5 37.4 64.4 50.0 80.4 53.3 47.4 61.2 41.2 47.1 40.9 76.8 51.5 63.0 79.1 79.0 84.0 60.8 42.0 78.6 46.4 85.4 48.0 84.0 86.9 73.6 75.3 78.9 72.7 91.9 58.4 80.8 92.7 78.0 87.8 77.1 39.2 74.2 77.5 63.0 48.9 52.4 65.7 74.0 70.3 69.1 84.7 84.8 86.0 84.8 66.8
ultrachat 40.4 63.6 72.3 72.9 73.7 75.5 45.3 55.7 35.4 71.3 47.0 77.3 54.3 54.6 57.1 40.2 49.4 42.0 81.1 51.5 63.0 78.0 80.0 84.0 57.7 44.0 79.6 49.5 85.4 45.9 86.0 89.9 69.0 81.5 80.0 72.7 90.5 50.6 72.7 95.8 82.0 87.8 80.2 37.1 76.3 76.4 64.0 46.7 50.0 65.7 71.9 74.7 71.3 88.8 86.9 86.0 87.9 67.6
ultramedical 36.0 47.5 54.3 45.9 43.4 42.9 36.0 40.2 20.2 51.7 26.0 63.9 34.8 37.1 43.9 28.9 28.7 33.0 55.8 27.3 49.0 61.5 55.0 63.0 36.1 26.0 49.0 20.6 53.1 26.5 72.0 77.8 55.2 59.3 48.4 42.4 59.5 37.1 54.5 60.4 54.0 74.4 40.6 20.6 50.5 41.6 50.0 34.8 35.4 34.3 44.8 58.2 41.5 44.9 68.7 62.8 66.7 46.1
wizardlm 40.4 65.7 72.3 70.6 72.7 69.4 54.7 51.5 38.4 66.7 47.0 77.3 56.5 56.7 63.3 43.3 48.3 39.8 82.1 52.5 64.0 78.0 80.0 81.0 58.8 37.0 73.5 50.5 82.3 43.9 88.0 88.9 73.6 77.8 76.8 74.7 91.9 52.8 75.8 92.7 76.0 86.7 74.0 37.1 74.2 77.5 63.0 54.3 47.6 65.7 72.9 72.5 70.2 88.8 86.9 90.7 86.9 67.3

Mistral-7B-v0.3

Method abstract_algebra anatomy astronomy business_ethics clinical_knowledge college_biology college_chemistry college_computer_science college_mathematics college_medicine college_physics computer_security conceptual_physics econometrics electrical_engineering elementary_mathematics formal_logic global_facts high_school_biology high_school_chemistry high_school_computer_science high_school_european_history high_school_geography high_school_government_and_politics high_school_macroeconomics high_school_mathematics high_school_microeconomics high_school_physics high_school_psychology high_school_statistics high_school_us_history high_school_world_history human_aging human_sexuality international_law jurisprudence logical_fallacies machine_learning management marketing medical_genetics miscellaneous moral_disputes moral_scenarios nutrition philosophy prehistory professional_accounting professional_law professional_medicine professional_psychology public_relations security_studies sociology us_foreign_policy virology world_religions Avg
dataflow 27.0 54.5 58.5 55.3 59.6 66.3 48.0 36.1 37.4 63.2 45.0 77.3 45.7 43.3 52.0 39.2 23.0 31.8 70.5 44.4 64.0 69.2 72.0 67.0 53.6 31.0 54.1 36.1 71.9 35.7 79.0 84.8 65.5 69.1 73.7 66.7 86.5 42.7 72.7 79.2 64.0 83.3 63.5 19.6 54.6 66.3 67.0 42.4 50.0 54.5 61.5 71.4 58.5 71.4 79.8 76.7 77.8 58.2
finance-instruct 31.5 53.5 60.6 56.5 59.6 58.2 41.3 39.2 34.3 55.2 34.0 69.1 45.7 41.2 53.1 38.1 28.7 35.2 65.3 40.4 48.0 72.5 60.0 71.0 50.5 25.0 48.0 22.7 68.8 30.6 75.0 77.8 67.8 65.4 78.9 64.6 81.1 43.8 71.7 74.0 54.0 83.3 66.7 29.9 58.8 67.4 58.0 42.4 34.1 41.4 56.2 72.5 63.8 75.5 78.8 79.1 78.8 55.8
lawyer-llama 25.8 58.6 59.6 38.8 53.5 60.2 50.7 42.3 22.2 60.9 27.0 59.8 30.4 39.2 38.8 10.3 34.5 23.9 54.7 41.4 36.0 68.1 45.0 66.0 53.6 17.0 42.9 29.9 69.8 37.8 72.0 74.7 46.0 64.2 65.3 66.7 75.7 32.6 56.6 67.7 45.0 73.3 37.5 29.9 56.7 46.1 51.0 42.4 36.6 40.4 39.6 59.3 46.8 61.2 71.7 69.8 75.8 49.2
megascience 34.8 55.6 58.5 58.8 64.6 61.2 48.0 49.5 34.3 57.5 42.0 75.3 46.7 39.2 52.0 43.3 27.6 44.3 70.5 39.4 60.0 78.0 64.0 70.0 50.5 22.0 61.2 38.1 72.9 33.7 76.0 78.8 65.5 75.3 77.9 69.7 79.7 46.1 74.7 82.3 73.0 82.2 65.6 22.7 68.0 68.5 63.0 48.9 45.1 53.5 58.3 74.7 66.0 80.6 76.8 79.1 78.8 59.4
nemotron-science 36.0 58.6 61.7 56.5 62.6 63.3 60.0 44.3 38.4 60.9 29.0 74.2 47.8 47.4 49.0 43.3 31.0 37.5 65.3 47.5 55.0 76.9 79.0 76.0 60.8 31.0 58.2 34.0 77.1 46.9 81.0 78.8 65.5 74.1 73.7 73.7 74.3 44.9 74.7 85.4 68.0 82.2 62.5 27.8 61.9 58.4 64.0 43.5 43.9 57.6 65.6 65.9 62.8 81.6 81.8 81.4 79.8 60.1
openhermes 36.0 55.6 56.4 63.5 63.6 60.2 46.7 43.3 37.4 55.2 45.0 68.0 42.4 38.1 59.2 39.2 34.5 35.2 63.2 40.4 59.0 80.2 71.0 73.0 55.7 32.0 61.2 36.1 70.8 36.7 82.0 80.8 70.1 71.6 71.6 68.7 83.8 46.1 67.7 87.5 61.0 81.1 61.5 30.9 59.8 65.2 68.0 46.7 48.8 53.5 54.2 62.6 66.0 75.5 78.8 69.8 76.8 58.7
openr1 22.5 40.4 38.3 32.9 31.3 38.8 30.7 35.1 27.3 29.9 26.0 40.2 28.3 29.9 30.6 29.9 20.7 34.1 38.9 31.3 34.0 70.3 43.0 40.0 19.6 23.0 30.6 26.8 42.7 20.4 58.0 64.6 31.0 43.2 50.5 33.3 33.8 39.3 39.4 43.8 34.0 54.4 36.5 22.7 42.3 40.4 35.0 32.6 28.0 26.3 38.5 47.3 44.7 32.7 41.4 34.9 48.5 36.2
scale 30.3 57.6 57.4 47.1 56.6 59.2 40.0 56.7 36.4 52.9 29.0 66.0 48.9 35.1 55.1 43.3 31.0 46.6 58.9 32.3 53.0 69.2 64.0 64.0 46.4 33.0 49.0 33.0 71.9 23.5 68.0 77.8 65.5 74.1 72.6 61.6 67.6 44.9 65.7 72.9 61.0 77.8 61.5 24.7 55.7 64.0 53.0 44.6 39.0 48.5 62.5 68.1 53.2 76.5 81.8 76.7 74.7 55.1
smoltalk 33.7 55.6 57.4 60.0 55.6 62.2 41.3 37.1 31.3 58.6 40.0 68.0 48.9 45.4 54.1 40.2 34.5 43.2 68.4 30.3 55.0 74.7 62.0 67.0 52.6 34.0 54.1 39.2 75.0 32.7 77.0 83.8 65.5 65.4 71.6 69.7 79.7 42.7 68.7 78.1 70.0 77.8 62.5 36.1 62.9 62.9 70.0 50.0 43.9 54.5 61.5 67.0 64.9 78.6 75.8 79.1 78.8 58.1
synthetic_1 32.6 52.5 55.3 44.7 43.4 49.0 32.0 30.9 33.3 46.0 31.0 58.8 40.2 40.2 46.9 38.1 20.7 36.4 47.4 38.4 50.0 73.6 51.0 56.0 43.3 36.0 43.9 32.0 56.2 25.5 68.0 82.8 49.4 59.3 68.4 53.5 63.5 36.0 59.6 62.5 49.0 76.7 59.4 26.8 49.5 62.9 53.0 44.6 30.5 37.4 50.0 61.5 55.3 53.1 62.6 72.1 72.7 49.2
tulu 38.2 56.6 59.6 63.5 62.6 64.3 56.0 49.5 38.4 70.1 45.0 74.2 50.0 40.2 58.2 40.2 35.6 43.2 68.4 45.5 56.0 75.8 75.0 74.0 58.8 32.0 61.2 40.2 75.0 38.8 79.0 78.8 64.4 74.1 73.7 73.7 82.4 46.1 75.8 83.3 63.0 81.1 69.8 43.3 62.9 69.7 69.0 47.8 39.0 58.6 66.7 72.5 69.1 76.5 78.8 83.7 77.8 61.5
ultrachat 36.0 58.6 59.6 70.6 64.6 72.4 56.0 44.3 38.4 63.2 49.0 71.1 44.6 43.3 54.1 38.1 37.9 42.0 72.6 42.4 59.0 75.8 77.0 76.0 56.7 38.0 61.2 32.0 78.1 46.9 76.0 80.8 70.1 76.5 75.8 76.8 90.5 48.3 74.7 88.5 70.0 83.3 68.8 39.2 61.9 74.2 70.0 51.1 52.4 59.6 74.0 70.3 71.3 85.7 83.8 74.4 79.8 62.9
ultramedical 28.1 32.3 47.9 31.8 40.4 43.9 34.7 37.1 29.3 37.9 29.0 34.0 25.0 36.1 32.7 21.6 28.7 28.4 44.2 39.4 39.0 68.1 38.0 54.0 26.8 23.0 27.6 20.6 44.8 33.7 69.0 60.6 42.5 45.7 51.6 34.3 29.7 25.8 41.4 39.6 27.0 47.8 41.7 26.8 45.4 31.5 39.0 26.1 28.0 38.4 37.5 44.0 44.7 46.9 60.6 53.5 39.4 38.2
wizardlm 34.8 57.6 60.6 58.8 64.6 65.3 61.3 47.4 36.4 56.3 42.0 70.1 46.7 44.3 59.2 43.3 28.7 38.6 72.6 45.5 52.0 79.1 72.0 75.0 59.8 31.0 62.2 39.2 77.1 41.8 74.0 77.8 63.2 67.9 73.7 75.8 81.1 43.8 73.7 82.3 61.0 82.2 68.8 35.1 56.7 69.7 63.0 48.9 48.8 56.6 65.6 68.1 64.9 79.6 82.8 74.4 83.8 60.5

Math

Qwen2.5-7B

Method aime24 amc23 gaokao2024_mix gsm8k math minerva_math olympiadbench Avg
dataflow 3.3 47.5 33.0 87.8 70.9 30.5 33.6 43.8
infinity-instruct 16.7 47.5 28.6 88.0 68.4 27.2 31.3 44.0
openhermes 0.0 27.5 33.0 77.9 37.6 15.8 13.8 29.4
openr1 20.0 57.5 70.3 92.6 82.7 40.1 46.8 58.6
scale 6.7 45.0 29.7 89.2 72.2 32.7 34.5 44.3
smoltalk 6.7 55.0 47.3 81.5 68.9 28.3 33.3 45.9
synthetic_1 16.7 50.0 68.1 92.6 82.4 38.2 45.9 56.3
tulu 3.3 35.0 30.8 82.3 48.6 17.3 18.8 33.7
ultrachat 0.0 15.0 27.5 80.0 47.2 15.8 16.0 28.8
wizardlm 3.3 22.5 22.0 79.7 46.0 17.3 15.1 29.4

Llama-3.1-8B

Method aime24 amc23 gaokao2024_mix gsm8k math minerva_math olympiadbench Avg
dataflow 0.0 7.5 20.9 67.4 29.2 13.6 7.6 20.9
infinity-instruct 0.0 15.0 18.7 64.0 26.0 13.2 7.1 20.6
openhermes 0.0 7.5 20.9 58.4 15.4 6.2 4.0 16.1
openr1 3.3 22.5 36.3 80.9 53.0 20.6 20.7 33.9
scale 3.3 10.0 15.4 77.1 36.1 14.7 10.8 23.9
smoltalk 0.0 2.5 20.9 30.6 19.4 13.2 5.9 13.2
synthetic_1 0.0 20.0 37.4 81.7 47.7 16.2 17.6 31.5
tulu 0.0 12.5 12.1 64.4 20.8 16.2 5.2 18.7
ultrachat 0.0 7.5 17.6 42.3 15.9 7.4 3.1 13.4
wizardlm 0.0 0.0 17.6 33.4 12.1 8.8 4.0 10.8

Mistral-7B-v0.3

Method aime24 amc23 gaokao2024_mix gsm8k math minerva_math olympiadbench Avg
dataflow 0.0 15.0 14.3 57.2 17.2 9.6 4.4 16.8
infinity-instruct 0.0 7.5 12.1 50.9 14.3 7.4 3.6 13.7
openhermes 0.0 7.5 13.2 41.7 7.7 3.7 1.9 10.8
openr1 3.3 32.5 36.3 80.5 51.8 17.6 22.1 34.9
scale 0.0 10.0 13.2 68.8 27.0 5.9 6.4 18.8
smoltalk 0.0 2.5 11.0 47.0 13.6 8.5 3.4 12.3
synthetic_1 0.0 17.5 37.4 82.8 43.0 13.2 16.9 30.1
tulu 0.0 5.0 9.9 53.8 10.7 7.7 4.4 13.1
ultrachat 0.0 2.5 18.7 17.3 6.0 6.6 2.2 7.6
wizardlm 0.0 0.0 13.2 20.5 7.0 5.1 2.7 6.9

Science

Qwen2.5-7B

Method ChemBench-multi-choice ChemBench-str-match gpqa_diamond gpqa_main mmlu mmlu_pro piqa scibench-chemistry scibench-math scibench-physics super_gpqa Avg Weighted Avg
dataflow 43.4 41.0 24.7 26.3 67.6 43.0 70.9 28.6 41.5 30.4 20.0 39.8 39.1
infinity-instruct 44.9 35.2 22.7 23.9 67.2 43.0 79.2 32.0 42.2 30.4 18.7 39.9 38.6
megascience 46.4 38.9 35.9 29.5 73.5 54.7 76.5 33.5 39.5 34.8 29.0 44.7 47.4
nemotron-science 49.4 28.7 25.8 25.7 70.2 47.1 82.5 14.3 29.3 13.7 25.8 37.5 43.6
openhermes 43.9 31.6 27.8 26.3 64.2 41.0 77.4 18.4 19.0 15.4 19.8 35.0 37.8
smoltalk 42.0 43.4 27.8 25.9 66.2 44.1 76.4 30.1 36.7 31.7 20.4 40.4 39.3
tulu 45.7 34.0 25.8 27.2 67.7 45.2 77.4 20.7 29.9 22.0 22.0 38.0 40.6
ultrachat 39.5 38.9 22.7 24.3 60.1 39.5 73.6 18.4 21.8 12.3 18.6 33.6 35.6
wizardlm 37.8 28.3 28.3 26.6 64.9 40.6 72.6 18.8 26.5 15.0 19.6 34.4 37.4

Llama-3.1-8B

Method ChemBench-multi-choice ChemBench-str-match gpqa_diamond gpqa_main mmlu mmlu_pro piqa scibench-chemistry scibench-math scibench-physics super_gpqa Avg Weighted Avg
dataflow 28.2 21.3 14.6 18.8 44.2 22.0 19.7 18.0 15.0 9.7 11.8 20.3 22.8
infinity-instruct 37.0 19.7 15.7 16.1 49.1 24.9 37.6 13.9 14.3 12.8 11.2 22.9 25.2
megascience 48.5 21.7 27.3 24.6 63.1 40.0 59.0 17.7 19.7 13.7 21.4 32.4 37.6
nemotron-science 45.6 15.2 19.2 24.3 59.2 33.2 70.1 3.8 12.2 1.3 19.0 27.6 34.2
openhermes 26.6 18.0 19.7 21.4 44.3 25.3 37.1 2.3 3.4 0.4 13.3 19.3 24.5
smoltalk 32.2 22.1 19.2 19.4 44.5 23.3 38.0 10.2 11.6 4.8 12.0 21.6 23.9
tulu 40.8 23.4 21.2 20.8 53.3 27.6 55.2 9.8 10.2 9.3 15.5 26.1 29.4
ultrachat 20.4 18.9 14.1 19.9 43.8 27.7 17.4 3.4 6.8 1.8 15.0 17.2 24.8
wizardlm 24.6 15.2 13.6 20.3 48.4 27.5 36.2 8.6 8.2 5.3 14.1 20.2 26.2

Mistral-7B-v0.3

Method ChemBench-multi-choice ChemBench-str-match gpqa_diamond gpqa_main mmlu mmlu_pro piqa scibench-chemistry scibench-math scibench-physics super_gpqa Avg Weighted Avg
dataflow 32.6 13.5 9.1 13.6 44.6 18.7 42.5 7.1 8.8 4.4 12.3 18.9 23.2
infinity-instruct 30.9 16.0 14.1 16.7 42.8 18.5 47.1 6.4 8.8 5.3 9.0 19.6 21.3
megascience 43.6 16.8 22.2 19.4 56.0 31.4 66.7 9.0 8.2 10.1 17.9 27.4 32.4
nemotron-science 44.7 10.2 16.2 16.7 53.9 26.7 75.2 3.4 2.7 3.1 16.2 24.5 30.3
openhermes 16.6 10.7 8.1 10.3 35.6 15.4 46.2 0.4 3.4 1.3 9.6 14.3 18.4
smoltalk 24.3 15.6 14.1 17.0 36.0 15.9 32.8 4.9 6.8 2.6 8.7 16.2 18.2
tulu 35.9 13.1 23.7 21.4 43.4 17.8 56.4 5.6 10.9 3.5 10.8 22.1 22.7
ultrachat 19.1 11.9 12.6 13.4 38.8 18.3 35.9 4.5 5.4 4.0 11.9 16.0 20.7
wizardlm 18.9 12.3 14.1 10.3 37.8 16.3 39.9 3.8 4.8 1.8 11.0 15.5 19.7

Medicine

Llama-3.1-8B

Method MedCaseReasoning MedMCQA MedR-Bench Avg
dataflow 21.0 47.1 79.8 49.3
infinity-instruct 19.0 46.6 78.0 47.8
openhermes 20.1 48.0 78.4 48.8
smoltalk 19.1 41.5 79.5 46.7
tulu 21.0 45.2 76.1 47.4
ultrachat 19.5 47.8 80.7 49.3
ultramedical 22.0 62.4 74.5 53.0
wizardlm 19.3 47.5 76.5 47.8

Mistral-7B-v0.3

Method MedCaseReasoning MedMCQA MedR-Bench Avg
dataflow 14.5 47.1 70.3 44.0
infinity-instruct 14.8 40.4 72.1 42.5
openhermes 12.5 42.0 71.9 42.1
smoltalk 16.3 42.2 79.1 45.8
tulu 15.3 42.6 71.0 43.0
ultrachat 15.5 43.0 75.9 44.8
ultramedical 17.4 58.9 69.0 48.4
wizardlm 15.7 40.2 78.8 44.9

Qwen2.5-7B

Method MedCaseReasoning MedMCQA MedR-Bench Avg
dataflow 18.5 55.2 77.2 50.3
infinity-instruct 19.1 54.9 74.4 49.5
openhermes 17.4 55.0 80.0 50.8
smoltalk 18.1 55.3 80.9 51.4
tulu 17.8 52.1 77.5 49.1
ultrachat 18.8 55.4 77.6 50.6
ultramedical 19.8 66.5 72.4 52.9
wizardlm 18.2 54.3 78.3 50.3

Finance

Llama-3.1-8B

Method CPA-KQA FinEval-KR XFinBench Avg
dataflow 27.1 30.7 59.3 39.0
finance-instruct 30.0 35.6 58.2 41.3
infinity-instruct 37.6 41.6 63.7 47.6
openhermes 38.6 36.6 57.2 44.1
smoltalk 29.0 33.7 54.9 39.2
tulu 38.1 35.6 62.3 45.3
ultrachat 32.9 33.7 55.9 40.8
wizardlm 32.4 30.7 57.0 40.0

Mistral-7B-v0.3

Method CPA-KQA FinEval-KR XFinBench Avg
dataflow 26.2 25.7 56.1 36.0
finance-instruct 23.3 28.7 54.9 35.7
infinity-instruct 33.3 39.6 56.6 43.2
openhermes 25.2 30.7 54.3 36.7
smoltalk 20.5 22.8 50.3 31.2
tulu 20.5 23.8 53.3 32.5
ultrachat 20.0 21.8 48.0 29.9
wizardlm 25.7 31.7 54.0 37.1

Qwen2.5-7B

Method CPA-KQA FinEval-KR XFinBench Avg
dataflow 56.7 64.4 62.3 61.1
finance-instruct 57.6 55.4 58.6 57.2
infinity-instruct 61.4 61.4 63.9 62.2
openhermes 58.6 57.4 61.1 59.0
smoltalk 59.0 63.4 60.9 61.1
tulu 59.5 64.4 62.5 62.1
ultrachat 58.1 61.4 59.1 59.5
wizardlm 60.0 64.4 63.7 62.7

Law

Llama-3.1-8B

Method LegalBench LexGLUE Avg
dataflow 83.5 51.2 67.4
infinity-instruct 91.2 68.9 80.1
lawyer-llama 84.2 56.3 70.2
openhermes 91.4 66.2 78.8
smoltalk 90.0 69.4 79.7
tulu 91.0 65.3 78.1
ultrachat 89.6 66.3 77.9
wizardlm 90.1 66.9 78.5

Mistral-7B-v0.3

Method LegalBench LexGLUE Avg
dataflow 82.0 41.3 61.6
infinity-instruct 90.9 61.8 76.4
lawyer-llama 89.2 51.5 70.3
openhermes 91.1 64.4 77.7
smoltalk 88.9 48.6 68.8
tulu 89.9 56.5 73.2
ultrachat 90.7 60.3 75.5
wizardlm 92.3 29.0 60.7

Qwen2.5-7B

Method LegalBench LexGLUE Avg
dataflow 85.6 64.6 75.1
infinity-instruct 91.0 64.7 77.8
lawyer-llama 80.4 62.2 71.3
openhermes 91.5 61.0 76.3
smoltalk 87.1 65.3 76.2
tulu 92.3 63.9 78.1
ultrachat 85.6 64.4 75.0
wizardlm 91.9 64.8 78.4

Data Selection

The Data Selection track selects subsets from the same candidate pools used in the Data Quality track and then trains a Qwen2.5-7B model on each selected subset. The training configuration is identical to the Data Quality track, with one exception: the selected subset is used in full and is not truncated to 20k samples. Because the Law pool contains fewer than 100k samples, Law selection results are reported only for k = 1k and 10k.

Training Datasets

Domain Dataset Pool Size Selection Budgets (k)
Math ScaleQuest-Math 1k / 10k / 100k
General WizardLM_evol_instruct_V2_196k (143k subset) 143,000 1k / 10k / 100k
Science MegaScience 1k / 10k / 100k
Medical UltraMedical 1k / 10k / 100k
Finance Finance-Instruct-500k 500,000 1k / 10k / 100k
Law Lawyer-Llama 1k / 10k

For embedding-similarity selectors, the domain proxy for each domain is built by sampling 200 examples from the target dataset used for MMD computation in the Data Quality track.

Data Selection Results

Finance

Method Count (k) FinCDM XFinBench Avg
deita_quality 1k 40.19 62.99 51.59
10k 56.59 64.14 60.36
100k 46.62 60.69 53.66
diversity_kcenter 1k 54.02 63.22 58.62
10k 57.56 57.47 57.51
100k 55.95 58.39 57.17
embedding_similarity-new 1k 53.38 51.49 52.44
10k 45.66 47.82 46.74
100k 45.66 59.31 52.48
length_based 1k 34.73 61.84 48.28
10k 54.34 59.31 56.83
100k 45.98 64.14 55.06
perplexity_based 1k 53.70 60.69 57.19
10k 49.84 60.69 55.26
100k 49.52 60.46 54.99
perplexity_based_high 1k 48.87 59.08 53.98
10k 47.59 55.86 51.73
100k 55.95 58.39 57.17
perplexity_based_mid 1k 44.69 15.86 30.28
10k 54.66 58.39 56.53
100k 52.09 56.55 54.32
quality_scorer 1k 60.13 61.84 60.98
10k 51.45 58.16 54.80
100k 45.66 61.15 53.40
random 1k 54.66 61.38 58.02
10k 53.38 59.54 56.46
100k 49.84 57.47 53.66
source_balanced_random 1k 54.02 62.30 58.16
10k 54.34 61.15 57.75
100k 54.02 59.54 56.78

Law

Method Count (k) LegalBench LexGLUE Avg
full - 81.30 60.99 71.15
deita_quality 1k 73.81 39.58 56.69
10k 87.79 52.03 69.91
diversity_kcenter 1k 85.34 55.38 70.36
10k 82.53 61.30 71.92
embedding_similarity-new 1k 86.35 56.26 71.30
10k 85.71 65.00 75.35
length_based 1k 79.53 24.59 52.06
10k 84.46 51.77 68.12
perplexity_based 1k 83.62 60.17 71.90
10k 86.31 57.74 72.02
perplexity_based_high 1k 78.08 60.98 69.53
10k 81.43 62.69 72.06
perplexity_based_mid 1k 83.53 55.96 69.75
10k 85.63 63.62 74.63
quality_scorer 1k 83.90 55.96 69.93
10k 85.27 58.51 71.89
random 1k 84.99 59.77 72.38
10k 82.95 63.07 73.01
source_balanced_random 1k 86.05 54.43 70.24
10k 83.33 61.74 72.54

Medicine

Method Count (k) MedCaseReasoning MedMCQA MedRBench Avg
deita_quality 1k 14.05 57.26 75.03 48.78
10k 13.94 60.41 73.88 49.41
100k 15.05 64.09 72.83 50.66
diversity_kcenter 1k 20.18 64.07 72.52 52.26
10k 16.39 63.02 72.94 50.78
100k 15.38 59.89 73.98 49.75
embedding_similarity-new 1k 17.61 63.78 73.15 51.51
10k 14.83 64.36 74.82 51.33
100k 15.50 60.24 75.44 50.39
length_based 1k 13.15 56.68 74.29 48.04
10k 14.83 56.87 75.24 48.98
100k 14.27 61.39 73.46 49.71
perplexity_based 1k 19.96 65.29 73.77 53.01
10k 14.16 63.83 71.79 49.92
100k 15.27 58.79 75.44 49.83
perplexity_based_high 1k 16.50 59.36 72.62 49.49
10k 14.38 61.01 74.92 50.10
100k 13.82 59.62 72.52 48.65
perplexity_based_mid 1k 18.62 62.87 73.77 51.75
10k 13.49 64.57 73.77 50.61
100k 14.60 60.41 73.25 49.42
quality_scorer 1k 19.84 65.62 71.47 52.31
10k 14.05 64.52 74.50 51.02
100k 14.60 62.11 76.38 51.03
random 1k 19.84 64.52 73.46 52.61
10k 14.83 64.28 72.83 50.65
100k 14.49 61.63 73.77 49.97
source_balanced_random 1k 20.18 65.19 73.56 52.98
10k 14.49 64.26 72.62 50.46
100k 14.49 60.98 74.19 49.89

Math

Method Count (k) aime24 amc23 gaokao2024_mix gsm8k math minerva_math olympiadbench Avg
deita_quality 1k 10.00 52.50 34.10 89.30 72.80 30.10 38.10 46.70
10k 13.30 47.50 28.60 89.40 73.50 32.70 37.50 46.07
100k 10.00 40.00 28.60 89.90 71.80 33.80 35.70 44.26
diversity_kcenter 1k 13.30 57.50 34.10 89.70 73.40 32.70 36.60 48.19
10k 13.30 45.00 29.70 90.70 72.40 32.00 32.90 45.14
100k 10.00 45.00 34.10 90.30 72.00 33.10 35.70 45.74
embedding_similarity-new 1k 10.00 42.50 33.00 89.20 73.60 32.00 35.60 45.13
10k 13.30 47.50 29.70 90.30 73.80 32.40 35.00 46.00
100k 23.30 42.50 36.30 91.00 72.90 34.20 35.00 47.89
length_based 1k 16.70 47.50 35.20 89.00 72.00 29.80 36.30 46.64
10k 6.70 42.50 30.80 90.30 73.10 32.70 33.20 44.19
100k 10.00 52.50 35.20 89.80 73.50 33.50 36.70 47.31
perplexity_based 1k 16.70 45.00 28.60 89.60 73.10 31.60 35.30 45.70
10k 13.30 37.50 29.70 91.00 72.60 31.20 36.40 44.53
100k 10.00 50.00 24.20 90.00 71.80 30.10 33.60 44.24
perplexity_based_high 1k 10.00 42.50 29.70 88.60 72.80 30.10 35.70 44.20
10k 10.00 45.00 29.70 90.00 73.00 31.20 35.60 44.93
100k 6.70 42.50 29.70 90.10 72.70 33.80 32.90 44.06
perplexity_based_mid 1k 10.00 45.00 37.40 89.10 73.40 34.20 37.20 46.61
10k 3.30 45.00 31.90 90.30 72.60 36.40 34.70 44.89
100k 6.70 47.50 34.10 90.10 72.10 30.10 35.40 45.14
quality_scorer 1k 10.00 37.50 28.60 84.60 71.70 29.40 35.90 42.53
10k 3.30 40.00 31.90 90.10 73.30 26.10 37.90 43.23
100k 6.70 45.00 33.00 90.50 72.20 30.50 34.50 44.63
random 1k 16.70 52.50 38.50 89.80 72.50 33.10 36.00 48.44
10k 10.00 42.50 33.00 90.10 72.30 34.20 34.50 45.23
100k 13.30 37.50 30.80 89.50 71.70 31.20 35.70 44.24
source_balanced_random 1k 6.70 42.50 39.60 90.00 72.70 30.90 34.50 45.27
10k 16.70 47.50 35.20 89.80 72.20 33.80 35.00 47.17
100k 13.30 45.00 28.60 89.30 72.20 30.90 33.90 44.74

General

Metod Count (k) MMLU-Redux (avg of 57)
deita_quality 1k 77.43
10k 77.44
100k 77.09
diversity_kcenter 1k 77.74
10k 77.88
100k 77.59
embedding_similarity-new 1k 76.95
10k 77.58
100k 77.48
length_based 1k 77.37
10k 76.28
100k 77.09
perplexity_based 1k 77.11
10k 77.40
100k 76.65
perplexity_based_high 1k 77.99
10k 77.85
100k 77.66
perplexity_based_mid 1k 77.84
10k 77.19
100k 77.73
quality_scorer 1k 77.48
10k 77.06
100k 76.57
random 1k 77.46
10k 77.32
100k 77.40
source_balanced_random 1k 77.43
10k 77.34
100k 76.99

Scinece

Method Count (k) mmlu mmlu_pro gpqa_main gpqa_diamond super_gpqa ChemBench-multi-choise ChemBench-str-match scibench-physics scibench-chemistry scibench-math piqa Avg
deita_quality 1k 57.46 40.61 24.11 28.28 21.26 42.37 12.70 12.78 16.54 23.81 58.76 30.79
10k 72.23 50.03 30.58 39.90 24.96 49.25 37.30 30.40 33.83 37.41 77.04 43.90
100k 74.11 55.73 34.38 32.32 28.35 49.84 39.34 34.80 32.33 40.14 79.76 45.56
diversity_kcenter 1k 59.62 43.05 22.32 30.81 23.08 47.48 12.70 14.10 11.65 17.69 60.66 31.20
10k 72.85 54.10 29.91 36.36 27.78 46.23 38.11 33.92 29.70 40.82 78.56 44.39
100k 74.08 56.07 35.04 38.38 29.49 50.08 40.57 35.24 29.32 38.78 76.22 45.75
embedding_similarity-new 1k 55.33 38.79 28.79 29.80 21.15 45.05 9.02 6.61 9.77 11.56 64.09 29.09
10k 72.60 52.88 30.13 36.36 27.25 49.29 38.52 31.72 33.08 37.41 80.36 44.51
100k 73.99 54.56 33.04 33.33 28.35 48.78 39.34 29.52 31.58 35.37 78.29 44.20
length_based 1k 62.56 46.84 27.68 29.29 23.52 49.33 26.64 30.40 30.08 34.69 66.27 38.84
10k 73.19 55.77 31.47 35.86 28.53 50.79 37.70 36.56 34.21 42.86 80.14 46.10
100k 74.15 57.95 34.60 32.83 30.15 51.14 38.93 34.80 32.71 38.10 75.14 45.50
perplexity_based 1k 60.62 41.36 24.11 31.31 21.28 42.30 8.20 11.89 10.90 10.88 66.54 29.95
10k 69.71 48.79 26.12 27.78 24.77 42.45 37.70 23.35 24.81 31.29 72.31 39.01
100k 71.59 50.29 29.69 36.36 25.93 46.27 34.02 25.55 25.56 34.69 75.46 41.40
perplexity_based_high 1k 60.95 43.09 22.10 20.71 23.13 47.01 6.56 11.89 10.53 8.16 68.23 29.31
10k 72.69 52.93 28.79 29.80 27.16 47.13 31.97 29.96 28.57 37.41 77.31 42.16
100k 73.20 54.45 30.80 34.34 28.32 46.50 38.93 34.36 27.07 36.73 76.12 43.71
perplexity_based_mid 1k 59.22 44.66 25.45 28.79 23.65 43.87 13.52 22.47 19.55 25.85 61.32 33.49
10k 73.55 54.45 35.27 42.42 28.27 44.65 40.98 34.36 31.20 41.50 76.99 45.79
100k 73.37 55.87 35.49 44.44 29.11 48.94 34.02 33.48 28.95 38.10 79.98 45.61
quality_scorer 1k 58.92 40.75 26.34 24.24 20.42 42.92 9.84 9.25 10.90 10.20 64.15 28.90
10k 70.78 49.87 28.79 35.35 24.94 40.64 40.16 31.28 27.44 39.46 74.32 42.09
100k 73.89 55.99 28.35 33.33 29.29 48.74 37.70 36.56 31.58 40.14 79.11 44.97
random 1k 60.24 45.19 30.36 33.84 23.72 46.54 13.11 14.98 14.29 12.24 64.20 32.61
10k 72.48 55.53 31.70 28.79 28.45 45.48 42.21 29.96 31.58 39.46 73.39 43.55
100k 73.86 54.38 32.14 34.85 27.99 45.68 36.48 30.40 30.83 36.73 77.04 43.67
source_balanced_random 1k 56.52 43.42 25.89 33.33 22.95 43.59 9.43 12.33 14.29 11.56 59.19 30.23
10k 73.96 55.68 31.70 28.28 28.82 47.05 40.98 30.84 33.08 44.90 78.51 44.89
100k 74.33 56.23 33.93 39.39 29.67 47.96 42.21 32.60 33.83 35.37 81.01 46.05