18 lines (18 loc) · 4.02 KB

C1. Contamination

[2025/10] On The Fragility of Benchmark Contamination Detection in Reasoning Models
[2025/09] Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
[2025/07] How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework
[2025/05] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
[2024/06] LiveBench: A Challenging, Contamination-Free LLM Benchmark
[2024/06] VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
[2024/06] Benchmark Data Contamination of Large Language Models: A Survey
[2024/06] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
[2024/04] Benchmarking Benchmark Leakage in Large Language Models
[2024/03] Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
[2024/03] Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
[2024/02] Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
[2024/01] KoLA: Carefully Benchmarking World Knowledge of Large Language Models
[2023/09] Proving Test Set Contamination for Black-Box Language Models
[2023/09] Time Travel in LLMs: Tracing Data Contamination in Large Language Models
[2023/09] To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination
[2023/09] DyVal: Graph-informed Dynamic Evaluation of Large Language Models