- [2025/10] On The Fragility of Benchmark Contamination Detection in Reasoning Models
- [2025/09] Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
- [2025/07] How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework
- [2025/05] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
- [2024/06] LiveBench: A Challenging, Contamination-Free LLM Benchmark
- [2024/06] VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
- [2024/06] Benchmark Data Contamination of Large Language Models: A Survey
- [2024/06] DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
- [2024/04] Benchmarking Benchmark Leakage in Large Language Models
- [2024/03] Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
- [2024/03] Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
- [2024/02] Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
- [2024/01] KoLA: Carefully Benchmarking World Knowledge of Large Language Models
- [2023/09] Proving Test Set Contamination for Black-Box Language Models
- [2023/09] Time Travel in LLMs: Tracing Data Contamination in Large Language Models
- [2023/09] To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination
- [2023/09] DyVal: Graph-informed Dynamic Evaluation of Large Language Models