Skip to content

Commit bb61e90

Browse files
authored
Update index.html
1 parent b1c1199 commit bb61e90

File tree

1 file changed

+3
-11
lines changed

1 file changed

+3
-11
lines changed

index.html

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,17 +112,10 @@ <h1 class="title is-1 publication-title">
112112
<!-- Abstract. -->
113113
<div class="columns is-centered has-text-centered">
114114
<div class="column is-four-fifths">
115-
<h2 class="title is-3">🔔News</h2>
115+
<h2 class="title is-3">Abstract</h2>
116116
<div class="content has-text-justified">
117117
<p>
118-
<b>🔥[2025-02-03] Our paper <a href="https://arxiv.org/abs/2502.01718">AceCoder</a> is out. <a href="https://huggingface.co/collections/TIGER-Lab/acecoder-67a16011a6c7d65cad529eba">Models</a> and <a href="https://huggingface.co/datasets/TIGER-Lab/AceCode-87K">Datasets</a> are also released 🚀.</b>
119-
</p>
120-
</div>
121-
<h2 class="title is-3">Introduction</h2>
122-
<div class="content has-text-justified">
123-
<p>
124-
Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of <b>10</b>-point improvement for Llama-3.1-8B-Ins and <b>5</b>-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model <b>on par with</b> 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over <b>25%</b> and MBPP-plus by <b>6%</b> for merely <b>80</b> optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.
125-
</p>
118+
Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning. </p>
126119
</div>
127120
</div>
128121
</div>
@@ -133,8 +126,7 @@ <h2 class="title is-3">Introduction</h2>
133126
<section class="hero is-light is-small">
134127
<div class="hero-body has-text-centered">
135128
<h1 class="title is-1 acecoder">
136-
🂡
137-
<span class="acecoder">AceCoder</span>
129+
<span class="acecoder">Critique-Coder</span>
138130
</h1>
139131
</div>
140132
</section>

0 commit comments

Comments
 (0)