Skip to content

ZongqianLi/MicroCoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

MicroCoder: Breaking Training Bottlenecks for Modern Coding Models

Content

🚀News📖Paper_List✨Motivation

📈Analysis🖥️Algorithms🗂️Scaling_Law💯Code_Evaluator

📌Citation🔖License

Links

Project PageAlgorithm PaperScaling Law PaperInsights Blog

 

🚀 News

  • [2026.3.10] The paper was uploaded to Arxiv.
 
 
 

📖 Paper List

This is the project page for MicroCoder and a brief summary for the papers below:

  • Breaking Training Bottlenecks: Effective Reinforcement Learning for Modern Coding Models
    Zongqian Li 1, 2, Shaohan Huang 1, Zewen Chi 1, Yixuan Su 2, Lexin Zhou 3, Li Dong 1, Nigel Collier 2, Furu Wei 1
    Microsoft 1, University of Cambridge 2, Princeton University 3
    Algorithm_Paper
  • Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems
    Zongqian Li 1, 2, Tengchao Lv 1, Shaohan Huang 1, Yixuan Su 2, Qinzheng Sun 1, Qiufeng Yin 1, Ying Xin 1, Scarlett Li 1, Lei Cui 1, Nigel Collier 2, Furu Wei 1
    Microsoft 1, University of Cambridge 2
    Scaling_Law_Paper
  • MicroCoder-Insights: Training Recipes for Modern Coding Models
    Insight_Blog
 
 
 

✨ Motivation

  • Cross-generational training effectiveness: Current training methods demonstrate substantial improvements on Qwen 2.5 models but minimal improvements on Qwen 3 models, revealing generation-specific training bottlenecks
  • Dataset difficulty gap: Mainstream datasets pose greater difficulty for Qwen 2.5 while appearing relatively simple for Qwen 3 capabilities, indicating need for more challenging training corpora
  • Fundamental behavioral differences: Output behavior patterns differ fundamentally between generations; Qwen 3 models exhibit pronounced upward trends in response length during training whereas Qwen 2.5 models show stable or decreasing lengths; across model series progression from Qwen 2.5 Instruct to Qwen 3 Instruct to Qwen 3 Thinking, standard outputs demonstrate increasing length and variance

Figure: Algorithm: GRPO+, Max Response Length: 8K, Test Dataset: LiveCodeBench v6, Train Batch Size: 64

 
 
 

📈 Analysis: MicroCoder-Insights

MicroCoder-Insights: Training Recipes for Modern Coding Models

Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 key training insights across seven main aspects including code evaluator, temperature, training data, context length and extension, truncation mask strategies, batch size and on-policy, KL loss and clip ratio.

 
 
 

🖥️ Algorithms: MicroCoder-GRPO

Breaking Training Bottlenecks: Effective Reinforcement Learning for Modern Coding Models

To address training bottlenecks, we propose MicroCoder-GRPO, an enhanced Group Relative Policy Optimization approach with three key innovations:

  • conditional truncation masking to enhance long output potential while maintaining training stability,
  • diversity-determined temperature selection to maintain and encourage output diversity,
  • and removal of KL loss with high clipping ratios to facilitate exploration.

The modifications of MicroCoder-GRPO compared to GRPO are shown as the red components in the equations:

$\theta$: current policy parameters, $\theta_{\text{old}}$: reference policy parameters, $\pi_{\theta}$: policy with parameters $\theta$, $\pi_{\theta_{\text{old}}}$: old/reference policy, $T(D)$: training temperature determined by diversity, $D$: output diversity, $\beta_0$: KL loss weight (set to 0), $\varepsilon$: clipping trust region parameter, $\varepsilon_{\text{high}}$: high clipping value, $L_{\max}$: maximum response length, $\rho$: masking probability, $m$: repeat check parameter (128 tokens), $q$: query, $Q$: set of queries, $P(Q)$: probability distribution over queries, $G$: number of outputs/samples, $o_i$: output $i$, $r_i$: reward for output $i$, $A_i$: advantage score for output $i$, $U(0,1)$: uniform distribution over [0,1], $\mathbb{I}[\cdot]$: indicator function, $\mathbf{D}_{\text{KL}}$: KL divergence, $\text{non-incorrect}(o_i)$: indicates whether output $i$ is non-incorrect, $\neg\text{repeat}(o_i, m)$: checks for non-repetition sequences (final 128 tokens differ from preceding 128 tokens)

Figure: Temperature: 1.2, Train Dataset: MicroCoder-Dataset, Test Dataset: LiveCodeBench v6, Train Batch Size: 64

 
 
 

🗂️ Scaling Law: Data Difficulty

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

 
 
 

💯 Code Evaluator: MicroCoder-Evaluator

 
 
 

📌 Citation

@misc{li2026breakingtrainingbottleneckseffective,
      title={Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models}, 
      author={Zongqian Li and Shaohan Huang and Zewen Chi and Yixuan Su and Lexin Zhou and Li Dong and Nigel Collier and Furu Wei},
      year={2026},
      eprint={2603.07777},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.07777}, 
}
@misc{li2026scalingdatadifficultyimproving,
      title={Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems}, 
      author={Zongqian Li and Tengchao Lv and Shaohan Huang and Yixuan Su and Qinzheng Sun and Qiufeng Yin and Ying Xin and Scarlett Li and Lei Cui and Nigel Collier and Furu Wei},
      year={2026},
      eprint={2603.07779},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.07779}, 
}
 
 
 

🔖 License

 
 
 

About

Repositories for the papers: "Breaking Training Bottlenecks: Effective Reinforcement Learning for Modern Coding Models" and "Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors