codefuse-ai · ss41979310 · May 11, 2025 · May 11, 2025
diff --git a/README.md b/README.md
@@ -4,16 +4,17 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
 
 ## News
 
-🔥🔥🔥 [2025/05/04] Featured papers:
+🔥🔥🔥 [2025/05/11] Featured papers:
 
-- 🔥🔥 [SWE-smith: Scaling Data for Software Engineering Agents](https://arxiv.org/abs/2504.21798) from Stanford University.
+- 🔥🔥 [Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware](https://arxiv.org/abs/2505.05057) from Harbin Institute of Technology.
 
-- 🔥🔥 [Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges](https://arxiv.org/abs/2504.20799) from Handong Global University.
+- 🔥🔥 [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949) from NVIDIA.
 
-- 🔥🔥 [AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers](https://arxiv.org/abs/2504.20115) from University of Science and Technology
-  of China.
+- 🔥 [SWE-smith: Scaling Data for Software Engineering Agents](https://arxiv.org/abs/2504.21798) from Stanford University.
 
-- 🔥 [SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs](https://arxiv.org/abs/2504.14757) from FPT Software AI Center.
+- 🔥 [Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges](https://arxiv.org/abs/2504.20799) from Handong Global University.
+
+- 🔥 [AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers](https://arxiv.org/abs/2504.20115) from University of Science and Technology of China.
 
 🔥🔥🔥 Recent works from Codefuse:
 
@@ -202,6 +203,8 @@ We list several recent surveys on similar topics. While they are all about langu
 
 15. "Challenges and Paths Towards AI for Software Engineering" [2025-03] [[paper](https://arxiv.org/abs/2503.22625)]
 
+16. "Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents" [2025-05] [[paper](https://arxiv.org/abs/2505.05283)]
+
 ## 2. Models
 
 <p align='center'>
@@ -366,6 +369,8 @@ These LLMs are not specifically trained for code, but have demonstrated varying
 
 77. **Command A**: "Command A: An Enterprise-Ready Large Language Model" [2025-04] [[paper](https://arxiv.org/abs/2504.00698)]
 
+78. **Llama-Nemotron**: "Llama-Nemotron: Efficient Reasoning Models" [2025-05] [[paper](https://arxiv.org/abs/2505.00949)]
+
 ### 2.2 Existing LLM Adapted to Code
 
 These models are general-purpose LLMs further pretrained on code-related data.
@@ -426,6 +431,8 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained
 
 8. **CoLSBERT** (MLM): "Scaling Laws Behind Code Understanding Model" [2024-02] [[paper](https://arxiv.org/abs/2402.12813)]
 
+9. **BiGSCoder**: "BiGSCoder: State Space Model for Code Understanding" [2025-05] [[paper](https://arxiv.org/abs/2505.01475)]
+
 #### Decoder
 
 1. **GPT-C** (CLM): "IntelliCode Compose: Code Generation Using Transformer" [2020-05] [ESEC/FSE 2020] [[paper](https://arxiv.org/abs/2005.08025)]
@@ -946,6 +953,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 60. **SWE-smith**: "SWE-smith: Scaling Data for Software Engineering Agents" [2025-04] [[paper](https://arxiv.org/abs/2504.21798)]
 
+61. "Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency" [2025-05] [[paper](https://arxiv.org/abs/2505.02133)]
+
 ### 3.4 Interactive Coding
 
 - "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)]
@@ -1312,6 +1321,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 - [**Verilog**] "VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation" [2025-04] [[paper](https://arxiv.org/abs/2504.15659)]
 
+- [**Chisel**] "ChiseLLM: Unleashing the Power of Reasoning LLMs for Chisel Agile Hardware Development" [2025-04] [[paper](https://arxiv.org/abs/2504.19144)]
+
 - [**Verilog**] "ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code" [2025-04] [[paper](https://arxiv.org/abs/2504.20653)]
 
 ## 5. Methods/Models for Downstream Tasks
@@ -1890,6 +1901,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "Using LLMs for Library Migration" [2025-04] [[paper](https://arxiv.org/abs/2504.13272)]
 
+- "Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes" [2025-05] [[paper](https://arxiv.org/abs/2505.02184)]
+
 ### Type Prediction
 
 - "Learning type annotation: is big data enough?" [2021-08] [ESEC/FSE 2021] [[paper](https://dl.acm.org/doi/10.1145/3468264.3473135)]
@@ -2060,6 +2073,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "ReaderLM-v2: Small Language Model for HTML to Markdown and JSON" [2025-03] [[paper](https://arxiv.org/abs/2503.01151)]
 
+- "WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch" [2025-05] [[paper](https://arxiv.org/abs/2505.03733)]
+
 ### Automated Machine Learning
 
 - "Large Language Models Synergize with Automated Machine Learning" [2024-05] [[paper](https://arxiv.org/abs/2405.03727)]
@@ -2334,6 +2349,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs" [2025-03] [[paper](https://arxiv.org/abs/2504.00048)]
 
+- "Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards" [2025-05] [[paper](https://arxiv.org/abs/2505.04671)]
+
 ### Program Proof
 
 - "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03] [FSE 2023] [[paper](https://arxiv.org/abs/2303.04910)]
@@ -2822,6 +2839,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "How Accurately Do Large Language Models Understand Code?" [2025-04] [[paper](https://arxiv.org/abs/2504.04372)]
 
+- "Program Semantic Inequivalence Game with Large Language Models" [2025-05] [[paper](https://arxiv.org/abs/2505.03818)]
+
 ### Malicious Code Detection
 
 - "I-MAD: Interpretable Malware Detector Using Galaxy Transformer", 2019-09, Comput. Secur. 2021, [[paper](https://arxiv.org/abs/1909.06865)]
@@ -3126,6 +3145,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "The Code Barrier: What LLMs Actually Understand?" [2025-04] [[paper](https://arxiv.org/abs/2504.10557)]
 
+- "Can Large Language Models Predict Parallel Code Performance?" [2025-05] [[paper](https://arxiv.org/abs/2505.03988)]
+
 ### Software Modeling
 
 - "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12] [[paper](https://arxiv.org/abs/2212.03404)]
@@ -3494,6 +3515,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "Mitigating Sensitive Information Leakage in LLMs4Code through Machine Unlearning" [2025-02] [[paper](https://arxiv.org/abs/2502.05739)]
 
+- "Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware" [2025-05] [[paper](https://arxiv.org/abs/2505.05057)]
+
 ### Bias
 
 - "Exploring Multi-Lingual Bias of Large Code Models in Code Generation" [2024-04] [[paper](https://arxiv.org/abs/2404.19368)]
@@ -3905,6 +3928,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 | 2025-03 | arXiv                            | DynaCode                                         | 405                  | Python                                                                           | "DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation" [[paper](https://arxiv.org/abs/2503.10452)]                                                                                                                                    |
 | 2025-03 | arXiv                            | BigO(Bench)                                      | 3105                 | Python                                                                           | "BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?" [[paper](https://arxiv.org/abs/2503.15242)] [[data](https://github.com/facebookresearch/bigobench)]                                                                                                       |
 | 2025-04 | arXiv                            | -                                                | 842K                 | Python                                                                           | "A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs" [[paper](https://arxiv.org/abs/2504.15564)] [[data](https://anonymous.4open.science/r/class-level-benchmark-dataset-B132/README.md)]                                                                             |
+| 2025-05 | arXiv                            | YABLoCo                                          | 215                  | C, C++                                                                           | "YABLoCo: Yet Another Benchmark for Long Context Code Generation" [[paper](https://arxiv.org/abs/2505.04406)] [[data](https://github.com/yabloco-codegen/yabloco-benchmark)]                                                                                                                 |
 
 \* Automatically mined/human-annotated
 
@@ -4183,6 +4207,7 @@ $^\diamond$ Machine/human prompts
 | 2025-04 | arXiv             | Multi-SWE-bench  | 1632                   | Java, JS, TS, Go, Rust, C, C++ | "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving" [[paper](https://arxiv.org/abs/2504.02605)] [[data](https://github.com/multi-swe-bench/multi-swe-bench)]                                                                       |
 | 2025-04 | arXiv             | SWE-PolyBench    | 2110                   | Java, JS, TS, Python           | "SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents" [[paper](https://arxiv.org/abs/2504.08703)] [[data](https://github.com/amazon-science/SWE-PolyBench)]                                             |
 | 2025-04 | arXiv             | SecRepoBench     | 318                    | C/C++                          | "SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories" [[paper](https://arxiv.org/abs/2504.21205)]                                                                                                            |
+| 2025-05 | arXiv             | OmniGIRL         | 959                    | Python, JS, TS, Java           | "OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution" [[paper](https://arxiv.org/abs/2505.04606)] [[data](https://github.com/DeepSoftwareAnalytics/OmniGIRL)]                                                        |
 
 \*Line Completion/API Invocation Completion/Function Completion