Skip to content

latest papers 05-11 #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 11, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 31 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,17 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per

## News

🔥🔥🔥 [2025/05/04] Featured papers:
🔥🔥🔥 [2025/05/11] Featured papers:

- 🔥🔥 [SWE-smith: Scaling Data for Software Engineering Agents](https://arxiv.org/abs/2504.21798) from Stanford University.
- 🔥🔥 [Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware](https://arxiv.org/abs/2505.05057) from Harbin Institute of Technology.

- 🔥🔥 [Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges](https://arxiv.org/abs/2504.20799) from Handong Global University.
- 🔥🔥 [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949) from NVIDIA.

- 🔥🔥 [AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers](https://arxiv.org/abs/2504.20115) from University of Science and Technology
of China.
- 🔥 [SWE-smith: Scaling Data for Software Engineering Agents](https://arxiv.org/abs/2504.21798) from Stanford University.

- 🔥 [SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs](https://arxiv.org/abs/2504.14757) from FPT Software AI Center.
- 🔥 [Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges](https://arxiv.org/abs/2504.20799) from Handong Global University.

- 🔥 [AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers](https://arxiv.org/abs/2504.20115) from University of Science and Technology of China.

🔥🔥🔥 Recent works from Codefuse:

Expand Down Expand Up @@ -202,6 +203,8 @@ We list several recent surveys on similar topics. While they are all about langu

15. "Challenges and Paths Towards AI for Software Engineering" [2025-03] [[paper](https://arxiv.org/abs/2503.22625)]

16. "Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents" [2025-05] [[paper](https://arxiv.org/abs/2505.05283)]

## 2. Models

<p align='center'>
Expand Down Expand Up @@ -366,6 +369,8 @@ These LLMs are not specifically trained for code, but have demonstrated varying

77. **Command A**: "Command A: An Enterprise-Ready Large Language Model" [2025-04] [[paper](https://arxiv.org/abs/2504.00698)]

78. **Llama-Nemotron**: "Llama-Nemotron: Efficient Reasoning Models" [2025-05] [[paper](https://arxiv.org/abs/2505.00949)]

### 2.2 Existing LLM Adapted to Code

These models are general-purpose LLMs further pretrained on code-related data.
Expand Down Expand Up @@ -426,6 +431,8 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained

8. **CoLSBERT** (MLM): "Scaling Laws Behind Code Understanding Model" [2024-02] [[paper](https://arxiv.org/abs/2402.12813)]

9. **BiGSCoder**: "BiGSCoder: State Space Model for Code Understanding" [2025-05] [[paper](https://arxiv.org/abs/2505.01475)]

#### Decoder

1. **GPT-C** (CLM): "IntelliCode Compose: Code Generation Using Transformer" [2020-05] [ESEC/FSE 2020] [[paper](https://arxiv.org/abs/2005.08025)]
Expand Down Expand Up @@ -946,6 +953,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

60. **SWE-smith**: "SWE-smith: Scaling Data for Software Engineering Agents" [2025-04] [[paper](https://arxiv.org/abs/2504.21798)]

61. "Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency" [2025-05] [[paper](https://arxiv.org/abs/2505.02133)]

### 3.4 Interactive Coding

- "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)]
Expand Down Expand Up @@ -1312,6 +1321,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

- [**Verilog**] "VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation" [2025-04] [[paper](https://arxiv.org/abs/2504.15659)]

- [**Chisel**] "ChiseLLM: Unleashing the Power of Reasoning LLMs for Chisel Agile Hardware Development" [2025-04] [[paper](https://arxiv.org/abs/2504.19144)]

- [**Verilog**] "ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code" [2025-04] [[paper](https://arxiv.org/abs/2504.20653)]

## 5. Methods/Models for Downstream Tasks
Expand Down Expand Up @@ -1890,6 +1901,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "Using LLMs for Library Migration" [2025-04] [[paper](https://arxiv.org/abs/2504.13272)]

- "Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes" [2025-05] [[paper](https://arxiv.org/abs/2505.02184)]

### Type Prediction

- "Learning type annotation: is big data enough?" [2021-08] [ESEC/FSE 2021] [[paper](https://dl.acm.org/doi/10.1145/3468264.3473135)]
Expand Down Expand Up @@ -2060,6 +2073,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "ReaderLM-v2: Small Language Model for HTML to Markdown and JSON" [2025-03] [[paper](https://arxiv.org/abs/2503.01151)]

- "WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch" [2025-05] [[paper](https://arxiv.org/abs/2505.03733)]

### Automated Machine Learning

- "Large Language Models Synergize with Automated Machine Learning" [2024-05] [[paper](https://arxiv.org/abs/2405.03727)]
Expand Down Expand Up @@ -2334,6 +2349,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs" [2025-03] [[paper](https://arxiv.org/abs/2504.00048)]

- "Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards" [2025-05] [[paper](https://arxiv.org/abs/2505.04671)]

### Program Proof

- "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03] [FSE 2023] [[paper](https://arxiv.org/abs/2303.04910)]
Expand Down Expand Up @@ -2822,6 +2839,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "How Accurately Do Large Language Models Understand Code?" [2025-04] [[paper](https://arxiv.org/abs/2504.04372)]

- "Program Semantic Inequivalence Game with Large Language Models" [2025-05] [[paper](https://arxiv.org/abs/2505.03818)]

### Malicious Code Detection

- "I-MAD: Interpretable Malware Detector Using Galaxy Transformer", 2019-09, Comput. Secur. 2021, [[paper](https://arxiv.org/abs/1909.06865)]
Expand Down Expand Up @@ -3126,6 +3145,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "The Code Barrier: What LLMs Actually Understand?" [2025-04] [[paper](https://arxiv.org/abs/2504.10557)]

- "Can Large Language Models Predict Parallel Code Performance?" [2025-05] [[paper](https://arxiv.org/abs/2505.03988)]

### Software Modeling

- "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12] [[paper](https://arxiv.org/abs/2212.03404)]
Expand Down Expand Up @@ -3494,6 +3515,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "Mitigating Sensitive Information Leakage in LLMs4Code through Machine Unlearning" [2025-02] [[paper](https://arxiv.org/abs/2502.05739)]

- "Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware" [2025-05] [[paper](https://arxiv.org/abs/2505.05057)]

### Bias

- "Exploring Multi-Lingual Bias of Large Code Models in Code Generation" [2024-04] [[paper](https://arxiv.org/abs/2404.19368)]
Expand Down Expand Up @@ -3905,6 +3928,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
| 2025-03 | arXiv | DynaCode | 405 | Python | "DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation" [[paper](https://arxiv.org/abs/2503.10452)] |
| 2025-03 | arXiv | BigO(Bench) | 3105 | Python | "BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?" [[paper](https://arxiv.org/abs/2503.15242)] [[data](https://github.com/facebookresearch/bigobench)] |
| 2025-04 | arXiv | - | 842K | Python | "A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs" [[paper](https://arxiv.org/abs/2504.15564)] [[data](https://anonymous.4open.science/r/class-level-benchmark-dataset-B132/README.md)] |
| 2025-05 | arXiv | YABLoCo | 215 | C, C++ | "YABLoCo: Yet Another Benchmark for Long Context Code Generation" [[paper](https://arxiv.org/abs/2505.04406)] [[data](https://github.com/yabloco-codegen/yabloco-benchmark)] |

\* Automatically mined/human-annotated

Expand Down Expand Up @@ -4183,6 +4207,7 @@ $^\diamond$ Machine/human prompts
| 2025-04 | arXiv | Multi-SWE-bench | 1632 | Java, JS, TS, Go, Rust, C, C++ | "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving" [[paper](https://arxiv.org/abs/2504.02605)] [[data](https://github.com/multi-swe-bench/multi-swe-bench)] |
| 2025-04 | arXiv | SWE-PolyBench | 2110 | Java, JS, TS, Python | "SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents" [[paper](https://arxiv.org/abs/2504.08703)] [[data](https://github.com/amazon-science/SWE-PolyBench)] |
| 2025-04 | arXiv | SecRepoBench | 318 | C/C++ | "SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories" [[paper](https://arxiv.org/abs/2504.21205)] |
| 2025-05 | arXiv | OmniGIRL | 959 | Python, JS, TS, Java | "OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution" [[paper](https://arxiv.org/abs/2505.04606)] [[data](https://github.com/DeepSoftwareAnalytics/OmniGIRL)] |

\*Line Completion/API Invocation Completion/Function Completion

Expand Down