This is the repo for our survey - a comprehensive review of LLM researches for code. Works in each category are ordered chronologically. If you have a basic understanding of machine learning but are new to NLP, we also provide a list of recommended readings in section 4.
-
2.2 Existing LLM Further Trained on Code
2.3 General Pretraining on Code
-
3.1 Pretraining
3.2 Benchmarks
We list six recent surveys on similar topics. While they are all about language models for code, the first two focus on NLP side, and the later four focus on SE side.
-
"Large Language Models Meet NL2Code: A Survey", 2022-12, ACL 2023, [paper]
-
"A Survey on Pretrained Language Models for Neural Code Intelligence", 2022-12, arXiv, [paper]
-
"An Empirical Comparison of Pre-Trained Models of Source Code", 2023-02, ICSE 2023, [paper]
-
"Large Language Models for Software Engineering: A Systematic Literature Review", 2023-08, arXiv, [paper]
-
"Towards an Understanding of Large Language Models in Software Engineering Tasks", 2023-08, arXiv, [paper]
-
"Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey", 2023-10, arXiv, [paper]
These LLMs are not specifically trained for code, but have demonstrated varying coding capability.
-
LaMDA: "LaMDA: Language Models for Dialog Applications", 2022-01, arXiv, [paper]
-
PaLM: "PaLM: Scaling Language Modeling with Pathways", 2022-04, arXiv, [paper]
-
GPT-NeoX: "GPT-NeoX-20B: An Open-Source Autoregressive Language Model", 2022-04, ACL 2022 Workshop on Challenges & Perspectives in Creating Large Language Models, [paper] [repo]
-
BLOOM: "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model", 2022-11, arXiv, [paper] [model]
-
LLaMA: "LLaMA: Open and Efficient Foundation Language Models", 2023-02, arXiv, [paper]
-
GPT-4: "GPT-4 Technical Report", 2023-03, arXiv, [paper]
-
LLaMA 2: "Llama 2: Open Foundation and Fine-Tuned Chat Models", 2023-07, arXiv, [paper] [repo]
-
Phi-1.5: "Textbooks Are All You Need II: phi-1.5 technical report", 2023-09, arXiv, [paper] [model]
-
Baichuan 2: "Baichuan 2: Open Large-scale Language Models", 2023-09, arXiv, [paper] [repo]
-
Qwen: "Qwen Technical Report", 2023-09, arXiv, [paper] [repo]
These models are general-purpose LLMs further pretrained on code-related data.
-
Codex (GPT-3): "Evaluating Large Language Models Trained on Code", 2021-07, arXiv, [paper]
-
PaLM Coder (PaLM): "PaLM: Scaling Language Modeling with Pathways", 2022-04, arXiv, [paper]
-
Minerva (PaLM): "Solving Quantitative Reasoning Problems with Language Models", 2022-06, arXiv, [paper]
-
PaLM 2 * (PaLM 2): "PaLM 2 Technical Report", 2023-05, arXiv, [paper]
-
Code LLaMA (LLaMA 2): "Code Llama: Open Foundation Models for Code", 2023-08, arXiv, [paper] [repo]
These models are Transformer encoders, decoders, and encoder-decoders pretrained from scratch using existing objectives for general language modeling.
-
CuBERT (MLM + NSP): "Learning and Evaluating Contextual Embedding of Source Code", 2019-12, ICML 2020, [paper] [repo]
-
CodeBERT (MLM + RTD): "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", 2020-02, EMNLP findings 2020, [paper] [repo]
-
GraphCodeBERT (MLM + DFG Edge Prediction + DFG Node Alignment): "GraphCodeBERT: Pre-training Code Representations with Data Flow", 2020-09, ICLR 2021, [paper] [repo]
-
SynCoBERT (MLM + Identifier Prediction + AST Edge Prediction + Contrastive Learning): "SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation", 2021-08, arXiv, [paper]
-
DISCO (MLM + Node Type MLM + Contrastive Learning): "Towards Learning (Dis)-Similarity of Source Code from Program Contrasts", 2021-q0, ACL 2022, [paper]
-
Code-MVP (MLM + Type Inference + Contrastive Learning): "CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training", 2022-05, NAACL 2022 Technical Track, [paper]
-
GPT-C (CLM): "IntelliCode Compose: Code Generation Using Transformer", 2020-05, ESEC/FSE 2020, [paper]
-
CodeGPT (CLM): "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation", 2021-02, NeurIPS Datasets and Benchmarks 2021, [paper] [repo]
-
CodeParrot (CLM), 2021-12, [blog]
-
PolyCoder (CLM): "A Systematic Evaluation of Large Language Models of Code", 2022-02, - DL4C@ICLR 2022, [paper] [repo]
-
CodeGen (CLM): "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis", 2022-03, ICLR 2023, [paper] [repo]
-
InCoder (Causal Masking): "InCoder: A Generative Model for Code Infilling and Synthesis", 2022-04, ICLR 2023, [paper] [repo]
-
PyCodeGPT (CLM): "CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation", 2022-06, IJCAI-ECAI 2022, [paper] [repo]
-
PanGu-Coder (CLM): "PanGu-Coder: Program Synthesis with Function-Level Language Modeling", 2022-07, arxiv, [paper]
-
SantaCoder (FIM): "SantaCoder: don't reach for the stars!", 2023-01, arXiv, [paper] [model]
-
CodeGeeX (CLM): "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X", 2023-03, arxiv, [paper] [repo]
-
StarCoder (FIM): "StarCoder: may the source be with you!", 2023-05, arXiv, [paper] [model]
-
Phi-1 (CLM): "Textbooks Are All You Need", 2023-06, arxiv, [paper] [model]
-
CodeFuse (CLM): "CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model", 2023-10, arxiv, [paper] [model]
-
CodeShell (CLM), 2023-10, [repo]
-
DeepSeek Coder (CLM), 2023-10, [repo]
-
PyMT5 (Span Corruption): "PyMT5: multi-mode translation of natural language and Python code with transformers", 2020-10, EMNLP 2020, [paper]
-
Mastropaolo et al. (MLM + Deobfuscation): "DOBF: A Deobfuscation Pre-Training Objective for Programming Languages", 2021-02, ICSE 2021, [paper] [repo]
-
DOBF (Span Corruption): "Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks", 2021-02, NeurIPS 2021, [paper] [repo]
-
PLBART (DAE): "Unified Pre-training for Program Understanding and Generation", 2021-03, NAACL 2021, [paper] [repo]
-
CodeT5 (Span Corruption + Identifier Tagging + Masked Identifier Prediction + Text2Code + Code2Text): "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation", 2021-09, EMNLP 2021, [paper] [repo]
-
SPT-Code (Span Corruption + NSP + Method Name Prediction): "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations", 2022-01, ICSE 2022 Technical Track, [paper]
-
AlphaCode (MLM + CLM): "Competition-Level Code Generation with AlphaCode", 2022-02, Science, [paper] [arxiv]
-
NatGen (Code Naturalization): "NatGen: Generative pre-training by "Naturalizing" source code", 2022-06, ESEC/FSE 2022, [paper] [repo]
-
CodeT5+ (Span Corruption + CLM + Text-Code Contrastive Learning + Text-Code Translation): "CodeT5+: Open Code Large Language Models for Code Understanding and Generation", 2023-05, arXiv, [paper] [repo]
-
CugLM (MLM + NSP + CLM): "Multi-task Learning based Pre-trained Language Model for Code Completion", 2020-12, ASE 2020, [paper]
-
UniXcoder (MLM + NSP + CLM + Span Corruption + Contrastive Learning + Code2Text): "UniXcoder: Unified Cross-Modal Pre-training for Code Representation", 2022-03, ACL 2022, [paper] [repo]
These models apply Instruction Fine-Tuning techniques to enhance the capacities of Code LLMs.
-
WizardCoder (StarCoder + Evol-Instruct): "WizardCoder: Empowering Code Large Language Models with Evol-Instruct", 2023-06, arXiv, [paper] [repo]
-
PanGu-Coder 2 (StarCoder + Evol-Instruct + RRTF): "PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback", 2023-07, arXiv, [paper]
-
OctoCoder (StarCoder) / OctoGeeX (CodeGeeX2): "OctoPack: Instruction Tuning Code Large Language Models", 2023-08, arXiv, [paper] [repo]
-
MFTCoder (Code LLaMA): "MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning", 2023-11, arXiv, [paper] [repo]
-
CompCoder: "Compilable Neural Code Generation with Compiler Feedback", 2022-03, ACL 2022, [paper]
-
CodeRL: "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning", 2022-07, NeurIPS 2022, [paper] [repo]
-
PPOCoder: "Execution-based Code Generation using Deep Reinforcement Learning", 2023-01, TMLR 2023, [paper] [repo]
-
RLTF: "RLTF: Reinforcement Learning from Unit Test Feedback", 2023-07, arXiv, [paper] [repo]
-
CodeSearchNet: "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search", 2019-09, arXiv, [paper] [repo] [data]
-
The Pile: "The Pile: An 800GB Dataset of Diverse Text for Language Modeling", 2020-12, arXiv, [paper] [data]
-
CodeParrot, 2022-02, [data]
-
The Stack: "The Stack: 3 TB of permissively licensed source code", 2022-11, arXiv, [paper] [data]
-
ROOTS: "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset", 2023-03, NeurIPS 2022 Datasets and Benchmarks Track, [paper] [data]
- CodeXGLUE: "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation", 2021-02, NeurIPS Datasets and Benchmarks 2021, [paper] [repo] [data]
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2018-08 | EMNLP 2018 | CONCODE | 104K | Java | "Mapping Language to Code in Programmatic Context" [paper] [data] |
2019-10 | EMNLP-IJCNLP 2019 | JuICe | 1.5M/3725 * | Python | "JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation" [paper] [data] |
2021-05 | NeurIPS 2021 | APPS | 10000 | Python | "Measuring Coding Challenge Competence With APPS" [paper] [data] |
2021-07 | arXiv | HumanEval | 164 | Python | "Evaluating Large Language Models Trained on Code" [paper] [data] |
2021-08 | arXiv | MBPP/MathQA-Python | 974/23914 | Python | "Program Synthesis with Large Language Models" [paper] [MBPP] [MathQA-Python] |
2021-08 | ACL/IJCNLP 2021 | PlotCoder | 40797 | Python | "PlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context" [paper] [data] |
2022-01 | arXiv | DSP | 1119 | Python | "Training and Evaluating a Jupyter Notebook Data Science Assistant" [paper] [data] |
2022-03 | EACL 2023 Findings | MCoNaLa | 896 | Python | "MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages" [paper] [data] |
2022-06 | arXiv | AixBench | 336 | Java | "AixBench: A Code Generation Benchmark Dataset" [paper] [data] |
2022-08 | IEEE Trans. Software Engineering | MultiPL-E | "MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation", [paper] [data] | ||
2022-10 | ICLR 2023 | MBXP | 12.4K | Python, Java, JS, TypeScript, Go, C#, PHP, Ruby, Kotlin, C++, Perl, Scala, Swift | "Multi-lingual Evaluation of Code Generation Models" [paper] [data] |
2022-10 | ICLR 2023 | Multilingual HumanEval | 1.9K | Python, Java, JS, TypeScript, Go, C#, PHP, Ruby, Kotlin, Perl, Scala, Swift | "Multi-lingual Evaluation of Code Generation Models" [paper] [data] |
2022-10 | ICLR 2023 | MathQA-X | 5.6K | Python, Java, JS | "Multi-lingual Evaluation of Code Generation Models" [paper] [data] |
2022-11 | arXiv | ExeDS | 534 | Python | "Execution-based Evaluation for Data Science Code Generation Models" [paper] [data] |
2022-11 | arXiv | DS-1000 | 1000 | Python | "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation" [paper] [data] |
2022-12 | arXiv | ODEX | 945 | Python | "Execution-Based Evaluation for Open-Domain Code Generation" [paper] [data] |
2023-02 | arXiv | CoderEval | 460 | Python, Java | "CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models" [paper] [data] |
2023-03 | arXiv | xCodeEval | 5.5M | C, C#, C++, Go, Java, JS, Kotlin, PHP, Python, Ruby, Rust | "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [paper] [data] |
2023-03 | arXiv | HumanEval-X | 820 | Python, C++, Java, JS, Go | "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [paper] [data] |
2023-05 | arXiv | HumanEval+ | 164 | Python | "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation" [paper] [data] |
2023-06 | arXiv | StudentEval | 1749 | Python | "StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code" [paper] [data] |
2023-08 | arXiv | HumanEvalPack | 984 | Python, JS, Go, Java, C++, Rust | "OctoPack: Instruction Tuning Code Large Language Models" [paper] [data] |
2023-09 | arXiv | CodeApex | 476 | C++ | "CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models" [paper] [data] |
* Automatically mined/human-annotated
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2020-06 | NeurIPS 2020 | Transcoder GeeksforGeeks | 1.4K | C++, Java, Python | "Unsupervised Translation of Programming Languages" [paper] [data] |
2021-02 | NeurIPS Datasets and Benchmarks 2021 | CodeTrans | 11.8K | Java, C# | "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [paper] [data] |
2021-08 | ACL 2023 Findings | Avatar | 9515 | Java, Python | "AVATAR: A Parallel Corpus for Java-Python Program Translation" [paper] [data] |
2022-06 | AAAI 2022 | CoST | 132K | C++, Java, Python, C#, JS, PHP, C | "Multilingual Code Snippets Training for Program Translation" [paper] [data] |
2022-06 | arXiv | XLCoST | 567K | C++, Java, Python, C#, JS, PHP, C | "XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence" [paper] [data] |
2023-03 | arXiv | xCodeEval | 5.6M | C, C#, C++, Go, Java, JS, Kotlin, PHP, Python, Ruby, Rust | "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval" [paper] [data] |
2023-03 | arXiv | HumanEval-X | 1640 | Python, C++, Java, JS, Go | "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X" [paper] [data] |
2023-08 | arXiv | G-TransEval | 4000 | C++, Java, C#, JS, Python | "On the Evaluation of Neural Code Translation: Taxonomy and Benchmark" [paper] [data] |
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2014-07 | ISSTA 2014 | Defects4J | 357 | Java | "Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs" [paper] [data] |
2015-12 | IEEE Trans. Software Engineering | ManyBugs/IntroClass | 185/998 | C | "The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs" [paper] [data] |
2016-11 | FSE 2016 | BugAID | 105K | JS | "Discovering Bug Patterns in JavaScript" [paper] [data] |
2017-02 | AAAI 2017 | DeepFix | 6971 | C | "DeepFix: Fixing Common C Language Errors by Deep Learning" [paper] [data] |
2017-05 | ICSE-C 2017 | Codeflaws | 3902 | C | "DeepFix: Fixing Common C Language Errors by Deep Learning" [paper] [data] |
2017-10 | SPLASH 2017 | QuixBugs | 80 | Java, Python | "QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge" [paper] [data] |
2018-12 | ACM Trans. Softw. Eng. Methodol. | BFP | 124K | Java | "An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation" [paper] [data] |
2019-01 | ICSE 2019 | unnamed | 21.8K * | Java | "On Learning Meaningful Code Changes via Neural Machine Translation" [paper] [data] |
2019-05 | MSR 2020 | ManySStuBs4J | 154K | Java | "How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset" [paper] [data] |
2019-11 | ASE 2019 | Refactory | 1783 | Python | "Re-factoring based program repair applied to programming assignments" [paper] [data] |
2020-07 | ISSTA 2020 | CoCoNut | 24M | Java, Python, C, JS | "CoCoNuT: combining context-aware neural translation models using ensemble for program repair" [paper] [data] |
2020-11 | ESEC/FSE 2020 | BugsInPy | 493 | Python | "BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies" [paper] [data] |
2021-07 | ICML 2021 | TFix | 105K | JS | "TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer" [paper] [data] |
2022-11 | ESEC/FSE 2022 | TypeBugs | 93 | Python | "PyTER: Effective Program Repair for Python Type Errors" [paper] [data] |
2023-08 | arXiv | HumanEvalPack | 984 | Python, JS, Go, Java, C++, Rust | "OctoPack: Instruction Tuning Code Large Language Models" [paper] [data] |
* This is a code-change dataset, and only a subset therein concern bug fixing.
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2016-08 | ACL 2016 | CODE-NN | 66K/32K | C#/SQL | "Summarizing Source Code using a Neural Attention Model" [paper] [data] |
2017-07 | IJCNLP 2017 | unnamed | 150K | Python | "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation" [paper] [data] |
2018-05 | ICPC 2018 | DeepCom | 588K | Java | "Deep code comment generation" [paper] [data] |
2018-07 | IJCAI 2018 | TL-CodeSum | 411K | Java | "Summarizing Source Code with Transferred API Knowledge" [paper] [data] |
2019-09 | arxiv | CodeSearchNet | 2.3M | Go, JS, Python, PHP, Java, Ruby | "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [paper] [data] |
2023-08 | arXiv | HumanEvalPack | 984 | Python, JS, Go, Java, C++, Rust | "OctoPack: Instruction Tuning Code Large Language Models" [paper] [data] |
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2018-03 | WWW 2018 | StaQC | 148K/120K | Python/SQL | "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" [paper] [data] |
2018-05 | ICSE 2018 | DeepCS | 18.2M | Java | "Deep Code Search" [paper] [data] |
2018-05 | MSR 2018 | CoNaLa | 600K/2.9K | Python | "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow" [paper] [data] |
2019-08 | arXiv | unnamed | 287 | Java | "Neural Code Search Evaluation Dataset" [paper] [data] |
2019-09 | arXiv | CodeSearchNet | 2.3M/99 | Go, PHP, JS, Python, Java, Ruby | "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search" [paper] [data] |
2020-02 | SANER 2020 | CosBench | 52 | Java | "Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries" [paper] [data] |
2020-08 | arXiv | SO-DS | 2.2K | Python | "Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent" [paper] [data] |
2020-10 | ACM Trans. Knowl. Discov. Data | FB-Java | 249K | Java | "Deep Graph Matching and Searching for Semantic Code Retrieval" [paper] [data] |
2021-02 | NeurIPS Datasets and Benchmarks 2021 | AdvTest/WebQueryTest | 280K/1K | Python | "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" [paper] [[data]] |
2021-05 | ACL/IJCNLP 2021 | CoSQA | 21K | Python | "CoSQA: 20,000+ Web Queries for Code Search and Question Answering" [paper] [data] |
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2019-12 | ESEC/FSE 2020 | TypeWriter OSS | 208K | Python | "TypeWriter: Neural Type Prediction with Search-based Validation" [paper] [data] |
2020-04 | PLDI 2020 | Typilus | 252K | Python | "Typilus: Neural Type Hints" [paper] [data] |
2020-04 | ICLR 2020 | LambdaNet | 300 * | TypeScript | "LambdaNet: Probabilistic Type Inference using Graph Neural Networks" [paper] [data] |
2021-04 | MSR 2021 | ManyTypes4Py | 869K | Python | "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference" [paper] [data] |
2022-10 | MSR 2022 | ManyTypes4TypeScript | 9.1M | TypeScript | "ManyTypes4TypeScript: a comprehensive TypeScript dataset for sequence-based type inference" [paper] [data] |
2023-02 | ECOOP 2023 | TypeWeaver | 513 * | TypeScript | "Do Machine Learning Models Produce TypeScript Types That Type Check?" [paper] [data] |
2023-03 | ICLR 2023 | BetterTypes4Py/InferTypes4Py | 608K/4.6K | Python | "TypeT5: Seq2seq Type Inference using Static Analysis" [paper] [data] |
2023-05 | arXiv | OpenTau | 744 * | TypeScript | "Type Prediction With Program Decomposition and Fill-in-the-Type Training" [paper] [data] |
* These are project counts.
Date | Venue | Benchmark | Size | Language | Source |
---|---|---|---|---|---|
2023-03 | arXiv | RepoEval | 1600/1600/373 * | Python | "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation" [paper] [data] |
2023-06 | arXiv | RepoBench | 890K/9M/43K |
Python, Java | "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems" [paper] [data] |
2023-06 | arXiv | Stack-Repo | 816K | Java | "RepoFusion: Training Code Models to Understand Your Repository" [paper] [data] |
2023-09 | arXiv | CodePlan | 645/21 |
C#/Python |
"CodePlan: Repository-level Coding using LLMs and Planning" [paper] [data] ** |
2023-10 | arXiv | SWE-Bench | 2294 | Python | "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [paper] [data] |
2023-10 | arXiv | CrossCodeEval | 9928 | Python, Java, TypeScript, C# | "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion" [paper] [data] |
*Line Completion/API Invocation Completion/Function Completion
** This is the link given in the paper, but we are unable to access it at the time of writing.
30 papers as a primer on LLM.
Date | Keyword | Paper | TL;DR |
---|---|---|---|
2014-09 | Attention | Neural Machine Translation by Jointly Learning to Align and Translate | The original attention, proposed for encoder-decoder RNN |
2015-08 | BPE | Neural Machine Translation of Rare Words with Subword Units | Byte-pair encoding: split rare words into subword units |
2017-06 | Transformer | Attention Is All You Need | Replace LSTM with self-attention for long-range dependency and parallel training |
2017-10 | Mixed Precision Training | Mixed Precision Training | Store model weights in fp16 to save memory |
2018-04 | GLUE | GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding | A language understanding benchmark |
2018-06 | GPT | Improving Language Understanding by Generative Pre-Training | Pretraining-finetuning paradigm applied to Transformer decoder |
2018-10 | BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Masked Language Modeling (MLM) applied to Transformer encoder for pretraining |
2019-02 | GPT-2 | Language Models are Unsupervised Multitask Learners | GPT made larger (1.5B). They found language models implicitly learn about downstream tasks (such as translation) during pretraining. |
2019-05 | SuperGLUE | SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems | Another langauge understanding benchmark |
2019-07 | RoBERTa | RoBERTa: A Robustly Optimized BERT Pretraining Approach | An optimized BERT |
2019-09 | Megatron-LM | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | Model parallelism |
2019-10 | ZeRO | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | Memory-efficient distributed optimization |
2019-10 | T5 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Transformer encoder-decoder pretrained with an MLM-like denoising objective |
2020-05 | GPT-3 | Language Models are Few-Shot Learners | By training an even larger version of GPT-2 (175B), they discovered a new learning paradigm: In-Context Learning (ICL) |
2020-09 | MMLU | Measuring Massive Multitask Language Understanding | A world-knowledge and complex reasoning benchmark |
2020-12 | Pile | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | A diverse pretraining dataset |
2021-06 | LoRA | LoRA: Low-Rank Adaptation of Large Language Models | Memory-efficient finetuning |
2021-09 | FLAN | Finetuned Language Models Are Zero-Shot Learners | Instruction-finetuning |
2021-10 | T0 | Multitask Prompted Training Enables Zero-Shot Task Generalization | Also instruction finetuning, but applied to the much smaller T5 |
2021-12 | Gopher | Scaling Language Models: Methods, Analysis & Insights from Training Gopher | A 280B LLM with comprehensive experiments |
2022-01 | CoT | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | Chain-of-Though reasoning |
2022-03 | InstructGPT | Training language models to follow instructions with human feedback | GPT-3 instruction finetuned with RLHF (reinforcement learning from human feedback) |
2022-03 | Chinchilla | Training Compute-Optimal Large Language Models | A smaller (70B) version of Gopher that's pretrained on more data |
2022-04 | PaLM | PaLM: Scaling Language Modeling with Pathways | The largest dense model ever (540B) |
2022-05 | 0-shot CoT | Large Language Models are Zero-Shot Reasoners | Tell LLMs to think step by step, and they can actually do it |
2022-06 | BIG Bench | Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models | Another world-knowledge and complex reasoning benchmark |
2022-06 | Emergent Ability | Emergent Abilities of Large Language Models | A review on emergent abilities |
2022-10 | Flan | Scaling Instruction-Finetuned Language Models | Consolidate all the existing instruction tuning datasets, and you get SOTA |
2022-11 | BLOOM | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | The largest open-source LLM, trained on 46 languages, with detailed discussion about training and evaluation |
2022-12 | Self-Instruct | Self-Instruct: Aligning Language Models with Self-Generated Instructions | Instruction tuning using LLM-generated data |
This list aims to provide the essential background for understanding current LLM technologies, and thus excludes more recent models such as LLaMA, GPT-4 or PaLM 2. For comprehensive reviews on these more general topics, we refer to other sources such as this paper or these repositories: Awesome-LLM, Awesome AIGC Tutorials. And for specific domains: Awesome Domain LLM, Awesome Tool Learning, Awesome-LLM-MT.
If you find this repo or our survey helpful, please consider citing us:
@misc{zhang2023survey,
title={A Survey on Language Models for Code},
author={Ziyin Zhang and Chaoyu Chen and Bingchang Liu and Cong Liao and Zi Gong and Hang Yu and Jianguo Li and Rui Wang},
year={2023},
eprint={2311.07989},
archivePrefix={arXiv},
primaryClass={cs.CL}
}