[TOC]
ā LLM For Security ā Fuzzing (Concrete Execution)
AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing https://arxiv.org/abs/2409.10737 https://ui.adsabs.harvard.edu/abs/2023arXiv230712469Z/abstract
ęęÆčæå± | Prompt-Fuzzļ¼åŗäŗLLMēåŗęØ”ē³ęµčÆé©±åØčŖåØåēę https://mp.weixin.qq.com/s/V3OC6BbZuMC0eev53C_LWA Prompt Fuzzing for Fuzz Driver Generation https://arxiv.org/abs/2312.17677
CovRL-Fuzzļ¼åŗäŗå¤§ęØ”ååå¼ēJavaScriptč§£éåØęØ”ē³ęµčÆęęÆ | ęęÆčæå± https://mp.weixin.qq.com/s/6DaD7rw3t7TFYupTJoEtaw
ęęÆčæå± | MINERļ¼äøē§ēØäŗREST API樔ē³ęµčÆēę··åę°ę®é©±åØę¹ę³ https://mp.weixin.qq.com/s/u26iSavxWGswrcuLaaACKA
ęęÆčæå± | SDFUZZ:ē®ę ē¶ę驱åØēå®å樔ē³ęµčÆ https://mp.weixin.qq.com/s/zodAi0hD05HuIji6krz9TA
SEAMFUZZļ¼ē°ē樔ē³ęµčÆēå¦ä¹ ē§åčŖéåŗēŖåēē„ https://mp.weixin.qq.com/s/duEtfMfayefdBzlqm1rM_g
大樔åäøęØ”ē³ęµčÆčæč”ē»åēē 究论ęę±ę»ļ½ęęÆčæå± https://mp.weixin.qq.com/s/DL4pGH-7nPi3eSRD-rlD-w
2024äæ”ęÆå®å Øé¢åå大锶ä¼Fuzz论ęę±ę»ļ½ęęÆčæå± https://mp.weixin.qq.com/s/Dn5ooUahCT7IqXIMevjVug
2024幓软件巄ēØé”¶ä¼Fuzz论ęę±ę» https://mp.weixin.qq.com/s/-ZYX-G_jX1AN8lzbHvW21g
https://github.com/cuiao/SCU_ThesisDissertation_LaTeXTemplate ååøę„ęļ¼2016幓5ę30ę„
ä½č ę¬äŗŗęÆäøååå·å¤§å¦ēå¦ēļ¼åØå¦ä¹ ē§ē ę“»åØäøē»åøøéč¦čæč”å¦ęÆåä½ćåØä½æēØä¼ ē»ēåå¤ē软件ļ¼å¦_MicrosoftĀ® Word_ļ¼ę¶ļ¼ē±äŗå ¶åƹę°å¦ēē¹ę®éę±ēęÆęäøå¤å儽ļ¼å ę¤å¾å¾ä¼éå°åē§åę ·ēé®é¢ļ¼ę ę³é«ęå°åä½ćä½č å¶ē¶ę„触å°äŗLaTeXęēē³»ē»ļ¼å ¶å¼ŗå¤§ēåč½ćä¼ē¾ēę°å¦ęēå便å©ēčŖåØåå·„å ·ēä¼å¤ä¼ē¹ä½æäŗŗå°č±”ę·±å»ļ¼ē¹å«éåēå·„ē§å¦ēå¦ä¹ å使ēØćē»čæäŗč§£ļ¼åē°å½é ęå论ęäø»č¦ä½æēØLaTeXčæč”ęēļ¼äøå½å å¤č®øå¤é«ę ”åęä¾LaTeXēå¦ä½č®ŗę樔ēļ¼čęę ”åØčæę¹é¢ēåå±čæē„ę¾äøč¶³ćåØčæę ·ēåØęŗé©±ä½æäøļ¼ä½č åØå©ēØčŖå·±č¾äøŗåēŗ§ēLaTeXē„čÆļ¼åčäŗå京大å¦åå¦ē_pkuthss_樔ēåå ¶ä»ēøå ³ęē®ēåŗē”äøļ¼å¼åäŗ_scuthesis_čæäøŖéēØäŗåå·å¤§å¦ē ē©¶ēļ¼č½å å«ęę¬ē§é锹ļ¼ä½å°é¢ęŖäæ®ę¹ļ¼ä½æēØēLaTeXå¦ä½č®ŗę樔ēćåøęę¤ęØ”ēč½å¤ē»åä½åå¦ęä¾äøäøŖé¢å¤ēéę©ļ¼ęØ”ēäøč„ęēēµļ¼čæčÆ·åä½åå¦ę¹čÆęę£ļ¼ēčØćę°å»ŗäøäøŖ_ISSUSE_ę_FORK_äøäøŖę°åęÆäæ®ę¹ć
https://github.com/fatestigma/scuthesis ScuThsis ęÆäøäøŖéå®ę¹ēåå·å¤§å¦ę¬ē§ēęÆäøč®¾č®”论ę樔ęæļ¼ęØ”ęæēå¶ä½ē®ēęÆäøŗäŗč®©å¤§å®¶åÆä»„ę“容ęēå©ēØ LaTeX ęēå¼ęå¶ä½åŗęēē²¾ē¾ēęÆäøč®ŗęļ¼äøę„ęé«ęÆäøč®ŗęēęē蓨éļ¼äŗę„åÆä»„å°ę“å¤ēē²¾åę¾åØč®ŗęå 容ēč¾åŗäøć
åØ2016å±ęÆäøēäøå
±ę10å¤ä½åäøScuThesiså
ęµļ¼å¹¶ę4ä½č·å¾åå·å¤§å¦ę¬ē§ä¼ē§ęÆäøč®ŗęļ¼å
¶äø1ä½č¢«ęåļ¼č¦ę±ęäŗ¤docę ¼å¼ę攣ć
https://github.com/fanmcgrady/SCU-Thesis-LaTeX-Template 左大ē„ļ¼åå·å¤§å¦č®”ē®ęŗå¦é¢å·¦å¼å¤§ē„ļ¼åčÆęä»ēå士论ęęÆēØčŖå·±č®¾č®”ēLaTeX樔ęæåēć
é¢ē®ļ¼ćåŗå 蔨达å¼ē¼ēØę øåæęęÆē ē©¶ćļ¼ęå “č¶£ēåå¦åÆä»„ę读äøäø
čęäŗå·¦å¤§ē„ēå士论ę仄åę触å¾å¤ć äøŗäŗčæ½é大ē„ēčę„ļ¼ęēå士论ęå½ē¶ä¹ē»åƹäøč½ēØWordäŗć åå·å¤§å¦ē ē©¶ēé¢ęä¾äŗWordēę¬ē论ęę ¼å¼ęØ”ęæļ¼ å¹¶ę²”ęęä¾ēøå ³ēå¦ä½č®ŗęLaTeX樔ęæć åØGithubäøę¾å°åä½č cuiaoē锹ē®_scuthesis_ć ę¬é”¹ē®ęÆåØåä½č ēåŗē”äøčæč”äŗäøäŗäæ®ę¹ļ¼č½å¤ę»”č¶³åå·å¤§å¦å¦ä½č®ŗęēę ¼å¼č¦ę±ć åøęę¬é”¹ē®č½č®©å¤§å®¶ę“å äøę³Øäŗå¦ä½č®ŗęēåä½ļ¼
https://github.com/SunnyHaze/scu-thesis-template?tab=readme-ov-file
- ę¬Word樔ęæä»åŗé¾ę„: https://github.com/SunnyHaze/scu-thesis-template
- åč¾å¤§ä½¬ęä¾ēLatex樔ęæļ¼https://github.com/fatestigma/scuthesis
- å¦äøäøŖåč¾å¤§ä½¬ęä¾ēLatex樔ęæļ¼https://github.com/dahakawang/scu_thesis_template
- å¦äøäøŖå¤§ä½¬ęä¾ēå¦ä½č®ŗęLatex樔ęæ: https://github.com/cuiao/SCU_ThesisDissertation_LaTeXTemplate
å½å®¶ę åćäæ”ęÆäøęē® čµęŗęčæ°ć ē±TC4ļ¼å Øå½äæ”ęÆäøęē®ę ååęęÆå§åä¼ļ¼å½å£ ļ¼äø»ē®”éØéØäøŗå½å®¶ę åå§ć
äø»č¦čµ·čåä½ äøå½ē§å¦é¢ęē®ę ę„äøåæ ćå½å®¶å¾ä¹¦é¦ ćå京大å¦å¾ä¹¦é¦ ćęø å大å¦å¾ä¹¦é¦ ćåäøåøč大å¦å¾ä¹¦é¦ ć广äøēē«äøå±±å¾ä¹¦é¦ ćäøęµ·å¾ä¹¦é¦ ćäøå½ē§å¦ęęÆäæ”ęÆē ē©¶ę ćé¦é½å¾ä¹¦é¦ ćäøå¤®é³ä¹å¦é¢å¾ä¹¦é¦ ćäøå½ē¤¾ä¼ē§å¦é¢å¾ä¹¦é¦ ć
äø»č¦čµ·čäŗŗ å®ę ćēę“ ćå»ē½ē½ ćęØę § ćč”å°č ćé²å½å¼ŗ ćęÆåę ćęÆé å ćēŗŖéę© ćę±å¦å ćéę„ ćå¼ åØ ćēå®å® ćę²ę£å ćč¢ēēŗ¢ ćé»äø½å©· ć
LLM驱åØē¼ŗé·ę£ęµēäøč¬ęµēØ LiĀ Y,Ā YangĀ WZ,Ā ZhangĀ Y,Ā XueĀ YX.Ā SurveyĀ onĀ FuzzingĀ BasedĀ onĀ LargeĀ LanguageĀ Model.Ā RuanĀ JianĀ XueĀ Bao/Journal ofĀ SoftwareĀ (inĀ Chinese).Ā http://www.jos.org.cn/1000-9825/7323.htm
This is basically a work-flow design contest š
https://aicyberchallenge.com/ DARPA's Artificial Intelligence Cyber Challenge (AIxCC)
Huang, Linghan, Peizhou Zhao, Huaming Chen, and Lei Ma. "Large language models based fuzzing techniques: A survey."Ā arXiv preprint arXiv:2402.00350 (2024).
https://arxiv.org/abs/2402.00350
Huang, Linghan, Peizhou Zhao, Huaming Chen, and Lei Ma. "Large language models based fuzzing techniques: A survey."Ā arXiv preprint arXiv:2402.00350 (2024).
https://arxiv.org/abs/2402.00350
LiĀ Y,Ā YangĀ WZ,Ā ZhangĀ Y,Ā XueĀ YX.Ā SurveyĀ onĀ FuzzingĀ BasedĀ onĀ LargeĀ LanguageĀ Model.Ā RuanĀ JianĀ XueĀ Bao/Journal ofĀ SoftwareĀ (inĀ Chinese).Ā http://www.jos.org.cn/1000-9825/7323.htm
Zhang, C., Zheng, Y., Bai, M., Li, Y., Ma, W., Xie, X., Li, Y., Sun, L., & Liu, Y. (2024). How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 1223ā1235. https://doi.org/10.1145/3650212.3680355
Essentially, a fuzz driver is a piece of code responsible for accepting mutated input from fuzzers and executing the APIs accordingly. An effective driver must contain a correct and robust API usage since incorrect or unsound usage can result in extensive false positive or negative fuzzing results, incurring extra manual validation efforts or testing resources waste. Due to the high standard required, fuzz drivers are typically written by human experts, which is a labor-intensive and time-consuming process. For instance, š OSS-Fuzz, the largest fuzzing framework for open-source projects, maintains thousands of fuzz drivers written by hundreds of contributors over the past seven years.
Fuzz Driver Basics. The key components of a fuzz driver are illustrated in Figure 1. A typical fuzz driver has three necessary parts: prerequisites initialization (line 3), execution (line 4), and post-cleaning (line 7). Besides, there are three optional parts commetned in lines 2, 5, and 6 that can improve a driverās effectiveness. Line 2 part improves a driver by proper input arrangement such as rejecting too short or too long inputs, interpreting input data as multiple testing arguments, etc. Line 5 part enables a driver to call more APIs which triggers more program behaviors during fuzzing. Finally, line 6 part adds semantic oracles for detecting logical bugs. These oracles are similar to assert statements in unit tests, aborting execution when certain program properties are unsatisfied. Since a driver will be repeatedly executed with randomly mutated input, there is a high requirement on its correctness and robustness. Incorrect or unrobust usage can lead to both false positives and negatives. For instance, if a driver failed to feed the mutated data into the API, its fuzzing can never find any bug. Or if an API argument is incorrectly initialized, false crashes may be raised.
Minimal Requirements of Effective Fuzz Drivers. The minimal requirements covers the line 3,4, and 7 of Figure 1, which mainly include correctly initializing the arguments and satisfying necessary control flow dependencies. Argument initialization can be one of the following cases (in the order of simplicity): ā¶ C1: If the argument value can be any value or should be naive values like 0 or NULL, a variable is declared or a literal constant is used directly; ā· C2: If the argument is supposed to be a macro or a global variable that is already defined in common libraries or the target APIās project, it is located and used; āø C3: If creating the argument requires the use of common library APIs, such as creating a file and writing specific content, common practices are followed; ā¹ C4: If initializing the argument requires the output of other APIs within the project, those APIs are initialized first following the above initialization cases.
Lyu, Y., Xie, Y., Chen, P., & Chen, H. (2024). Prompt Fuzzing for Fuzz Driver Generation (No. arXiv:2312.17677). arXiv. https://doi.org/10.48550/arXiv.2312.17677
Xu, H., Ma, W., Zhou, T., Zhao, Y., Chen, K., Hu, Q., Liu, Y., & Wang, H. (2024). CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge Graph (No. arXiv:2411.11532). arXiv. https://doi.org/10.48550/arXiv.2411.11532
Sun, Y. (2024). Automated Generation and Compilation of Fuzz Driver Based on Large Language Models. Proceedings of the 2024 9th International Conference on Cyber Security and Information Engineering, 461ā468. https://doi.org/10.1145/3689236.3689272
Murali, A., Mathews, N., Alfadel, M., Nagappan, M., & Xu, M. (2024). FuzzSlice: Pruning False Positives in Static Analysis Warnings through Function-Level Fuzzing. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1ā13. https://doi.org/10.1145/3597503.3623321
We propose FuzzSlice, a novel framework that automatically prunes possible false positives among static analysis warnings. Unlike prior work that mostly focuses on confirming true positives among static analysis warnings, which inevitably requires end-to-end fuzzing, FuzzSlice focuses on ruling out potential false positives, which are the majority in static analysis reports. The key insight that we base our work on is that a warning that does not yield a crash when fuzzed at the function level in a given time budget is a possible false positive. To achieve this, FuzzSlice first aims to generate compilable code slices at the function level. Then, FuzzSlice fuzzes these code slices instead of the entire binary to prune possible false positives. FuzzSlice is also unlikely to misclassify a true bug as a false positive because the crashing input can be reproduced by a fuzzer at the function level as well.
We evaluate FuzzSlice on the Juliet synthetic dataset and real-world complex C projects:
openssl,tmuxandopenssh-portable. Our evaluation shows that the ground truth in the Juliet dataset had 864 false positives which were all detected by FuzzSlice. For the open-source repositories, we were able to get the developers from two of these open-source repositories to independently label these warnings. FuzzSlice automatically identifies 33 out of 53 false positives confirmed by developers in these two repositories. This implies that FuzzSlice can reduce the number of false positives by 62.26% in the open-source repositories and by 100% in the Juliet dataset.
Xia, C. S., Paltenghi, M., Le Tian, J., Pradel, M., & Zhang, L. (2024). Fuzz4All: Universal Fuzzing with Large Language Models. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1ā13. https://doi.org/10.1145/3597503.3639121
This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java, and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 98 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 64 bugs already confirmed by developers as previously unknown.
Sheng, Z., Xu, Q., Huang, J., Woodcock, M., Huang, H., Donaldson, A. F., Gu, G., & Huang, J. (2025). All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching (No. arXiv:2509.07225). arXiv. https://doi.org/10.48550/arXiv.2509.07225
Our team, All You Need Is A Fuzzing Brain, was one of seven finalists in DARPAās Artificial Intelligence Cyber Challenge (AIxCC), placing fourth in the final round. During the competition, we developed a Cyber Reasoning System (CRS) that autonomously discovered 28 security vulnerabilitiesāincluding six previously unknown zero-daysāin real-world open-source C and Java projects, and successfully patched 14 of them. The complete CRS is open source at github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain. This paper provides a detailed technical description of our CRS, with an emphasis on its LLM-powered components and strategies. Building on AIxCC, we further introduce a public leaderboard for benchmarking state-of-the-art LLMs on vulnerability detection and patching tasks, derived from the AIxCC dataset. The leaderboard is available at o2lab.github.io/FuzzingBrain-Leaderboard.
![]()
Yang, Y., Yao, S., Chen, J., & Lee, W. (n.d.). Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving. https://www.usenix.org/conference/usenixsecurity25/presentation/yang-yupeng This paper is included in the Proceedings of the 34th USENIX Security Symposium.
More quotes can be found on below "Complex Input Constraints"
Fuzzing feeds random input to a program and checks for faulty behaviors [21, 22, 35, 70, 91]. Modern fuzzers rely on random mutations (e.g., bit-flips) and coverage feedback to produce inputs covering more code regions. However, naive random mutations perform badly at exploring deep logic behind complex constraints. To overcome this problem, researchers have proposed generic and target-specific solutions.
- Generic Solutions. Generic solutions aim at solving complex constraints without target-specific heuristics. The typical approach is to track the relationship between the input and the evaluation of the constraints. Taint tracking and symbolic/concolic execution fall into this category. Taint tracking enables fuzzers [24, 31, 51, 63] to learn which parts of the input affect constraint evaluation. Afterward, fuzzers focus their mutations on the relevant parts to increase the chances of solving the constraints. Symbolic execution replaces the input with symbolic values and then tracks computations stemming from them. Then, constraint evaluation can be represented in SMT formulas, and SMT solvers [17, 74] help determine the exact inputs to solve them. In the context of fuzzing, symbolic execution is often enhanced with concolic execution [33, 59, 90], where concrete values and execution paths from fuzzers help speed up SMT solving and path exploration. Additionally, approaches like REDQUEEN [5] and others [10, 57, 81] aim to reduce constraint tracking overhead through approximation.
- Unfortunately, these solutions fall short when programs become as complex as language processors. Taint tracking fails, because just after the first stage (i.e., tokenization in Fig.1), every input byte is applied some constraints and gets tainted, rendering taint tracking pointless. Symbolic and concolic fuzzing fail for several reasons. First, they require heavy engineering to model every operation semantics in the program (e.g., native instruction semantics [90] or IR semantics [9, 12, 59, 78]) to track symbol propagation correctly. This process also faces difficulties in operations commonly seen in language processors (e.g., symbolized pointers [12]). To demonstrate, Fig.2 shows a simplified language processor for C++. Line 27 contains the code region that is only triggered when the input source contains structs inside functions. Our experiments show that state-of-the-art symbolic/concolic execution engines, including
SymCC[59],Angr[78], andSymQemu[60], cannot track the input constraint leading to line 27 correctly due to incomplete semantics modeling. Second, even when semantics modeling is complete (via heavy manual effort), the enormous number of control paths in early language processor stages can cause path explosion for symbolic execution and over-constraining [12] for concolic execution. Additionally, certain stages might contain logic that defeats symbolic solvers, such as hash-based tokenization that cannot be inverted. Researchers have proposed symbolizing tokens instead of input bytes to mitigate this problem for Javascript [27]. However, its generalizability to other languages remains unclear [88], and it only mitigates the problem in the first stage, which does not handle constraints in later stages (i.e., semantics checking and code generation).
- Unfortunately, these solutions fall short when programs become as complex as language processors. Taint tracking fails, because just after the first stage (i.e., tokenization in Fig.1), every input byte is applied some constraints and gets tainted, rendering taint tracking pointless. Symbolic and concolic fuzzing fail for several reasons. First, they require heavy engineering to model every operation semantics in the program (e.g., native instruction semantics [90] or IR semantics [9, 12, 59, 78]) to track symbol propagation correctly. This process also faces difficulties in operations commonly seen in language processors (e.g., symbolized pointers [12]). To demonstrate, Fig.2 shows a simplified language processor for C++. Line 27 contains the code region that is only triggered when the input source contains structs inside functions. Our experiments show that state-of-the-art symbolic/concolic execution engines, including
- Target-specific Solutions. Researchers have proposed solutions to overcome complex constraints in specific language processors.
Csmith[88] generates C programs avoiding undefined behaviors. DIE [55] andFuzzilli[30] maintain Javascript semantics to reach deep just-in-time (JIT) regions. These approaches require heavy effort to design target-specific heuristics, making them un-generalizable. Furthermore, researchers have proposed annotation-based approaches [11, 89]. Still, they require expert knowledge of the annotation language and an understanding of the target constraints, making them non-trivial to apply to new targets. Additionally, token-level fuzzing [67] and grammar-level fuzzing [4, 80] have been proposed. They are semi-generic because they do not target a specific language due to the wide availability of tokenizers and grammar. However, they only try to overcome constraints in the first two stages (see Fig.1), leaving the later stages unhandled.
Yu, J., Shao, Y., Miao, H., & Shi, J. (2025). PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs (No. arXiv:2409.14729). arXiv. https://doi.org/10.48550/arXiv.2409.14729
In this paper, we propose
PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing,PROMPTFUZZselects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLMās resilience.PROMPTFUZZoperates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high quality prompt injections. UsingPROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts.
By deploying the generated attack prompts from
PROMPTFUZZin a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours, demonstratingPROMPTFUZZās effectiveness compared to experienced human attackers. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness,PROMPTFUZZcontinues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.
Deng, Y., Xia, C. S., Yang, C., Zhang, S. D., Yang, S., & Zhang, L. (2024). Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1ā13. https://doi.org/10.1145/3597503.3623343
This paper proposes FuzzGPT, the first approach to guiding LLMs to directly synthesize unusual input programs for effective fuzzing. While the recent TitanFuzz work samples programs from the natural probability distribution encoded in pre-trained LLMs, FuzzGPT aims to shift the distribution towards unusual programs in the search space to explore code paths rarely hit by ordinary programs. FuzzGPT is mainly built on the well-known hypothesis that historical bug-triggering programs may include edge-case/valuable code ingredients important for bug finding. ... To implement FuzzGPT, we first construct a dataset of bug-triggering code snippets by mining bug reports from open-source repositories of the target DL libraries. Built on this dataset, FuzzGPT includes the following strategies. (1) In-context learning [37]: we provide LLMs with either a few examples of historical bug-triggering programs (few-shot learning [7, 51]) or a partial bug-triggering program (zeroshot learning [47, 58]) to either generate a new code snippet or to autocomplete the partial code. (2) Fine-tuning [50]: we modify the model weights by training on the extracted historical bug-triggering programs to obtain fine-tuned LLMs that are specially designed to generate similar bug-triggering code snippets. From both learning strategies, FuzzGPT can prime the LLMs to generate bug-triggering programs by capturing code ingredients within either the local context examples or fine-tuning dataset.
Yang, Y., Yao, S., Chen, J., & Lee, W. (n.d.). Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving.
To address the problem of complex input constraints, researchers have proposed two categories of solutions: generic solutions and target-specific solutions.
- Generic solutions aim at solving complex constraints without target-specific heuristics [3, 5, 57, 71, 90]. The typical approach is to track the relationship between the input and the evaluation of branching conditions within the program. Taint tracking and concolic execution fall into this category. Taint tracking enables fuzzers [24, 31, 51, 63] to learn which parts of the input affect the branching condition checks. Afterward, fuzzers can focus the mutation on the relevant parts of the input to increase the chances of passing such checks. Concolic execution combines a fuzzer with a symbolic solver [9, 12, 59, 78, 90], which replaces the input with symbolic values and then tracks computations stemming from them. Afterward, branching conditions can be represented in SMT formulas and solved by SMT solvers [17, 74].
- Unfortunately, these solutions fall short on programs as complex as language processors. Taint tracking would fail after the tokenization stage, which applies constraints on every input byte, rendering taint tracking pointless. Concolic execution would fail due to incomplete semantics modeling or over-constraining [12].
- To deal with such limitations, researchers have proposed target-specific solutions for specific language processors [30, 55, 88, 97]. Although these approaches show effectiveness on their respective targets, they require heavy manual effort to embed target-specific heuristics into the fuzzer, making them ungeneralizable to other targets. Further, researchers have proposed annotation-based approaches [11, 89]. Still, they require expert knowledge of the annotation language and an understanding of the targetās input constraints, making them non-trivial to apply to new targets.
- Semi-generic approaches, such as token-level [67] and grammar-level [4, 80] fuzzing, have been proposed for language processors. Due to the wide availability of tokenizers and grammar, these approaches can be generalized. However, they only aim to overcome lexical and syntax constraints, which are still ineffective in exploring deeper logics (e.g., in optimization and execution stages) behind semantic constraints [11, 55]. To date, we face a dilemma where generic solutions fail to effectively handle language processors while target-specific solutions struggle to generalize.
... (They give a more detailed version of these common solutions later. I quote them at the section under heading of "Language Processors (Compilers, Interpreters, ...)")
(They decided to use LLM to solve this problem. But first, here are the challenges of an LLM-based approach.)
We identify two unique challenges that must be addressed to make an LLM-based approach practical.
- First, it is unclear how to prioritize constraints to maximize coverage gain. Modern language processors easily contain up to millions of lines of code (see Table 4). Since inference with LLMs is not free [85, 86], it is important to prioritize the constraints to solve. However, constraint prioritization is non-trivial. An ideal prioritization strategy should not only favor those having greater potential to reach more code regions, but also account for the solving difficulty for LLMs. This is because we notice that LLMs may easily solve some constraints, but fail entirely with others due to lacking reasoning capabilities or inherently unsolvable problems (e.g., one-way functions). To date, effectively satisfying these two properties to maximize coverage gain remains an open challenge.
- Second, it is non-trivial to decide the degree of context presentation. To instruct an LLM to solve a constraint, we need to provide the constraint context (e.g., source code) for it to reason about. Presenting the entire code base is impractical, considering the magnitude of modern language processor code bases. Presenting execution traces or slices is not ideal either because their sizes can still be huge, with total lines of code easily exceeding 10k in our experiments. On the flip side, presenting only a narrow context (e.g., the function enclosing the constraintās evaluation point) can miss necessary information for the LLM to understand the full picture of the constraint. Therefore, it remains unclear how to present sufficient and concise context to achieve both effectiveness and efficiency.
... (Later they give a more detailed explanation of above-mentioned common solutions and challenges in applying LLM to solve the problem of complex constraints)
- Constraint Prioritization. The first challenge we identify is it is unclear how to prioritize unsolved constraints to maximize coverage gain. Language processors can contain up to millions of lines of code (see Table 4). It is important to prioritize the unsolved constraints to solve (i.e., the uncovered lines to cover), because inference with LLMs is not free [85, 86], and it is by far impractical to ask LLMs to reason about all unsolved constraints. However, constraint prioritization is non-trivial. An ideal prioritization strategy should satisfy the following two properties. First, it should prioritize the constraints having greater potential to reach more code regions if solved. For instance, in Fig.2, consider a scenario where line 20 and line 27 are both uncovered. Since line 27 can lead to additional uncovered regions, such as deeper functions for processing local structs, while line 20 merely bails out, prioritizing reaching line 27 is more effective. Second, prioritization should account for the solving difficulty. LLMs may easily solve some constraints but fail entirely with others. Solving success rates can drop when constraints involve complex logic beyond the LLMās reasoning capabilities, or when they involve inherently hard-to-solve problems (e.g., one-way functions). To remain cost-effective, constraints that LLMs struggle to solve should be deprioritized, even if they have the potential to cover more code regions. To date, effectively satisfying these two properties to maximize coverage gain remains an open challenge.
- Context Granularity. After we decide which constraint to solve, the next step is to collect relevant context, structure it into a prompt, and query the LLM for results. To work with LLMs, we think the best context for a constraint is the relevant source code. However, it is non-trivial to decide the source granularity to present to the LLM. Presenting the entire code base is not cost-effective, considering the magnitude of modern language processor code bases (see Table 4). A smarter approach would be using dynamic program slicing [87, 94, 95] to extract only code relevant to the branching condition (which guards uncovered code regions) by setting the branching condition as the slicing criterion. However, we notice this approach is not ideal either. The dynamic slice of a language processor can still be prohibitively huge. In fact, setting any branching condition in one stage (see Fig.1) as the slicing criterion should result in the majority of all previous stages being included in the dynamic slice. For instance, tokenization and parsing stages are both relevant to any branching condition in the code generation stage, as inputs failing to be tokenized or parsed can never trigger code generation. It is not ideal to present the entirety of the tokenizer and parser as context for constraints in the code generation stage. Additionally, even if the cost budget and LLM context window size support large slices, presenting redundant information can cause more hallucinations [6]. On the flip side, presenting only a narrow context, such as the function containing the uncovered code region (i.e., the enclosing function), can miss necessary information for understanding the constraint clearly. For instance, prior to reaching the enclosing function, there could be early constraint checks that reject the input. To conclude, how to present sufficient and concise context to achieve both effectiveness and efficiency remains unclear.
... (They give their solutions as following)
We propose two solutions to solve the two aforementioned challenges respectively, hybrid centrality prioritization and iterative context construction. Next, we discuss our insights.
- Hybrid Centrality Prioritization. As discussed in §2.4, an ideal prioritization algorithm should (i) prioritize constraints having greater potential to trigger more code regions and (ii) deprioritize constraints that LLMs may not be able to solve.
- To address (i), in the field of fuzzing seed scheduling, researchers have faced similar challenges in measuring the potential to reach more code regions [7, 57, 69]. An effective solution is graph centrality analysis, a method to estimate a nodeās influence in a graph. Researchers have shown that after building the inter-procedural Control Flow Graph (CFG), centrality analysis can effectively measure one basic blockās potential to reach more other basic blocks [69]. Inspired by this, we extend the existing approach and design a centrality analysis mechanism to prioritize unsolved constraints.
- To address (ii), we need a way to measure the constraint solving difficulty. The first approach we consider is using graph complexity analysis [98] to approximate constraint complexity, which seems to correlate with solving difficulty. However, in reality, we notice that pretrained LLMs are capable of picking up hints in the context that immediately enable them to solve the constraints, without thoroughly analyzing the constraints in detail. For instance, in Fig.2, imagine a scenario where lines after line 20 are not covered due to failure to pass the constraints evaluated at line 19. In our experiment, LLMs (e.g., GPT-4o-mini) can draw insights from just the function names (e.g., ts_node_has_error, tree_sitter_cpp) to realize that the input should be a syntax-correct C++ source, and then proceed to generate an input that successfully triggers line 21. However, graph complexity analysis would consider the parsing stage as a complex operation and neglect such hints. Moreover, asking the LLMs themselves to score each constraint isnāt viable because it is neither cost-effective nor reliable (because it is hard to verify the scores). Instead, we propose approximating solving difficulty using past solving history: the more LLMs fail to solve a certain type of constraint, the harder it is for LLMs to solve similar constraints in the future. Finally, we use the approximated difficulty to adjust the centrality score, so that the final scores start high for constraints with higher centrality scores but are scaled down later if the constraints appear hard to solve.
- Iterative Context Construction. Considering the āpicking up hintsā capability discussed above, we deem it impractical to decide how much context is sufficient for an LLM a priori. Instead, we draw insights from recent works on LLMs for software engineering [19, 93], which show that LLMs can improve and correct their code generation results iteratively with feedback information. ==To provide a sufficient and concise context, we start by presenting the LLM with a narrow context around the constraint evaluation point (i.e., the enclosing function). We then instruct the LLM to either solve the constraint or request additional context (e.g., symbol definitions).== This approach allows us to leverage dynamic and static analysis techniques to iteratively expand the context. If the LLM provides a solution, we verify its correctness. In case of failure, the execution trace is analyzed to identify a failing pointās context, which is then fed back to the LLM. When the LLM requests more context, source code analysis tools [50] are employed to supply the requested context. Through this process, we gradually construct the context until reaching a successful solution or an iteration threshold.







