LLM & Fuzzing

[TOC]

Res

Paper Reading List

AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing https://arxiv.org/abs/2409.10737 https://ui.adsabs.harvard.edu/abs/2023arXiv230712469Z/abstract

技术进展 | Prompt-Fuzz：基于LLM的库模糊测试驱动自动化生成 https://mp.weixin.qq.com/s/V3OC6BbZuMC0eev53C_LWA Prompt Fuzzing for Fuzz Driver Generation https://arxiv.org/abs/2312.17677

CovRL-Fuzz：基于大模型变异的JavaScript解释器模糊测试技术 | 技术进展 https://mp.weixin.qq.com/s/6DaD7rw3t7TFYupTJoEtaw

技术进展 | MINER：一种用于REST API模糊测试的混合数据驱动方法 https://mp.weixin.qq.com/s/u26iSavxWGswrcuLaaACKA

技术进展 | SDFUZZ:目标状态驱动的定向模糊测试 https://mp.weixin.qq.com/s/zodAi0hD05HuIji6krz9TA

SEAMFUZZ：灰盒模糊测试的学习种子自适应突变策略 https://mp.weixin.qq.com/s/duEtfMfayefdBzlqm1rM_g

大模型与模糊测试进行结合的研究论文汇总｜技术进展 https://mp.weixin.qq.com/s/DL4pGH-7nPi3eSRD-rlD-w

2024信息安全领域四大顶会Fuzz论文汇总｜技术进展 https://mp.weixin.qq.com/s/Dn5ooUahCT7IqXIMevjVug

2024年软件工程顶会Fuzz论文汇总 https://mp.weixin.qq.com/s/-ZYX-G_jX1AN8lzbHvW21g

Graduation Thesis

Thesis Template

https://github.com/cuiao/SCU_ThesisDissertation_LaTeXTemplate 发布日期：2016年5月30日

作者本人是一名四川大学的学生，在学习科研活动中经常需要进行学术写作。在使用传统的字处理软件（如_Microsoft® Word_）时，由于其对数学等特殊需求的支持不够友好，因此往往会遇到各种各样的问题，无法高效地写作。作者偶然接触到了LaTeX排版系统，其强大的功能、优美的数学排版和便利的自动化工具等众多优点使人印象深刻，特别适合理工科学生学习和使用。经过了解，发现国际期刊论文主要使用LaTeX进行排版，且国内外许多高校均提供LaTeX的学位论文模版，而我校在这方面的发展还略显不足。在这样的动机驱使下，作者在利用自己较为初级的LaTeX知识，参考了北京大学同学的_pkuthss_模版和其他相关文献的基础上，开发了_scuthesis_这个适用于四川大学研究生（虽包含有本科选项，但封面未修改）使用的LaTeX学位论文模版。希望此模版能够给各位同学提供一个额外的选择，模版中若有瑕疵，还请各位同学批评指正，留言、新建一个_ISSUSE_或_FORK_一个新分支修改。

https://github.com/fatestigma/scuthesis ScuThsis 是一个非官方的四川大学本科生毕业设计论文模板，模板的制作目的是为了让大家可以更容易的利用 LaTeX 排版引擎制作出排版精美的毕业论文，一来提高毕业论文的排版质量，二来可以将更多的精力放在论文内容的输出上。

在2016届毕业生中共有10多位参与ScuThesis内测，并有4位获得四川大学本科优秀毕业论文，其中1位被打回，要求提交doc格式文档。

https://github.com/fanmcgrady/SCU-Thesis-LaTeX-Template 左大神（四川大学计算机学院左劼大神）告诉我他的博士论文是用自己设计的LaTeX模板写的。

题目：《基因表达式编程核心技术研究》，有兴趣的同学可以拜读一下

膜拜了左大神的博士论文以后感触很多。为了追逐大神的脚步，我的博士论文当然也绝对不能用Word了。四川大学研究生院提供了Word版本的论文格式模板，并没有提供相关的学位论文LaTeX模板。在Github上找到原作者cuiao的项目_scuthesis_。本项目是在原作者的基础上进行了一些修改，能够满足四川大学学位论文的格式要求。希望本项目能让大家更加专注于学位论文的写作！

https://github.com/SunnyHaze/scu-thesis-template?tab=readme-ov-file

本Word模板仓库链接: https://github.com/SunnyHaze/scu-thesis-template
前辈大佬提供的Latex模板：https://github.com/fatestigma/scuthesis
另一个前辈大佬提供的Latex模板：https://github.com/dahakawang/scu_thesis_template
另一个大佬提供的学位论文Latex模板: https://github.com/cuiao/SCU_ThesisDissertation_LaTeXTemplate

Survey

Translation

Graduation Paper

Links

国家标准《信息与文献资源描述》由TC4（全国信息与文献标准化技术委员会）归口，主管部门为国家标准委。

主要起草单位中国科学院文献情报中心、国家图书馆、北京大学图书馆、清华大学图书馆、华东师范大学图书馆、广东省立中山图书馆、上海图书馆、中国科学技术信息研究所、首都图书馆、中央音乐学院图书馆、中国社会科学院图书馆。

主要起草人宋文、王洋、喻爽爽、杨慧、胡小菁、鲍国强、毛凌文、毛雅君、纪陆恩、朱学军、魏来、张娟、王宁宁、沈正华、袁玉红、黄丽婷。

Intro

LLM-Based Fuzzing Procedure

LLM驱动缺陷检测的一般流程 Li Y, Yang WZ, Zhang Y, Xue YX. Survey on Fuzzing Based on Large Language Model. Ruan Jian Xue Bao/Journal of Software (in Chinese). http://www.jos.org.cn/1000-9825/7323.htm

LLM-Driven Test Cases Generation

LLM-Driven Bug/Vulnerability Discovery

LLM-Driven Post-Fuzzing Processing

LLM-Based Fuzzing (Integrating LLM Into Fuzzing Work-Flow 😅)

This is basically a work-flow design contest 😅

https://aicyberchallenge.com/ DARPA's Artificial Intelligence Cyber Challenge (AIxCC)

Surveys & Overview

Huang, Linghan, Peizhou Zhao, Huaming Chen, and Lei Ma. "Large language models based fuzzing techniques: A survey." arXiv preprint arXiv:2402.00350 (2024). https://arxiv.org/abs/2402.00350

Li Y, Yang WZ, Zhang Y, Xue YX. Survey on Fuzzing Based on Large Language Model. Ruan Jian Xue Bao/Journal of Software (in Chinese). http://www.jos.org.cn/1000-9825/7323.htm

Fuzzers by PUT

Fuzzing via Library API & Fuzz Driver Generation

Zhang, C., Zheng, Y., Bai, M., Li, Y., Ma, W., Xie, X., Li, Y., Sun, L., & Liu, Y. (2024). How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 1223–1235. https://doi.org/10.1145/3650212.3680355

Essentially, a fuzz driver is a piece of code responsible for accepting mutated input from fuzzers and executing the APIs accordingly. An effective driver must contain a correct and robust API usage since incorrect or unsound usage can result in extensive false positive or negative fuzzing results, incurring extra manual validation efforts or testing resources waste. Due to the high standard required, fuzz drivers are typically written by human experts, which is a labor-intensive and time-consuming process. For instance, 🔗 OSS-Fuzz, the largest fuzzing framework for open-source projects, maintains thousands of fuzz drivers written by hundreds of contributors over the past seven years.

Fuzz Driver Basics. The key components of a fuzz driver are illustrated in Figure 1. A typical fuzz driver has three necessary parts: prerequisites initialization (line 3), execution (line 4), and post-cleaning (line 7). Besides, there are three optional parts commetned in lines 2, 5, and 6 that can improve a driver’s effectiveness. Line 2 part improves a driver by proper input arrangement such as rejecting too short or too long inputs, interpreting input data as multiple testing arguments, etc. Line 5 part enables a driver to call more APIs which triggers more program behaviors during fuzzing. Finally, line 6 part adds semantic oracles for detecting logical bugs. These oracles are similar to assert statements in unit tests, aborting execution when certain program properties are unsatisfied. Since a driver will be repeatedly executed with randomly mutated input, there is a high requirement on its correctness and robustness. Incorrect or unrobust usage can lead to both false positives and negatives. For instance, if a driver failed to feed the mutated data into the API, its fuzzing can never find any bug. Or if an API argument is incorrectly initialized, false crashes may be raised.

Minimal Requirements of Effective Fuzz Drivers. The minimal requirements covers the line 3,4, and 7 of Figure 1, which mainly include correctly initializing the arguments and satisfying necessary control flow dependencies. Argument initialization can be one of the following cases (in the order of simplicity): ❶ C1: If the argument value can be any value or should be naive values like 0 or NULL, a variable is declared or a literal constant is used directly; ❷ C2: If the argument is supposed to be a macro or a global variable that is already defined in common libraries or the target API’s project, it is located and used; ❸ C3: If creating the argument requires the use of common library APIs, such as creating a file and writing specific content, common practices are followed; ❹ C4: If initializing the argument requires the output of other APIs within the project, those APIs are initialized first following the above initialization cases.

Lyu, Y., Xie, Y., Chen, P., & Chen, H. (2024). Prompt Fuzzing for Fuzz Driver Generation (No. arXiv:2312.17677). arXiv. https://doi.org/10.48550/arXiv.2312.17677

Xu, H., Ma, W., Zhou, T., Zhao, Y., Chen, K., Hu, Q., Liu, Y., & Wang, H. (2024). CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge Graph (No. arXiv:2411.11532). arXiv. https://doi.org/10.48550/arXiv.2411.11532

Sun, Y. (2024). Automated Generation and Compilation of Fuzz Driver Based on Large Language Models. Proceedings of the 2024 9th International Conference on Cyber Security and Information Engineering, 461–468. https://doi.org/10.1145/3689236.3689272

Murali, A., Mathews, N., Alfadel, M., Nagappan, M., & Xu, M. (2024). FuzzSlice: Pruning False Positives in Static Analysis Warnings through Function-Level Fuzzing. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13. https://doi.org/10.1145/3597503.3623321

We propose FuzzSlice, a novel framework that automatically prunes possible false positives among static analysis warnings. Unlike prior work that mostly focuses on confirming true positives among static analysis warnings, which inevitably requires end-to-end fuzzing, FuzzSlice focuses on ruling out potential false positives, which are the majority in static analysis reports. The key insight that we base our work on is that a warning that does not yield a crash when fuzzed at the function level in a given time budget is a possible false positive. To achieve this, FuzzSlice first aims to generate compilable code slices at the function level. Then, FuzzSlice fuzzes these code slices instead of the entire binary to prune possible false positives. FuzzSlice is also unlikely to misclassify a true bug as a false positive because the crashing input can be reproduced by a fuzzer at the function level as well.

We evaluate FuzzSlice on the Juliet synthetic dataset and real-world complex C projects: openssl, tmux and openssh-portable. Our evaluation shows that the ground truth in the Juliet dataset had 864 false positives which were all detected by FuzzSlice. For the open-source repositories, we were able to get the developers from two of these open-source repositories to independently label these warnings. FuzzSlice automatically identifies 33 out of 53 false positives confirmed by developers in these two repositories. This implies that FuzzSlice can reduce the number of false positives by 62.26% in the open-source repositories and by 100% in the Juliet dataset.

Fuzzing via Regular Entrance

Fuzzing All Software

Xia, C. S., Paltenghi, M., Le Tian, J., Pradel, M., & Zhang, L. (2024). Fuzz4All: Universal Fuzzing with Large Language Models. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13. https://doi.org/10.1145/3597503.3639121

This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java, and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 98 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 64 bugs already confirmed by developers as previously unknown.

Sheng, Z., Xu, Q., Huang, J., Woodcock, M., Huang, H., Donaldson, A. F., Gu, G., & Huang, J. (2025). All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching (No. arXiv:2509.07225). arXiv. https://doi.org/10.48550/arXiv.2509.07225

Our team, All You Need Is A Fuzzing Brain, was one of seven finalists in DARPA’s Artificial Intelligence Cyber Challenge (AIxCC), placing fourth in the final round. During the competition, we developed a Cyber Reasoning System (CRS) that autonomously discovered 28 security vulnerabilities—including six previously unknown zero-days—in real-world open-source C and Java projects, and successfully patched 14 of them. The complete CRS is open source at github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain. This paper provides a detailed technical description of our CRS, with an emphasis on its LLM-powered components and strategies. Building on AIxCC, we further introduce a public leaderboard for benchmarking state-of-the-art LLMs on vulnerability detection and patching tasks, derived from the AIxCC dataset. The leaderboard is available at o2lab.github.io/FuzzingBrain-Leaderboard.

Fuzzing System-Level Software

Language Processors (Compilers, Interpreters, ...)

Yang, Y., Yao, S., Chen, J., & Lee, W. (n.d.). Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving. https://www.usenix.org/conference/usenixsecurity25/presentation/yang-yupeng This paper is included in the Proceedings of the 34th USENIX Security Symposium.

More quotes can be found on below "Complex Input Constraints"

Fuzzing feeds random input to a program and checks for faulty behaviors [21, 22, 35, 70, 91]. Modern fuzzers rely on random mutations (e.g., bit-flips) and coverage feedback to produce inputs covering more code regions. However, naive random mutations perform badly at exploring deep logic behind complex constraints. To overcome this problem, researchers have proposed generic and target-specific solutions.

Generic Solutions. Generic solutions aim at solving complex constraints without target-specific heuristics. The typical approach is to track the relationship between the input and the evaluation of the constraints. Taint tracking and symbolic/concolic execution fall into this category. Taint tracking enables fuzzers [24, 31, 51, 63] to learn which parts of the input affect constraint evaluation. Afterward, fuzzers focus their mutations on the relevant parts to increase the chances of solving the constraints. Symbolic execution replaces the input with symbolic values and then tracks computations stemming from them. Then, constraint evaluation can be represented in SMT formulas, and SMT solvers [17, 74] help determine the exact inputs to solve them. In the context of fuzzing, symbolic execution is often enhanced with concolic execution [33, 59, 90], where concrete values and execution paths from fuzzers help speed up SMT solving and path exploration. Additionally, approaches like REDQUEEN [5] and others [10, 57, 81] aim to reduce constraint tracking overhead through approximation.
- Unfortunately, these solutions fall short when programs become as complex as language processors. Taint tracking fails, because just after the first stage (i.e., tokenization in Fig.1), every input byte is applied some constraints and gets tainted, rendering taint tracking pointless. Symbolic and concolic fuzzing fail for several reasons. First, they require heavy engineering to model every operation semantics in the program (e.g., native instruction semantics [90] or IR semantics [9, 12, 59, 78]) to track symbol propagation correctly. This process also faces difficulties in operations commonly seen in language processors (e.g., symbolized pointers [12]). To demonstrate, Fig.2 shows a simplified language processor for C++. Line 27 contains the code region that is only triggered when the input source contains structs inside functions. Our experiments show that state-of-the-art symbolic/concolic execution engines, including SymCC [59], Angr [78], and SymQemu [60], cannot track the input constraint leading to line 27 correctly due to incomplete semantics modeling. Second, even when semantics modeling is complete (via heavy manual effort), the enormous number of control paths in early language processor stages can cause path explosion for symbolic execution and over-constraining [12] for concolic execution. Additionally, certain stages might contain logic that defeats symbolic solvers, such as hash-based tokenization that cannot be inverted. Researchers have proposed symbolizing tokens instead of input bytes to mitigate this problem for Javascript [27]. However, its generalizability to other languages remains unclear [88], and it only mitigates the problem in the first stage, which does not handle constraints in later stages (i.e., semantics checking and code generation).
Target-specific Solutions. Researchers have proposed solutions to overcome complex constraints in specific language processors. Csmith [88] generates C programs avoiding undefined behaviors. DIE [55] and Fuzzilli [30] maintain Javascript semantics to reach deep just-in-time (JIT) regions. These approaches require heavy effort to design target-specific heuristics, making them un-generalizable. Furthermore, researchers have proposed annotation-based approaches [11, 89]. Still, they require expert knowledge of the annotation language and an understanding of the target constraints, making them non-trivial to apply to new targets. Additionally, token-level fuzzing [67] and grammar-level fuzzing [4, 80] have been proposed. They are semi-generic because they do not target a specific language due to the wide availability of tokenizers and grammar. However, they only try to overcome constraints in the first two stages (see Fig.1), leaving the later stages unhandled.

Protocols /Implementations

OS Kernels

Database System

Fuzzing Application Software /Programs

Fuzzing Against Neural Networks

🔥 Fuzzing Against Large Language Model 😅

Yu, J., Shao, Y., Miao, H., & Shi, J. (2025). PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs (No. arXiv:2409.14729). arXiv. https://doi.org/10.48550/arXiv.2409.14729

In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM’s resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts.

By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours, demonstrating PROMPTFUZZ’s effectiveness compared to experienced human attackers. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.

Deng, Y., Xia, C. S., Yang, C., Zhang, S. D., Yang, S., & Zhang, L. (2024). Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13. https://doi.org/10.1145/3597503.3623343

This paper proposes FuzzGPT, the first approach to guiding LLMs to directly synthesize unusual input programs for effective fuzzing. While the recent TitanFuzz work samples programs from the natural probability distribution encoded in pre-trained LLMs, FuzzGPT aims to shift the distribution towards unusual programs in the search space to explore code paths rarely hit by ordinary programs. FuzzGPT is mainly built on the well-known hypothesis that historical bug-triggering programs may include edge-case/valuable code ingredients important for bug finding. ... To implement FuzzGPT, we first construct a dataset of bug-triggering code snippets by mining bug reports from open-source repositories of the target DL libraries. Built on this dataset, FuzzGPT includes the following strategies. (1) In-context learning [37]: we provide LLMs with either a few examples of historical bug-triggering programs (few-shot learning [7, 51]) or a partial bug-triggering program (zeroshot learning [47, 58]) to either generate a new code snippet or to autocomplete the partial code. (2) Fine-tuning [50]: we modify the model weights by training on the extracted historical bug-triggering programs to obtain fine-tuned LLMs that are specially designed to generate similar bug-triggering code snippets. From both learning strategies, FuzzGPT can prime the LLMs to generate bug-triggering programs by capturing code ingredients within either the local context examples or fine-tuning dataset.

Fuzzing by Strategies

Testing Mode

White-Box Testing

Black-Box Testing

Grey-Box Testing

Test Case Generation

Directly Generation

Seed + Mutation

Fuzzing with Historical Bugs

Complex Input Constraints

Yang, Y., Yao, S., Chen, J., & Lee, W. (n.d.). Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving.

To address the problem of complex input constraints, researchers have proposed two categories of solutions: generic solutions and target-specific solutions.

Generic solutions aim at solving complex constraints without target-specific heuristics [3, 5, 57, 71, 90]. The typical approach is to track the relationship between the input and the evaluation of branching conditions within the program. Taint tracking and concolic execution fall into this category. Taint tracking enables fuzzers [24, 31, 51, 63] to learn which parts of the input affect the branching condition checks. Afterward, fuzzers can focus the mutation on the relevant parts of the input to increase the chances of passing such checks. Concolic execution combines a fuzzer with a symbolic solver [9, 12, 59, 78, 90], which replaces the input with symbolic values and then tracks computations stemming from them. Afterward, branching conditions can be represented in SMT formulas and solved by SMT solvers [17, 74].
- Unfortunately, these solutions fall short on programs as complex as language processors. Taint tracking would fail after the tokenization stage, which applies constraints on every input byte, rendering taint tracking pointless. Concolic execution would fail due to incomplete semantics modeling or over-constraining [12].
To deal with such limitations, researchers have proposed target-specific solutions for specific language processors [30, 55, 88, 97]. Although these approaches show effectiveness on their respective targets, they require heavy manual effort to embed target-specific heuristics into the fuzzer, making them ungeneralizable to other targets. Further, researchers have proposed annotation-based approaches [11, 89]. Still, they require expert knowledge of the annotation language and an understanding of the target’s input constraints, making them non-trivial to apply to new targets.
Semi-generic approaches, such as token-level [67] and grammar-level [4, 80] fuzzing, have been proposed for language processors. Due to the wide availability of tokenizers and grammar, these approaches can be generalized. However, they only aim to overcome lexical and syntax constraints, which are still ineffective in exploring deeper logics (e.g., in optimization and execution stages) behind semantic constraints [11, 55]. To date, we face a dilemma where generic solutions fail to effectively handle language processors while target-specific solutions struggle to generalize.

... (They give a more detailed version of these common solutions later. I quote them at the section under heading of "Language Processors (Compilers, Interpreters, ...)")

(They decided to use LLM to solve this problem. But first, here are the challenges of an LLM-based approach.)

We identify two unique challenges that must be addressed to make an LLM-based approach practical.

First, it is unclear how to prioritize constraints to maximize coverage gain. Modern language processors easily contain up to millions of lines of code (see Table 4). Since inference with LLMs is not free [85, 86], it is important to prioritize the constraints to solve. However, constraint prioritization is non-trivial. An ideal prioritization strategy should not only favor those having greater potential to reach more code regions, but also account for the solving difficulty for LLMs. This is because we notice that LLMs may easily solve some constraints, but fail entirely with others due to lacking reasoning capabilities or inherently unsolvable problems (e.g., one-way functions). To date, effectively satisfying these two properties to maximize coverage gain remains an open challenge.
Second, it is non-trivial to decide the degree of context presentation. To instruct an LLM to solve a constraint, we need to provide the constraint context (e.g., source code) for it to reason about. Presenting the entire code base is impractical, considering the magnitude of modern language processor code bases. Presenting execution traces or slices is not ideal either because their sizes can still be huge, with total lines of code easily exceeding 10k in our experiments. On the flip side, presenting only a narrow context (e.g., the function enclosing the constraint’s evaluation point) can miss necessary information for the LLM to understand the full picture of the constraint. Therefore, it remains unclear how to present sufficient and concise context to achieve both effectiveness and efficiency.

... (Later they give a more detailed explanation of above-mentioned common solutions and challenges in applying LLM to solve the problem of complex constraints)

Constraint Prioritization. The first challenge we identify is it is unclear how to prioritize unsolved constraints to maximize coverage gain. Language processors can contain up to millions of lines of code (see Table 4). It is important to prioritize the unsolved constraints to solve (i.e., the uncovered lines to cover), because inference with LLMs is not free [85, 86], and it is by far impractical to ask LLMs to reason about all unsolved constraints. However, constraint prioritization is non-trivial. An ideal prioritization strategy should satisfy the following two properties. First, it should prioritize the constraints having greater potential to reach more code regions if solved. For instance, in Fig.2, consider a scenario where line 20 and line 27 are both uncovered. Since line 27 can lead to additional uncovered regions, such as deeper functions for processing local structs, while line 20 merely bails out, prioritizing reaching line 27 is more effective. Second, prioritization should account for the solving difficulty. LLMs may easily solve some constraints but fail entirely with others. Solving success rates can drop when constraints involve complex logic beyond the LLM’s reasoning capabilities, or when they involve inherently hard-to-solve problems (e.g., one-way functions). To remain cost-effective, constraints that LLMs struggle to solve should be deprioritized, even if they have the potential to cover more code regions. To date, effectively satisfying these two properties to maximize coverage gain remains an open challenge.
Context Granularity. After we decide which constraint to solve, the next step is to collect relevant context, structure it into a prompt, and query the LLM for results. To work with LLMs, we think the best context for a constraint is the relevant source code. However, it is non-trivial to decide the source granularity to present to the LLM. Presenting the entire code base is not cost-effective, considering the magnitude of modern language processor code bases (see Table 4). A smarter approach would be using dynamic program slicing [87, 94, 95] to extract only code relevant to the branching condition (which guards uncovered code regions) by setting the branching condition as the slicing criterion. However, we notice this approach is not ideal either. The dynamic slice of a language processor can still be prohibitively huge. In fact, setting any branching condition in one stage (see Fig.1) as the slicing criterion should result in the majority of all previous stages being included in the dynamic slice. For instance, tokenization and parsing stages are both relevant to any branching condition in the code generation stage, as inputs failing to be tokenized or parsed can never trigger code generation. It is not ideal to present the entirety of the tokenizer and parser as context for constraints in the code generation stage. Additionally, even if the cost budget and LLM context window size support large slices, presenting redundant information can cause more hallucinations [6]. On the flip side, presenting only a narrow context, such as the function containing the uncovered code region (i.e., the enclosing function), can miss necessary information for understanding the constraint clearly. For instance, prior to reaching the enclosing function, there could be early constraint checks that reject the input. To conclude, how to present sufficient and concise context to achieve both effectiveness and efficiency remains unclear.

... (They give their solutions as following)

We propose two solutions to solve the two aforementioned challenges respectively, hybrid centrality prioritization and iterative context construction. Next, we discuss our insights.

Hybrid Centrality Prioritization. As discussed in §2.4, an ideal prioritization algorithm should (i) prioritize constraints having greater potential to trigger more code regions and (ii) deprioritize constraints that LLMs may not be able to solve.
- To address (i), in the field of fuzzing seed scheduling, researchers have faced similar challenges in measuring the potential to reach more code regions [7, 57, 69]. An effective solution is graph centrality analysis, a method to estimate a node’s influence in a graph. Researchers have shown that after building the inter-procedural Control Flow Graph (CFG), centrality analysis can effectively measure one basic block’s potential to reach more other basic blocks [69]. Inspired by this, we extend the existing approach and design a centrality analysis mechanism to prioritize unsolved constraints.
- To address (ii), we need a way to measure the constraint solving difficulty. The first approach we consider is using graph complexity analysis [98] to approximate constraint complexity, which seems to correlate with solving difficulty. However, in reality, we notice that pretrained LLMs are capable of picking up hints in the context that immediately enable them to solve the constraints, without thoroughly analyzing the constraints in detail. For instance, in Fig.2, imagine a scenario where lines after line 20 are not covered due to failure to pass the constraints evaluated at line 19. In our experiment, LLMs (e.g., GPT-4o-mini) can draw insights from just the function names (e.g., ts_node_has_error, tree_sitter_cpp) to realize that the input should be a syntax-correct C++ source, and then proceed to generate an input that successfully triggers line 21. However, graph complexity analysis would consider the parsing stage as a complex operation and neglect such hints. Moreover, asking the LLMs themselves to score each constraint isn’t viable because it is neither cost-effective nor reliable (because it is hard to verify the scores). Instead, we propose approximating solving difficulty using past solving history: the more LLMs fail to solve a certain type of constraint, the harder it is for LLMs to solve similar constraints in the future. Finally, we use the approximated difficulty to adjust the centrality score, so that the final scores start high for constraints with higher centrality scores but are scaled down later if the constraints appear hard to solve.
Iterative Context Construction. Considering the “picking up hints” capability discussed above, we deem it impractical to decide how much context is sufficient for an LLM a priori. Instead, we draw insights from recent works on LLMs for software engineering [19, 93], which show that LLMs can improve and correct their code generation results iteratively with feedback information. ==To provide a sufficient and concise context, we start by presenting the LLM with a narrow context around the constraint evaluation point (i.e., the enclosing function). We then instruct the LLM to either solve the constraint or request additional context (e.g., symbol definitions).== This approach allows us to leverage dynamic and static analysis techniques to iteratively expand the context. If the LLM provides a solution, we verify its correctness. In case of failure, the execution trace is analyzed to identify a failing point’s context, which is then fed back to the LLM. When the LLM requests more context, source code analysis tools [50] are employed to supply the requested context. Through this process, we gradually construct the context until reaching a successful solution or an iteration threshold.

Fuzzing Under Multi-Agent LLM

↗ LLM Agents, AI Workflow, & Agentic MLLM

FilesExpand file tree

LLM & Fuzzing.md

Latest commit

History