Skip to content

Latest commit

Ā 

History

History
269 lines (181 loc) Ā· 39.6 KB

File metadata and controls

269 lines (181 loc) Ā· 39.6 KB

LLM & Fuzzing

[TOC]

Res

Related Topics

↗ LLM For Security ↗ Fuzzing (Concrete Execution)

Paper Reading List

AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing https://arxiv.org/abs/2409.10737 https://ui.adsabs.harvard.edu/abs/2023arXiv230712469Z/abstract

ęŠ€ęœÆčæ›å±• | Prompt-Fuzzļ¼šåŸŗäŗŽLLMēš„åŗ“ęØ”ē³Šęµ‹čÆ•é©±åŠØč‡ŖåŠØåŒ–ē”Ÿęˆ https://mp.weixin.qq.com/s/V3OC6BbZuMC0eev53C_LWA Prompt Fuzzing for Fuzz Driver Generation https://arxiv.org/abs/2312.17677

CovRL-Fuzzļ¼šåŸŗäŗŽå¤§ęØ”åž‹å˜å¼‚ēš„JavaScriptč§£é‡Šå™ØęØ”ē³Šęµ‹čÆ•ęŠ€ęœÆ | ęŠ€ęœÆčæ›å±• https://mp.weixin.qq.com/s/6DaD7rw3t7TFYupTJoEtaw

ęŠ€ęœÆčæ›å±• | MINERļ¼šäø€ē§ē”ØäŗŽREST APIęØ”ē³Šęµ‹čÆ•ēš„ę··åˆę•°ę®é©±åŠØę–¹ę³• https://mp.weixin.qq.com/s/u26iSavxWGswrcuLaaACKA

ęŠ€ęœÆčæ›å±• | SDFUZZ:ē›®ę ‡ēŠ¶ę€é©±åŠØēš„å®šå‘ęØ”ē³Šęµ‹čÆ• https://mp.weixin.qq.com/s/zodAi0hD05HuIji6krz9TA

SEAMFUZZļ¼šē°ē›’ęØ”ē³Šęµ‹čÆ•ēš„å­¦ä¹ ē§å­č‡Ŗé€‚åŗ”ēŖå˜ē­–ē•„ https://mp.weixin.qq.com/s/duEtfMfayefdBzlqm1rM_g


å¤§ęØ”åž‹äøŽęØ”ē³Šęµ‹čÆ•čæ›č”Œē»“åˆēš„ē ”ē©¶č®ŗę–‡ę±‡ę€»ļ½œęŠ€ęœÆčæ›å±• https://mp.weixin.qq.com/s/DL4pGH-7nPi3eSRD-rlD-w

2024äæ”ęÆå®‰å…Øé¢†åŸŸå››å¤§é”¶ä¼šFuzzč®ŗę–‡ę±‡ę€»ļ½œęŠ€ęœÆčæ›å±• https://mp.weixin.qq.com/s/Dn5ooUahCT7IqXIMevjVug

2024å¹“č½Æä»¶å·„ēØ‹é”¶ä¼šFuzz论文汇总 https://mp.weixin.qq.com/s/-ZYX-G_jX1AN8lzbHvW21g

Graduation Thesis

Thesis Template

https://github.com/cuiao/SCU_ThesisDissertation_LaTeXTemplate å‘åøƒę—„ęœŸļ¼š2016幓5月30ę—„

ä½œč€…ęœ¬äŗŗę˜Æäø€åå››å·å¤§å­¦ēš„å­¦ē”Ÿļ¼ŒåœØå­¦ä¹ ē§‘ē ”ę“»åŠØäø­ē»åøøéœ€č¦čæ›č”Œå­¦ęœÆå†™ä½œć€‚åœØä½æē”Øä¼ ē»Ÿēš„å­—å¤„ē†č½Æä»¶ļ¼ˆå¦‚_MicrosoftĀ® Word_ļ¼‰ę—¶ļ¼Œē”±äŗŽå…¶åÆ¹ę•°å­¦ē­‰ē‰¹ę®Šéœ€ę±‚ēš„ę”ÆęŒäøå¤Ÿå‹å„½ļ¼Œå› ę­¤å¾€å¾€ä¼šé‡åˆ°å„ē§å„ę ·ēš„é—®é¢˜ļ¼Œę— ę³•é«˜ę•ˆåœ°å†™ä½œć€‚ä½œč€…å¶ē„¶ęŽ„č§¦åˆ°äŗ†LaTeXęŽ’ē‰ˆē³»ē»Ÿļ¼Œå…¶å¼ŗå¤§ēš„åŠŸčƒ½ć€ä¼˜ē¾Žēš„ę•°å­¦ęŽ’ē‰ˆå’Œä¾æåˆ©ēš„č‡ŖåŠØåŒ–å·„å…·ē­‰ä¼—å¤šä¼˜ē‚¹ä½æäŗŗå°č±”ę·±åˆ»ļ¼Œē‰¹åˆ«é€‚åˆē†å·„ē§‘å­¦ē”Ÿå­¦ä¹ å’Œä½æē”Øć€‚ē»čæ‡äŗ†č§£ļ¼Œå‘ēŽ°å›½é™…ęœŸåˆŠč®ŗę–‡äø»č¦ä½æē”ØLaTeXčæ›č”ŒęŽ’ē‰ˆļ¼Œäø”å›½å†…å¤–č®øå¤šé«˜ę ”å‡ęä¾›LaTeXēš„å­¦ä½č®ŗę–‡ęØ”ē‰ˆļ¼Œč€Œęˆ‘ę ”åœØčæ™ę–¹é¢ēš„å‘å±•čæ˜ē•„ę˜¾äøč¶³ć€‚åœØčæ™ę ·ēš„åŠØęœŗé©±ä½æäø‹ļ¼Œä½œč€…åœØåˆ©ē”Øč‡Ŗå·±č¾ƒäøŗåˆēŗ§ēš„LaTeXēŸ„čÆ†ļ¼Œå‚č€ƒäŗ†åŒ—äŗ¬å¤§å­¦åŒå­¦ēš„_pkuthss_ęØ”ē‰ˆå’Œå…¶ä»–ē›øå…³ę–‡ēŒ®ēš„åŸŗē”€äøŠļ¼Œå¼€å‘äŗ†_scuthesis_čæ™äøŖé€‚ē”ØäŗŽå››å·å¤§å­¦ē ”ē©¶ē”Ÿļ¼ˆč™½åŒ…å«ęœ‰ęœ¬ē§‘é€‰é”¹ļ¼Œä½†å°é¢ęœŖäæ®ę”¹ļ¼‰ä½æē”Øēš„LaTeXå­¦ä½č®ŗę–‡ęØ”ē‰ˆć€‚åøŒęœ›ę­¤ęØ”ē‰ˆčƒ½å¤Ÿē»™å„ä½åŒå­¦ęä¾›äø€äøŖé¢å¤–ēš„é€‰ę‹©ļ¼ŒęØ”ē‰ˆäø­č‹„ęœ‰ē‘•ē–µļ¼Œčæ˜čÆ·å„ä½åŒå­¦ę‰¹čÆ„ęŒ‡ę­£ļ¼Œē•™čØ€ć€ę–°å»ŗäø€äøŖ_ISSUSE_ꈖ_FORK_äø€äøŖę–°åˆ†ę”Æäæ®ę”¹ć€‚

https://github.com/fatestigma/scuthesis ScuThsis ę˜Æäø€äøŖéžå®˜ę–¹ēš„å››å·å¤§å­¦ęœ¬ē§‘ē”ŸęÆ•äøšč®¾č®”č®ŗę–‡ęØ”ęæļ¼ŒęØ”ęæēš„åˆ¶ä½œē›®ēš„ę˜Æäøŗäŗ†č®©å¤§å®¶åÆä»„ę›“å®¹ę˜“ēš„åˆ©ē”Ø LaTeX ęŽ’ē‰ˆå¼•ę“Žåˆ¶ä½œå‡ŗęŽ’ē‰ˆē²¾ē¾Žēš„ęÆ•äøšč®ŗę–‡ļ¼Œäø€ę„ęé«˜ęÆ•äøšč®ŗę–‡ēš„ęŽ’ē‰ˆč“Øé‡ļ¼ŒäŗŒę„åÆä»„å°†ę›“å¤šēš„ē²¾åŠ›ę”¾åœØč®ŗę–‡å†…å®¹ēš„č¾“å‡ŗäøŠć€‚

在2016å±ŠęÆ•äøšē”Ÿäø­å…±ęœ‰10å¤šä½å‚äøŽScuThesiså†…ęµ‹ļ¼Œå¹¶ęœ‰4ä½čŽ·å¾—å››å·å¤§å­¦ęœ¬ē§‘ä¼˜ē§€ęÆ•äøšč®ŗę–‡ļ¼Œå…¶äø­1ä½č¢«ę‰“å›žļ¼Œč¦ę±‚ęäŗ¤docę ¼å¼ę–‡ę”£ć€‚

https://github.com/fanmcgrady/SCU-Thesis-LaTeX-Template å·¦å¤§ē„žļ¼ˆå››å·å¤§å­¦č®”ē®—ęœŗå­¦é™¢å·¦åŠ¼å¤§ē„žļ¼‰å‘ŠčÆ‰ęˆ‘ä»–ēš„åšå£«č®ŗę–‡ę˜Æē”Øč‡Ŗå·±č®¾č®”ēš„LaTeXęØ”ęæå†™ēš„ć€‚

é¢˜ē›®ļ¼šć€ŠåŸŗå› č”Øč¾¾å¼ē¼–ēØ‹ę øåæƒęŠ€ęœÆē ”ē©¶ć€‹ļ¼Œęœ‰å…“č¶£ēš„åŒå­¦åÆä»„ę‹œčÆ»äø€äø‹

č†œę‹œäŗ†å·¦å¤§ē„žēš„åšå£«č®ŗę–‡ä»„åŽę„Ÿč§¦å¾ˆå¤šć€‚ äøŗäŗ†čæ½é€å¤§ē„žēš„č„šę­„ļ¼Œęˆ‘ēš„åšå£«č®ŗę–‡å½“ē„¶ä¹Ÿē»åÆ¹äøčƒ½ē”ØWord了。 å››å·å¤§å­¦ē ”ē©¶ē”Ÿé™¢ęä¾›äŗ†Wordē‰ˆęœ¬ēš„č®ŗę–‡ę ¼å¼ęØ”ęæļ¼Œ å¹¶ę²”ęœ‰ęä¾›ē›øå…³ēš„å­¦ä½č®ŗę–‡LaTeXęØ”ęæć€‚ 在GithubäøŠę‰¾åˆ°åŽŸä½œč€…cuiaoēš„é”¹ē›®_scuthesis_怂 ęœ¬é”¹ē›®ę˜ÆåœØåŽŸä½œč€…ēš„åŸŗē”€äøŠčæ›č”Œäŗ†äø€äŗ›äæ®ę”¹ļ¼Œčƒ½å¤Ÿę»”č¶³å››å·å¤§å­¦å­¦ä½č®ŗę–‡ēš„ę ¼å¼č¦ę±‚ć€‚ åøŒęœ›ęœ¬é”¹ē›®čƒ½č®©å¤§å®¶ę›“åŠ äø“ę³ØäŗŽå­¦ä½č®ŗę–‡ēš„å†™ä½œļ¼

https://github.com/SunnyHaze/scu-thesis-template?tab=readme-ov-file

Survey

Translation

Graduation Paper

Fuzz4All.excalidraw | 800

Links

å›½å®¶ę ‡å‡†ć€Šäæ”ęÆäøŽę–‡ēŒ® čµ„ęŗęčæ°ć€‹ ē”±TC4ļ¼ˆå…Øå›½äæ”ęÆäøŽę–‡ēŒ®ę ‡å‡†åŒ–ęŠ€ęœÆå§”å‘˜ä¼šļ¼‰å½’å£ ļ¼Œäø»ē®”éƒØé—Øäøŗå›½å®¶ę ‡å‡†å§”ć€‚

äø»č¦čµ·č‰å•ä½ äø­å›½ē§‘å­¦é™¢ę–‡ēŒ®ęƒ…ęŠ„äø­åæƒ 、国家图书馆 ć€åŒ—äŗ¬å¤§å­¦å›¾ä¹¦é¦† ć€ęø…åŽå¤§å­¦å›¾ä¹¦é¦† ć€åŽäøœåøˆčŒƒå¤§å­¦å›¾ä¹¦é¦† ć€å¹æäøœēœē«‹äø­å±±å›¾ä¹¦é¦† ć€äøŠęµ·å›¾ä¹¦é¦† ć€äø­å›½ē§‘å­¦ęŠ€ęœÆäæ”ęÆē ”ē©¶ę‰€ ć€é¦–éƒ½å›¾ä¹¦é¦† ć€äø­å¤®éŸ³ä¹å­¦é™¢å›¾ä¹¦é¦† ć€äø­å›½ē¤¾ä¼šē§‘å­¦é™¢å›¾ä¹¦é¦† 怂

äø»č¦čµ·č‰äŗŗ 宋文 ć€ēŽ‹ę“‹ ć€å–»ēˆ½ēˆ½ ć€ęØę…§ ć€čƒ”å°č ć€é²å›½å¼ŗ ć€ęÆ›å‡Œę–‡ 、毛雅君 、纪陆恩 ć€ęœ±å­¦å†› ć€é­ę„ ć€å¼ åØŸ ć€ēŽ‹å®å® ć€ę²ˆę­£åŽ ć€č¢ēŽ‰ēŗ¢ 、黄丽婷 怂

Intro

LLM-Based Fuzzing Procedure

LLMé©±åŠØē¼ŗé™·ę£€ęµ‹ēš„äø€čˆ¬ęµēØ‹ LiĀ Y,Ā YangĀ WZ,Ā ZhangĀ Y,Ā XueĀ YX.Ā SurveyĀ onĀ FuzzingĀ BasedĀ onĀ LargeĀ LanguageĀ Model.Ā RuanĀ JianĀ XueĀ Bao/Journal ofĀ SoftwareĀ (inĀ Chinese).Ā http://www.jos.org.cn/1000-9825/7323.htm

LLM-Driven Test Cases Generation

LLM-Driven Bug/Vulnerability Discovery

LLM-Driven Post-Fuzzing Processing

LLM-Based Fuzzing (Integrating LLM Into Fuzzing Work-Flow šŸ˜…)

This is basically a work-flow design contest šŸ˜…

https://aicyberchallenge.com/ DARPA's Artificial Intelligence Cyber Challenge (AIxCC)

Surveys & Overview

Huang, Linghan, Peizhou Zhao, Huaming Chen, and Lei Ma. "Large language models based fuzzing techniques: A survey."Ā arXiv preprint arXiv:2402.00350 (2024). https://arxiv.org/abs/2402.00350

Huang, Linghan, Peizhou Zhao, Huaming Chen, and Lei Ma. "Large language models based fuzzing techniques: A survey."Ā arXiv preprint arXiv:2402.00350 (2024). https://arxiv.org/abs/2402.00350

LiĀ Y,Ā YangĀ WZ,Ā ZhangĀ Y,Ā XueĀ YX.Ā SurveyĀ onĀ FuzzingĀ BasedĀ onĀ LargeĀ LanguageĀ Model.Ā RuanĀ JianĀ XueĀ Bao/Journal ofĀ SoftwareĀ (inĀ Chinese).Ā http://www.jos.org.cn/1000-9825/7323.htm

Fuzzers by PUT

Fuzzing via Library API & Fuzz Driver Generation

Zhang, C., Zheng, Y., Bai, M., Li, Y., Ma, W., Xie, X., Li, Y., Sun, L., & Liu, Y. (2024). How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 1223–1235. https://doi.org/10.1145/3650212.3680355

Essentially, a fuzz driver is a piece of code responsible for accepting mutated input from fuzzers and executing the APIs accordingly. An effective driver must contain a correct and robust API usage since incorrect or unsound usage can result in extensive false positive or negative fuzzing results, incurring extra manual validation efforts or testing resources waste. Due to the high standard required, fuzz drivers are typically written by human experts, which is a labor-intensive and time-consuming process. For instance, šŸ”— OSS-Fuzz, the largest fuzzing framework for open-source projects, maintains thousands of fuzz drivers written by hundreds of contributors over the past seven years.

Fuzz Driver Basics. The key components of a fuzz driver are illustrated in Figure 1. A typical fuzz driver has three necessary parts: prerequisites initialization (line 3), execution (line 4), and post-cleaning (line 7). Besides, there are three optional parts commetned in lines 2, 5, and 6 that can improve a driver’s effectiveness. Line 2 part improves a driver by proper input arrangement such as rejecting too short or too long inputs, interpreting input data as multiple testing arguments, etc. Line 5 part enables a driver to call more APIs which triggers more program behaviors during fuzzing. Finally, line 6 part adds semantic oracles for detecting logical bugs. These oracles are similar to assert statements in unit tests, aborting execution when certain program properties are unsatisfied. Since a driver will be repeatedly executed with randomly mutated input, there is a high requirement on its correctness and robustness. Incorrect or unrobust usage can lead to both false positives and negatives. For instance, if a driver failed to feed the mutated data into the API, its fuzzing can never find any bug. Or if an API argument is incorrectly initialized, false crashes may be raised.

Minimal Requirements of Effective Fuzz Drivers. The minimal requirements covers the line 3,4, and 7 of Figure 1, which mainly include correctly initializing the arguments and satisfying necessary control flow dependencies. Argument initialization can be one of the following cases (in the order of simplicity): ā¶ C1: If the argument value can be any value or should be naive values like 0 or NULL, a variable is declared or a literal constant is used directly; ā· C2: If the argument is supposed to be a macro or a global variable that is already defined in common libraries or the target API’s project, it is located and used; āø C3: If creating the argument requires the use of common library APIs, such as creating a file and writing specific content, common practices are followed; ā¹ C4: If initializing the argument requires the output of other APIs within the project, those APIs are initialized first following the above initialization cases.

Lyu, Y., Xie, Y., Chen, P., & Chen, H. (2024). Prompt Fuzzing for Fuzz Driver Generation (No. arXiv:2312.17677). arXiv. https://doi.org/10.48550/arXiv.2312.17677

Xu, H., Ma, W., Zhou, T., Zhao, Y., Chen, K., Hu, Q., Liu, Y., & Wang, H. (2024). CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge Graph (No. arXiv:2411.11532). arXiv. https://doi.org/10.48550/arXiv.2411.11532

Sun, Y. (2024). Automated Generation and Compilation of Fuzz Driver Based on Large Language Models. Proceedings of the 2024 9th International Conference on Cyber Security and Information Engineering, 461–468. https://doi.org/10.1145/3689236.3689272

Murali, A., Mathews, N., Alfadel, M., Nagappan, M., & Xu, M. (2024). FuzzSlice: Pruning False Positives in Static Analysis Warnings through Function-Level Fuzzing. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13. https://doi.org/10.1145/3597503.3623321

We propose FuzzSlice, a novel framework that automatically prunes possible false positives among static analysis warnings. Unlike prior work that mostly focuses on confirming true positives among static analysis warnings, which inevitably requires end-to-end fuzzing, FuzzSlice focuses on ruling out potential false positives, which are the majority in static analysis reports. The key insight that we base our work on is that a warning that does not yield a crash when fuzzed at the function level in a given time budget is a possible false positive. To achieve this, FuzzSlice first aims to generate compilable code slices at the function level. Then, FuzzSlice fuzzes these code slices instead of the entire binary to prune possible false positives. FuzzSlice is also unlikely to misclassify a true bug as a false positive because the crashing input can be reproduced by a fuzzer at the function level as well.

We evaluate FuzzSlice on the Juliet synthetic dataset and real-world complex C projects: openssl, tmux and openssh-portable. Our evaluation shows that the ground truth in the Juliet dataset had 864 false positives which were all detected by FuzzSlice. For the open-source repositories, we were able to get the developers from two of these open-source repositories to independently label these warnings. FuzzSlice automatically identifies 33 out of 53 false positives confirmed by developers in these two repositories. This implies that FuzzSlice can reduce the number of false positives by 62.26% in the open-source repositories and by 100% in the Juliet dataset.

Fuzzing via Regular Entrance

Fuzzing All Software

Xia, C. S., Paltenghi, M., Le Tian, J., Pradel, M., & Zhang, L. (2024). Fuzz4All: Universal Fuzzing with Large Language Models. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13. https://doi.org/10.1145/3597503.3639121

This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java, and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 98 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 64 bugs already confirmed by developers as previously unknown.

Sheng, Z., Xu, Q., Huang, J., Woodcock, M., Huang, H., Donaldson, A. F., Gu, G., & Huang, J. (2025). All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching (No. arXiv:2509.07225). arXiv. https://doi.org/10.48550/arXiv.2509.07225

Our team, All You Need Is A Fuzzing Brain, was one of seven finalists in DARPA’s Artificial Intelligence Cyber Challenge (AIxCC), placing fourth in the final round. During the competition, we developed a Cyber Reasoning System (CRS) that autonomously discovered 28 security vulnerabilities—including six previously unknown zero-days—in real-world open-source C and Java projects, and successfully patched 14 of them. The complete CRS is open source at github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain. This paper provides a detailed technical description of our CRS, with an emphasis on its LLM-powered components and strategies. Building on AIxCC, we further introduce a public leaderboard for benchmarking state-of-the-art LLMs on vulnerability detection and patching tasks, derived from the AIxCC dataset. The leaderboard is available at o2lab.github.io/FuzzingBrain-Leaderboard.

Fuzzing System-Level Software
Language Processors (Compilers, Interpreters, ...)

Yang, Y., Yao, S., Chen, J., & Lee, W. (n.d.). Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving. https://www.usenix.org/conference/usenixsecurity25/presentation/yang-yupeng This paper is included in the Proceedings of the 34th USENIX Security Symposium.

More quotes can be found on below "Complex Input Constraints"

Fuzzing feeds random input to a program and checks for faulty behaviors [21, 22, 35, 70, 91]. Modern fuzzers rely on random mutations (e.g., bit-flips) and coverage feedback to produce inputs covering more code regions. However, naive random mutations perform badly at exploring deep logic behind complex constraints. To overcome this problem, researchers have proposed generic and target-specific solutions.

  • Generic Solutions. Generic solutions aim at solving complex constraints without target-specific heuristics. The typical approach is to track the relationship between the input and the evaluation of the constraints. Taint tracking and symbolic/concolic execution fall into this category. Taint tracking enables fuzzers [24, 31, 51, 63] to learn which parts of the input affect constraint evaluation. Afterward, fuzzers focus their mutations on the relevant parts to increase the chances of solving the constraints. Symbolic execution replaces the input with symbolic values and then tracks computations stemming from them. Then, constraint evaluation can be represented in SMT formulas, and SMT solvers [17, 74] help determine the exact inputs to solve them. In the context of fuzzing, symbolic execution is often enhanced with concolic execution [33, 59, 90], where concrete values and execution paths from fuzzers help speed up SMT solving and path exploration. Additionally, approaches like REDQUEEN [5] and others [10, 57, 81] aim to reduce constraint tracking overhead through approximation.
    • Unfortunately, these solutions fall short when programs become as complex as language processors. Taint tracking fails, because just after the first stage (i.e., tokenization in Fig.1), every input byte is applied some constraints and gets tainted, rendering taint tracking pointless. Symbolic and concolic fuzzing fail for several reasons. First, they require heavy engineering to model every operation semantics in the program (e.g., native instruction semantics [90] or IR semantics [9, 12, 59, 78]) to track symbol propagation correctly. This process also faces difficulties in operations commonly seen in language processors (e.g., symbolized pointers [12]). To demonstrate, Fig.2 shows a simplified language processor for C++. Line 27 contains the code region that is only triggered when the input source contains structs inside functions. Our experiments show that state-of-the-art symbolic/concolic execution engines, including SymCC [59], Angr [78], and SymQemu [60], cannot track the input constraint leading to line 27 correctly due to incomplete semantics modeling. Second, even when semantics modeling is complete (via heavy manual effort), the enormous number of control paths in early language processor stages can cause path explosion for symbolic execution and over-constraining [12] for concolic execution. Additionally, certain stages might contain logic that defeats symbolic solvers, such as hash-based tokenization that cannot be inverted. Researchers have proposed symbolizing tokens instead of input bytes to mitigate this problem for Javascript [27]. However, its generalizability to other languages remains unclear [88], and it only mitigates the problem in the first stage, which does not handle constraints in later stages (i.e., semantics checking and code generation).
  • Target-specific Solutions. Researchers have proposed solutions to overcome complex constraints in specific language processors. Csmith [88] generates C programs avoiding undefined behaviors. DIE [55] and Fuzzilli [30] maintain Javascript semantics to reach deep just-in-time (JIT) regions. These approaches require heavy effort to design target-specific heuristics, making them un-generalizable. Furthermore, researchers have proposed annotation-based approaches [11, 89]. Still, they require expert knowledge of the annotation language and an understanding of the target constraints, making them non-trivial to apply to new targets. Additionally, token-level fuzzing [67] and grammar-level fuzzing [4, 80] have been proposed. They are semi-generic because they do not target a specific language due to the wide availability of tokenizers and grammar. However, they only try to overcome constraints in the first two stages (see Fig.1), leaving the later stages unhandled.
Protocols /Implementations
OS Kernels
Database System
Fuzzing Application Software /Programs
Fuzzing Against Neural Networks
šŸ”„ Fuzzing Against Large Language Model šŸ˜…

Yu, J., Shao, Y., Miao, H., & Shi, J. (2025). PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs (No. arXiv:2409.14729). arXiv. https://doi.org/10.48550/arXiv.2409.14729

In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM’s resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts.

By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours, demonstrating PROMPTFUZZ’s effectiveness compared to experienced human attackers. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.

Deng, Y., Xia, C. S., Yang, C., Zhang, S. D., Yang, S., & Zhang, L. (2024). Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13. https://doi.org/10.1145/3597503.3623343

This paper proposes FuzzGPT, the first approach to guiding LLMs to directly synthesize unusual input programs for effective fuzzing. While the recent TitanFuzz work samples programs from the natural probability distribution encoded in pre-trained LLMs, FuzzGPT aims to shift the distribution towards unusual programs in the search space to explore code paths rarely hit by ordinary programs. FuzzGPT is mainly built on the well-known hypothesis that historical bug-triggering programs may include edge-case/valuable code ingredients important for bug finding. ... To implement FuzzGPT, we first construct a dataset of bug-triggering code snippets by mining bug reports from open-source repositories of the target DL libraries. Built on this dataset, FuzzGPT includes the following strategies. (1) In-context learning [37]: we provide LLMs with either a few examples of historical bug-triggering programs (few-shot learning [7, 51]) or a partial bug-triggering program (zeroshot learning [47, 58]) to either generate a new code snippet or to autocomplete the partial code. (2) Fine-tuning [50]: we modify the model weights by training on the extracted historical bug-triggering programs to obtain fine-tuned LLMs that are specially designed to generate similar bug-triggering code snippets. From both learning strategies, FuzzGPT can prime the LLMs to generate bug-triggering programs by capturing code ingredients within either the local context examples or fine-tuning dataset.

Fuzzing by Strategies

Testing Mode

White-Box Testing
Black-Box Testing
Grey-Box Testing

Test Case Generation

Directly Generation
Seed + Mutation
Fuzzing with Historical Bugs
Complex Input Constraints

Yang, Y., Yao, S., Chen, J., & Lee, W. (n.d.). Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving.

To address the problem of complex input constraints, researchers have proposed two categories of solutions: generic solutions and target-specific solutions.

  • Generic solutions aim at solving complex constraints without target-specific heuristics [3, 5, 57, 71, 90]. The typical approach is to track the relationship between the input and the evaluation of branching conditions within the program. Taint tracking and concolic execution fall into this category. Taint tracking enables fuzzers [24, 31, 51, 63] to learn which parts of the input affect the branching condition checks. Afterward, fuzzers can focus the mutation on the relevant parts of the input to increase the chances of passing such checks. Concolic execution combines a fuzzer with a symbolic solver [9, 12, 59, 78, 90], which replaces the input with symbolic values and then tracks computations stemming from them. Afterward, branching conditions can be represented in SMT formulas and solved by SMT solvers [17, 74].
    • Unfortunately, these solutions fall short on programs as complex as language processors. Taint tracking would fail after the tokenization stage, which applies constraints on every input byte, rendering taint tracking pointless. Concolic execution would fail due to incomplete semantics modeling or over-constraining [12].
  • To deal with such limitations, researchers have proposed target-specific solutions for specific language processors [30, 55, 88, 97]. Although these approaches show effectiveness on their respective targets, they require heavy manual effort to embed target-specific heuristics into the fuzzer, making them ungeneralizable to other targets. Further, researchers have proposed annotation-based approaches [11, 89]. Still, they require expert knowledge of the annotation language and an understanding of the target’s input constraints, making them non-trivial to apply to new targets.
  • Semi-generic approaches, such as token-level [67] and grammar-level [4, 80] fuzzing, have been proposed for language processors. Due to the wide availability of tokenizers and grammar, these approaches can be generalized. However, they only aim to overcome lexical and syntax constraints, which are still ineffective in exploring deeper logics (e.g., in optimization and execution stages) behind semantic constraints [11, 55]. To date, we face a dilemma where generic solutions fail to effectively handle language processors while target-specific solutions struggle to generalize.

... (They give a more detailed version of these common solutions later. I quote them at the section under heading of "Language Processors (Compilers, Interpreters, ...)")

(They decided to use LLM to solve this problem. But first, here are the challenges of an LLM-based approach.)

We identify two unique challenges that must be addressed to make an LLM-based approach practical.

  • First, it is unclear how to prioritize constraints to maximize coverage gain. Modern language processors easily contain up to millions of lines of code (see Table 4). Since inference with LLMs is not free [85, 86], it is important to prioritize the constraints to solve. However, constraint prioritization is non-trivial. An ideal prioritization strategy should not only favor those having greater potential to reach more code regions, but also account for the solving difficulty for LLMs. This is because we notice that LLMs may easily solve some constraints, but fail entirely with others due to lacking reasoning capabilities or inherently unsolvable problems (e.g., one-way functions). To date, effectively satisfying these two properties to maximize coverage gain remains an open challenge.
  • Second, it is non-trivial to decide the degree of context presentation. To instruct an LLM to solve a constraint, we need to provide the constraint context (e.g., source code) for it to reason about. Presenting the entire code base is impractical, considering the magnitude of modern language processor code bases. Presenting execution traces or slices is not ideal either because their sizes can still be huge, with total lines of code easily exceeding 10k in our experiments. On the flip side, presenting only a narrow context (e.g., the function enclosing the constraint’s evaluation point) can miss necessary information for the LLM to understand the full picture of the constraint. Therefore, it remains unclear how to present sufficient and concise context to achieve both effectiveness and efficiency.

... (Later they give a more detailed explanation of above-mentioned common solutions and challenges in applying LLM to solve the problem of complex constraints)

  • Constraint Prioritization. The first challenge we identify is it is unclear how to prioritize unsolved constraints to maximize coverage gain. Language processors can contain up to millions of lines of code (see Table 4). It is important to prioritize the unsolved constraints to solve (i.e., the uncovered lines to cover), because inference with LLMs is not free [85, 86], and it is by far impractical to ask LLMs to reason about all unsolved constraints. However, constraint prioritization is non-trivial. An ideal prioritization strategy should satisfy the following two properties. First, it should prioritize the constraints having greater potential to reach more code regions if solved. For instance, in Fig.2, consider a scenario where line 20 and line 27 are both uncovered. Since line 27 can lead to additional uncovered regions, such as deeper functions for processing local structs, while line 20 merely bails out, prioritizing reaching line 27 is more effective. Second, prioritization should account for the solving difficulty. LLMs may easily solve some constraints but fail entirely with others. Solving success rates can drop when constraints involve complex logic beyond the LLM’s reasoning capabilities, or when they involve inherently hard-to-solve problems (e.g., one-way functions). To remain cost-effective, constraints that LLMs struggle to solve should be deprioritized, even if they have the potential to cover more code regions. To date, effectively satisfying these two properties to maximize coverage gain remains an open challenge.
  • Context Granularity. After we decide which constraint to solve, the next step is to collect relevant context, structure it into a prompt, and query the LLM for results. To work with LLMs, we think the best context for a constraint is the relevant source code. However, it is non-trivial to decide the source granularity to present to the LLM. Presenting the entire code base is not cost-effective, considering the magnitude of modern language processor code bases (see Table 4). A smarter approach would be using dynamic program slicing [87, 94, 95] to extract only code relevant to the branching condition (which guards uncovered code regions) by setting the branching condition as the slicing criterion. However, we notice this approach is not ideal either. The dynamic slice of a language processor can still be prohibitively huge. In fact, setting any branching condition in one stage (see Fig.1) as the slicing criterion should result in the majority of all previous stages being included in the dynamic slice. For instance, tokenization and parsing stages are both relevant to any branching condition in the code generation stage, as inputs failing to be tokenized or parsed can never trigger code generation. It is not ideal to present the entirety of the tokenizer and parser as context for constraints in the code generation stage. Additionally, even if the cost budget and LLM context window size support large slices, presenting redundant information can cause more hallucinations [6]. On the flip side, presenting only a narrow context, such as the function containing the uncovered code region (i.e., the enclosing function), can miss necessary information for understanding the constraint clearly. For instance, prior to reaching the enclosing function, there could be early constraint checks that reject the input. To conclude, how to present sufficient and concise context to achieve both effectiveness and efficiency remains unclear.

... (They give their solutions as following)

We propose two solutions to solve the two aforementioned challenges respectively, hybrid centrality prioritization and iterative context construction. Next, we discuss our insights.

  • Hybrid Centrality Prioritization. As discussed in §2.4, an ideal prioritization algorithm should (i) prioritize constraints having greater potential to trigger more code regions and (ii) deprioritize constraints that LLMs may not be able to solve.
    • To address (i), in the field of fuzzing seed scheduling, researchers have faced similar challenges in measuring the potential to reach more code regions [7, 57, 69]. An effective solution is graph centrality analysis, a method to estimate a node’s influence in a graph. Researchers have shown that after building the inter-procedural Control Flow Graph (CFG), centrality analysis can effectively measure one basic block’s potential to reach more other basic blocks [69]. Inspired by this, we extend the existing approach and design a centrality analysis mechanism to prioritize unsolved constraints.
    • To address (ii), we need a way to measure the constraint solving difficulty. The first approach we consider is using graph complexity analysis [98] to approximate constraint complexity, which seems to correlate with solving difficulty. However, in reality, we notice that pretrained LLMs are capable of picking up hints in the context that immediately enable them to solve the constraints, without thoroughly analyzing the constraints in detail. For instance, in Fig.2, imagine a scenario where lines after line 20 are not covered due to failure to pass the constraints evaluated at line 19. In our experiment, LLMs (e.g., GPT-4o-mini) can draw insights from just the function names (e.g., ts_node_has_error, tree_sitter_cpp) to realize that the input should be a syntax-correct C++ source, and then proceed to generate an input that successfully triggers line 21. However, graph complexity analysis would consider the parsing stage as a complex operation and neglect such hints. Moreover, asking the LLMs themselves to score each constraint isn’t viable because it is neither cost-effective nor reliable (because it is hard to verify the scores). Instead, we propose approximating solving difficulty using past solving history: the more LLMs fail to solve a certain type of constraint, the harder it is for LLMs to solve similar constraints in the future. Finally, we use the approximated difficulty to adjust the centrality score, so that the final scores start high for constraints with higher centrality scores but are scaled down later if the constraints appear hard to solve.
  • Iterative Context Construction. Considering the ā€œpicking up hintsā€ capability discussed above, we deem it impractical to decide how much context is sufficient for an LLM a priori. Instead, we draw insights from recent works on LLMs for software engineering [19, 93], which show that LLMs can improve and correct their code generation results iteratively with feedback information. ==To provide a sufficient and concise context, we start by presenting the LLM with a narrow context around the constraint evaluation point (i.e., the enclosing function). We then instruct the LLM to either solve the constraint or request additional context (e.g., symbol definitions).== This approach allows us to leverage dynamic and static analysis techniques to iteratively expand the context. If the LLM provides a solution, we verify its correctness. In case of failure, the execution trace is analyzed to identify a failing point’s context, which is then fed back to the LLM. When the LLM requests more context, source code analysis tools [50] are employed to supply the requested context. Through this process, we gradually construct the context until reaching a successful solution or an iteration threshold.

|500

Fuzzing Under Multi-Agent LLM

↗ LLM Agents, AI Workflow, & Agentic MLLM

Ref