Add ClusterFuzzLite for continuous fuzz testing#1282
Conversation
|
Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible. สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้ |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
|
@copilot people will confuse about Dockerfile* files. we should name it properly, telling which one is for normal use, which one is for fuzzing. Alternatively, put the Dockerfile for fuzzing in the fuzz/ directory. What is the best pattern that commonly used? |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Moved the fuzzing Dockerfile to |
There was a problem hiding this comment.
Pull request overview
This pull request adds ClusterFuzzLite infrastructure to enable continuous fuzz testing of PyThaiNLP functions, helping discover edge cases, crashes, and security vulnerabilities in text processing operations.
Changes:
- Added GitHub Actions workflow for automated fuzzing on PRs, pushes, and daily schedule
- Created fuzzing Docker environment with OSS-Fuzz base builder
- Implemented two initial fuzzing harnesses for
word_tokenize()andnormalize()functions - Added build script and comprehensive documentation for fuzzing infrastructure
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/clusterfuzzlite.yml |
GitHub Actions workflow orchestrating ClusterFuzzLite fuzzing with AddressSanitizer |
fuzz/Dockerfile |
Docker configuration extending OSS-Fuzz base builder for Python projects |
fuzz/build.sh |
Build script for compiling fuzzing harnesses with atheris |
fuzz/fuzz_tokenize.py |
Fuzzing harness testing word_tokenize() with random Unicode input |
fuzz/fuzz_util_normalize.py |
Fuzzing harness testing normalize() with random Unicode input |
fuzz/README.md |
Documentation covering setup, local testing, CI/CD integration, and expansion roadmap |
|
@copilot apply suggestions in this thread #1282 (review) |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Applied all suggestions from the code review (499966b): Workflow changes:
Fuzzer files:
Build script:
Dockerfile:
|
|
Two issues in fuzz/fuzz_util_normalize.py
The fuzz testing job can started now but still failed. Fix all the issues. |
- Rename TestOneInput to test_one_input (Python snake_case convention) - Remove TypeError from except clause in both fuzzers: since we raise TypeError ourselves for validation failures, catching it was silently swallowing those checks instead of reporting them as fuzzer findings - Use compile_python_fuzzer instead of cp+chmod in build.sh: bare cp produces a Python file without a proper executable wrapper, which fails ClusterFuzzLite's bad-build-check and causes 'No fuzz targets found in out dir' (CI job 65987217516) - Update README example to match new naming and exception handling Co-authored-by: bact <128572+bact@users.noreply.github.com>
Fixed all three issues (06a94ea): 1. Renamed 2. Removed 3. Fixed "No fuzz targets found" (CI job 65987217516): The root cause was using |
|
@copilot There are issues with the fuzz build and fuzz targets. Fix them. See the run log at https://github.com/PyThaiNLP/pythainlp/actions/runs/22751954112/job/65988614803?pr=1282 |
…install PyInstaller (used by compile_python_fuzzer) cannot follow the editable install path hooks created by 'pip install -e .'. When bundling the fuzzer executables, pythainlp is not found in the bundle, causing: ModuleNotFoundError: No module named 'pythainlp' Fix: use 'pip install .' so pythainlp lands in site-packages where PyInstaller's dependency scanner can discover and bundle it. Seen in CI job 65988614803. Co-authored-by: bact <128572+bact@users.noreply.github.com>
Fixed in d3d661f. The full error from job 65988614803:
Fix: changed the Dockerfile from |
|
@copilot In fuzz/fuzz_tokenize.py and fuzz/fuzz_util_normalize.py, check if ValueError and UnicodeDecodeError are redundantly capture and fix them if necessary. The workflow can finished successfully now. Check the run log at |
UnicodeDecodeError is a subclass of UnicodeError which is a subclass of ValueError (UnicodeDecodeError -> UnicodeError -> ValueError). Listing both in the except clause is therefore redundant; ValueError alone is sufficient to catch UnicodeDecodeError as well. The CI run (job 65991658653) completed successfully: both fuzz targets built and ran for 300 seconds with no crashes found. The only notable entry in the logs is a 401 on the coverage API which is a known ClusterFuzzLite limitation when no prior coverage artifact exists. Co-authored-by: bact <128572+bact@users.noreply.github.com>
Fixed redundant exception in 88a86f3. The workflow run (job 65991658653) looks good overall:
|
|
@copilot Can we fix that? |
…rts() Without atheris.instrument_imports(), the Python bytecode is not instrumented for coverage, so libFuzzer receives no feedback on which code paths are hit and cannot guide mutation. This causes the warning: WARNING: no interesting inputs were found so far. Is the code instrumented for coverage? Fix: wrap the pythainlp imports in both harnesses inside a 'with atheris.instrument_imports():' block so atheris instruments the module bytecode at import time. Also update README: - Fix the example code to use atheris.instrument_imports() - Fix example except clause: ValueError alone covers UnicodeDecodeError (UnicodeDecodeError -> UnicodeError -> ValueError) - Add "Known warnings on first run" section documenting: - gh-pages branch bootstrap message (one-time, benign) - coverage instrumentation warning (fixed by this commit) Co-authored-by: bact <128572+bact@users.noreply.github.com>
... Fixed in 9b853f7. Coverage instrumentation warning — The root cause was that with atheris.instrument_imports():
import pythainlp.tokenize # or pythainlp.util
|
Updated headings to follow consistent casing and formatting. Adjusted sections related to directory structure, fuzzing targets, local testing, CI/CD integration, adding new fuzzers, and best practices for corpus storage.
Update README to clarify error message during workflow run.
|
@bact Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub. You can ask me to try again later by mentioning me in a new comment. If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: Sorry for the inconvenience! |
|


What do these changes do
Adds ClusterFuzzLite infrastructure to continuously fuzz test PyThaiNLP functions for crashes, edge cases, and security vulnerabilities, along with safe corpus management practices.
What was wrong
PyThaiNLP lacked automated fuzz testing to discover memory safety issues, Unicode handling bugs, and unexpected crashes in text processing functions. The fuzzing setup also lacked guidance on corpus handling and had no protection against committing large/duplicate corpus files, sensitive data, or crash-triggering inputs to version control.
Several CI build failures and warnings were encountered and resolved during setup:
The initial configuration placed
Dockerfileandbuild.shinfuzz/with a customdockerfile-pathparameter, but ClusterFuzzLite hardcodes the build context to.clusterfuzzlite/Dockerfileregardless of that parameter, causinglstat .clusterfuzzlite: no such file or directory.After fixing the directory structure, the build failed with
No fuzz targets found in out dirbecause the build script usedcp+chmodto place Python files in$OUT. ClusterFuzzLite's bad-build-check requires proper libFuzzer executables; a plain Python script without an executable wrapper fails this check.After switching to
compile_python_fuzzer, the executables crashed at startup withModuleNotFoundError: No module named 'pythainlp'. PyInstaller (used internally bycompile_python_fuzzer) cannot follow the path hooks created by an editable install (pip install -e .), so pythainlp was missing from the bundled executables.Without wrapping the target module imports in
atheris.instrument_imports(), the Python bytecode was not instrumented for coverage, causing libFuzzer to printWARNING: no interesting inputs were found so far. Is the code instrumented for coverage?and preventing coverage-guided mutation.Additionally,
TestOneInputdid not follow Python naming conventions, andTypeErrorwas being caught in theexceptclause after being explicitly raised for output validation, silencing those checks. Theexceptclauses also redundantly listedUnicodeDecodeErroralongsideValueError, sinceUnicodeDecodeErroris a subclass ofUnicodeErrorwhich is a subclass ofValueError.How this fixes it
GitHub Actions Workflow (
.github/workflows/clusterfuzzlite.yml)dev(code-change mode), push todev(batch mode), and daily at 06:00 UTC (batch mode)contents: writefor storing corpus and coverage data to gh-pagesClusterFuzzLite Configuration (
.clusterfuzzlite/)Dockerfile— required location for ClusterFuzzLite's fuzzing Docker image (extendsgcr.io/oss-fuzz-base/base-builder-python); installs pythainlp withpip install .(non-editable) so PyInstaller can discover and bundle itbuild.sh— usescompile_python_fuzzerto create proper libFuzzer executable wrappers for eachfuzz/fuzz_*.pyharness; uses pinnedatheris==2.3.0for supply-chain securityFuzzing Harnesses (
fuzz/)fuzz_tokenize.py— testsword_tokenize()with random Unicode inputfuzz_util_normalize.py— testsnormalize()with random Unicode inputwith atheris.instrument_imports():to enable coverage-guided fuzzing; usetest_one_input(Python snake_case convention); only catchValueError(which already coversUnicodeDecodeErrorvia the exception hierarchy);TypeErrorraised by output validation propagates as a fuzzer findingREADME.md— local testing guide, directory structure documentation, corpus best practices, expansion roadmap, and known first-run warnings (one-timegh-pagesbranch creation message, coverage instrumentation requirement)Corpus Storage Best Practices
.gitignoreupdated to exclude local fuzzing artifacts (fuzz/corpus/,fuzz/crashes/,crash-*,leak-*,timeout-*,oom-*, etc.)fuzz/README.mddocuments minimize/de-duplicate procedures, data sanitization requirements, dedicated storage approach (gh-pages branch), crash handling workflow, and security considerationsYour checklist for this pull request
Original prompt
This pull request was created from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.