A specialized Python tool for crawling and analyzing syzkaller bug reports with focus on vulnerability research and data collection.
This project provides automated capabilities to extract, filter, and analyze kernel vulnerability reports from syzkaller's database. It enables systematic security research by collecting structured data about kernel bugs, their reproducer code, and fix information.
- Automated Bug Collection: Retrieves bug entries from syzkaller summary pages
- Intelligent Filtering: Selects bugs based on C reproducer availability and architecture (x86)
- Detail Page Enhancement: Visits each bug's detail page to extract reproducer URLs, kernel config, disk images, and commit info
- Linux Version Lookup: Maps kernel commit hashes to version tags via
linux_versions.csv - Structured Output: Exports data as JSON with metadata (timestamp, bug count, column list)
- Respectful Crawling: Implements rate limiting, retry with exponential backoff, and proper HTTP headers
-
Source Identification: Query syzkaller summary pages
- Linux 6.1:
https://syzkaller.appspot.com/linux-6.1/fixed - All Linux versions:
https://syzkaller.appspot.com/upstream/fixed - Focus on fixed bugs for confirmed vulnerabilities
- Linux 6.1:
-
Reproducer Filtering: Select bugs providing C reproducers
- Ensures reproducible crash conditions
- Enables further analysis and validation
-
Architecture Filtering: Target x86-specific vulnerabilities
- Excludes ARM-specific issues
- Focuses on most common server/desktop architecture
-
Detail Page Enhancement: For each selected bug, extract from its detail page:
- C reproducer URL
- Kernel configuration (.config) URL
- Disk image URL (if available)
- Linux kernel commit hash
- Linux version tag (looked up from commit hash)
- Rate Limiting: Implements respectful crawling with appropriate delays
- Retry Logic: Exponential backoff with up to 3 attempts for failed requests
- Error Handling: Robust error recovery and logging
- Data Validation: Verifies mainline kernel commits and filters non-mainline forks
syzkaller_bug_crawler/
├── syzkaller_crawler.py # Core crawler implementation
├── pyproject.toml # Project configuration and dependencies
├── uv.lock # Dependency lock file
├── linux_versions.csv # Commit hash to Linux version tag mappings
├── .gitignore
├── README.md
└── data/ # Output directory (gitignored, created during execution)
├── raw_bugs/ # Unfiltered bug table from listing page
└── filtered_bugs/ # Filtered and enhanced bug data
- Python 3.10+
- uv (Python package manager)
- Clone the repository:
git clone https://github.com/MirageTurtle/syzkaller_bug_crawler.git
cd syzkaller_bug_crawler- Install dependencies using uv:
uv sync- Generate the newest Linux version mappings:
(echo "version,commit" && \
git ls-remote https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git "refs/tags/*" \
| grep '\^{}' | grep -v '\-rc' | grep -v '\-tree' \
| sed 's/\^{}//' \
| awk '{gsub("refs/tags/", "", $2); print $2 "," $1}' \
| grep -E '^v[0-9]+\.[0-9]+\.[0-9]+,') > linux_versions.csvfrom syzkaller_crawler import get_syzkaller_bug_table
# Retrieve bug table for Linux 6.1 fixed bugs
df = get_syzkaller_bug_table("https://syzkaller.appspot.com/linux-6.1/fixed")
if df is not None:
print(f"Retrieved {len(df)} bugs")
print(df.head())# Default crawl (Linux 6.1 fixed bugs)
uv run python syzkaller_crawler.py
# With filters
uv run python syzkaller_crawler.py --c-reproducer-only --arch x86
# Custom URL and output directory
uv run python syzkaller_crawler.py --url "https://syzkaller.appspot.com/upstream/fixed" --output results
# Test mode (first 50 bugs only)
uv run python syzkaller_crawler.py --testEach bug record in the output JSON contains the following fields:
- bug_id: Unique identifier extracted from syzkaller bug URL
- title: Bug description and crash details
- bug_url: Link to the bug report on syzkaller
- reproducer_type: Type of reproducer available (
c,syz, orunknown) - row_index: Original row position in the syzkaller listing table
- reproducer_url: Direct link to the C reproducer code (from detail page)
- config_url: Link to the kernel configuration file (from detail page)
- disk_url: Link to the disk image for reproduction (from detail page, if available)
- linux_commit: Kernel commit hash associated with the crash (from detail page)
- linux_version: Linux version tag mapped from the commit hash (if found in
linux_versions.csv) - architecture: Target CPU architecture (
x86orarm, from detail page)
The project follows standard Python conventions. For development tools:
# Install development dependencies
uv add --dev pytest ruff
# Run tests
uv run pytest
# Format and lint code
uv run ruff check --fix .
uv run ruff format .- Mainline kernel commit filter: The detail page enhancement only accepts mainline (
torvalds/linux.git) commit URLs, so crawling stable branch pages (e.g.linux-6.1/fixed) where commits point tostable/linux-stable.gitwill result in all bugs being filtered out.
- Batch processing and scheduling
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.