Skip to content

MirageTurtle/syzkaller_bug_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Syzkaller Bug Crawler

A specialized Python tool for crawling and analyzing syzkaller bug reports with focus on vulnerability research and data collection.

Purpose

This project provides automated capabilities to extract, filter, and analyze kernel vulnerability reports from syzkaller's database. It enables systematic security research by collecting structured data about kernel bugs, their reproducer code, and fix information.

Key Features

  • Automated Bug Collection: Retrieves bug entries from syzkaller summary pages
  • Intelligent Filtering: Selects bugs based on C reproducer availability and architecture (x86)
  • Detail Page Enhancement: Visits each bug's detail page to extract reproducer URLs, kernel config, disk images, and commit info
  • Linux Version Lookup: Maps kernel commit hashes to version tags via linux_versions.csv
  • Structured Output: Exports data as JSON with metadata (timestamp, bug count, column list)
  • Respectful Crawling: Implements rate limiting, retry with exponential backoff, and proper HTTP headers

Methodology

Data Collection Process

  1. Source Identification: Query syzkaller summary pages

    • Linux 6.1: https://syzkaller.appspot.com/linux-6.1/fixed
    • All Linux versions: https://syzkaller.appspot.com/upstream/fixed
    • Focus on fixed bugs for confirmed vulnerabilities
  2. Reproducer Filtering: Select bugs providing C reproducers

    • Ensures reproducible crash conditions
    • Enables further analysis and validation
  3. Architecture Filtering: Target x86-specific vulnerabilities

    • Excludes ARM-specific issues
    • Focuses on most common server/desktop architecture
  4. Detail Page Enhancement: For each selected bug, extract from its detail page:

    • C reproducer URL
    • Kernel configuration (.config) URL
    • Disk image URL (if available)
    • Linux kernel commit hash
    • Linux version tag (looked up from commit hash)

Best Practices Implemented

  • Rate Limiting: Implements respectful crawling with appropriate delays
  • Retry Logic: Exponential backoff with up to 3 attempts for failed requests
  • Error Handling: Robust error recovery and logging
  • Data Validation: Verifies mainline kernel commits and filters non-mainline forks

Project Structure

syzkaller_bug_crawler/
├── syzkaller_crawler.py     # Core crawler implementation
├── pyproject.toml           # Project configuration and dependencies
├── uv.lock                  # Dependency lock file
├── linux_versions.csv       # Commit hash to Linux version tag mappings
├── .gitignore
├── README.md
└── data/                    # Output directory (gitignored, created during execution)
    ├── raw_bugs/            # Unfiltered bug table from listing page
    └── filtered_bugs/       # Filtered and enhanced bug data

Quick Start

Prerequisites

  • Python 3.10+
  • uv (Python package manager)

Installation

  1. Clone the repository:
git clone https://github.com/MirageTurtle/syzkaller_bug_crawler.git
cd syzkaller_bug_crawler
  1. Install dependencies using uv:
uv sync
  1. Generate the newest Linux version mappings:
(echo "version,commit" && \
git ls-remote https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git "refs/tags/*" \
  | grep '\^{}' | grep -v '\-rc' | grep -v '\-tree' \
  | sed 's/\^{}//' \
  | awk '{gsub("refs/tags/", "", $2); print $2 "," $1}' \
  | grep -E '^v[0-9]+\.[0-9]+\.[0-9]+,') > linux_versions.csv

Basic Usage

from syzkaller_crawler import get_syzkaller_bug_table

# Retrieve bug table for Linux 6.1 fixed bugs
df = get_syzkaller_bug_table("https://syzkaller.appspot.com/linux-6.1/fixed")

if df is not None:
    print(f"Retrieved {len(df)} bugs")
    print(df.head())

Command Line Usage

# Default crawl (Linux 6.1 fixed bugs)
uv run python syzkaller_crawler.py

# With filters
uv run python syzkaller_crawler.py --c-reproducer-only --arch x86

# Custom URL and output directory
uv run python syzkaller_crawler.py --url "https://syzkaller.appspot.com/upstream/fixed" --output results

# Test mode (first 50 bugs only)
uv run python syzkaller_crawler.py --test

Data Fields

Each bug record in the output JSON contains the following fields:

  • bug_id: Unique identifier extracted from syzkaller bug URL
  • title: Bug description and crash details
  • bug_url: Link to the bug report on syzkaller
  • reproducer_type: Type of reproducer available (c, syz, or unknown)
  • row_index: Original row position in the syzkaller listing table
  • reproducer_url: Direct link to the C reproducer code (from detail page)
  • config_url: Link to the kernel configuration file (from detail page)
  • disk_url: Link to the disk image for reproduction (from detail page, if available)
  • linux_commit: Kernel commit hash associated with the crash (from detail page)
  • linux_version: Linux version tag mapped from the commit hash (if found in linux_versions.csv)
  • architecture: Target CPU architecture (x86 or arm, from detail page)

Development

Code Style

The project follows standard Python conventions. For development tools:

# Install development dependencies
uv add --dev pytest ruff

# Run tests
uv run pytest

# Format and lint code
uv run ruff check --fix .
uv run ruff format .

Known Limitations

  • Mainline kernel commit filter: The detail page enhancement only accepts mainline (torvalds/linux.git) commit URLs, so crawling stable branch pages (e.g. linux-6.1/fixed) where commits point to stable/linux-stable.git will result in all bugs being filtered out.

Future Enhancements

  • Batch processing and scheduling

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages