- Project Overview
- Directory Structure
- Core Components
- Pipeline Workflows
- Data Flow
- Extending KNighter
KNighter is an LLM-powered static analysis checker synthesis tool that automatically generates Clang Static Analyzer (CSA) checkers from historical patch commits. The system transforms bug fix patches from the Linux kernel into functional static analysis checkers through a multi-stage AI-driven pipeline.
- 🔍 Patch Analysis: Extracts bug patterns from Git commits
- 🤖 LLM-Powered Generation: Uses multiple AI models for checker synthesis
- 🔧 Automatic Refinement: Self-improves checkers by reducing false positives
- 📊 Batch Processing: Handles multiple commits and checkers efficiently
- 🚀 Production Ready: Integrates with LLVM and Linux kernel builds
knighter-dev/
├── 📁 src/ # Core application code
│ ├── 🐍 main.py # Entry point & CLI interface
│ ├── 🤖 agent.py # LLM orchestration & prompting
│ ├── ⚙️ checker_gen.py # Checker generation pipeline
│ ├── 🔄 checker_refine.py # Checker refinement & improvement
│ ├── 🔍 checker_scan.py # Static analysis execution
│ ├── 📊 checker_data.py # Data models & persistence
│ ├── 🎯 model.py # Multi-LLM client management
│ ├── 🌐 global_config.py # Configuration management
│ ├── 🛠️ tools.py # Utility functions
│ ├── 🏷️ commit_label.py # Commit classification
│ ├── 📁 backends/ # Analysis backend abstraction
│ │ └── 🔧 csa.py # Clang Static Analyzer backend
│ ├── 📁 targets/ # Target system abstraction
│ │ └── 🐧 linux.py # Linux kernel integration
│ ├── 📁 tests/ # Test suite
│ └── 📁 kparser/ # Code parsing utilities
│ ├── 🌳 kparser.py # Tree-sitter integration
│ ├── ⚡ kfunction.py # Function analysis
│ └── 📁 tree-sitter-cpp/ # C++ parser (submodule)
├── 📁 checker_database/ # Pre-existing checker examples
│ ├── 📁 ArrayBoundChecker/ # Example: Array bounds checking
│ │ ├── 📄 checker.cpp # Checker implementation
│ │ ├── 📝 pattern.md # Bug pattern description
│ │ ├── 📋 plan.md # Implementation plan
│ │ ├── 🧠 pattern_embedding.pt # Vector embedding
│ │ └── 🧠 plan_embedding.pt # Vector embedding
│ └── ... (100+ more checkers)
├── 📁 prompt_template/ # LLM prompt engineering
│ ├── 📄 patch2pattern.md # Patch → Pattern prompts
│ ├── 📄 pattern2plan.md # Pattern → Plan prompts
│ ├── 📄 plan2checker.md # Plan → Code prompts
│ ├── 📁 knowledge/ # Domain knowledge base
│ │ ├── 💡 utility.md # Helper functions
│ │ ├── 💡 suggestions.md # Best practices
│ │ └── 💡 template.md # Code templates
│ └── 📁 examples/ # Few-shot learning examples
├── 📁 scripts/ # Automation & setup scripts
│ ├── 🔧 setup_llvm.py # LLVM environment setup
│ ├── 📊 collect_commits.py # Commit harvesting
│ ├── ✅ collect_valid_checkers.py # Checker validation
│ ├── 📈 count_errors.py # Error analysis
│ └── 🔢 count_tokens.py # Token usage tracking
├── 📁 llvm_utils/ # LLVM integration utilities
│ ├── 🏗️ create_plugin.py # Plugin creation
│ ├── 🛠️ utility.cpp # C++ utility functions
│ └── 🛠️ utility.h # C++ utility headers
├── 📁 commits/ # Commit data files
│ ├── 📄 commits.txt # Evaluation commits
│ └── 📄 commits-sampled.txt # Sampled commits for ablation
├── 📁 assets/ # Documentation assets
├── 📄 config-example.yaml # Configuration template
├── 📄 requirements.txt # Python dependencies
├── 🐳 Dockerfile # Container definition
├── 🐳 docker-compose.yml # Container orchestration
└── 📚 README.md # Project documentation
Purpose: Main application entry point with CLI interface
Key Features:
- 5 operational modes:
gen,refine,scan,triage,label - Fire-based CLI with automatic help generation
- Configuration initialization and validation
- Error handling and logging setup
Usage:
# Generate checkers from commits
python main.py gen --commit_file=commits.txt --config_file=config-generate.yaml
# Refine existing checkers
# This will scan the kernel with the checkers and refine them
python main.py refine --checker_dir=results/ --config_file=config-refine.yaml
# Triage the report
python main.py triage --config_file=config-refine.yaml results-refine/
# Scan kernel with a single checker only
python main.py scan_single --config_file=config-scan.yaml checker_file.cppPurpose: Coordinates LLM interactions for the generation pipeline
Key Functions:
patch2pattern(): Extracts bug patterns from patchespattern2plan(): Creates implementation plansplan2checker(): Generates checker coderepair_checker(): Fixes compilation errors- Template loading and prompt engineering
Data Flow:
Patch → pattern2pattern() → Bug Pattern
Bug Pattern → pattern2plan() → Implementation Plan
Implementation Plan → plan2checker() → Checker Code
Purpose: End-to-end checker synthesis from commits
Pipeline Stages:
- Pattern Extraction: Analyze patch to identify vulnerability pattern
- Plan Generation: Create detailed implementation strategy
- Code Generation: Produce compilable CSA checker
- Syntax Repair: Fix compilation errors iteratively
- Validation: Test against known examples
Key Classes:
CheckerGenerator: Main pipeline orchestratorGenerationConfig: Pipeline configurationGenerationResult: Output metadata and statistics
Purpose: Improves generated checkers by reducing false positives
Refinement Process:
- Scanning: Execute checker against Linux kernel
- Report Analysis: Identify false positive patterns
- Refinement: Generate improved checker versions
- Validation: Re-test refined checkers
- Iteration: Repeat until acceptable accuracy
Key Functions:
refine_checker(): Main refinement workflowanalyze_reports(): FP pattern identification
Purpose: Clang Static Analyzer integration and abstraction
Key Capabilities:
- Checker compilation with LLVM build system
- Static analysis execution on Linux kernel
- Report generation and parsing
- Build system integration
- Error capture and diagnostics
Key Methods:
build_checker(): Build checker pluginrun_checker(): Execute static analysisvalidate_checker(): Validate checkerextract_reports(): Extract bug reports
Purpose: Unified interface for multiple LLM providers
Supported Models:
- OpenAI: GPT-4o, o1-preview, o3-mini
- Anthropic: Claude-3.5-Sonnet
- Google: Gemini-1.5-Pro
- DeepSeek: DeepSeek-V3, DeepSeek-Reasoner
- Nvidia: Various models via API
Purpose: Core data structures and persistence
Key Classes:
CheckerData: Checker metadata and stateRefineAttempt: Refinement iteration trackingReportData: Bug report analysis resultsCheckerStatus: State machine for checker lifecycle
Persistence:
- YAML serialization for human-readable configs
- JSON for structured data exchange
- File-based storage with atomic writes
Purpose: Centralized configuration and environment setup
Configuration Sections:
- Paths: LLVM, Linux kernel, output directories
- Models: LLM provider settings and API keys
- Pipeline: Generation and refinement parameters
- Analysis: Scanning timeouts and resource limits
Singleton Pattern: Ensures consistent configuration across all components
Purpose: Domain-specific knowledge for LLM prompting
Components:
utility.md: Pre-implemented helper functions for CSAsuggestions.md: Best practices for checker developmenttemplate.md: Standard checker code structure- Examples of successful checker implementations
Purpose: Comprehensive library of existing static analysis checkers
Structure: Each checker includes:
checker.cpp: Full implementationpattern.md: Bug pattern descriptionplan.md: Implementation strategy*_embedding.pt: Vector embeddings for similarity matching
Usage: Provides examples for few-shot learning and similarity-based retrieval
graph TD
A[Git Commit] --> B[Patch Extraction]
B --> C[Pattern Analysis]
C --> D[Plan Generation]
D --> E[Code Generation]
E --> F[Syntax Repair]
F --> G[Validation]
G --> H[Checker Output]
C --> I[Knowledge Base]
D --> I
E --> I
F --> J[Compilation Errors]
J --> F
graph TD
A[Generated Checker] --> B[Kernel Scanning]
B --> C[Report Analysis]
C --> D{False Positives?}
D -->|Yes| E[LLM Refinement]
D -->|No| F[Refined Checker]
E --> G[Updated Checker]
G --> B
E --> H{Max Attempts?}
H -->|Yes| I[Failed Refinement]
H -->|No| B
- Git Commits: Linux kernel vulnerability fixes
- Configuration: YAML files with pipeline settings
- Knowledge Base: Domain expertise in Markdown format
- Checker Database: 100+ example checkers with embeddings
- Preprocessing: Patch extraction and normalization
- Pattern Analysis: LLM-based vulnerability pattern identification
- Plan Generation: Detailed implementation strategy creation
- Code Synthesis: CSA checker code generation
- Validation: Compilation testing and error repair
- Refinement: False positive reduction through iterative improvement
- Generated Checkers: Compilable C++ CSA checker implementations
- Documentation: Patterns, plans, and implementation notes
- Reports: Analysis results and bug findings
- Metadata: Generation statistics and quality metrics
- Logs: Detailed pipeline execution traces
- Extend Model Client (
src/model.py):
def create_new_provider_client(api_key: str):
"""Add new LLM provider integration"""
return NewProviderClient(api_key=api_key)- Update Configuration:
# Add to llm_keys.yaml
new_provider_key: "your-api-key"Create Backend Class (src/backends/new_backend.py):
class NewAnalysisBackend(AnalysisBackend):
def build_checker(self, checker_code: str) -> bool:
"""Implement checker compilation"""
pass
def validate_checker(self, checker_code: str, commit_id: str, patch: str, target: TargetFactory, skip_build_checker=False) -> Tuple[int, int]:
"""Validate checker"""
pass
def run_checker(self, checker_code: str, commit_id: str, target: TargetFactory, object_to_analyze=None, jobs=32, output_dir="tmp", skip_build_checker=False, skip_checkout=False, **kwargs) -> int:
"""Implement static analysis execution"""
pass
def extract_reports(self, report_dir: str, output_dir: str, sampled_num: int = 5, stop_num: int = 5, max_num: int = 100, seed: int = 0) -> Tuple[Optional[List[ReportData]], int]:
"""Extract reports"""
pass- Create Target Class (
src/targets/new_target.py):
class NewTarget(Target):
def checkout_commit(self, commit_id: str, is_before: bool = False, **kwargs):
"""Prepare target for analysis"""
pass
def get_object_name(file_name: str) -> str:
"""Get object name from file name"""
pass
def get_patch(self, commit_id: str) -> str:
"""Get patch from commit"""
pass
def path_similarity(path1, path2):
"""Calculate path similarity"""- Update existing backends to support the new target system