The Vulnerability Data Tools project aims to improve and enrich vulnerability data by maintaining accurate Common Platform Enumeration (CPE) mappings and other metadata. This guide will walk you through the process of contributing to the project, from setting up your environment to submitting your first pull request.
The project consists of three main repositories that work together:
-
vulnerability-data-tools: The primary development repository containing scripts and tools for processing vulnerability data. This is where the core functionality lives.
-
cve-data-enrichment: The repository where human contributors submit updates and corrections to vulnerability data. This is where you'll be submitting most of your pull requests.
-
nvd-data-overrides: An automatically generated repository containing the processed data in NVD-compatible format. This repository is updated through GitHub Actions based on changes in cve-data-enrichment.
Think of these repositories as a pipeline: Contributors make changes in cve-data-enrichment, the tools in vulnerability-data-tools process these changes, and the results are published in nvd-data-overrides.
- Python 3.10 or newer
- Git
- A GitHub account
- The
cranetool for container operations
-
First, install the crane tool:
# On macOS using Homebrew brew install crane # For other platforms, see: https://github.com/google/go-containerregistry
-
Clone the vulnerability-data-tools repository:
git clone https://github.com/anchore/vulnerability-data-tools cd vulnerability-data-tools -
Create and activate a Python virtual environment:
# Using uv (recommended for speed) uv venv source .venv/bin/activate # Or using venv python -m venv .venv source .venv/bin/activate # Or using your preferred virtual environment tool
-
Install required Python packages:
# If using uv uv pip install requests check-jsonschema cpe # If using pip pip install requests check-jsonschema cpe
-
Fork and clone the cve-data-enrichment repository:
# Fork the repository on GitHub first, then: cd nvd git clone https://github.com/YOUR-USERNAME/cve-data-enrichment
Before diving into the specific tools, it's important to understand how different sources of vulnerability data work together. The vulnerability data ecosystem consists of several key components:
-
The National Vulnerability Database (NVD) provides vulnerability information based off the official CVE data, NVD will add metadata such as CPE configurations that describe affected software as well as severity information such as CVSS.
-
The CVE Project (in version 5 format) provides the foundational vulnerability records, CVE IDs can detailed information about affected packages and versions, but do not always.
-
Package ecosystems (like npm, PyPI, or Go) maintain their own repositories with package metadata that helps identify vulnerable versions.
-
Linux distributions (Debian, Ubuntu, Red Hat, Suse) maintain security trackers that provide additional context about vulnerabilities in their packages.
Our tools help bridge these different sources of information, ensuring accurate and comprehensive vulnerability data. Let's explore how each component works.
The vulnerability data tools use a pipeline approach to gather and process data:
-
Data Collection Layer The first layer of tools gathers data from various sources:
# wordpress_name_to_slug.py connects to wordpress.org's API url_format = "https://api.wordpress.org/{wp_type}/info/1.2/?action=query_{wp_type}&request[page]={page}"
-
Normalization Layer The collected data goes through normalization to ensure consistency:
# normalization.py handles vendor name standardization vendor_aliases = { "red hat": "redhat", "apache software foundation": "apache", "microsoft corporation": "microsoft" }
-
Enrichment Layer Additional metadata is added to make the data more useful:
# enrich_package_metadata.py adds ecosystem-specific details if package_type == "maven": # Add Maven-specific target software information components[-3] = "maven"
-
Validation Layer The enriched data is validated before being committed:
check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \ cve-data-enrichment/data/anchore/**/CVE-*.json
Each script provides specific functionality and options. Here are some common scenarios:
The cve_enrichment_record_candidates_from_upstream_cve5.py script is one of the most commonly used tools. It supports several useful options:
# Process specific CVEs
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
--cves CVE-2024-XXXXX CVE-2024-YYYYY
# Process only records from specific assigners
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
--assigners github_m wordfence
# Force processing of records, even if they have existing CPE data
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
--force
# Limit the batch size to avoid overwhelming review capacity
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
--batch-size 10
# Enable verbose output for debugging
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
--verboseThe script makes intelligent decisions about which records to process:
- Skips rejected CVEs
- Avoids processing records that already have CPE configurations
- Handles different package ecosystems appropriately
The enrich_package_metadata.py script adds ecosystem-specific information:
# Enrich all records
python -m scripts.enrich_package_metadata
# Common enrichments include:
# - Adding package repository URLs
# - Standardizing version information
# - Adding ecosystem-specific target software informationThe tools use Python's logging framework for debugging and monitoring. You can control logging verbosity:
# Enable debug logging
export PYTHONPATH=.
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --verbose
# Log output includes:
# - Data processing decisions
# - Skipped records and why
# - CPE lookup resultsCommon log messages and what they mean:
- "No CPEs discovered": The tools couldn't automatically determine CPE identifiers
- "Multiple collectionURL possibilities": Multiple potential package sources found
- "Skipping git version type": Version information uses git commits instead of releases
When working with the tools, you might encounter these common situations:
-
Missing CPE Information
# If no CPEs are found, check: # - Vendor and product normalization # - Package ecosystem mappings # - Existing CPE dictionary entries
-
Version Parsing Issues
# The tools handle various version formats: # - Semantic versions (1.2.3) # - Custom versions (2024.1.1) # - Range specifications (<2.0.0)
-
Data Reconciliation Conflicts
# When upstream data changes: # - Use --reconcile flag to update # - Check version ranges carefully # - Validate CPE configurations
For larger-scale updates, you can combine the tools:
cpe_lookups_from_cve5.py: Generates CPE pattern lookups from CVE v5 dataenrich_package_metadata.py: Enriches vulnerability data with additional package informationcve_enrichment_record_candidates_from_upstream_cve5.py: Creates enrichment record candidates from upstream CVE v5 data, with options like:# Process specific CVEs python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --cves CVE-2024-XXXXX # Process records from specific assigners python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --assigners github_m wordfence # Force processing even for records with existing CPE data python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --force
The toolset includes robust normalization capabilities through normalization.py which helps ensure consistent data formatting:
- Vendor and product name normalization
- Collection URL normalization
- Version string parsing and normalization
- Package type detection
The typical workflow for contributing vulnerability data updates follows these steps:
- Data Update: Run scripts to fetch the latest data
- Analysis: Process new CVEs and identify needed updates
- Review: Verify CPE names and version information
- Validation: Check changes against the schema
- Submission: Create a pull request with your changes
Start by running the update script to fetch the latest vulnerability data:
cd nvd
./scripts/update.shThis script performs several important tasks:
- Downloads the official CPE dictionary
- Fetches current NVD data
- Updates local git repositories with the latest information
Two main scripts handle CVE processing:
-
Find new CPEs to add:
python -m scripts.cpe_lookups_from_cve5
-
Generate enrichment records:
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --batch-size 100
After generating new records, you'll need to review the changes in the data/anchore directory. Pay special attention to:
- Files marked with
needsReviewtags - CPE names and their accuracy
- Version information and ranges
When reviewing version information, follow these guidelines:
- For version ranges, use
<=for "up to and including" and<for "up to but not including"- "up to and including" means, we know version X and below is vulnerable and there is not currently a fix
- "up to but not including" means we know version X is fixed, and all previous versions are affected
- Verify version numbers against advisory information
- Check the official CPE dictionary for existing entries
- More information on CPE can be found on the NIST website
After your initial review, enrich the data with additional information:
python -m scripts.enrich_package_metadata
python scripts/compare_to_distros.pyAlways validate your changes against the JSON schema before submitting:
check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
cve-data-enrichment/data/anchore/**/CVE-*.jsonWhen new vulnerabilities are reported, you may need to add them to the dataset. Here's a real-world example from January 2025:
{
"additionalMetadata": {
"cna": "github_m",
"cveId": "CVE-2025-23214",
"description": "Cosmos provides users the ability self-host a home server...",
"reason": "Added CPE configurations because not yet analyzed by NVD."
},
"adp": {
"affected": [
{
"collectionURL": "https://pkg.go.dev",
"cpes": [
"cpe:2.3:a:cosmos-cloud:cosmos_server:*:*:*:*:*:go:*:*"
],
"packageName": "github.com/azukaar/cosmos-server",
"packageType": "go-module",
"product": "Cosmos-Server",
"repo": "https://github.com/azukaar/cosmos-server",
"vendor": "azukaar",
"versions": [
{
"lessThan": "0.17.7",
"status": "affected",
"version": "0",
"versionType": "custom"
}
]
}
]
}
}This example shows the key components:
- Basic metadata about the CVE
- Affected package information including:
- Collection URL (where the package is hosted)
- CPE identifiers
- Package details and type
- Version ranges that are affected
Sometimes version information needs to be updated as new releases fix vulnerabilities. For example, from a recent update:
A common task is updating version ranges when new releases are made. For example, this change from January 2025:
"versions": [
{
- "lessThanOrEqual": "8.7.13",
+ "lessThan": "8.7.16",
"status": "affected",
"version": "0",
"versionType": "semver"
}
]To update a specific CVE record:
-
Generate the record if it doesn't exist:
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --cves CVE-YYYY-XXXXX --force
-
Edit the file in cve-data-enrichment/data/anchore/YYYY/CVE-YYYY-XXXXX.json
-
Run validation:
check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \ cve-data-enrichment/data/anchore/YYYY/CVE-YYYY-XXXXX.json
For WordPress plugins and other packages hosted in public repositories, it's important to include repository information. Here's an example:
"packageName": "categorify",
"packageType": "wordpress-plugin",
"product": "Categorify – WordPress Media Library Category & File Manager",
+ "repo": "https://plugins.svn.wordpress.org/categorify",
"vendor": "frenify",The scripts in vulnerability-data-tools/nvd/scripts/ help automate this process:
- wordpress_name_to_slug.py maps plugin names to their repository slugs
- normalization.py provides functions for standardizing package and vendor names
When addressing false negatives (usually reported by users or through issue reports):
- Generate the CVE record if it doesn't exist
- Update the CPE information based on the reported issue
- Verify the version information
- Remove the needsReview tag after confirmation
Here's a real example of correcting version information for a previous false negative:
CPE (Common Platform Enumeration) data sometimes needs correction to ensure accurate vulnerability tracking. Here's an example workflow:
- Create or update the CVE record
- Add the correct CPE information
- Document the reason for the override in the commit message
- Validate the changes
Example CPE corrections:
{
"cpes": [
"cpe:2.3:a:phpoffice:phpspreadsheet:*:*:*:*:*:php:*:*",
"cpe:2.3:a:phpspreadsheet_project:phpspreadsheet:*:*:*:*:*:php:*:*"
]
}When creating CPEs:
- Follow the standard CPE format: cpe:2.3:part:vendor:product:version:update:edition:language:sw_edition:target_sw:target_hw:other
- Use asterisks (*) for fields that apply to all values
- Ensure vendor and product names follow standardized formats
Sometimes a match needs to be removed from appearing in the results. There are many reasons this could happen: missing details for a very old vulnerability, the vulnerability might be marked as wontfix by the upstream project, or the original CPE is just incorrect.
Using the example from Scenario 1
{
"additionalMetadata": {
"cna": "github_m",
"cveId": "CVE-2025-23214",
"description": "Cosmos provides users the ability self-host a home server...",
"reason": "Added CPE configurations because not yet analyzed by NVD."
},
"adp": {
"affected": [
{
"collectionURL": "https://pkg.go.dev",
"cpes": [
"cpe:2.3:a:cosmos-cloud:cosmos_server:*:*:*:*:*:go:*:*"
],
"packageName": "github.com/azukaar/cosmos-server",
"packageType": "go-module",
"product": "Cosmos-Server",
"repo": "https://github.com/azukaar/cosmos-server",
"vendor": "azukaar",
"versions": [
{
"lessThan": "0.17.7",
"status": "affected",
"version": "0",
"versionType": "custom"
}
]
}
]
}
}We will change the line
"status": "affected",
to be unaffected
"status": "unaffected",
We modify the status rather than removing the data so future script run won't mistakenly re-add the incorrect data.
-
Documentation: Always include clear commit messages explaining your changes. Examples:
- "updates 2025-01-21" - For routine updates with multiple changes
- "reconcile with NVD" - When aligning data with NVD records
- "Update CVE-YYYY-XXXXX with fix version" - For specific version updates
-
Version Ranges: Be precise with version ranges to avoid false positives/negatives
-
CPE Validation: When possible, verify CPEs against the official dictionary
-
Incremental Changes: Submit smaller, focused pull requests rather than large changes
-
Review Process: Pay attention to automated checks and respond to review feedback
-
Create a branch for your changes:
cd cve-data-enrichment git checkout -b fix-cve-yyyy-xxxxx -
Commit your changes with a clear message:
git add data/anchore/yyyy/CVE-yyyy-xxxxx.json git commit -s -m "Update CVE-YYYY-XXXXX with correct version range"Don't forget to use the
signoffoption-s -
Push your changes and create a pull request:
git push origin fix-cve-yyyy-xxxxx
-
Open a pull request on GitHub against the main cve-data-enrichment repository
Your changes will be reviewed, and once merged, they will automatically flow through to the nvd-data-overrides repository.
- Check existing issues in the vulnerability-data-tools repository
- Join the Anchore Community Discourse
- Review the FAQ in the vulnerability-data-tools README
Remember: The goal is to improve the quality of vulnerability data for everyone. Take your time to verify changes and don't hesitate to ask for help when needed.