Contributing to Vulnerability Data Tools: A Technical Guide

Introduction

The Vulnerability Data Tools project aims to improve and enrich vulnerability data by maintaining accurate Common Platform Enumeration (CPE) mappings and other metadata. This guide will walk you through the process of contributing to the project, from setting up your environment to submitting your first pull request.

Project Overview

The project consists of three main repositories that work together:

vulnerability-data-tools: The primary development repository containing scripts and tools for processing vulnerability data. This is where the core functionality lives.
cve-data-enrichment: The repository where human contributors submit updates and corrections to vulnerability data. This is where you'll be submitting most of your pull requests.
nvd-data-overrides: An automatically generated repository containing the processed data in NVD-compatible format. This repository is updated through GitHub Actions based on changes in cve-data-enrichment.

Think of these repositories as a pipeline: Contributors make changes in cve-data-enrichment, the tools in vulnerability-data-tools process these changes, and the results are published in nvd-data-overrides.

Getting Started

Prerequisites

Python 3.10 or newer
Git
A GitHub account
The crane tool for container operations

Setting Up Your Environment

First, install the crane tool:

# On macOS using Homebrew
brew install crane

# For other platforms, see: https://github.com/google/go-containerregistry

Clone the vulnerability-data-tools repository:

git clone https://github.com/anchore/vulnerability-data-tools
cd vulnerability-data-tools

Create and activate a Python virtual environment:

# Using uv (recommended for speed)
uv venv
source .venv/bin/activate

# Or using venv
python -m venv .venv
source .venv/bin/activate

# Or using your preferred virtual environment tool

Install required Python packages:

# If using uv
uv pip install requests check-jsonschema cpe

# If using pip
pip install requests check-jsonschema cpe

Fork and clone the cve-data-enrichment repository:

# Fork the repository on GitHub first, then:
cd nvd
git clone https://github.com/YOUR-USERNAME/cve-data-enrichment

Understanding the Vulnerability Data Ecosystem

Before diving into the specific tools, it's important to understand how different sources of vulnerability data work together. The vulnerability data ecosystem consists of several key components:

The National Vulnerability Database (NVD) provides vulnerability information based off the official CVE data, NVD will add metadata such as CPE configurations that describe affected software as well as severity information such as CVSS.
The CVE Project (in version 5 format) provides the foundational vulnerability records, CVE IDs can detailed information about affected packages and versions, but do not always.
Package ecosystems (like npm, PyPI, or Go) maintain their own repositories with package metadata that helps identify vulnerable versions.
Linux distributions (Debian, Ubuntu, Red Hat, Suse) maintain security trackers that provide additional context about vulnerabilities in their packages.

Our tools help bridge these different sources of information, ensuring accurate and comprehensive vulnerability data. Let's explore how each component works.

Data Collection and Enrichment Architecture

The vulnerability data tools use a pipeline approach to gather and process data:

Data Collection Layer The first layer of tools gathers data from various sources:

# wordpress_name_to_slug.py connects to wordpress.org's API
url_format = "https://api.wordpress.org/{wp_type}/info/1.2/?action=query_{wp_type}&request[page]={page}"

Normalization Layer The collected data goes through normalization to ensure consistency:

# normalization.py handles vendor name standardization
vendor_aliases = {
    "red hat": "redhat",
    "apache software foundation": "apache",
    "microsoft corporation": "microsoft"
}

Enrichment Layer Additional metadata is added to make the data more useful:

# enrich_package_metadata.py adds ecosystem-specific details
if package_type == "maven":
    # Add Maven-specific target software information
    components[-3] = "maven"

Validation Layer The enriched data is validated before being committed:

check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
                 cve-data-enrichment/data/anchore/**/CVE-*.json

Script Usage and Options

Each script provides specific functionality and options. Here are some common scenarios:

CVE Enrichment Record Generation

The cve_enrichment_record_candidates_from_upstream_cve5.py script is one of the most commonly used tools. It supports several useful options:

# Process specific CVEs
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --cves CVE-2024-XXXXX CVE-2024-YYYYY

# Process only records from specific assigners
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --assigners github_m wordfence

# Force processing of records, even if they have existing CPE data
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --force

# Limit the batch size to avoid overwhelming review capacity
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --batch-size 10

# Enable verbose output for debugging
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --verbose

The script makes intelligent decisions about which records to process:

Skips rejected CVEs
Avoids processing records that already have CPE configurations
Handles different package ecosystems appropriately

Package Metadata Enrichment

The enrich_package_metadata.py script adds ecosystem-specific information:

# Enrich all records
python -m scripts.enrich_package_metadata

# Common enrichments include:
# - Adding package repository URLs
# - Standardizing version information
# - Adding ecosystem-specific target software information

Debugging and Logging

The tools use Python's logging framework for debugging and monitoring. You can control logging verbosity:

# Enable debug logging
export PYTHONPATH=.
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --verbose

# Log output includes:
# - Data processing decisions
# - Skipped records and why
# - CPE lookup results

Common log messages and what they mean:

"No CPEs discovered": The tools couldn't automatically determine CPE identifiers
"Multiple collectionURL possibilities": Multiple potential package sources found
"Skipping git version type": Version information uses git commits instead of releases

Troubleshooting Common Issues

When working with the tools, you might encounter these common situations:

Missing CPE Information

# If no CPEs are found, check:
# - Vendor and product normalization
# - Package ecosystem mappings
# - Existing CPE dictionary entries

Version Parsing Issues

# The tools handle various version formats:
# - Semantic versions (1.2.3)
# - Custom versions (2024.1.1)
# - Range specifications (<2.0.0)

Data Reconciliation Conflicts

# When upstream data changes:
# - Use --reconcile flag to update
# - Check version ranges carefully
# - Validate CPE configurations

Batch Processing and Automation

For larger-scale updates, you can combine the tools:

Data Processing Scripts

cpe_lookups_from_cve5.py: Generates CPE pattern lookups from CVE v5 data
enrich_package_metadata.py: Enriches vulnerability data with additional package information

cve_enrichment_record_candidates_from_upstream_cve5.py: Creates enrichment record candidates from upstream CVE v5 data, with options like:

# Process specific CVEs
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --cves CVE-2024-XXXXX

# Process records from specific assigners
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --assigners github_m wordfence

# Force processing even for records with existing CPE data
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --force

Normalization and Validation

The toolset includes robust normalization capabilities through normalization.py which helps ensure consistent data formatting:

Vendor and product name normalization
Collection URL normalization
Version string parsing and normalization
Package type detection

Typical Workflow

The typical workflow for contributing vulnerability data updates follows these steps:

Data Update: Run scripts to fetch the latest data
Analysis: Process new CVEs and identify needed updates
Review: Verify CPE names and version information
Validation: Check changes against the schema
Submission: Create a pull request with your changes

Updating Data

Start by running the update script to fetch the latest vulnerability data:

cd nvd
./scripts/update.sh

This script performs several important tasks:

Downloads the official CPE dictionary
Fetches current NVD data
Updates local git repositories with the latest information

Processing New CVEs

Two main scripts handle CVE processing:

Find new CPEs to add:
```
python -m scripts.cpe_lookups_from_cve5
```

Generate enrichment records:

python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --batch-size 100

Reviewing Changes

After generating new records, you'll need to review the changes in the data/anchore directory. Pay special attention to:

Files marked with needsReview tags
CPE names and their accuracy
Version information and ranges

When reviewing version information, follow these guidelines:

For version ranges, use <= for "up to and including" and < for "up to but not including"
- "up to and including" means, we know version X and below is vulnerable and there is not currently a fix
- "up to but not including" means we know version X is fixed, and all previous versions are affected
Verify version numbers against advisory information
Check the official CPE dictionary for existing entries
- More information on CPE can be found on the NIST website

Enriching Metadata

After your initial review, enrich the data with additional information:

python -m scripts.enrich_package_metadata
python scripts/compare_to_distros.py

Validating Changes

Always validate your changes against the JSON schema before submitting:

check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
                 cve-data-enrichment/data/anchore/**/CVE-*.json

Common Scenarios and Examples

Scenario 1: Adding New CVE Data

When new vulnerabilities are reported, you may need to add them to the dataset. Here's a real-world example from January 2025:

{
  "additionalMetadata": {
    "cna": "github_m",
    "cveId": "CVE-2025-23214",
    "description": "Cosmos provides users the ability self-host a home server...",
    "reason": "Added CPE configurations because not yet analyzed by NVD."
  },
  "adp": {
    "affected": [
      {
        "collectionURL": "https://pkg.go.dev",
        "cpes": [
          "cpe:2.3:a:cosmos-cloud:cosmos_server:*:*:*:*:*:go:*:*"
        ],
        "packageName": "github.com/azukaar/cosmos-server",
        "packageType": "go-module",
        "product": "Cosmos-Server",
        "repo": "https://github.com/azukaar/cosmos-server",
        "vendor": "azukaar",
        "versions": [
          {
            "lessThan": "0.17.7",
            "status": "affected",
            "version": "0",
            "versionType": "custom"
          }
        ]
      }
    ]
  }
}

This example shows the key components:

Basic metadata about the CVE
Affected package information including:
- Collection URL (where the package is hosted)
- CPE identifiers
- Package details and type
- Version ranges that are affected

Scenario 2: Updating Version Ranges

Sometimes version information needs to be updated as new releases fix vulnerabilities. For example, from a recent update:

A common task is updating version ranges when new releases are made. For example, this change from January 2025:

   "versions": [
     {
-      "lessThanOrEqual": "8.7.13",
+      "lessThan": "8.7.16",
       "status": "affected",
       "version": "0",
       "versionType": "semver"
     }
   ]

To update a specific CVE record:

Generate the record if it doesn't exist:

python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --cves CVE-YYYY-XXXXX --force

Edit the file in cve-data-enrichment/data/anchore/YYYY/CVE-YYYY-XXXXX.json

Run validation:

check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
                 cve-data-enrichment/data/anchore/YYYY/CVE-YYYY-XXXXX.json

Scenario 3: Adding Package Repository Information

For WordPress plugins and other packages hosted in public repositories, it's important to include repository information. Here's an example:

   "packageName": "categorify",
   "packageType": "wordpress-plugin",
   "product": "Categorify – WordPress Media Library Category & File Manager",
+  "repo": "https://plugins.svn.wordpress.org/categorify",
   "vendor": "frenify",

The scripts in vulnerability-data-tools/nvd/scripts/ help automate this process:

wordpress_name_to_slug.py maps plugin names to their repository slugs
normalization.py provides functions for standardizing package and vendor names

Scenario 4: Handling False Negatives

When addressing false negatives (usually reported by users or through issue reports):

Generate the CVE record if it doesn't exist
Update the CPE information based on the reported issue
Verify the version information
Remove the needsReview tag after confirmation

Here's a real example of correcting version information for a previous false negative:

Scenario 5: Correcting Inaccurate CPE Data

CPE (Common Platform Enumeration) data sometimes needs correction to ensure accurate vulnerability tracking. Here's an example workflow:

Create or update the CVE record
Add the correct CPE information
Document the reason for the override in the commit message
Validate the changes

Example CPE corrections:

{
  "cpes": [
    "cpe:2.3:a:phpoffice:phpspreadsheet:*:*:*:*:*:php:*:*",
    "cpe:2.3:a:phpspreadsheet_project:phpspreadsheet:*:*:*:*:*:php:*:*"
  ]
}

When creating CPEs:

Follow the standard CPE format: cpe:2.3:part:vendor:product:version:update:edition:language:sw_edition:target_sw:target_hw:other
Use asterisks (*) for fields that apply to all values
Ensure vendor and product names follow standardized formats

Scenario 6: Removing a match

Sometimes a match needs to be removed from appearing in the results. There are many reasons this could happen: missing details for a very old vulnerability, the vulnerability might be marked as wontfix by the upstream project, or the original CPE is just incorrect.

Using the example from Scenario 1

{
  "additionalMetadata": {
    "cna": "github_m",
    "cveId": "CVE-2025-23214",
    "description": "Cosmos provides users the ability self-host a home server...",
    "reason": "Added CPE configurations because not yet analyzed by NVD."
  },
  "adp": {
    "affected": [
      {
        "collectionURL": "https://pkg.go.dev",
        "cpes": [
          "cpe:2.3:a:cosmos-cloud:cosmos_server:*:*:*:*:*:go:*:*"
        ],
        "packageName": "github.com/azukaar/cosmos-server",
        "packageType": "go-module",
        "product": "Cosmos-Server",
        "repo": "https://github.com/azukaar/cosmos-server",
        "vendor": "azukaar",
        "versions": [
          {
            "lessThan": "0.17.7",
            "status": "affected",
            "version": "0",
            "versionType": "custom"
          }
        ]
      }
    ]
  }
}

We will change the line "status": "affected", to be unaffected "status": "unaffected",

We modify the status rather than removing the data so future script run won't mistakenly re-add the incorrect data.

Best Practices

Documentation: Always include clear commit messages explaining your changes. Examples:
- "updates 2025-01-21" - For routine updates with multiple changes
- "reconcile with NVD" - When aligning data with NVD records
- "Update CVE-YYYY-XXXXX with fix version" - For specific version updates
Version Ranges: Be precise with version ranges to avoid false positives/negatives
CPE Validation: When possible, verify CPEs against the official dictionary
Incremental Changes: Submit smaller, focused pull requests rather than large changes
Review Process: Pay attention to automated checks and respond to review feedback

Contributing Your Changes

Create a branch for your changes:

cd cve-data-enrichment
git checkout -b fix-cve-yyyy-xxxxx

Commit your changes with a clear message:

git add data/anchore/yyyy/CVE-yyyy-xxxxx.json
git commit -s -m "Update CVE-YYYY-XXXXX with correct version range"

Don't forget to use the signoff option -s

Push your changes and create a pull request:
```
git push origin fix-cve-yyyy-xxxxx
```
Open a pull request on GitHub against the main cve-data-enrichment repository

Your changes will be reviewed, and once merged, they will automatically flow through to the nvd-data-overrides repository.

Getting Help

Check existing issues in the vulnerability-data-tools repository
Join the Anchore Community Discourse
Review the FAQ in the vulnerability-data-tools README

Remember: The goal is to improve the quality of vulnerability data for everyone. Take your time to verify changes and don't hesitate to ask for help when needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to Vulnerability Data Tools: A Technical Guide

Introduction

Project Overview

Getting Started

Prerequisites

Setting Up Your Environment

Understanding the Vulnerability Data Ecosystem

Data Collection and Enrichment Architecture

Script Usage and Options

CVE Enrichment Record Generation

Package Metadata Enrichment

Debugging and Logging

Troubleshooting Common Issues

Batch Processing and Automation

Data Processing Scripts

Normalization and Validation

Typical Workflow

Updating Data

Processing New CVEs

Reviewing Changes

Enriching Metadata

Validating Changes

Common Scenarios and Examples

Scenario 1: Adding New CVE Data

Scenario 2: Updating Version Ranges

Scenario 3: Adding Package Repository Information

Scenario 4: Handling False Negatives

Scenario 5: Correcting Inaccurate CPE Data

Scenario 6: Removing a match

Best Practices

Contributing Your Changes

Getting Help

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Vulnerability Data Tools: A Technical Guide

Introduction

Project Overview

Getting Started

Prerequisites

Setting Up Your Environment

Understanding the Vulnerability Data Ecosystem

Data Collection and Enrichment Architecture

Script Usage and Options

CVE Enrichment Record Generation

Package Metadata Enrichment

Debugging and Logging

Troubleshooting Common Issues

Batch Processing and Automation

Data Processing Scripts

Normalization and Validation

Typical Workflow

Updating Data

Processing New CVEs

Reviewing Changes

Enriching Metadata

Validating Changes

Common Scenarios and Examples

Scenario 1: Adding New CVE Data

Scenario 2: Updating Version Ranges

Scenario 3: Adding Package Repository Information

Scenario 4: Handling False Negatives

Scenario 5: Correcting Inaccurate CPE Data

Scenario 6: Removing a match

Best Practices

Contributing Your Changes

Getting Help