Skip to content

Latest commit

 

History

History
550 lines (416 loc) · 18.4 KB

File metadata and controls

550 lines (416 loc) · 18.4 KB

Contributing to Vulnerability Data Tools: A Technical Guide

Introduction

The Vulnerability Data Tools project aims to improve and enrich vulnerability data by maintaining accurate Common Platform Enumeration (CPE) mappings and other metadata. This guide will walk you through the process of contributing to the project, from setting up your environment to submitting your first pull request.

Project Overview

The project consists of three main repositories that work together:

  1. vulnerability-data-tools: The primary development repository containing scripts and tools for processing vulnerability data. This is where the core functionality lives.

  2. cve-data-enrichment: The repository where human contributors submit updates and corrections to vulnerability data. This is where you'll be submitting most of your pull requests.

  3. nvd-data-overrides: An automatically generated repository containing the processed data in NVD-compatible format. This repository is updated through GitHub Actions based on changes in cve-data-enrichment.

Think of these repositories as a pipeline: Contributors make changes in cve-data-enrichment, the tools in vulnerability-data-tools process these changes, and the results are published in nvd-data-overrides.

Getting Started

Prerequisites

  • Python 3.10 or newer
  • Git
  • A GitHub account
  • The crane tool for container operations

Setting Up Your Environment

  1. First, install the crane tool:

    # On macOS using Homebrew
    brew install crane
    
    # For other platforms, see: https://github.com/google/go-containerregistry
  2. Clone the vulnerability-data-tools repository:

    git clone https://github.com/anchore/vulnerability-data-tools
    cd vulnerability-data-tools
  3. Create and activate a Python virtual environment:

    # Using uv (recommended for speed)
    uv venv
    source .venv/bin/activate
    
    # Or using venv
    python -m venv .venv
    source .venv/bin/activate
    
    # Or using your preferred virtual environment tool
  4. Install required Python packages:

    # If using uv
    uv pip install requests check-jsonschema cpe
    
    # If using pip
    pip install requests check-jsonschema cpe
  5. Fork and clone the cve-data-enrichment repository:

    # Fork the repository on GitHub first, then:
    cd nvd
    git clone https://github.com/YOUR-USERNAME/cve-data-enrichment

Understanding the Vulnerability Data Ecosystem

Before diving into the specific tools, it's important to understand how different sources of vulnerability data work together. The vulnerability data ecosystem consists of several key components:

  1. The National Vulnerability Database (NVD) provides vulnerability information based off the official CVE data, NVD will add metadata such as CPE configurations that describe affected software as well as severity information such as CVSS.

  2. The CVE Project (in version 5 format) provides the foundational vulnerability records, CVE IDs can detailed information about affected packages and versions, but do not always.

  3. Package ecosystems (like npm, PyPI, or Go) maintain their own repositories with package metadata that helps identify vulnerable versions.

  4. Linux distributions (Debian, Ubuntu, Red Hat, Suse) maintain security trackers that provide additional context about vulnerabilities in their packages.

Our tools help bridge these different sources of information, ensuring accurate and comprehensive vulnerability data. Let's explore how each component works.

Data Collection and Enrichment Architecture

The vulnerability data tools use a pipeline approach to gather and process data:

  1. Data Collection Layer The first layer of tools gathers data from various sources:

    # wordpress_name_to_slug.py connects to wordpress.org's API
    url_format = "https://api.wordpress.org/{wp_type}/info/1.2/?action=query_{wp_type}&request[page]={page}"
  2. Normalization Layer The collected data goes through normalization to ensure consistency:

    # normalization.py handles vendor name standardization
    vendor_aliases = {
        "red hat": "redhat",
        "apache software foundation": "apache",
        "microsoft corporation": "microsoft"
    }
  3. Enrichment Layer Additional metadata is added to make the data more useful:

    # enrich_package_metadata.py adds ecosystem-specific details
    if package_type == "maven":
        # Add Maven-specific target software information
        components[-3] = "maven"
  4. Validation Layer The enriched data is validated before being committed:

    check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
                     cve-data-enrichment/data/anchore/**/CVE-*.json

Script Usage and Options

Each script provides specific functionality and options. Here are some common scenarios:

CVE Enrichment Record Generation

The cve_enrichment_record_candidates_from_upstream_cve5.py script is one of the most commonly used tools. It supports several useful options:

# Process specific CVEs
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --cves CVE-2024-XXXXX CVE-2024-YYYYY

# Process only records from specific assigners
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --assigners github_m wordfence

# Force processing of records, even if they have existing CPE data
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --force

# Limit the batch size to avoid overwhelming review capacity
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --batch-size 10

# Enable verbose output for debugging
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 \
    --verbose

The script makes intelligent decisions about which records to process:

  • Skips rejected CVEs
  • Avoids processing records that already have CPE configurations
  • Handles different package ecosystems appropriately

Package Metadata Enrichment

The enrich_package_metadata.py script adds ecosystem-specific information:

# Enrich all records
python -m scripts.enrich_package_metadata

# Common enrichments include:
# - Adding package repository URLs
# - Standardizing version information
# - Adding ecosystem-specific target software information

Debugging and Logging

The tools use Python's logging framework for debugging and monitoring. You can control logging verbosity:

# Enable debug logging
export PYTHONPATH=.
python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --verbose

# Log output includes:
# - Data processing decisions
# - Skipped records and why
# - CPE lookup results

Common log messages and what they mean:

  • "No CPEs discovered": The tools couldn't automatically determine CPE identifiers
  • "Multiple collectionURL possibilities": Multiple potential package sources found
  • "Skipping git version type": Version information uses git commits instead of releases

Troubleshooting Common Issues

When working with the tools, you might encounter these common situations:

  1. Missing CPE Information

    # If no CPEs are found, check:
    # - Vendor and product normalization
    # - Package ecosystem mappings
    # - Existing CPE dictionary entries
  2. Version Parsing Issues

    # The tools handle various version formats:
    # - Semantic versions (1.2.3)
    # - Custom versions (2024.1.1)
    # - Range specifications (<2.0.0)
  3. Data Reconciliation Conflicts

    # When upstream data changes:
    # - Use --reconcile flag to update
    # - Check version ranges carefully
    # - Validate CPE configurations

Batch Processing and Automation

For larger-scale updates, you can combine the tools:

Data Processing Scripts

  • cpe_lookups_from_cve5.py: Generates CPE pattern lookups from CVE v5 data
  • enrich_package_metadata.py: Enriches vulnerability data with additional package information
  • cve_enrichment_record_candidates_from_upstream_cve5.py: Creates enrichment record candidates from upstream CVE v5 data, with options like:
    # Process specific CVEs
    python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --cves CVE-2024-XXXXX
    
    # Process records from specific assigners
    python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --assigners github_m wordfence
    
    # Force processing even for records with existing CPE data
    python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --force

Normalization and Validation

The toolset includes robust normalization capabilities through normalization.py which helps ensure consistent data formatting:

  • Vendor and product name normalization
  • Collection URL normalization
  • Version string parsing and normalization
  • Package type detection

Typical Workflow

The typical workflow for contributing vulnerability data updates follows these steps:

  1. Data Update: Run scripts to fetch the latest data
  2. Analysis: Process new CVEs and identify needed updates
  3. Review: Verify CPE names and version information
  4. Validation: Check changes against the schema
  5. Submission: Create a pull request with your changes

Updating Data

Start by running the update script to fetch the latest vulnerability data:

cd nvd
./scripts/update.sh

This script performs several important tasks:

  • Downloads the official CPE dictionary
  • Fetches current NVD data
  • Updates local git repositories with the latest information

Processing New CVEs

Two main scripts handle CVE processing:

  1. Find new CPEs to add:

    python -m scripts.cpe_lookups_from_cve5
  2. Generate enrichment records:

    python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --batch-size 100

Reviewing Changes

After generating new records, you'll need to review the changes in the data/anchore directory. Pay special attention to:

  1. Files marked with needsReview tags
  2. CPE names and their accuracy
  3. Version information and ranges

When reviewing version information, follow these guidelines:

  • For version ranges, use <= for "up to and including" and < for "up to but not including"
    • "up to and including" means, we know version X and below is vulnerable and there is not currently a fix
    • "up to but not including" means we know version X is fixed, and all previous versions are affected
  • Verify version numbers against advisory information
  • Check the official CPE dictionary for existing entries

Enriching Metadata

After your initial review, enrich the data with additional information:

python -m scripts.enrich_package_metadata
python scripts/compare_to_distros.py

Validating Changes

Always validate your changes against the JSON schema before submitting:

check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
                 cve-data-enrichment/data/anchore/**/CVE-*.json

Common Scenarios and Examples

Scenario 1: Adding New CVE Data

When new vulnerabilities are reported, you may need to add them to the dataset. Here's a real-world example from January 2025:

{
  "additionalMetadata": {
    "cna": "github_m",
    "cveId": "CVE-2025-23214",
    "description": "Cosmos provides users the ability self-host a home server...",
    "reason": "Added CPE configurations because not yet analyzed by NVD."
  },
  "adp": {
    "affected": [
      {
        "collectionURL": "https://pkg.go.dev",
        "cpes": [
          "cpe:2.3:a:cosmos-cloud:cosmos_server:*:*:*:*:*:go:*:*"
        ],
        "packageName": "github.com/azukaar/cosmos-server",
        "packageType": "go-module",
        "product": "Cosmos-Server",
        "repo": "https://github.com/azukaar/cosmos-server",
        "vendor": "azukaar",
        "versions": [
          {
            "lessThan": "0.17.7",
            "status": "affected",
            "version": "0",
            "versionType": "custom"
          }
        ]
      }
    ]
  }
}

This example shows the key components:

  • Basic metadata about the CVE
  • Affected package information including:
    • Collection URL (where the package is hosted)
    • CPE identifiers
    • Package details and type
    • Version ranges that are affected

Scenario 2: Updating Version Ranges

Sometimes version information needs to be updated as new releases fix vulnerabilities. For example, from a recent update:

A common task is updating version ranges when new releases are made. For example, this change from January 2025:

   "versions": [
     {
-      "lessThanOrEqual": "8.7.13",
+      "lessThan": "8.7.16",
       "status": "affected",
       "version": "0",
       "versionType": "semver"
     }
   ]

To update a specific CVE record:

  1. Generate the record if it doesn't exist:

    python -m scripts.cve_enrichment_record_candidates_from_upstream_cve5 --cves CVE-YYYY-XXXXX --force
  2. Edit the file in cve-data-enrichment/data/anchore/YYYY/CVE-YYYY-XXXXX.json

  3. Run validation:

    check-jsonschema --schemafile cve-data-enrichment/schema/enrichment_record.schema.json \
                     cve-data-enrichment/data/anchore/YYYY/CVE-YYYY-XXXXX.json

Scenario 3: Adding Package Repository Information

For WordPress plugins and other packages hosted in public repositories, it's important to include repository information. Here's an example:

   "packageName": "categorify",
   "packageType": "wordpress-plugin",
   "product": "Categorify – WordPress Media Library Category & File Manager",
+  "repo": "https://plugins.svn.wordpress.org/categorify",
   "vendor": "frenify",

The scripts in vulnerability-data-tools/nvd/scripts/ help automate this process:

  • wordpress_name_to_slug.py maps plugin names to their repository slugs
  • normalization.py provides functions for standardizing package and vendor names

Scenario 4: Handling False Negatives

When addressing false negatives (usually reported by users or through issue reports):

  1. Generate the CVE record if it doesn't exist
  2. Update the CPE information based on the reported issue
  3. Verify the version information
  4. Remove the needsReview tag after confirmation

Here's a real example of correcting version information for a previous false negative:

Scenario 5: Correcting Inaccurate CPE Data

CPE (Common Platform Enumeration) data sometimes needs correction to ensure accurate vulnerability tracking. Here's an example workflow:

  1. Create or update the CVE record
  2. Add the correct CPE information
  3. Document the reason for the override in the commit message
  4. Validate the changes

Example CPE corrections:

{
  "cpes": [
    "cpe:2.3:a:phpoffice:phpspreadsheet:*:*:*:*:*:php:*:*",
    "cpe:2.3:a:phpspreadsheet_project:phpspreadsheet:*:*:*:*:*:php:*:*"
  ]
}

When creating CPEs:

  • Follow the standard CPE format: cpe:2.3:part:vendor:product:version:update:edition:language:sw_edition:target_sw:target_hw:other
  • Use asterisks (*) for fields that apply to all values
  • Ensure vendor and product names follow standardized formats

Scenario 6: Removing a match

Sometimes a match needs to be removed from appearing in the results. There are many reasons this could happen: missing details for a very old vulnerability, the vulnerability might be marked as wontfix by the upstream project, or the original CPE is just incorrect.

Using the example from Scenario 1

{
  "additionalMetadata": {
    "cna": "github_m",
    "cveId": "CVE-2025-23214",
    "description": "Cosmos provides users the ability self-host a home server...",
    "reason": "Added CPE configurations because not yet analyzed by NVD."
  },
  "adp": {
    "affected": [
      {
        "collectionURL": "https://pkg.go.dev",
        "cpes": [
          "cpe:2.3:a:cosmos-cloud:cosmos_server:*:*:*:*:*:go:*:*"
        ],
        "packageName": "github.com/azukaar/cosmos-server",
        "packageType": "go-module",
        "product": "Cosmos-Server",
        "repo": "https://github.com/azukaar/cosmos-server",
        "vendor": "azukaar",
        "versions": [
          {
            "lessThan": "0.17.7",
            "status": "affected",
            "version": "0",
            "versionType": "custom"
          }
        ]
      }
    ]
  }
}

We will change the line "status": "affected", to be unaffected "status": "unaffected",

We modify the status rather than removing the data so future script run won't mistakenly re-add the incorrect data.

Best Practices

  1. Documentation: Always include clear commit messages explaining your changes. Examples:

    • "updates 2025-01-21" - For routine updates with multiple changes
    • "reconcile with NVD" - When aligning data with NVD records
    • "Update CVE-YYYY-XXXXX with fix version" - For specific version updates
  2. Version Ranges: Be precise with version ranges to avoid false positives/negatives

  3. CPE Validation: When possible, verify CPEs against the official dictionary

  4. Incremental Changes: Submit smaller, focused pull requests rather than large changes

  5. Review Process: Pay attention to automated checks and respond to review feedback

Contributing Your Changes

  1. Create a branch for your changes:

    cd cve-data-enrichment
    git checkout -b fix-cve-yyyy-xxxxx
  2. Commit your changes with a clear message:

    git add data/anchore/yyyy/CVE-yyyy-xxxxx.json
    git commit -s -m "Update CVE-YYYY-XXXXX with correct version range"

    Don't forget to use the signoff option -s

  3. Push your changes and create a pull request:

    git push origin fix-cve-yyyy-xxxxx
  4. Open a pull request on GitHub against the main cve-data-enrichment repository

Your changes will be reviewed, and once merged, they will automatically flow through to the nvd-data-overrides repository.

Getting Help

Remember: The goal is to improve the quality of vulnerability data for everyone. Take your time to verify changes and don't hesitate to ask for help when needed.