Skip to content

Bug Report: Processing Level Search Missing 40,121 Collections with "Not provided" #2358

Description

@iamsims

cmr_processing_level_report.json
generate_processing_level_report.py

Summary

The CMR search API's processing_level and processing_level_id parameters completely fail to return collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') in UMM-C format. 40,121 collections (74.5% of all CMR collections) are invisible to processing level searches.

Environment

  • CMR Base URL: https://cmr.earthdata.nasa.gov/search/
  • API Endpoint: collections.umm_json
  • Date of Analysis: December 16, 2024
  • Total Collections in CMR: 53,852

Bug Description

Expected Behavior

Collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') should be searchable using the processing_level_id parameter.

From exhaustive scan of all collections:

  • 40,121 collections have ProcessingLevel.Id = "Not provided" (lowercase 'p')
  • 161 collections have ProcessingLevel.Id = "Not Provided" (capital 'P')
  • These are distinct, non-overlapping sets (verified)

When searching with processing_level_id="Not provided", the API should return the 40,121 + 161 collections (given it is case insensitive)

Actual Behavior

Searching for any case variant ('not provided', 'Not provided', 'NOT PROVIDED') all return only 161 collections - these are the collections with ProcessingLevel.Id = "Not Provided" (capital 'P').

The 40,121 collections with lowercase "Not provided" are completely missing from the search results.

Analysis Result (from cmr_processing_level_search_index_report.json):

{
  "search_filter_results": {
    "Not provided": {
      "processing_level_id": 161,
      "processing_level": 161,
      "expected": 40121,
    },
    "Not Provided": {
      "processing_level_id": 161,
      "processing_level": 161,
      "expected": 161,
    }
  }
}

Impact

  • Severity: CRITICAL
  • Collections Affected: 40,121 collections (74.5% of all CMR collections)
  • User Impact: Users searching for data by processing level miss 99.6% of collections with "Not provided" values. This prevents implementing a fail-closed approach for processing level filtering. For example, when searching for processing_level=1, users cannot reliably include collections with unspecified/unknown processing levels by adding OR processing_level="Not provided" to their query, because 99.6% of those collections are missing from the search index.
  • Scope: Only affects "Not provided" (lowercase 'p'); all other 17 processing levels work correctly

Detailed Analysis

Exhaustive Scan Results (Ground Truth)

From comprehensive analysis of all 53,852 collections:

{
  "exhaustive_scan": {
    "total_collections": 53852,
    "collections_with_levels": 53852,
    "unique_levels": 18,
    "levels": {
      "Not provided": 40121,
      "Not Provided": 161,
      "NA": 1763,
      "3": 4401,
      "2": 3476,
      "4": 1471,
      "1B": 1088,
      "1": 705,
      "1A": 234,
      "0": 221,
      "2G": 76,
      "2P": 53,
      "2B": 34,
      "1C": 20,
      "2A": 14,
      "1T": 11,
      "L2": 2,
      "Level 3": 1
    }
  }
}

Search Filter Test Results

Testing all 18 unique processing level values:

Processing Level Expected Search Returns Status
Not provided 40,121 161 ❌ BROKEN
Not Provided 161 161 ✅ Works
NA 1,763 1,763 ✅ Works
3 4,401 4,401 ✅ Works
2 3,476 3,476 ✅ Works
4 1,471 1,471 ✅ Works
1B 1,088 1,088 ✅ Works
1 705 705 ✅ Works
1A 234 234 ✅ Works
0 221 221 ✅ Works
(13 others) * * ✅ Works

Result: 17 out of 18 processing levels work perfectly. Only "Not provided" (lowercase 'p') is broken.

Reproduction Steps

Prerequisites

  • Python 3.8+

Reproduce the Bug

Download and run the comprehensive analysis script:

#Download the attached script
# Install dependencies
pip install httpx

# Run analysis (takes ~10-15 minutes to scan all 53,852 collections)
python generate_processing_level_report.py

The json consists of two fields exhaustive_scan and search_filter_results. exhaustive_scan consists of the actual number of collections per processing level in CMR while search_filter_results consists of number of collections returned from searching with filter on that processing level.

Root Cause Analysis

What Works ✅

  • All 17 other processing level values index and search correctly
  • "Not Provided" (capital 'P') works perfectly (returns all 161 collections)
  • Both processing_level and processing_level_id parameters behave identically

What's Broken ❌

  • Collections with ProcessingLevel.Id = "Not provided" (lowercase 'p') are NOT indexed
  • These 40,121 collections are completely invisible to processing level searches
  • Searching for any case variant only returns the 161 "Not Provided" (capital P) collections

Attachments

Analysis Files

  1. generate_processing_level_report.py: Complete analysis script that:

    • Performs exhaustive scan of all 53,852 collections
    • Tests search filters for all 18 unique processing levels
    • Generates comprehensive JSON report
    • Identifies discrepancies and problematic levels
  2. cmr_processing_level_search_index_report.json: Full analysis results including:

    • Exhaustive scan results (ground truth)
    • Search filter test results for each processing level

Conclusion

This is a critical search indexing bug affecting 74% of CMR collections. The bug prevents users from discovering 40,121 collections when filtering by processing level, severely impacting data discovery for Earth science research.

The bug is:

  • 100% reproducible with provided analysis script
  • Well-isolated to lowercase "Not provided" only

Report Generated: December 16, 2025
Analysis Scope: All 53,852 CMR collections
Test Coverage: All 18 unique processing level values
Reproducibility: 100% (verified with comprehensive automated testing)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions