Skip to content

Registry Data Integrity ValidationΒ #527

Description

@jordanpadams

Are you sure this is not a new requirement or bug?

No

Task Type

Theme

πŸ’‘ Description

The registry continues to experience data integrity issues β€” products missing from the archive, broken references, and inconsistencies between what was delivered and what is actually queryable. This theme establishes a systematic set of integrity checks to detect, report, and remediate these problems.

Sub-tasks foreseen are:

  • Collection inventory completeness β€” compare LIDVIDs listed in collection inventory files against products actually loaded in the registry; report products present in inventories but absent from the registry (As a data engineer, I want to generate a list of product LIDVIDs tracked in inventory files but missing from the registryΒ #508)
  • Orphaned product detection β€” identify products loaded in the registry that are not referenced by any collection inventory (loaded but untracked/abandoned)
  • Internal reference validation β€” validate that all internal LID/LIDVID references within product labels (e.g. <Internal_Reference>) resolve to existing products in the registry
  • Bundle/collection hierarchy completeness β€” verify that all collections referenced in bundle inventory files exist in the registry, and all bundles referencing those collections are consistent
  • Duplicate LIDVID detection β€” identify cases where the same LIDVID has been ingested more than once, resulting in duplicate or conflicting records
  • Superseded version consistency β€” confirm that products with multiple versions are correctly marked (latest version active, superseded versions flagged), with no stale "latest" markers on older versions
  • Label-to-index metadata consistency β€” detect drift between field values indexed in OpenSearch and the corresponding values in the source PDS4 label (e.g. from partial re-ingestion or failed updates)
  • Broken file reference detection β€” identify products whose label file references (file_ref, file_name, url) point to archive locations that are no longer resolvable

Motivation

Multiple nodes have independently encountered missing or inconsistent data in the registry. Ad-hoc, one-off scripts are being written to diagnose specific incidents (e.g. ~500 missing M20 products). A systematic suite of integrity checks will allow node operators and data engineers to proactively detect and remediate data quality issues rather than reacting to user-reported gaps.

πŸ€– Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Theme.

    Projects

    Status
    ToDo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions