Skip to content

Conversation

@djwhitt
Copy link
Collaborator

@djwhitt djwhitt commented Aug 15, 2025

Summary

Implements a comprehensive validation script to compare Parquet exports from different methods (TypeScript vs bash) and validate against source SQLite databases.

Changes

  • Add scripts/validate-parquet with three validation modes:
    • source: Validate Parquet against SQLite databases
    • compare: Compare two Parquet export directories
    • full: Both source and cross-export validation (default)

Features

  • Row count verification between SQLite source and Parquet exports
  • Data sampling with configurable sample rates (default 1%)
  • File structure comparison between different export methods
  • Schema validation ensuring column types match
  • Height range consistency checking
  • Color-coded console output with ✓/✗/⚠ indicators
  • Detailed log files and JSON summary reports
  • Smart defaults for common directory structures
  • Exit codes for programmatic usage

Usage Examples

# Full validation with defaults
./scripts/validate-parquet

# Quick metadata-only check
./scripts/validate-parquet --quick

# Validate specific height range
./scripts/validate-parquet --start-height 900000 --end-height 910000

# Compare export methods only
./scripts/validate-parquet --mode compare

Test plan

  • Test with existing Parquet exports from both TypeScript and bash methods
  • Verify row count accuracy against SQLite databases
  • Test height range filtering functionality
  • Validate JSON report generation
  • Test error handling with missing files/directories

Related

  • JIRA: PE-8458
  • Addresses data integrity validation between export methods

🤖 Generated with Claude Code

Add scripts/validate-parquet to compare TypeScript and bash Parquet
export outputs and validate against source SQLite databases.

Features:
- Three validation modes: source, compare, full
- Row count verification between SQLite and Parquet
- Data sampling with configurable rates
- File structure comparison between export methods
- Color-coded output with detailed logging
- JSON report generation
- Smart defaults for common directory structures
- Height range filtering support

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants