Skip to content

Re-process historical data after fixing version extraction #46

Description

@grdumas

Summary

After fixing version extraction in processors (issues #41-#45), all historical data in OpenSearch needs to be re-processed to populate correct test.version values.

Current State

~83% of documents in OpenSearch have incorrect test.version values:

  • Fully affected tests: test.version = wrapper version or "unknown"
  • Partially affected tests: test.version = wrapper version, benchmark version in config

Required Steps

1. Pre-Migration Validation

Before re-processing:

  • Verify all processor fixes are merged
  • Test fixes with sample data
  • Validate new documents have correct versions
  • Create backup of OpenSearch index

2. Migration Strategy Options

Option A: Full Re-ingest (Recommended)

  • Re-process all raw result files
  • Generate new documents with correct versions
  • Replace existing documents

Pros: Clean, consistent results
Cons: Time-intensive, requires raw files

Option B: Partial Update

  • Query existing documents
  • Extract benchmark version from runs[].configuration (for partially affected)
  • Update test.version field

Pros: Faster for partially affected tests
Cons: Cannot fix fully affected tests without raw data

Option C: Hybrid

  • Re-ingest fully affected tests (no version in config)
  • Update partially affected tests (version in config)

3. Migration Script

Create script to:

  1. Identify affected documents by test.name
  2. Re-process raw results OR extract from config
  3. Update test.version field
  4. Validate results
# Pseudo-code
affected_tests = {
    'fully': ['coremark', 'coremark_pro', 'phoronix', 'specjbb', 'streams', 'uperf'],
    'partial': ['autohpl', 'fio', 'passmark', 'speccpu2017']
}

for test_name in affected_tests['fully']:
    # Must re-ingest from raw files
    documents = opensearch.search(test_name=test_name)
    for doc in documents:
        raw_file = locate_raw_result(doc)
        new_doc = process_with_fixed_processor(raw_file)
        opensearch.update(doc.id, new_doc)

for test_name in affected_tests['partial']:
    # Can extract from existing config
    documents = opensearch.search(test_name=test_name)
    for doc in documents:
        version = extract_version_from_config(doc, test_name)
        opensearch.update(doc.id, {'test.version': version})

4. Validation

After migration:

  • Query counts match pre-migration
  • test.version != test.wrapper_version for affected tests
  • Spot-check documents have correct benchmark versions
  • Run test queries to verify version filtering works
# Validation queries
# 1. Check version distribution
GET /chronicler-runs/_search
{
  "aggs": {
    "by_test": {
      "terms": {"field": "test.name.keyword"},
      "aggs": {
        "versions": {"terms": {"field": "test.version.keyword"}}
      }
    }
  }
}

# 2. Verify no conflation (version != wrapper_version)
GET /chronicler-runs/_search
{
  "query": {
    "script": {
      "script": "doc['test.version.keyword'].value == doc['test.wrapper_version.keyword'].value"
    }
  }
}

5. Rollback Plan

If issues found:

  1. Restore from backup
  2. Fix processor bugs
  3. Re-run migration

Dependencies

Estimated Impact

Assuming uniform distribution:

  • Documents affected: ~83% of chronicler-runs index
  • Re-processing time: Depends on data volume and strategy
  • Downtime: None (can update in-place)

References

  • Analysis: VERSION_CONFLATION_IMPACT.md
  • Example queries in impact document

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions