Skip to content

Enable multi-version evaluation and checkpoint replay #567

@joellidin

Description

@joellidin

Current Limitations

The evaluator currently:

  • Only evaluates the latest checkpoint discovered
  • Cannot replay/re-evaluate historical checkpoints
  • Doesn't support evaluating checkpoints from multiple model versions
  • Tracks evaluated windows but cannot systematically process all available checkpoints

Proposed Enhancement

Implement comprehensive checkpoint evaluation capabilities:

  1. Multi-version support: Allow evaluator to work with checkpoints from different model versions simultaneously
  2. Full replay mode: Add ability to systematically evaluate all historical checkpoints, not just the newest ones
  3. Batch evaluation: Process multiple checkpoints in sequence for comprehensive benchmarking

Checkpoint Storage Reorganization

To support this enhancement, we should migrate to a dedicated checkpoint storage architecture:

  • New bucket structure: Move checkpoints from the current validator bucket to a dedicated checkpoint bucket
  • Organized format: Use hierarchical structure: <version>/<window>/data
    • Example: v2.0.9/150/model.safetensors
    • Enables efficient version-based queries
    • Simplifies checkpoint discovery and management

Migration Strategy

Create a migration script that:

  • Copies all existing checkpoints from the current validator bucket
  • Reorganizes them into the new <version>/<window>/data structure
  • Preserves metadata and checkpoint integrity
  • Provides verification of successful migration

Benefits

  • Complete evaluation history for model performance tracking
  • Ability to compare performance across versions
  • Better checkpoint organization and discoverability
  • Enables retroactive evaluation with updated metrics
  • Supports A/B testing between model versions

Implementation Considerations

  • Maintain backward compatibility during transition
  • Implement efficient checkpoint discovery across versions
  • Add configuration for selective version evaluation
  • Consider checkpoint deduplication strategies
  • Implement progress tracking for batch evaluations

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions