-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Current Limitations
The evaluator currently:
- Only evaluates the latest checkpoint discovered
- Cannot replay/re-evaluate historical checkpoints
- Doesn't support evaluating checkpoints from multiple model versions
- Tracks evaluated windows but cannot systematically process all available checkpoints
Proposed Enhancement
Implement comprehensive checkpoint evaluation capabilities:
- Multi-version support: Allow evaluator to work with checkpoints from different model versions simultaneously
- Full replay mode: Add ability to systematically evaluate all historical checkpoints, not just the newest ones
- Batch evaluation: Process multiple checkpoints in sequence for comprehensive benchmarking
Checkpoint Storage Reorganization
To support this enhancement, we should migrate to a dedicated checkpoint storage architecture:
- New bucket structure: Move checkpoints from the current validator bucket to a dedicated checkpoint bucket
- Organized format: Use hierarchical structure:
<version>/<window>/data- Example:
v2.0.9/150/model.safetensors - Enables efficient version-based queries
- Simplifies checkpoint discovery and management
- Example:
Migration Strategy
Create a migration script that:
- Copies all existing checkpoints from the current validator bucket
- Reorganizes them into the new
<version>/<window>/datastructure - Preserves metadata and checkpoint integrity
- Provides verification of successful migration
Benefits
- Complete evaluation history for model performance tracking
- Ability to compare performance across versions
- Better checkpoint organization and discoverability
- Enables retroactive evaluation with updated metrics
- Supports A/B testing between model versions
Implementation Considerations
- Maintain backward compatibility during transition
- Implement efficient checkpoint discovery across versions
- Add configuration for selective version evaluation
- Consider checkpoint deduplication strategies
- Implement progress tracking for batch evaluations
Metadata
Metadata
Assignees
Labels
No labels