Summary
Configuration handling across training, evaluation, and inference workflows is currently fragmented and partially duplicated. This issue proposes introducing a centralized and standardized configuration system to improve maintainability, reproducibility, and developer experience.
Motivation
Different modules appear to manage configuration independently, which may lead to:
inconsistent parameter naming
duplicated default values
harder experiment reproducibility
configuration drift between pipelines
increased debugging complexity
A unified configuration system would simplify future development and experimentation.
Proposed Changes
Introduce Shared Configuration Schema
Create a standardized configuration abstraction for:
dataset settings
model hyperparameters
tokenizer settings
training arguments
inference/runtime options
device configuration
Possible approaches:
class ModelConfig(BaseObject): ...
Refactor Existing Pipelines
Update:
training scripts
inference scripts
evaluation utilities
dataset initialization
model loading logic
To consume the same configuration structure.
Expected Benefits
Improved reproducibility
Easier experiment tracking
Cleaner CLI/API interfaces
Reduced configuration duplication
Simpler onboarding for contributors
Better compatibility with future models
Suggested Scope
Included
config schema standardization
shared config loading utilities
validation/default handling
serialization support
Not Included
changing model architectures
altering training objectives
introducing external experiment platforms
Summary
Configuration handling across training, evaluation, and inference workflows is currently fragmented and partially duplicated. This issue proposes introducing a centralized and standardized configuration system to improve maintainability, reproducibility, and developer experience.
Motivation
Different modules appear to manage configuration independently, which may lead to:
inconsistent parameter naming
duplicated default values
harder experiment reproducibility
configuration drift between pipelines
increased debugging complexity
A unified configuration system would simplify future development and experimentation.
Proposed Changes
Introduce Shared Configuration Schema
Create a standardized configuration abstraction for:
dataset settings
model hyperparameters
tokenizer settings
training arguments
inference/runtime options
device configuration
Possible approaches:
class ModelConfig(BaseObject): ...Refactor Existing Pipelines
Update:
training scripts
inference scripts
evaluation utilities
dataset initialization
model loading logic
To consume the same configuration structure.
Expected Benefits
Improved reproducibility
Easier experiment tracking
Cleaner CLI/API interfaces
Reduced configuration duplication
Simpler onboarding for contributors
Better compatibility with future models
Suggested Scope
Included
config schema standardization
shared config loading utilities
validation/default handling
serialization support
Not Included
changing model architectures
altering training objectives
introducing external experiment platforms