Skip to content

[ENH] Standardize configuration management for training and inference pipelines #665

@ayushk687

Description

@ayushk687

Summary

Configuration handling across training, evaluation, and inference workflows is currently fragmented and partially duplicated. This issue proposes introducing a centralized and standardized configuration system to improve maintainability, reproducibility, and developer experience.

Motivation

Different modules appear to manage configuration independently, which may lead to:

inconsistent parameter naming
duplicated default values
harder experiment reproducibility
configuration drift between pipelines
increased debugging complexity

A unified configuration system would simplify future development and experimentation.

Proposed Changes
Introduce Shared Configuration Schema

Create a standardized configuration abstraction for:

dataset settings
model hyperparameters
tokenizer settings
training arguments
inference/runtime options
device configuration

Possible approaches:
class ModelConfig(BaseObject): ...

Refactor Existing Pipelines

Update:

training scripts
inference scripts
evaluation utilities
dataset initialization
model loading logic

To consume the same configuration structure.

Expected Benefits
Improved reproducibility
Easier experiment tracking
Cleaner CLI/API interfaces
Reduced configuration duplication
Simpler onboarding for contributors
Better compatibility with future models
Suggested Scope
Included
config schema standardization
shared config loading utilities
validation/default handling
serialization support
Not Included
changing model architectures
altering training objectives
introducing external experiment platforms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions