This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
ike is a DeepSpeed-based training and inference framework for language models. It provides high-level pipeline abstractions (TrainingPipeline, InferencePipeline) that handle distributed training, checkpointing, logging, and model management while allowing customization of data processing, forward functions, and model architecture.
pip install -e .For development:
pip install -e ".[dev]"deepspeed --include localhost:0,1 --master_port 12300 training.py \
-c cfgs/training.cfg cfgs/model.cfg \
--save_log --save_modeldeepspeed --include localhost:0,1 --master_port 12300 evaluation.py \
-c cfgs/evaluation.cfg cfgs/model.cfg \
--pretrained_model_dir $CHECKPOINT_DIRdeepspeed --include localhost:0,1 --master_port 12300 generation.py \
-c cfgs/generation.cfg cfgs/model.cfg \
--pretrained_model_dir $CHECKPOINT_DIRAdd --debug_mode to disable multiprocessing in data loading for easier debugging.
The framework follows a Task → Pipeline → Modules pattern:
-
Pipelines (
src/ike/training.py,src/ike/inference.py): High-level abstractions that orchestrate:- DeepSpeed initialization and distributed training
- Data loading and batching
- Model and optimizer creation
- Logging (TensorBoard, WandB)
- Checkpoint saving/loading
-
Tasks (user-defined, e.g.,
examples/lm/training.py): Instantiate pipelines by providing:load_data_from_filepath_fn: Data loading functiondata_processor_classes: List ofDataProcessorsubclassestrain_forward_step_fn/valid_forward_step_fn: Forward pass implementations- Optional: custom tokenizer, model, optimizer builders
-
Data Processing (
src/ike/data.py):DataProcessor: Base class for line-to-data conversion (implementline2data())DataReformatter: Optional post-processing (implementformat_data())BasicDataSource: PyTorch Dataset wrapper- Built-in loaders:
load_data_from_jsonl,load_data_from_hf,load_data_from_jsonl_or_hf
Training forward step:
def train_forward_step(step, accum_idx, model, tokenizer, batch_data, config) -> (loss, ret_data, ret_stat)Validation forward step:
def valid_forward_step(step, batch_idx, model, tokenizer, batch_data, config) -> (ret_data, ret_stat)Inference forward step:
def forward_step(model, tokenizer, batch_data, config) -> (ret_data, ret_stat)Uses configargparse with YAML support. Configuration can be provided via:
- Command line arguments
- Config files (
-c cfgs/training.cfg cfgs/model.cfg) - Combination of both (CLI overrides config files)
Key argument groups (see README_args.md for full list):
- Model:
--pretrained_model_dir,--tokenizer_dir,--attn_implementation - Data:
--train_filepaths,--valid_filepaths,--max_seq_len - Training:
--global_batch_size,--micro_batch_size,--n_epochs,--peak_lr - DeepSpeed:
--zero_stage,--bf16,--activation_checkpointing_layers - PEFT/LoRA:
--peft_type LORA,--peft_lora_r,--peft_lora_alpha - Management:
--save_log,--save_model,--validate_interval
For LoRA training, specify PEFT arguments in config:
peft_type: LORA
peft_task_type: CAUSAL_LM
peft_lora_r: 8
peft_lora_alpha: 32
peft_lora_target_modules: [q_proj, v_proj]Load trained LoRA checkpoints with --pretrained_peft_model_dir.
--save_model: Save best model based on--monitor_metric--save_all_models: Save every checkpoint at--save_model_interval--save_ds_checkpoint: Include optimizer/scheduler states for training resumption--resume_training: Resume from DeepSpeed checkpoint with full state
See examples/lm/ for language model finetuning and examples/gsm8k/ for SFT examples. Each example includes:
- Task-specific data processors (
data.py) - Training/evaluation scripts
- Config files in
cfgs/