Skip to content

Latest commit

 

History

History
128 lines (97 loc) · 4.28 KB

File metadata and controls

128 lines (97 loc) · 4.28 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

ike is a DeepSpeed-based training and inference framework for language models. It provides high-level pipeline abstractions (TrainingPipeline, InferencePipeline) that handle distributed training, checkpointing, logging, and model management while allowing customization of data processing, forward functions, and model architecture.

Installation

pip install -e .

For development:

pip install -e ".[dev]"

Common Commands

Training

deepspeed --include localhost:0,1 --master_port 12300 training.py \
    -c cfgs/training.cfg cfgs/model.cfg \
    --save_log --save_model

Inference/Evaluation

deepspeed --include localhost:0,1 --master_port 12300 evaluation.py \
    -c cfgs/evaluation.cfg cfgs/model.cfg \
    --pretrained_model_dir $CHECKPOINT_DIR

Generation

deepspeed --include localhost:0,1 --master_port 12300 generation.py \
    -c cfgs/generation.cfg cfgs/model.cfg \
    --pretrained_model_dir $CHECKPOINT_DIR

Debug Mode

Add --debug_mode to disable multiprocessing in data loading for easier debugging.

Architecture

Core Components

The framework follows a Task → Pipeline → Modules pattern:

  1. Pipelines (src/ike/training.py, src/ike/inference.py): High-level abstractions that orchestrate:

    • DeepSpeed initialization and distributed training
    • Data loading and batching
    • Model and optimizer creation
    • Logging (TensorBoard, WandB)
    • Checkpoint saving/loading
  2. Tasks (user-defined, e.g., examples/lm/training.py): Instantiate pipelines by providing:

    • load_data_from_filepath_fn: Data loading function
    • data_processor_classes: List of DataProcessor subclasses
    • train_forward_step_fn / valid_forward_step_fn: Forward pass implementations
    • Optional: custom tokenizer, model, optimizer builders
  3. Data Processing (src/ike/data.py):

    • DataProcessor: Base class for line-to-data conversion (implement line2data())
    • DataReformatter: Optional post-processing (implement format_data())
    • BasicDataSource: PyTorch Dataset wrapper
    • Built-in loaders: load_data_from_jsonl, load_data_from_hf, load_data_from_jsonl_or_hf

Forward Step Function Signatures

Training forward step:

def train_forward_step(step, accum_idx, model, tokenizer, batch_data, config) -> (loss, ret_data, ret_stat)

Validation forward step:

def valid_forward_step(step, batch_idx, model, tokenizer, batch_data, config) -> (ret_data, ret_stat)

Inference forward step:

def forward_step(model, tokenizer, batch_data, config) -> (ret_data, ret_stat)

Configuration System

Uses configargparse with YAML support. Configuration can be provided via:

  • Command line arguments
  • Config files (-c cfgs/training.cfg cfgs/model.cfg)
  • Combination of both (CLI overrides config files)

Key argument groups (see README_args.md for full list):

  • Model: --pretrained_model_dir, --tokenizer_dir, --attn_implementation
  • Data: --train_filepaths, --valid_filepaths, --max_seq_len
  • Training: --global_batch_size, --micro_batch_size, --n_epochs, --peak_lr
  • DeepSpeed: --zero_stage, --bf16, --activation_checkpointing_layers
  • PEFT/LoRA: --peft_type LORA, --peft_lora_r, --peft_lora_alpha
  • Management: --save_log, --save_model, --validate_interval

PEFT/LoRA Support

For LoRA training, specify PEFT arguments in config:

peft_type: LORA
peft_task_type: CAUSAL_LM
peft_lora_r: 8
peft_lora_alpha: 32
peft_lora_target_modules: [q_proj, v_proj]

Load trained LoRA checkpoints with --pretrained_peft_model_dir.

Checkpoint Handling

  • --save_model: Save best model based on --monitor_metric
  • --save_all_models: Save every checkpoint at --save_model_interval
  • --save_ds_checkpoint: Include optimizer/scheduler states for training resumption
  • --resume_training: Resume from DeepSpeed checkpoint with full state

Examples

See examples/lm/ for language model finetuning and examples/gsm8k/ for SFT examples. Each example includes:

  • Task-specific data processors (data.py)
  • Training/evaluation scripts
  • Config files in cfgs/