CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

MLPerf Storage Benchmark Suite (v2.0.0b1) - a Python framework for benchmarking storage systems supporting ML workloads. The suite uses DLIO (Deep Learning I/O) benchmark as its execution engine.

Common Commands

# Install for development
pip install -e .

# Install with test dependencies
pip install -e ".[test]"

# Install with full DLIO support for running benchmarks
pip install -e ".[full]"

# Run all unit tests
pytest tests/unit -v

# Run a single test file
pytest tests/unit/test_cli.py -v

# Run tests with coverage
pytest tests/unit -v --cov=mlpstorage --cov-report=xml

# Run integration tests
pytest tests/integration -v

CLI Usage

The main entry point is mlpstorage with nested subcommands:

# Training benchmarks (unet3d, resnet50, cosmoflow)
mlpstorage training datasize ...   # Calculate required dataset size
mlpstorage training datagen ...    # Generate synthetic data
mlpstorage training run ...        # Execute benchmark
mlpstorage training configview ... # View final configuration

# Checkpointing benchmarks (llama3-8b, llama3-70b, llama3-405b, llama3-1t)
mlpstorage checkpointing run ...
mlpstorage checkpointing datagen ...
mlpstorage checkpointing validate ...

# Other benchmarks
mlpstorage vectordb run ...        # Vector database (PREVIEW)
mlpstorage kvcache run ...         # KV cache

# Utilities
mlpstorage reports reportgen ...   # Generate submission reports
mlpstorage history list/replay ... # Command history

Architecture

Benchmark System

All benchmarks inherit from Benchmark base class (mlpstorage/benchmarks/base.py):

Subclasses implement _run() method and set BENCHMARK_TYPE class attribute
Base class handles cluster info collection, result directories, metadata, and signal handling
Supports dependency injection for cluster collectors and validators (for testing)

Concrete implementations in mlpstorage/benchmarks/:

TrainingBenchmark, CheckpointingBenchmark - DLIO-based benchmarks
VectorDBBenchmark - Vector database operations
KVCacheBenchmark - LLM KV cache management

Registry Pattern

BenchmarkRegistry (mlpstorage/registry.py) dynamically registers benchmarks at import time. Each benchmark registration includes its CLI argument builder, enabling automatic CLI construction.

Configuration Flow

CLI arguments parsed via cli_parser.py
YAML config templates loaded from configs/dlio/workload/
Parameters merged with precedence: CLI args > YAML config > environment variables
Dotted-key parameters (e.g., dataset.num_files_train) flattened/unflattened for DLIO

Validation System

Located in mlpstorage/rules/:

Run Checkers (run_checkers/) - Real-time validation during execution
Submission Checkers (submission_checkers/) - Post-run compliance validation
BenchmarkVerifier (verifier.py) - Orchestrates all validation
Validation states: CLOSED, OPEN, INVALID (defined in config.py as PARAM_VALIDATION enum)

MPI Integration

cluster_collector.py - MPI-based system information collection
Commands executed via CommandExecutor in utils.py with live output streaming
Supports both mpirun and mpiexec via --mpi-bin flag

Key Files

File	Purpose
`mlpstorage/main.py`	Entry point with signal/error handling
`mlpstorage/benchmarks/base.py`	Abstract benchmark base class
`mlpstorage/benchmarks/__init__.py`	Benchmark registry initialization
`mlpstorage/config.py`	Constants, enums, model configurations
`mlpstorage/rules/models.py`	Data classes for validation pipeline
`mlpstorage/utils.py`	Command execution, JSON encoding, config loading

Adding a New Benchmark

Create benchmark class inheriting from Benchmark
Set BENCHMARK_TYPE class attribute
Implement _run() method
Create CLI argument builder in mlpstorage/cli/
Register in mlpstorage/benchmarks/__init__.py via BenchmarkRegistry.register()

Testing

Tests use pytest with fixtures in tests/fixtures/:

mock_collector.py - Mock cluster collector
mock_executor.py - Mock command executor
mock_logger.py - Mock logger
sample_data.py - Sample test data

Test Environment

When running the mlpstorage CLI for manual testing or integration tests, use:

Data directory: /databases/mlps-v3.0/data/
Results directory: /databases/mlps-v3.0/results/

Example Commands

# Generate dataset for unet3d with 4 processes
mlpstorage training datagen \
    --model unet3d \
    --num-processes 4 \
    --data-dir /databases/mlps-v3.0/data/ \
    --results-dir /databases/mlps-v3.0/results

# Run training benchmark for unet3d with 2 h100 accelerators
mlpstorage training run \
    --model unet3d \
    --num-accelerators 2 \
    --accelerator-type h100 \
    --client-host-memory-in-gb 64 \
    --data-dir /databases/mlps-v3.0/data/ \
    --results-dir /databases/mlps-v3.0/results

Note: These benchmarks require MPI (OpenMPI) to be installed. Install with:

# Ubuntu/Debian
sudo apt-get install openmpi-bin

# RHEL/CentOS
sudo yum install openmpi

Key Constants

From mlpstorage/config.py:

Training models: cosmoflow, resnet50, unet3d
LLM models (checkpointing): llama3-8b, llama3-70b, llama3-405b, llama3-1t
Accelerators: h100, a100
Submission categories: CLOSED, OPEN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Common Commands

CLI Usage

Architecture

Benchmark System

Registry Pattern

Configuration Flow

Validation System

MPI Integration

Key Files

Adding a New Benchmark

Testing

Test Environment

Example Commands

Key Constants

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Common Commands

CLI Usage

Architecture

Benchmark System

Registry Pattern

Configuration Flow

Validation System

MPI Integration

Key Files

Adding a New Benchmark

Testing

Test Environment

Example Commands

Key Constants