Communication Efficient LLM Pre-training with SparseLoCo

This repository provides a PyTorch implementation of SparseLoCo, Communication Efficient LLM Pre-training with SparseLoCo. SparseLoCo mitigates the communication bottleneck by combining Top-k EF and DiLoCo.

Key Features

SparseLoCo Optimizer: A reference implementation of the core algorithm in src/tplr/sparseloco.py.
Multiple Training Strategies: Includes baselines for robust comparison:
- SparseLoCo: Proposed Top-k EF compression with local optimization.
- DiLoCo: Distributed training with a local optimization.
- DeMo: Gradient compression with DCT without local optimization.
- AdamW: Standard distributed data-parallel training.

Getting Started

1. Prerequisites

Python 3.11+
uv (for environment management)
This codebase has been tested with H100 and H200 GPUs

2. Installation

Clone the repository and install the required dependencies using uv.

git clone https://github.com/tplr-ai/SparseLoCo
cd SparseLoCo
uv sync
source .venv/bin/activate

3. Data Preparation

The training script expects a pre-tokenized and sharded dataset. Use the pretokenize_data.py script to process a dataset from Hugging Face.

The default configuration uses mlfoundations/dclm-baseline-1.0-parquet and expects the output in ~/datasets/dclm_tokenized.

export DATA_DIR="~/datasets/"
python pretokenize_data.py --output_dir $DATA_DIR/dclm_tokenized

Note: Ensure the --output_dir matches the shards_path in the sweep configuration files (hparams/**/*.yaml) or update the YAML files accordingly.

Running Experiments

Experiments are managed through wandb sweeps. The run_sweep.sh script simplifies the process by creating a sweep and launching a wandb agent.

First, set your W&B API key:

export WANDB_API_KEY="..."

Then, run any of the predefined experiments using the corresponding sweep file. Each experiment is configured to run on 8 GPUs by default (--nproc_per_node=8). You can adjust the number of GPUs by modifying the --nproc_per_node parameter in the sweep configuration files.

SparseLoCo (Proposed Method)

bash ./run_sweep.sh hparams/512M/sweeps/sparseloco.yaml

Baselines

DiLoCo Baseline: Baseline DiLoCo with Nesterov outer optimizer

bash ./run_sweep.sh hparams/512M/sweeps/diloco_baseline.yaml

DeMo Baseline: Standard DDP with DeMo

bash ./run_sweep.sh hparams/512M/sweeps/demo_baseline.yaml

AdamW Baseline: Standard DDP with AdamW

bash ./run_sweep.sh hparams/512M/sweeps/adam_baseline.yaml

Citation

If you find SparseLoCo useful in your work, please consider citing our work. You can read more the arXiv preprint.

@misc{sarfi2025sparseloco,
  title        = {Communication Efficient LLM Pre-training with SparseLoCo},
  author       = {Sarfi, Amir and Thérien, Benjamin and Lidin, Joel and Belilovsky, Eugene},
  year         = {2025},
  eprint       = {2508.15706},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  howpublished = {\url{https://arxiv.org/pdf/2508.15706}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
hparams/512M		hparams/512M
src/tplr		src/tplr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pretokenize_data.py		pretokenize_data.py
pyproject.toml		pyproject.toml
run_sweep.sh		run_sweep.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Communication Efficient LLM Pre-training with SparseLoCo

Key Features

Getting Started

1. Prerequisites

2. Installation

3. Data Preparation

Running Experiments

SparseLoCo (Proposed Method)

Baselines

Citation

About

Uh oh!

Releases

Packages

Languages

License

one-covenant/SparseLoCo

Folders and files

Latest commit

History

Repository files navigation

Communication Efficient LLM Pre-training with SparseLoCo

Key Features

Getting Started

1. Prerequisites

2. Installation

3. Data Preparation

Running Experiments

SparseLoCo (Proposed Method)

Baselines

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages