This repository is the official implementation of our ICLR 2025 paper Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures.
conda create -n colm python=3.10
conda activate colm
conda install -c nvidia cuda-python
pip install -r requirement.txt --no-cache-dir --no-build-isolation
git clone https://github.com/hsgser/vllm.git
cd vllm
VLLM_INSTALL_PUNICA_KERNELS=1 pip install -e .
cd ..
pip install traker[fast] --no-cache-dir
pip install flash-attn==2.5.7 --no-build-isolation
pip install -i https://pypi.org/simple/ bitsandbytes
git clone https://github.com/decile-team/submodlib.git
cd submodlib
pip install -e .
cd ..
pip install -e .Note: Our implementation is tied to transformers==4.43.2. If you’re using a different transformers version or different model architectures, you may need to upgrade the libraries and modify the following files accordingly:
- colm/custom_phi.py
- colm/subset_trainer_distributed.py
- colm/train.py
Please download MathInstruct and SuperGLUE datasets with additional annotationshere and store it under the following path /data/*.jsonl.
bash scripts/run_math_efficient.shNote: We implement CoLM with an efficient last-layer zeroth-order gradient estimation that requires approximately only one forward pass of the model. While the selection time is negligible (<0.1s), CoLM still introduces additional overhead, such as synchronizing gradients before selection, broadcasting selected indices back, padding after selection (which can make some samples longer), transferring tensors between CPU and GPU, context switching, and so on. In the paper, we report the ideal training time of our method which is the forward pass time for a batch size of 128 + the forward and backward pass time for a batch size of 64.
cd math_eval
bash eval_finetuned.sh /path/to/your/modelIf you have any questions related to the code or the paper, feel free to email Dang Nguyen ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you find the repo helpful in your work:
@article{nguyen2025mini,
title = {Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures},
author = {Nguyen, Dang and Yang, Wenhan and Anand, Rathul and Yang, Yu and Mirzasoleiman, Baharan},
journal = {International Conference on Learning Representations (ICLR)},
year = {2025}
}The structure of this repository is largely based on the official implementation of LESS and MeZO. We are grateful for their open sources.