CoLM

This repository is the official implementation of our ICLR 2025 paper Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures.

🔗 Quick Links

CoLM

Install Requirements

conda create -n colm python=3.10
conda activate colm
conda install -c nvidia cuda-python
pip install -r requirement.txt --no-cache-dir --no-build-isolation
git clone https://github.com/hsgser/vllm.git
cd vllm
VLLM_INSTALL_PUNICA_KERNELS=1 pip install -e .
cd ..
pip install traker[fast] --no-cache-dir
pip install flash-attn==2.5.7 --no-build-isolation
pip install -i https://pypi.org/simple/ bitsandbytes
git clone https://github.com/decile-team/submodlib.git
cd submodlib
pip install -e .
cd ..
pip install -e .

Note: Our implementation is tied to transformers==4.43.2. If you’re using a different transformers version or different model architectures, you may need to upgrade the libraries and modify the following files accordingly:

colm/custom_phi.py
colm/subset_trainer_distributed.py
colm/train.py

Data Preparation

Please download MathInstruct and SuperGLUE datasets with additional annotationshere and store it under the following path /data/*.jsonl.

Training

bash scripts/run_math_efficient.sh

Note: We implement CoLM with an efficient last-layer zeroth-order gradient estimation that requires approximately only one forward pass of the model. While the selection time is negligible (<0.1s), CoLM still introduces additional overhead, such as synchronizing gradients before selection, broadcasting selected indices back, padding after selection (which can make some samples longer), transferring tensors between CPU and GPU, context switching, and so on. In the paper, we report the ideal training time of our method which is the forward pass time for a batch size of 128 + the forward and backward pass time for a batch size of 64.

Evaluation

cd math_eval
bash eval_finetuned.sh /path/to/your/model

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Dang Nguyen ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you find the repo helpful in your work:

@article{nguyen2025mini,
  title = {Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures},
  author = {Nguyen, Dang and Yang, Wenhan and Anand, Rathul and Yang, Yu and Mirzasoleiman, Baharan},
  journal = {International Conference on Learning Representations (ICLR)},
  year = {2025}
}

Acknowledgements

The structure of this repository is largely based on the official implementation of LESS and MeZO. We are grateful for their open sources.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
colm		colm
math_eval		math_eval
scripts		scripts
superglue_eval		superglue_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CoLM

🔗 Quick Links

Install Requirements

Data Preparation

Training

Evaluation

Bugs or Questions?

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

BigML-CS-UCLA/CoLM

Folders and files

Latest commit

History

Repository files navigation

CoLM

🔗 Quick Links

Install Requirements

Data Preparation

Training

Evaluation

Bugs or Questions?

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages