Skip to content

Polygl0t/llm-foundry

Repository files navigation

██████╗  ██████╗ ██╗  ██╗   ██╗ ██████╗  ██████╗ ████████╗
██╔══██╗██╔═══██╗██║  ╚██╗ ██╔╝██╔════╝ ██╔═████╗╚══██╔══╝
██████╔╝██║   ██║██║   ╚████╔╝ ██║  ███╗██║██╔██║   ██║   
██╔═══╝ ██║   ██║██║    ╚██╔╝  ██║   ██║████╔╝██║   ██║   
██║     ╚██████╔╝███████╗██║   ╚██████╔╝╚██████╔╝   ██║   
╚═╝      ╚═════╝ ╚══════╝╚═╝    ╚═════╝  ╚═════╝    ╚═╝   

Developing foundation models for low-resource languages

Overview
--------

This repository (the foundry) contains all the source code we use in our research and development pipeline.

* data: Scripts for downloading and preprocessing datasets (e.g., HF Hub, Common Craul).
* distributed: Scripts for training and evaluating language models with DDP and FSDP.
* dpo: Implementation for Direct Preference Optimization via TRL.
* evals: Scripts for evaluating language models via the lm-evaluation-harness.
* gym: Scripts for training and evaluating language models on custom environments (WIP).
* hf_hub: Scripts for interacting with the Hugging Face Hub.
* merge: Scripts for running different merging techniques via mergekit.
* sft: Implementation of Supervised Fine-Tuning via TRL.
* synthetic: Scripts for generating synthetic datasets with vLLM.
* tests: Unit and integration tests for our code base.
* tokenization: Scripts for training, evaluating, and using tokenizers.
* utils: Some miscellaneous utilities for our code base.

All of our code base is made to run on the Marvin cluster (University of Bonn).

Installation
------------

You can use the `installation.sh` to help you create workspaces in Marvin. Marvin has a dual stack setup (Intel / AMD), so make sure to create your local environments with this in mind (see `installation.sh` for details).

The `.modules_{amd|intel}.sh` file contains all the modules you need to load in order to run our code base in Marvins AMD or Intel stack (see `installation.sh` for details).

You can also use the `pyproject.toml` to install certain specific/working builds of our code base. The `pyproject.toml` has currently some basic builds to work with:

* `data`: For downloading and preprocessing datasets.
* `distributed`: For training and evaluating language models with DDP and FSDP.
* `synth`: For generating synthetic datasets with vLLM.
* `trl`: For training and evaluating language models with TRL.
* `tests`: For running our test suite.

Aknowlegments
-------------

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge access to the Marvin cluster, hosted by the University of Bonn, along with support from its High Performance Computing & Analytics Lab.

About

Developing foundation models for low-resource languages.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors