GitHub - Polygl0t/llm-foundry: Developing foundation models for low-resource languages.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
data		data
distributed		distributed
dpo		dpo
evals		evals
gym		gym
hf_hub		hf_hub
long_ctx		long_ctx
merge		merge
sft		sft
synthetic		synthetic
tests		tests
tokenization		tokenization
utils		utils
.codecarbon.config		.codecarbon.config
.gitignore		.gitignore
.modules_amd.sh		.modules_amd.sh
.modules_intel.sh		.modules_intel.sh
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README		README
installation.sh		installation.sh
pyproject.toml		pyproject.toml

Repository files navigation

██████╗ ██████╗ ██╗ ██╗ ██╗ ██████╗ ██████╗ ████████╗
██╔══██╗██╔═══██╗██║ ╚██╗ ██╔╝██╔════╝ ██╔═████╗╚══██╔══╝
██████╔╝██║ ██║██║ ╚████╔╝ ██║ ███╗██║██╔██║ ██║
██╔═══╝ ██║ ██║██║ ╚██╔╝ ██║ ██║████╔╝██║ ██║
██║ ╚██████╔╝███████╗██║ ╚██████╔╝╚██████╔╝ ██║
╚═╝ ╚═════╝ ╚══════╝╚═╝ ╚═════╝ ╚═════╝ ╚═╝

Developing foundation models for low-resource languages

Overview
--------

This repository (the foundry) contains all the source code we use in our research and development pipeline.

* data: Scripts for downloading and preprocessing datasets (e.g., HF Hub, Common Craul).
* distributed: Scripts for training and evaluating language models with DDP and FSDP.
* dpo: Implementation for Direct Preference Optimization via TRL.
* evals: Scripts for evaluating language models via the lm-evaluation-harness.
* gym: Scripts for training and evaluating language models on custom environments (WIP).
* hf_hub: Scripts for interacting with the Hugging Face Hub.
* merge: Scripts for running different merging techniques via mergekit.
* sft: Implementation of Supervised Fine-Tuning via TRL.
* synthetic: Scripts for generating synthetic datasets with vLLM.
* tests: Unit and integration tests for our code base.
* tokenization: Scripts for training, evaluating, and using tokenizers.
* utils: Some miscellaneous utilities for our code base.

All of our code base is made to run on the Marvin cluster (University of Bonn).

Installation
------------

You can use the `installation.sh` to help you create workspaces in Marvin. Marvin has a dual stack setup (Intel / AMD), so make sure to create your local environments with this in mind (see `installation.sh` for details).

The `.modules_{amd|intel}.sh` file contains all the modules you need to load in order to run our code base in Marvins AMD or Intel stack (see `installation.sh` for details).

You can also use the `pyproject.toml` to install certain specific/working builds of our code base. The `pyproject.toml` has currently some basic builds to work with:

* `data`: For downloading and preprocessing datasets.
* `distributed`: For training and evaluating language models with DDP and FSDP.
* `synth`: For generating synthetic datasets with vLLM.
* `trl`: For training and evaluating language models with TRL.
* `tests`: For running our test suite.

Aknowlegments
-------------

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge access to the Marvin cluster, hosted by the University of Bonn, along with support from its High Performance Computing & Analytics Lab.