Adapted ReadMe, added presentation

JoelDag · JoelDag · commit beb69003e313 · 2026-02-14T14:30:29.000Z
diff --git a/README.md b/README.md
@@ -0,0 +1,69 @@
+# HTYLLM-PG: How To Train Your LLM
+
+This repository is dedicated to developing and investigating language modeling techniques for massively multilingual language models with 200 or more languages.
+
+The work was carried out by a project group from the DICE group at the University of Paderborn over two semesters: SS25 and WS25/26.
+
+## Team
+
+- Tutors: Nikit Srivastava, Rene Speck
+- Supervisor: Prof. Dr. Axel Ngonga
+- Participants: Yven de Buhr, Jamil Mounzer, Joel Dag, Sashreek Nayak Dhaimodkar, Luke Friedrichs, Martin Schröder
+
+## Project Scope
+
+During the project group, we investigated different approaches and explored their feasibility for training scalable and extensible massively multilingual LLMs.
+
+### Approaches explored in SS25
+
+- Mixture-of-Experts (MoE) training: sparse expert routing to scale model capacity efficiently.
+- Cross-lingual transfer fine-tuning: training related languages together to improve transfer between them.
+- Joint multilingual pretraining: training a dense model across all target languages.
+- Adapter techniques (LoRA, QLoRA, XLoRA): parameter-efficient adaptation for multilingual settings.
+
+### Focus in WS25/26
+
+- Asymmetric hierarchical LoRA adapters: hierarchical adapter-based methods with shared and language-specific components.
+- Dynamic MoE: adaptive expert routing variants and large-scale multilingual models.
+
+## Where To Find The Approaches
+
+- `approaches/CoLA`: hierarchical multilingual adapter line (CoLA/HydraLoRA and language-aware routing).
+- `approaches/adapter`: adapter-centered experiments and setup notes.
+- `approaches/cross_lingual_transfer`: cross-lingual transfer fine-tuning and evaluation scripts.
+- `approaches/joint_multilingual_pretraining`: dense multilingual pretraining pipeline.
+- `approaches/moe`: MoE pretraining pipeline and scripts.
+- `approaches/dynamic_moe`: dynamic MoE with DeepSpeed, conversion, and evaluation tooling.
+
+## How To Get Started With The Different Approaches
+
+For the hierarchical asymmetric LoRA approach (in `approaches/CoLA`), there is extensive documentation of the intermediate steps across data sampling, preparation, training, and evaluation in `approaches/CoLA/docs/`.
+
+Additionally, you can use the presentations in `presentations/` to understand the overall approaches and project progress.
+
+Most explored approaches, especially the ones we focused on in WS25/26, have their own README files to support onboarding.
+
+For CoLA:
+- `approaches/CoLA/README.md`
+- `approaches/CoLA/docs/01_project_documentation.md`
+
+For dynamic MoE:
+- `approaches/dynamic_moe/README.md`
+
+For the other approaches:
+- `approaches/adapter/README.md`
+- `approaches/cross_lingual_transfer/readme.md`
+- `approaches/joint_multilingual_pretraining/readme.md`
+- `approaches/moe/readme.md`
+
+
+### Project Management And Shared Resources
+
+We organized and tracked work mainly via GitHub issues/milestones and a Kanban board:
+
+- GitHub issues: https://github.com/dice-group/HTYLLM-PG/issues
+- Kanban board: https://kanboard.cs.uni-paderborn.de/?controller=BoardViewController&action=show&project_id=850&search=status%3Aopen
+
+We also maintained a shared Sciebo folder for plots, documentation, literature review material, and approach-specific resources such as important training/evaluation datasets:
+
+- Sciebo folder: https://uni-paderborn.sciebo.de/apps/files/files/850613857?dir=/HTYLLM-PG
diff --git a/approaches/CoLA/README.md b/approaches/CoLA/README.md
@@ -1,20 +1,80 @@
-# Hierarchical Adapter Pipeline
+# CoLA: Hierarchical Asymmetric Adapter Pipeline
 
-Built by Yven & Sashreek and Joel to explore multilingual hierarchical CoLA/Hydra adapters with hierarchical (language-aware) routing
+This folder contains our hierarchical multilingual adapter work based on CoLA and HydraLoRA, including language-aware routing for massively multilingual training.
+
+## What This Approach Investigates
+
+- Hierarchical asymmetric adapters (shared low-rank structure + language-specific components)
+- CoLA and HydraLoRA routing variants
+- Language Prior Routing (LPR) with language-id guidance
+- Multilingual ablations across 200 languages
+
+## Folder Guide
+
+- `LLaMA-Factory/`: training framework with CoLA/HydraLoRA implementation changes
+- `data_prep/`: data sampling, clustering, tokenizer extension, and tokenization pipeline
+- `scripts/`: SLURM/local launchers for training and evaluation
+- `configs/`: evaluation task lists and run configuration inputs
+- `tools/two_stage_clustering/`: language grouping JSONs
+- `docs/`: full documentation and generated PDF/MD builds
+- `result_analysis/`: evaluation exports and analysis scripts
 
 ## Setup
-1. **Conda env**: `cd LLaMA-Factory && conda env create -f environment.yaml && conda activate cola_llama_factory`.
-2. **Local installs**: Afterwards uninstall peft and llamafactory again and `pip install -e .` (inside `LLaMA-Factory` and inside of `peft`).
-3. **Models/data**: we use llama3.1B as well as llama3.2-1B / 3B. We have prepared/tokenized datasets referenced in the scripts (e.g. `/scratch/.../tokenized/...`). check on cluster for details
 
-## Hierarchical Adapter TL;DR
-CoLA/Hydra layers now share family-level A matrices, keep B/heads per language, and optionally use Language Prior Routing (bias/hard routing driven by batch-level language IDs + auxiliary loss). See `docs/storyline.md` for the full narrative and implementation details.
+1. Initialize submodules from the repository root:
+```bash
+git submodule update --init --recursive
+```
+
+2. Create and activate the conda environment:
+```bash
+cd approaches/CoLA/LLaMA-Factory
+conda env create -f environment.yml
+conda activate merlin
+```
 
-## Running (Slurm)
-All launchers live in `scripts/`. For example, to train the standard Accelerate MoE CoLA baseline on the cluster:
+3. Install LLaMA-Factory in editable mode so local CoLA/Hydra changes are used:
+```bash
+pip uninstall -y peft llamafactory
+pip install -e .
 ```
+
+ follow `approaches/CoLA/LLaMA-Factory/setup_conda_env.md` for more details
+
+4. Check model/data paths and environment variables in the SLURM scripts before launching (cluster-specific `/scratch/...` paths are referenced in several scripts).
+
+## Running
+
+From `approaches/CoLA/`:
+
+- Baseline CoLA training:
+```bash
 cd scripts
 sbatch accelerate_moe_cola_train.sh
 ```
-Languag Prior routing is work in progress.
-This should also be extended TODO
+
+- Multilingual ablation launcher:
+```bash
+cd scripts/comparison
+sbatch run_multilingual_ablation.sh
+```
+
+- Single-variant comparison jobs:
+  - `scripts/comparison/cola_lpr_job.sh`
+  - `scripts/comparison/hydralora_lpr_job.sh`
+  - `scripts/comparison/lora_job.sh`
+
+## Documentation (Start Here)
+
+- `docs/README.md`
+- `docs/01_project_documentation.md`
+- `docs/02_data_preparation.md`
+- `docs/03_model_training_and_implementation.md`
+- `docs/04_training_orchestration.md`
+- `docs/05_evaluation_and_analysis.md`
+- `docs/06_reproducibility_and_submission.md`
+
+Deep-dive references:
+
+- `docs/extra/hierarchical_adapters_multilingual_study_approaches_explanation.md`
+- `docs/extra/storyline.md`
diff --git a/presentations/HTYLLM_Semester_1_end_Presentation.pdf b/presentations/HTYLLM_Semester_1_end_Presentation.pdf