Skip to content

Commit 2b9b87a

Browse files
MichelDucartierCopilotfabnemEPFLBoyeGuillaume
authored
Big refactor + add mock data (#8)
* Add workflows + mock data * Better workflow * Maybe * Check free space * Free space * Typo * In GPT we trust * Add JSON output * Add iterative variable generation * Add medtrinity question generator * Stuff happened * Stuff happened again * Finish? * Clean * Change output_var to variable * Add validation pass * Maybe * Remove even more space * Fix typo * No cache for pip * Idk * Try with ARM64 only * Small changes * Clean * Update src/mirage/core/loader/base.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Some docs * Add docstring * Ruff reformat * Fix * Update src/mirage/config/loading.py Co-authored-by: fabnemEPFL <117652591+fabnemEPFL@users.noreply.github.com> * added test_mock_data.sh * Fix * Remove extra configs + add sorting by type in data loading process * Fix invalid feature name for Sglang Sglang `all` feature does not exists in every version of Sglang that is >=0.5.2 hence the requirement modification * Create script to generate .env file * Update the setup scripts * Fix missing dependencies in pyproject.toml * Add support for DatasetDict * Update gitignore * Remove comment * Put tp_size at 1 for mock config * Add error on abstract methods --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: fabnemEPFL <117652591+fabnemEPFL@users.noreply.github.com> Co-authored-by: Fabrice Nemo <fabrice.nemo@epfl.ch> Co-authored-by: Guillaume Boye <guillaume.boye@epfl.ch>
1 parent 27aedbc commit 2b9b87a

37 files changed

Lines changed: 2067 additions & 686 deletions

.github/workflows/docker.yml

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
name: Build and Push Docker Image
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
env:
12+
IMAGE_NAME: michelducartier24/mirage
13+
REGISTRY: docker.io
14+
15+
jobs:
16+
build-docker:
17+
strategy:
18+
fail-fast: true
19+
matrix:
20+
include:
21+
- platform: ubuntu-latest
22+
path: docker/Dockerfile
23+
tag_base: amd64
24+
name: mirage-git
25+
- platform: ubuntu-24.04-arm
26+
path: docker/Dockerfile
27+
tag_base: arm64
28+
name: mirage-git
29+
30+
runs-on: ${{ matrix.platform }}
31+
environment: docker
32+
33+
steps:
34+
- name: Free space (ARM)
35+
if: matrix.platform == 'ubuntu-24.04-arm'
36+
run: |
37+
df -h
38+
du -h -d1 /home/runner || true
39+
40+
rm -rf /opt/hostedtoolcache
41+
rm -rf /home/runner/.cache
42+
rm -rf /home/runner/.docker
43+
rm -rf /home/runner/actions-runner/_work/_tool
44+
45+
df -h
46+
47+
- name: Free disk space
48+
uses: jlumbroso/free-disk-space@main
49+
with:
50+
tool-cache: true
51+
docker-images: true
52+
android: true
53+
dotnet: true
54+
haskell: true
55+
large-packages: true
56+
swap-storage: true
57+
58+
- name: Check free space
59+
run: df -h
60+
61+
- name: Checkout repository
62+
uses: actions/checkout@v4
63+
64+
- name: Log in to DockerHub
65+
uses: docker/login-action@v3
66+
with:
67+
username: ${{ secrets.DOCKER_USERNAME }}
68+
password: ${{ secrets.DOCKER_PASSWORD }}
69+
70+
- name: Set up Docker Buildx
71+
uses: docker/setup-buildx-action@v3
72+
73+
- name: Build and push
74+
uses: docker/build-push-action@v6
75+
with:
76+
context: .
77+
file: ${{ matrix.path }}
78+
push: true
79+
tags: |
80+
${{ secrets.DOCKER_USERNAME }}/${{ matrix.name }}:latest-${{ matrix.tag_base }}
81+
${{ secrets.DOCKER_USERNAME }}/${{ matrix.name }}:${{ github.sha }}-${{ matrix.tag_base }}

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ __pycache__/
66
# C extensions
77
*.so
88

9+
tests/output/**
10+
tests/merged/**
11+
912
# Distribution / packaging
1013
.Python
1114
build/
@@ -160,4 +163,4 @@ cython_debug/
160163
#.idea/
161164

162165
logs/
163-
else/
166+
else/

README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,21 @@
22

33
MIRAGE, which stands for Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models. It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
44

5+
## How to install
6+
7+
To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed `torch` and `sglang` to take advantage of GPU acceleration.
8+
9+
```bash
10+
git clone git@github.com:EPFLiGHT/MIRAGE.git
11+
pip install -e ./MIRAGE
12+
```
13+
14+
For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
15+
```bash
16+
curl https://raw.githubusercontent.com/EPFLiGHT/MIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh
17+
```
18+
19+
520
## Key features
621

722
- Easily configurable with a YAML file which configure the following parameters
@@ -114,4 +129,4 @@ Here, we choose to output a JSON answer with 3 keys ("question", "explanation" a
114129
- Jinja2 to process the YAML: #[link](https://jinja.palletsprojects.com/en/stable/)
115130
- JMESPath: #[link](https://jmespath.org/)
116131
- SGLang: #[link](https://github.com/sgl-project/sglang)
117-
- Paper for performance drom: #[link](https://arxiv.org/abs/2408.02442)
132+
- Paper for performance drom: #[link](https://arxiv.org/abs/2408.02442)

configs/config_medtrinity.yaml

Lines changed: 0 additions & 51 deletions
This file was deleted.

configs/config_mock.yaml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
processors:
2+
- type: llm
3+
server_args:
4+
model_path: Qwen/Qwen3-4B-Instruct-2507
5+
tp_size: 1
6+
disable_custom_all_reduce: true
7+
sampling_params:
8+
temperature: 0.1
9+
top_p: 0.9
10+
max_new_tokens: 1024
11+
custom_params:
12+
chat_template_kwargs:
13+
enable_thinking: false
14+
15+
loading_params:
16+
datasets:
17+
- path: tests/mock_data/data.jsonl
18+
type: JSONL
19+
output_dir: tests/output/data
20+
- path:
21+
train: tests/mock_data/data2/train.jsonl
22+
test: tests/mock_data/data2/test.jsonl
23+
type: JSONL
24+
output_dir: tests/output/data2
25+
26+
num_shards: 4
27+
shard_id: 0
28+
conversations_field: "conversations"
29+
batch_size: 64
30+
31+
processing_params:
32+
inputs:
33+
- name: text
34+
key: text
35+
36+
outputs:
37+
- name: formatted_answer
38+
type: llm
39+
output_type: JSON
40+
output_schema:
41+
- question
42+
- answer
43+
prompt: |
44+
Generate one question and its corresponding answer using the following text:
45+
```
46+
{{ text }}
47+
```
48+
49+
remove_columns: True
50+
output_schema:
51+
conversations:
52+
- role: "user"
53+
content: "{{ formatted_answer.question }}"
54+
- role: "assistant"
55+
content: "{{ formatted_answer.answer }}"

configs/config_small.yaml

Lines changed: 0 additions & 46 deletions
This file was deleted.

docker/Dockerfile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
FROM docker.io/lmsysorg/sglang:latest
2+
3+
COPY . /workspace/MIRAGE
4+
WORKDIR /workspace/MIRAGE
5+
RUN pip install --no-cache-dir -e .

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ authors = [{ name = "Meditron team" }]
1111

1212
# Core runtime deps for your scripts
1313
dependencies = [
14-
"sglang[all]>=0.5.2",
14+
"sglang[diffusion]>=0.5.2",
1515
"transformers>=4.46.0",
1616
"pyzmq",
1717
"uvloop<0.22; platform_system != 'Windows'",
@@ -34,6 +34,7 @@ dependencies = [
3434
"fsspec",
3535
"dacite>=1.6.0",
3636
"pydantic>=2.12",
37+
"jmespath"
3738
]
3839

3940
[project.optional-dependencies]
@@ -49,4 +50,4 @@ dev = [
4950
packages = ["src/mirage"]
5051

5152
[tool.hatch.build.targets.sdist]
52-
include = ["src/mirage/**", "pyproject.toml", "README.md"]
53+
include = ["src/mirage/**", "pyproject.toml", "README.md"]

run.sh

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
2-
#SBATCH --job-name=med-sharded
2+
#SBATCH --job-name=mirage-example
33
#SBATCH --chdir=/users/$USER/meditron/MIRAGE/src/mirage
44
#SBATCH --output=/users/$USER/reports/R-%x.%A_%a.out
55
#SBATCH --error=/users/$USER/reports/R-%x.%A_%a.err
@@ -8,21 +8,31 @@
88
#SBATCH --gres=gpu:4
99
#SBATCH --cpus-per-task=288
1010
#SBATCH --time=11:59:59
11-
#SBATCH --environment=/users/$USER/.edf/sglang.toml
1211
#SBATCH -A a127
13-
#SBATCH --array=0-31
12+
#SBATCH --array=0-3
1413

1514
# --- outputs & config ---
16-
export ROOT=/capstor/store/cscs/swissai/a127/homes/$USER/datasets/english_small
15+
export ROOT=$SCRATCH/mirage_example
1716
export SHARDS_ROOT="$ROOT/shards"
1817
export MERGED_DIR="$ROOT/merged"
19-
export CFG=/users/$USER/MIRAGE/configs/config_small.yaml
18+
export CFG=/users/$USER/meditron/MIRAGE/configs/config_small.yaml
2019

2120
# HF cache/home
22-
export HF_HOME=/capstor/store/cscs/swissai/a127/homes/$USER/hf
21+
export HF_HOME=$SCRATCH/hf
2322

2423
mkdir -p "$SHARDS_ROOT"
2524
mkdir -p "$MERGED_DIR"
2625

27-
python /users/$USER/MIRAGE/src/mirage/shard_process.py \
28-
--config "$CFG"
26+
export CMD="python /users/$USER/meditron/MIRAGE/src/mirage/shard_process.py --config $CFG"
27+
28+
SRUN_ARGS=" \
29+
--cpus-per-task $SLURM_CPUS_PER_TASK \
30+
--jobid $SLURM_JOB_ID \
31+
--wait 60 \
32+
-A a127 \
33+
--reservation sai-a127 \
34+
--environment /users/$USER/.edf/mirage.toml
35+
"
36+
# bash -c is needed for the delayed interpolation of env vars to work
37+
srun $SRUN_ARGS bash -c "$CMD"
38+
echo "END TIME: $(date)"

0 commit comments

Comments
 (0)