Skip to content

Commit 202c4f3

Browse files
qchappCopilotCopilotBoyeGuillaume
authored
Feature/multi modal (#4)
* first version supporting image modalities and ready to test * changes to handle cases where image is not in the column of the dataset * added demo dataset to test * small change * fixed sglang image error * fixed image error * changes to handle empty shards and test with 32 nodes * maybe this time * fix output * trying to fix strange error * change to use map correctly and prompt * not fully working * using image token * removed useless code * last changes before PR * Update README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/mirage/shard_process.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/mirage/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fixed small error for lower python version * Update src/mirage/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Make chat template configurable for multimodal models (#5) * Initial plan * Make chat template configurable for multimodal models Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Update documentation for configurable chat template Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Address code review feedback: improve chat template validation and inference Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Improve chat template inference and add early validation Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Remove infer_chat_template method, make chat_template explicit in config Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Improve error message and simplify comment Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Optimize batch processing for both text-only and multimodal samples (#6) * Initial plan * Optimize batch processing by separating text-only and multimodal samples Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Optimize chat template validation to run once per batch Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * Enable batched processing for multimodal samples Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com> * added exception and error handling for llm shutdown * Update src/mirage/utils.py Co-authored-by: Guillaume Boyé <guillaume.boye@epfl.ch> * rework to resolve conflicts with main * Rename project to MMIRAGE (Modular Multimodal) * fixed imports * added vision tests * trying smaller images * reverted some commits * added error handling * trying to fix multimodalities * changed file path handling * ready for PR and tests working * making ready for the merge * fixed example config --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Guillaume Boyé <guillaume.boye@epfl.ch>
1 parent 8461729 commit 202c4f3

32 files changed

Lines changed: 825 additions & 311 deletions

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,7 @@ cython_debug/
164164

165165
logs/
166166
else/
167+
168+
# Test outputs
169+
tests/mock_data/output/
170+
tests/mock_data/shards/

README.md

Lines changed: 138 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# MMIRAGE
22

3-
MMIRAGE, which stands for Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models. It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
3+
MMIRAGE, which stands for **M**odular **M**ultimodal **I**ntelligent **R**eformatting and **A**ugmentation **G**eneration **E**ngine, is an advanced platform designed to streamline the processing of datasets using generative models, including vision-language models (VLMs). It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
44

55
## How to install
66

@@ -11,137 +11,175 @@ git clone git@github.com:EPFLiGHT/MMIRAGE.git
1111
pip install -e ./MMIRAGE
1212
```
1313

14-
For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
14+
For testing and scripts that make use of the library, it is advised to create a .env file:
1515
```bash
16-
curl https://raw.githubusercontent.com/EPFLiGHT/MMIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh
16+
./scripts/generate_env.sh
1717
```
1818

19-
20-
## How to install
21-
22-
To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed `torch` and `sglang` to take advantage of GPU acceleration.
23-
24-
```bash
25-
git clone git@github.com:EPFLiGHT/MIRAGE.git
26-
pip install -e ./MIRAGE
27-
```
28-
29-
For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
30-
```bash
31-
curl https://raw.githubusercontent.com/EPFLiGHT/MIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh
32-
```
33-
34-
3519
## Key features
3620

37-
- Easily configurable with a YAML file which configure the following parameters
38-
- The prompt to the LLM
39-
- Variables with the name and their key to a JSON
40-
- Parallelizable with a multi-node support
41-
- The training pipeline should use either distributed inference using accelerate
42-
- Support a variety of LLMs and VLMs (LLM only for a first version)
21+
- **Multimodal Support**: Process both text and images with vision-language models
22+
- Easily configurable with a YAML file which configures the following parameters:
23+
- The prompt to the LLM (using Jinja2 templating)
24+
- Variables with the name and their JMESPath key to a JSON
25+
- Image inputs for multimodal processing
26+
- Parallelizable with multi-node support
27+
- The training pipeline uses distributed inference with sharding
28+
- Support a variety of LLMs and VLMs (Vision-Language Models)
4329
- Support any dataset schemas (configurable with the YAML format)
44-
- The ability to either output a JSON (or any other structured format) or a plain text
30+
- The ability to either output a JSON (or any other structured format) or plain text
31+
- Modular architecture with pluggable processors, loaders, and writers
4532

4633
## Example usage
4734

48-
### Reformatting dataset
35+
### Text-only: Reformatting dataset
4936

5037
Suppose you have a dataset with samples of the following format
5138

5239
```json
5340
{
5441
"conversations" : [{"role": "user", "content": "Describe the image"}, {"role": "assistant", "content": "This is a badly formmatted answer"}],
55-
"modalities" : [<the images>]
42+
"modalities" : ["<the images>"]
5643
}
5744
```
5845

59-
The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file.
60-
Then in the YAML configuration file, we could specify
46+
The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file:
6147

6248
```yaml
63-
inputs:
64-
- name: assistant_answer
65-
key: conversations[1].content
66-
- name: user_prompt
67-
key: conversations[0].content
68-
- name: modalities
69-
key: modalities
70-
71-
outputs:
72-
- name: formatted_answer
73-
type: llm
74-
output_type: plain
75-
prompt: |
76-
Reformat the answer in a markdown format without adding anything else:
77-
{assistant_answer}
49+
processors:
50+
- type: llm
51+
server_args:
52+
model_path: Qwen/Qwen3-8B
53+
tp_size: 4
54+
trust_remote_code: true
55+
default_sampling_params:
56+
temperature: 0.1
57+
top_p: 1.0
58+
max_new_tokens: 384
59+
60+
loading_params:
61+
datasets:
62+
- path: /path/to/dataset
63+
type: loadable
64+
output_dir: /path/to/output/shards
65+
num_shards: "$SLURM_ARRAY_TASK_COUNT"
66+
shard_id: "$SLURM_ARRAY_TASK_ID"
67+
batch_size: 64
68+
69+
processing_params:
70+
inputs:
71+
- name: assistant_answer
72+
key: conversations[1].content
73+
- name: user_prompt
74+
key: conversations[0].content
75+
- name: modalities
76+
key: modalities
77+
78+
outputs:
79+
- name: formatted_answer
80+
type: llm
81+
output_type: plain
82+
prompt: |
83+
Reformat the answer in a markdown format without adding anything else:
84+
{{ assistant_answer }}
7885
79-
output_schema:
80-
conversations:
81-
- role: user
82-
content: {user_prompt}
83-
- role: assistant
84-
content: {formatted_answer}
85-
modalities: {modalities}
86-
86+
remove_columns: false
87+
output_schema:
88+
conversations:
89+
- role: user
90+
content: "{{ user_prompt }}"
91+
- role: assistant
92+
content: "{{ formatted_answer }}"
93+
modalities: "{{ modalities }}"
8794
```
8895
8996
Configuration explanation:
9097
91-
- `inputs`: specify variables that are defined from the input dataset. For instance by specifying the key `conversations[1].content`, we say that this variable corresponds to `sample["conversations"][1]["content"]`
92-
- `outputs`: specify variables that are created from the pipeline. We specify how the variable should be created:
93-
- Here `formatted_answer` is created using a LLM prompt and is a plain text variable (as opposed to JSON variables)
94-
- `output_schema`: specify the output schema of the dataset. So each sample will follow this format. Here we know that each sample will contain 2 keys: `conversations` and `modalities`
98+
- `processors`: List of processor configurations. Currently supports `llm` type for LLM-based generation.
99+
- `loading_params`: Parameters for loading and sharding datasets.
100+
- `datasets`: List of dataset configurations with path, type, and output directory.
101+
- `processing_params`:
102+
- `inputs`: Variables extracted from the input dataset using JMESPath queries.
103+
- `outputs`: Variables created by processors. Prompts use Jinja2 templating (`{{ variable }}`).
104+
- `output_schema`: Defines the structure of output samples.
95105

96-
### Transforming datasets
106+
### Multimodal: Processing images with VLMs
97107

98-
In the second example, we want to generate questions from plain text document. The 3 keys that we want to generate are:
108+
MMIRAGE supports multimodal processing with vision-language models:
99109

100-
- "question"
101-
- "answer"
102-
- "explanation"
110+
```yaml
111+
processors:
112+
- type: llm
113+
server_args:
114+
model_path: Qwen/Qwen2-VL-7B-Instruct
115+
tp_size: 4
116+
trust_remote_code: true
117+
chat_template: qwen2-vl # Required for VLMs
118+
default_sampling_params:
119+
temperature: 0.1
120+
top_p: 0.95
121+
max_new_tokens: 768
122+
123+
loading_params:
124+
datasets:
125+
- path: /path/to/image/dataset
126+
type: loadable
127+
output_dir: /path/to/output/shards
128+
num_shards: "$SLURM_ARRAY_TASK_COUNT"
129+
shard_id: "$SLURM_ARRAY_TASK_ID"
130+
batch_size: 32
131+
132+
processing_params:
133+
inputs:
134+
- name: medical_image
135+
key: image
136+
type: image # Mark as image input
137+
image_base_path: /path/to/images # Base directory for relative paths
138+
- name: original_caption
139+
key: caption
140+
type: text
141+
142+
outputs:
143+
- name: enhanced_caption
144+
type: llm
145+
output_type: plain
146+
prompt: |
147+
Describe the medical image in detail.
148+
Original caption for context: {{ original_caption }}
149+
150+
remove_columns: false
151+
output_schema:
152+
image: "{{ medical_image }}"
153+
caption: "{{ enhanced_caption }}"
154+
original_caption: "{{ original_caption }}"
155+
```
103156

104-
Suppose we have the following format:
157+
Key multimodal features:
158+
- `chat_template`: Specify the VLM chat template (e.g., `qwen2-vl`)
159+
- `type: image`: Mark input variables as images
160+
- `image_base_path`: Base directory for resolving relative image paths
161+
- Supports PIL Images, URLs, and file paths
105162

106-
```json
107-
{
108-
"text" : "This is a very interesting article about cancer"
109-
}
110-
```
163+
## Architecture
111164

112-
```yaml
113-
inputs:
114-
- name: plain_text
115-
key: text
116-
117-
outputs:
118-
- name: output_dict
119-
type: prompt
120-
output_type: JSON
121-
prompt: |
122-
I want to generate Q/A pairs from the following text:
123-
{plain_text}
124-
output_schema:
125-
- question
126-
- explanation
127-
- answer
128-
129-
output_schema:
130-
conversations:
131-
- role: user
132-
content: {question}
133-
- role: assistant
134-
content: |
135-
{explanation}
136-
Answer: {answer}
165+
MMIRAGE uses a modular architecture:
137166

167+
```
168+
mmirage/
169+
├── config/ # Configuration loading and validation
170+
├── core/
171+
│ ├── loader/ # Dataset loaders (JSONL, HuggingFace)
172+
│ ├── process/ # Processors (LLM, etc.) and variable system
173+
│ │ └── processors/
174+
│ │ └── llm/ # LLM processor with multimodal support
175+
│ └── writer/ # Output rendering with Jinja2
176+
├── shard_process.py # Main processing script
177+
└── merge_shards.py # Shard merging utility
138178
```
139179

140-
Here, we choose to output a JSON answer with 3 keys ("question", "explanation" and "answer"). That we will match
141-
142-
## Usefool tools
180+
## Useful tools
143181

144-
- Jinja2 to process the YAML: #[link](https://jinja.palletsprojects.com/en/stable/)
145-
- JMESPath: #[link](https://jmespath.org/)
146-
- SGLang: #[link](https://github.com/sgl-project/sglang)
147-
- Paper for performance drom: #[link](https://arxiv.org/abs/2408.02442)
182+
- Jinja2 for template processing: [link](https://jinja.palletsprojects.com/en/stable/)
183+
- JMESPath for JSON queries: [link](https://jmespath.org/)
184+
- SGLang for fast inference: [link](https://github.com/sgl-project/sglang)
185+
- Performance paper: [link](https://arxiv.org/abs/2408.02442)

configs/config_mock.yaml

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ processors:
44
model_path: Qwen/Qwen3-4B-Instruct-2507
55
tp_size: 1
66
disable_custom_all_reduce: true
7-
sampling_params:
7+
default_sampling_params:
88
temperature: 0.1
99
top_p: 0.9
1010
max_new_tokens: 1024
@@ -17,15 +17,9 @@ loading_params:
1717
- path: tests/mock_data/data.jsonl
1818
type: JSONL
1919
output_dir: tests/output/data
20-
- path:
21-
train: tests/mock_data/data2/train.jsonl
22-
test: tests/mock_data/data2/test.jsonl
23-
type: JSONL
24-
output_dir: tests/output/data2
2520

2621
num_shards: 4
2722
shard_id: 0
28-
conversations_field: "conversations"
2923
batch_size: 64
3024

3125
processing_params:
@@ -46,7 +40,7 @@ processing_params:
4640
{{ text }}
4741
```
4842
49-
remove_columns: True
43+
remove_columns: true
5044
output_schema:
5145
conversations:
5246
- role: "user"

configs/config_mock_vision.yaml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
processors:
2+
- type: llm
3+
server_args:
4+
model_path: Qwen/Qwen3-VL-8B-Instruct
5+
tp_size: 1
6+
trust_remote_code: true
7+
chat_template: qwen2-vl # Chat template for vision-language models
8+
default_sampling_params:
9+
temperature: 0.1
10+
top_p: 0.9
11+
max_new_tokens: 512
12+
13+
loading_params:
14+
datasets:
15+
- path: tests/mock_data_vision/data.jsonl
16+
type: JSONL
17+
output_dir: tests/output/data_vision
18+
image_base_path: tests/mock_data_vision # Base directory where images are stored
19+
20+
num_shards: 4
21+
shard_id: 0
22+
batch_size: 1
23+
24+
processing_params:
25+
inputs:
26+
- name: image_input
27+
key: image
28+
type: image
29+
30+
outputs:
31+
- name: caption
32+
type: llm
33+
output_type: plain
34+
prompt: |
35+
Describe what you see in this image in one concise sentence.
36+
37+
remove_columns: false
38+
output_schema:
39+
image: "{{ image_input }}"
40+
caption: "{{ caption }}"

pyproject.toml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "mmirage"
7-
version = "0.1.2"
8-
description = "Advanced platform designed to streamline the processing of datasets using generative models."
7+
version = "0.1.3"
8+
description = "Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine - Advanced platform for processing datasets using generative models including vision-language models."
99
readme = "README.md"
1010
requires-python = ">=3.10"
1111
authors = [{ name = "Meditron team" }]
@@ -35,7 +35,9 @@ dependencies = [
3535
"fsspec",
3636
"dacite>=1.6.0",
3737
"pydantic>=2.12",
38-
"jmespath"
38+
"jmespath",
39+
"jinja2>=3.0.0",
40+
"pillow>=9.0.0",
3941
]
4042

4143
[project.optional-dependencies]
@@ -51,4 +53,4 @@ dev = [
5153
packages = ["src/mmirage"]
5254

5355
[tool.hatch.build.targets.sdist]
54-
include = ["src/mirage/**", "pyproject.toml", "README.md"]
56+
include = ["src/mmirage/**", "pyproject.toml", "README.md"]

0 commit comments

Comments
 (0)