You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* first version supporting image modalities and ready to test
* changes to handle cases where image is not in the column of the dataset
* added demo dataset to test
* small change
* fixed sglang image error
* fixed image error
* changes to handle empty shards and test with 32 nodes
* maybe this time
* fix output
* trying to fix strange error
* change to use map correctly and prompt
* not fully working
* using image token
* removed useless code
* last changes before PR
* Update README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update src/mirage/shard_process.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update src/mirage/utils.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fixed small error for lower python version
* Update src/mirage/utils.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Make chat template configurable for multimodal models (#5)
* Initial plan
* Make chat template configurable for multimodal models
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Update documentation for configurable chat template
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Address code review feedback: improve chat template validation and inference
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Improve chat template inference and add early validation
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Remove infer_chat_template method, make chat_template explicit in config
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Improve error message and simplify comment
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Optimize batch processing for both text-only and multimodal samples (#6)
* Initial plan
* Optimize batch processing by separating text-only and multimodal samples
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Optimize chat template validation to run once per batch
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* Enable batched processing for multimodal samples
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: qchapp <74377782+qchapp@users.noreply.github.com>
* added exception and error handling for llm shutdown
* Update src/mirage/utils.py
Co-authored-by: Guillaume Boyé <guillaume.boye@epfl.ch>
* rework to resolve conflicts with main
* Rename project to MMIRAGE (Modular Multimodal)
* fixed imports
* added vision tests
* trying smaller images
* reverted some commits
* added error handling
* trying to fix multimodalities
* changed file path handling
* ready for PR and tests working
* making ready for the merge
* fixed example config
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Guillaume Boyé <guillaume.boye@epfl.ch>
MMIRAGE, which stands for Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models. It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
3
+
MMIRAGE, which stands for **M**odular **M**ultimodal **I**ntelligent **R**eformatting and **A**ugmentation **G**eneration **E**ngine, is an advanced platform designed to streamline the processing of datasets using generative models, including vision-language models (VLMs). It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed `torch` and `sglang` to take advantage of GPU acceleration.
23
-
24
-
```bash
25
-
git clone git@github.com:EPFLiGHT/MIRAGE.git
26
-
pip install -e ./MIRAGE
27
-
```
28
-
29
-
For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
30
-
```bash
31
-
curl https://raw.githubusercontent.com/EPFLiGHT/MIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh
32
-
```
33
-
34
-
35
19
## Key features
36
20
37
-
- Easily configurable with a YAML file which configure the following parameters
38
-
- The prompt to the LLM
39
-
- Variables with the name and their key to a JSON
40
-
- Parallelizable with a multi-node support
41
-
- The training pipeline should use either distributed inference using accelerate
42
-
- Support a variety of LLMs and VLMs (LLM only for a first version)
21
+
-**Multimodal Support**: Process both text and images with vision-language models
22
+
- Easily configurable with a YAML file which configures the following parameters:
23
+
- The prompt to the LLM (using Jinja2 templating)
24
+
- Variables with the name and their JMESPath key to a JSON
25
+
- Image inputs for multimodal processing
26
+
- Parallelizable with multi-node support
27
+
- The training pipeline uses distributed inference with sharding
28
+
- Support a variety of LLMs and VLMs (Vision-Language Models)
43
29
- Support any dataset schemas (configurable with the YAML format)
44
-
- The ability to either output a JSON (or any other structured format) or a plain text
30
+
- The ability to either output a JSON (or any other structured format) or plain text
31
+
- Modular architecture with pluggable processors, loaders, and writers
45
32
46
33
## Example usage
47
34
48
-
### Reformatting dataset
35
+
### Text-only: Reformatting dataset
49
36
50
37
Suppose you have a dataset with samples of the following format
51
38
52
39
```json
53
40
{
54
41
"conversations" : [{"role": "user", "content": "Describe the image"}, {"role": "assistant", "content": "This is a badly formmatted answer"}],
55
-
"modalities" : [<the images>]
42
+
"modalities" : ["<the images>"]
56
43
}
57
44
```
58
45
59
-
The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file.
60
-
Then in the YAML configuration file, we could specify
46
+
The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file:
61
47
62
48
```yaml
63
-
inputs:
64
-
- name: assistant_answer
65
-
key: conversations[1].content
66
-
- name: user_prompt
67
-
key: conversations[0].content
68
-
- name: modalities
69
-
key: modalities
70
-
71
-
outputs:
72
-
- name: formatted_answer
73
-
type: llm
74
-
output_type: plain
75
-
prompt: |
76
-
Reformat the answer in a markdown format without adding anything else:
77
-
{assistant_answer}
49
+
processors:
50
+
- type: llm
51
+
server_args:
52
+
model_path: Qwen/Qwen3-8B
53
+
tp_size: 4
54
+
trust_remote_code: true
55
+
default_sampling_params:
56
+
temperature: 0.1
57
+
top_p: 1.0
58
+
max_new_tokens: 384
59
+
60
+
loading_params:
61
+
datasets:
62
+
- path: /path/to/dataset
63
+
type: loadable
64
+
output_dir: /path/to/output/shards
65
+
num_shards: "$SLURM_ARRAY_TASK_COUNT"
66
+
shard_id: "$SLURM_ARRAY_TASK_ID"
67
+
batch_size: 64
68
+
69
+
processing_params:
70
+
inputs:
71
+
- name: assistant_answer
72
+
key: conversations[1].content
73
+
- name: user_prompt
74
+
key: conversations[0].content
75
+
- name: modalities
76
+
key: modalities
77
+
78
+
outputs:
79
+
- name: formatted_answer
80
+
type: llm
81
+
output_type: plain
82
+
prompt: |
83
+
Reformat the answer in a markdown format without adding anything else:
84
+
{{ assistant_answer }}
78
85
79
-
output_schema:
80
-
conversations:
81
-
- role: user
82
-
content: {user_prompt}
83
-
- role: assistant
84
-
content: {formatted_answer}
85
-
modalities: {modalities}
86
-
86
+
remove_columns: false
87
+
output_schema:
88
+
conversations:
89
+
- role: user
90
+
content: "{{ user_prompt }}"
91
+
- role: assistant
92
+
content: "{{ formatted_answer }}"
93
+
modalities: "{{ modalities }}"
87
94
```
88
95
89
96
Configuration explanation:
90
97
91
-
-`inputs`: specify variables that are defined from the input dataset. For instance by specifying the key `conversations[1].content`, we say that this variable corresponds to `sample["conversations"][1]["content"]`
92
-
-`outputs`: specify variables that are created from the pipeline. We specify how the variable should be created:
93
-
- Here `formatted_answer` is created using a LLM prompt and is a plain text variable (as opposed to JSON variables)
94
-
-`output_schema`: specify the output schema of the dataset. So each sample will follow this format. Here we know that each sample will contain 2 keys: `conversations` and `modalities`
98
+
- `processors`: List of processor configurations. Currently supports `llm` type for LLM-based generation.
99
+
- `loading_params`: Parameters for loading and sharding datasets.
100
+
- `datasets`: List of dataset configurations with path, type, and output directory.
101
+
- `processing_params`:
102
+
- `inputs`: Variables extracted from the input dataset using JMESPath queries.
103
+
- `outputs`: Variables created by processors. Prompts use Jinja2 templating (`{{ variable }}`).
104
+
- `output_schema`: Defines the structure of output samples.
95
105
96
-
### Transforming datasets
106
+
### Multimodal: Processing images with VLMs
97
107
98
-
In the second example, we want to generate questions from plain text document. The 3 keys that we want to generate are:
108
+
MMIRAGE supports multimodal processing with vision-language models:
99
109
100
-
- "question"
101
-
- "answer"
102
-
- "explanation"
110
+
```yaml
111
+
processors:
112
+
- type: llm
113
+
server_args:
114
+
model_path: Qwen/Qwen2-VL-7B-Instruct
115
+
tp_size: 4
116
+
trust_remote_code: true
117
+
chat_template: qwen2-vl # Required for VLMs
118
+
default_sampling_params:
119
+
temperature: 0.1
120
+
top_p: 0.95
121
+
max_new_tokens: 768
122
+
123
+
loading_params:
124
+
datasets:
125
+
- path: /path/to/image/dataset
126
+
type: loadable
127
+
output_dir: /path/to/output/shards
128
+
num_shards: "$SLURM_ARRAY_TASK_COUNT"
129
+
shard_id: "$SLURM_ARRAY_TASK_ID"
130
+
batch_size: 32
131
+
132
+
processing_params:
133
+
inputs:
134
+
- name: medical_image
135
+
key: image
136
+
type: image # Mark as image input
137
+
image_base_path: /path/to/images # Base directory for relative paths
138
+
- name: original_caption
139
+
key: caption
140
+
type: text
141
+
142
+
outputs:
143
+
- name: enhanced_caption
144
+
type: llm
145
+
output_type: plain
146
+
prompt: |
147
+
Describe the medical image in detail.
148
+
Original caption for context: {{ original_caption }}
149
+
150
+
remove_columns: false
151
+
output_schema:
152
+
image: "{{ medical_image }}"
153
+
caption: "{{ enhanced_caption }}"
154
+
original_caption: "{{ original_caption }}"
155
+
```
103
156
104
-
Suppose we have the following format:
157
+
Key multimodal features:
158
+
- `chat_template`: Specify the VLM chat template (e.g., `qwen2-vl`)
159
+
- `type: image`: Mark input variables as images
160
+
- `image_base_path`: Base directory for resolving relative image paths
161
+
- Supports PIL Images, URLs, and file paths
105
162
106
-
```json
107
-
{
108
-
"text" : "This is a very interesting article about cancer"
109
-
}
110
-
```
163
+
## Architecture
111
164
112
-
```yaml
113
-
inputs:
114
-
- name: plain_text
115
-
key: text
116
-
117
-
outputs:
118
-
- name: output_dict
119
-
type: prompt
120
-
output_type: JSON
121
-
prompt: |
122
-
I want to generate Q/A pairs from the following text:
123
-
{plain_text}
124
-
output_schema:
125
-
- question
126
-
- explanation
127
-
- answer
128
-
129
-
output_schema:
130
-
conversations:
131
-
- role: user
132
-
content: {question}
133
-
- role: assistant
134
-
content: |
135
-
{explanation}
136
-
Answer: {answer}
165
+
MMIRAGE uses a modular architecture:
137
166
167
+
```
168
+
mmirage/
169
+
├── config/ # Configuration loading and validation
0 commit comments