EPFLiGHT
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 138 additions & 100 deletions b/‎README.md‎
Lines changed: 138 additions & 100 deletions
diff --git a/‎configs/config_mock.yaml‎
Lines changed: 2 additions & 8 deletions b/‎configs/config_mock.yaml‎
Lines changed: 2 additions & 8 deletions
diff --git a/‎configs/config_mock_vision.yaml‎
Lines changed: 40 additions & 0 deletions b/‎configs/config_mock_vision.yaml‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 6 additions & 4 deletions b/‎pyproject.toml‎
Lines changed: 6 additions & 4 deletions
@@ -164,3 +164,7 @@ cython_debug/
 
 logs/
 else/
+
+# Test outputs
+tests/mock_data/output/
+tests/mock_data/shards/
@@ -1,6 +1,6 @@
 # MMIRAGE
 
-MMIRAGE, which stands for Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models. It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
+MMIRAGE, which stands for **M**odular **M**ultimodal **I**ntelligent **R**eformatting and **A**ugmentation **G**eneration **E**ngine, is an advanced platform designed to streamline the processing of datasets using generative models, including vision-language models (VLMs). It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
 
 ## How to install
 
@@ -11,137 +11,175 @@ git clone git@github.com:EPFLiGHT/MMIRAGE.git
 pip install -e ./MMIRAGE
 ```
 
-For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
+For testing and scripts that make use of the library, it is advised to create a .env file:
 ```bash
-curl https://raw.githubusercontent.com/EPFLiGHT/MMIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh
+./scripts/generate_env.sh
 ```
 
-
-## How to install
-
-To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed `torch` and `sglang` to take advantage of GPU acceleration.
-
-```bash
-git clone git@github.com:EPFLiGHT/MIRAGE.git
-pip install -e ./MIRAGE
-```
-
-For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
-```bash
-curl https://raw.githubusercontent.com/EPFLiGHT/MIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh
-```
-
-
 ## Key features
 
-- Easily configurable with a YAML file which configure the following parameters
-    - The prompt to the LLM
-    - Variables with the name and their key to a JSON
-- Parallelizable with a multi-node support
-    - The training pipeline should use either distributed inference using accelerate 
-- Support a variety of LLMs and VLMs (LLM only for a first version)
+- **Multimodal Support**: Process both text and images with vision-language models
+- Easily configurable with a YAML file which configures the following parameters:
+    - The prompt to the LLM (using Jinja2 templating)
+    - Variables with the name and their JMESPath key to a JSON
+    - Image inputs for multimodal processing
+- Parallelizable with multi-node support
+    - The training pipeline uses distributed inference with sharding
+- Support a variety of LLMs and VLMs (Vision-Language Models)
 - Support any dataset schemas (configurable with the YAML format)
-- The ability to either output a JSON (or any other structured format) or a plain text
+- The ability to either output a JSON (or any other structured format) or plain text
+- Modular architecture with pluggable processors, loaders, and writers
 
 ## Example usage
 
-### Reformatting dataset
+### Text-only: Reformatting dataset
 
 Suppose you have a dataset with samples of the following format
 
 ```json
 { 
     "conversations" : [{"role": "user", "content": "Describe the image"}, {"role": "assistant", "content": "This is a badly formmatted answer"}],
-    "modalities" : [<the images>]
+    "modalities" : ["<the images>"]
 }
 ```
 
-The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file.
-Then in the YAML configuration file, we could specify
+The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file:
 
 ```yaml
-inputs:
-  - name: assistant_answer
-    key: conversations[1].content
-  - name: user_prompt
-    key: conversations[0].content
-  - name: modalities
-    key: modalities
-
-outputs:
-  - name: formatted_answer
-    type: llm
-    output_type: plain
-    prompt: | 
-      Reformat the answer in a markdown format without adding anything else:
-      {assistant_answer}
+processors:
+  - type: llm
+    server_args:
+      model_path: Qwen/Qwen3-8B
+      tp_size: 4
+      trust_remote_code: true
+    default_sampling_params:
+      temperature: 0.1
+      top_p: 1.0
+      max_new_tokens: 384
+
+loading_params:
+  datasets:
+    - path: /path/to/dataset
+      type: loadable
+      output_dir: /path/to/output/shards
+  num_shards: "$SLURM_ARRAY_TASK_COUNT"
+  shard_id: "$SLURM_ARRAY_TASK_ID"
+  batch_size: 64
+
+processing_params:
+  inputs:
+    - name: assistant_answer
+      key: conversations[1].content
+    - name: user_prompt
+      key: conversations[0].content
+    - name: modalities
+      key: modalities
+
+  outputs:
+    - name: formatted_answer
+      type: llm
+      output_type: plain
+      prompt: | 
+        Reformat the answer in a markdown format without adding anything else:
+        {{ assistant_answer }}
       
-output_schema:
-  conversations:
-    - role: user
-      content: {user_prompt}
-    - role: assistant
-      content: {formatted_answer}
-  modalities: {modalities}
-
+  remove_columns: false
+  output_schema:
+    conversations:
+      - role: user
+        content: "{{ user_prompt }}"
+      - role: assistant
+        content: "{{ formatted_answer }}"
+    modalities: "{{ modalities }}"
 ```
 
 Configuration explanation:
 
-- `inputs`: specify variables that are defined from the input dataset. For instance by specifying the key `conversations[1].content`, we say that this variable corresponds to `sample["conversations"][1]["content"]`
-- `outputs`: specify variables that are created from the pipeline. We specify how the variable should be created: 
-    - Here `formatted_answer` is created using a LLM prompt and is a plain text variable (as opposed to JSON variables)
-- `output_schema`: specify the output schema of the dataset. So each sample will follow this format. Here we know that each sample will contain 2 keys: `conversations` and `modalities`
+- `processors`: List of processor configurations. Currently supports `llm` type for LLM-based generation.
+- `loading_params`: Parameters for loading and sharding datasets.
+  - `datasets`: List of dataset configurations with path, type, and output directory.
+- `processing_params`:
+  - `inputs`: Variables extracted from the input dataset using JMESPath queries.
+  - `outputs`: Variables created by processors. Prompts use Jinja2 templating (`{{ variable }}`).
+  - `output_schema`: Defines the structure of output samples.
 
-### Transforming datasets
+### Multimodal: Processing images with VLMs
 
-In the second example, we want to generate questions from plain text document. The 3 keys that we want to generate are:
+MMIRAGE supports multimodal processing with vision-language models:
 
-- "question"
-- "answer"
-- "explanation"
+```yaml
+processors:
+  - type: llm
+    server_args:
+      model_path: Qwen/Qwen2-VL-7B-Instruct
+      tp_size: 4
+      trust_remote_code: true
+    chat_template: qwen2-vl  # Required for VLMs
+    default_sampling_params:
+      temperature: 0.1
+      top_p: 0.95
+      max_new_tokens: 768
+
+loading_params:
+  datasets:
+    - path: /path/to/image/dataset
+      type: loadable
+      output_dir: /path/to/output/shards
+  num_shards: "$SLURM_ARRAY_TASK_COUNT"
+  shard_id: "$SLURM_ARRAY_TASK_ID"
+  batch_size: 32
+
+processing_params:
+  inputs:
+    - name: medical_image
+      key: image
+      type: image  # Mark as image input
+      image_base_path: /path/to/images  # Base directory for relative paths
+    - name: original_caption
+      key: caption
+      type: text
+
+  outputs:
+    - name: enhanced_caption
+      type: llm
+      output_type: plain
+      prompt: |
+        Describe the medical image in detail.
+        Original caption for context: {{ original_caption }}
+        
+  remove_columns: false
+  output_schema:
+    image: "{{ medical_image }}"
+    caption: "{{ enhanced_caption }}"
+    original_caption: "{{ original_caption }}"
+```
 
-Suppose we have the following format:
+Key multimodal features:
+- `chat_template`: Specify the VLM chat template (e.g., `qwen2-vl`)
+- `type: image`: Mark input variables as images
+- `image_base_path`: Base directory for resolving relative image paths
+- Supports PIL Images, URLs, and file paths
 
-```json
-{
-    "text" : "This is a very interesting article about cancer"
-}
-```
+## Architecture
 
-```yaml
-inputs:
-  - name: plain_text
-    key: text
-    
-outputs:
-  - name: output_dict
-    type: prompt
-    output_type: JSON
-    prompt: | 
-      I want to generate Q/A pairs from the following text:
-      {plain_text}
-    output_schema:
-      - question
-      - explanation
-      - answer
-        
-output_schema:
-  conversations:
-    - role: user
-      content: {question}
-    - role: assistant
-      content: |
-        {explanation}
-        Answer: {answer}
+MMIRAGE uses a modular architecture:
 
+```
+mmirage/
+├── config/           # Configuration loading and validation
+├── core/
+│   ├── loader/       # Dataset loaders (JSONL, HuggingFace)
+│   ├── process/      # Processors (LLM, etc.) and variable system
+│   │   └── processors/
+│   │       └── llm/  # LLM processor with multimodal support
+│   └── writer/       # Output rendering with Jinja2
+├── shard_process.py  # Main processing script
+└── merge_shards.py   # Shard merging utility
 ```
 
-Here, we choose to output a JSON answer with 3 keys ("question", "explanation" and "answer"). That we will match
-
-## Usefool tools
+## Useful tools
 
-- Jinja2 to process the YAML: #[link](https://jinja.palletsprojects.com/en/stable/)
-- JMESPath: #[link](https://jmespath.org/)
-- SGLang: #[link](https://github.com/sgl-project/sglang)
-- Paper for performance drom: #[link](https://arxiv.org/abs/2408.02442)
+- Jinja2 for template processing: [link](https://jinja.palletsprojects.com/en/stable/)
+- JMESPath for JSON queries: [link](https://jmespath.org/)
+- SGLang for fast inference: [link](https://github.com/sgl-project/sglang)
+- Performance paper: [link](https://arxiv.org/abs/2408.02442)
@@ -4,7 +4,7 @@ processors:
       model_path: Qwen/Qwen3-4B-Instruct-2507
       tp_size: 1
       disable_custom_all_reduce: true
-    sampling_params:
+    default_sampling_params:
       temperature: 0.1
       top_p: 0.9
       max_new_tokens: 1024
@@ -17,15 +17,9 @@ loading_params:
     - path: tests/mock_data/data.jsonl
       type: JSONL
       output_dir: tests/output/data
-    - path: 
-        train: tests/mock_data/data2/train.jsonl
-        test: tests/mock_data/data2/test.jsonl
-      type: JSONL
-      output_dir: tests/output/data2
 
   num_shards: 4
   shard_id: 0
-  conversations_field: "conversations"
   batch_size: 64
 
 processing_params:
@@ -46,7 +40,7 @@ processing_params:
         {{ text }}
         ```
   
-  remove_columns: True
+  remove_columns: true
   output_schema:
     conversations:
       - role: "user"
 
@@ -0,0 +1,40 @@
+processors:
+  - type: llm
+    server_args:
+      model_path: Qwen/Qwen3-VL-8B-Instruct
+      tp_size: 1
+      trust_remote_code: true
+    chat_template: qwen2-vl  # Chat template for vision-language models
+    default_sampling_params:
+      temperature: 0.1
+      top_p: 0.9
+      max_new_tokens: 512
+
+loading_params:
+  datasets:
+    - path: tests/mock_data_vision/data.jsonl
+      type: JSONL
+      output_dir: tests/output/data_vision
+      image_base_path: tests/mock_data_vision  # Base directory where images are stored
+
+  num_shards: 4
+  shard_id: 0
+  batch_size: 1
+
+processing_params:
+  inputs:
+    - name: image_input
+      key: image
+      type: image
+
+  outputs:
+    - name: caption
+      type: llm
+      output_type: plain
+      prompt: |
+        Describe what you see in this image in one concise sentence.
+  
+  remove_columns: false
+  output_schema:
+    image: "{{ image_input }}"
+    caption: "{{ caption }}"
@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
 
 [project]
 name = "mmirage"
-version = "0.1.2"
-description = "Advanced platform designed to streamline the processing of datasets using generative models."
+version = "0.1.3"
+description = "Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine - Advanced platform for processing datasets using generative models including vision-language models."
 readme = "README.md"
 requires-python = ">=3.10"
 authors = [{ name = "Meditron team" }]
@@ -35,7 +35,9 @@ dependencies = [
   "fsspec",
   "dacite>=1.6.0",
   "pydantic>=2.12",
-  "jmespath"
+  "jmespath",
+  "jinja2>=3.0.0",
+  "pillow>=9.0.0",
 ]
 
 [project.optional-dependencies]
@@ -51,4 +53,4 @@ dev = [
 packages = ["src/mmirage"]
 
 [tool.hatch.build.targets.sdist]
-include = ["src/mirage/**", "pyproject.toml", "README.md"]
+include = ["src/mmirage/**", "pyproject.toml", "README.md"]