Feature/llm api support by legstar67 · Pull Request #32 · EPFLiGHT/MMIRAGE

legstar67 · 2026-03-26T05:58:03Z

Usage :
Setup first the environment variable : export OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE"
(Note : replace YOUR_OPENAI_API_KEY_HERE by your key, which looks like sk-... )

3 CLI commands:

Submit:

(You can run this concurrently across multiple shards)
python src/mmirage/shard_process.py --config <yaml_config>

Check Status:

(Accepts wildcards to check all shards/modalities at once)
python src/mmirage/core/process/batch/status_checker.py --config <yaml_config>

Collect & Merge:

(Aggregates all receipts and perfectly reconstructs the original dataset)
python src/mmirage/core/process/batch/collector.py --config <yaml_config> --output-path <final_dataset.jsonl>

Note - For the batch receipts :

Set metadata_output_path to a base filename (for example: tests/output/batch_metadata.jsonl).
Submission writes suffixed receipts like batch_metadata.text.<run>.jsonl and
batch_metadata.multimodal.<run>.jsonl. Receiver tools expand the base path into
these files automatically when --metadata-path is omitted.
With the optional argument --metadata-path, to select all batch_metadata.jsonl you can use '*' like : PATH_TO_FILE/batch_metadata.*.jsonl

Exemple of Usage

Submit:
python src/mmirage/shard_process.py --config configs/config_mock_openai_batch_vision.yaml
Check Status:
python src/mmirage/core/process/batch/status_checker.py --config configs/config_mock_openai_batch_vision.yaml
Collect & Merge:
python src/mmirage/core/process/batch/collector.py --config configs/config_mock_openai_batch_vision.yaml --output-path tests/output/final_dataset.jsonl

…ta logging + integrate the byte-chunking engine into the main llmprocessor + add a buffer over the iteration of map() to bridge row batches parameter of the function and the API byte limit without send non-full batches to the API + add metadata receipt generation to collect output in a asynchronous way later

… + config mock for text and vision + some corrections

Copilot

Pull request overview

Adds provider-batch (OpenAI Batch API) support to MMIRAGE, enabling asynchronous submission, status polling, and later result collection/merge, and wires this into the LLM processor pipeline.

Changes:

Introduces a provider-agnostic batch submission layer (adapter contracts, chunking, orchestration, registry) plus an OpenAI Batch adapter.
Adds receiver-side CLIs/utilities to check batch status and collect/merge results from receipt JSONL metadata.
Integrates “batch mode” into LLMProcessor and adds extensive unit/integration test coverage plus example configs.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/test_openai_batch_adapter.py	Unit tests for OpenAI adapter request building, submission, status, retrieval.
tests/test_integration_receiver.py	Integration test for receipt reading + merge output writing.
tests/test_integration_batch_pipeline.py	Integration coverage for batch-mode pipeline and chunk flushing behavior.
tests/test_batch_status_checker.py	Tests for status checker parsing/dedup + output formatting.
tests/test_batch_orchestrator.py	Tests for orchestrator buffering, flush reasons, metadata writing.
tests/test_batch_collector.py	Tests for collector merge behavior and plain-text fallback.
tests/test_batch_chunking.py	Tests for byte/request-count chunking and oversized request policies.
tests/test_batch_adapter_contracts.py	Tests for adapter ABC compliance + registry credential handling.
src/mmirage/shard_process.py	Adds lifecycle flush (`finalize_processors`) after dataset `.map()`.
src/mmirage/core/process/processors/llm/llm_processor.py	Adds batch-provider mode (no SGLang init), request submission + finalize flush.
src/mmirage/core/process/processors/llm/config.py	Adds `batch_provider` config field; defaults TP size from env.
src/mmirage/core/process/mapper.py	Adds `finalize_processors()` hook to call processor `finalize()`.
src/mmirage/core/process/batch/status_checker.py	New CLI/util to poll batch statuses from receipt JSONL.
src/mmirage/core/process/batch/registry.py	New adapter registry + factory with credential validation/env fallback.
src/mmirage/core/process/batch/orchestrator.py	New stateful orchestrator to buffer across `.map()` calls and persist receipts.
src/mmirage/core/process/batch/openai_adapter.py	New OpenAI Batch adapter implementation.
src/mmirage/core/process/batch/collector.py	New CLI/util to retrieve results and merge back into output JSONL.
src/mmirage/core/process/batch/chunking.py	New provider-agnostic request chunker by byte/request limits.
src/mmirage/core/process/batch/adapter.py	New adapter contracts + normalized submission result model.
src/mmirage/core/process/batch/init.py	Exports batch subsystem public API.
src/mmirage/config/openai_batch.py	New OpenAI-specific batch config type.
src/mmirage/config/batch_provider.py	New provider-agnostic batch config contract and retry policy.
configs/config_mock_openai_batch_vision.yaml	Example config for OpenAI batch vision flow.
configs/config_mock_openai_batch.yaml	Example config for OpenAI batch text/structured output flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… global

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… into feature/llm-api-support

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…st of files

…t comment

legstar67 added 14 commits March 11, 2026 04:59

Draft : implementation OpenAI support

56e925c

continuation and correction of the OpenAI API support

f99a690

cleaning

7189b3e

small modifications and cleaning

141d214

reset from the main

7811496

implementation of config classes and agnostic-provider classes

8e085a3

pytests for previous implementations

f26f804

OpenAI API implementation

2003d34

unit tests for OpenAI implementation

95ae28c

chunking by byte implementation + its unit tests

4dd199f

unit tests + corrections

77e39e4

implementation of the status checker , and the collector + unit tests…

a5f9b62

… + config mock for text and vision + some corrections

correction on the image handling + tests

b32b4ae

legstar67 requested a review from Copilot March 26, 2026 05:58

Copilot started reviewing on behalf of legstar67 March 26, 2026 05:58 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

legstar67 and others added 7 commits March 26, 2026 17:11

correct the problem Copilot spotted about a global index which is not…

c55efaf

… global

correction of problem of sample id spotted by copilot

fead12e

correction of the file naming, spotted by copilot, to allow sharding

cead89b

add an import

03eed23

Update src/mmirage/core/process/batch/collector.py

d60e337

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'feature/llm-api-support' of github.com:EPFLiGHT/MMIRAGE…

46f5e65

… into feature/llm-api-support

copilot suggestion

c642744

legstar67 requested a review from Copilot March 26, 2026 17:51

Copilot started reviewing on behalf of legstar67 March 26, 2026 17:52 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

legstar67 added 2 commits March 27, 2026 01:00

template application changed + small correction

c7d4bac

suggestion by Copilot, mostly changing the CLI arguement to accept li…

c5e75a2

…st of files

legstar67 requested a review from Copilot March 27, 2026 01:49

legstar67 added 21 commits May 2, 2026 15:07

abstract method added in BaseProcessor

7ff6e10

small update on the completion_window of openai

c9c43e9

adjust adapter.py thanks to the comments of @fabnemEPFL

69beb2f

make --metadata_path optional as already in the config file + tests

ddca511

use of a batch receipt base path

473c906

solve problem of parsing the openai output

92219ac

upgrade of status checker according to the comments

16b270a

modif according to comments : optimization + deduplication

af013e5

new class BatchMetadataRecord instead of using dict

b9ff47c

log, tracking and syntax modif following the comments

bfad50e

small modif following copilot's comments

41d59f2

syntax improved according to comments

eaf089c

correction of use of a dict instead of the config class

91be3de

cleaning

08a1dfa

improve of openai adapter following the comments

69d01f1

improvement of openai adapter following the comments

e81d1a3

improvement of openai adapter following the comments

c32d633

syntax improved following the comments + test recommended by a copilo…

0298f76

…t comment

working version of llm api provider feature

4d4f8f1

Merge branch 'main' into feature/llm-api-support

f23bc70

small correction

645110a

legstar67 temporarily deployed to docker May 12, 2026 11:40 — with GitHub Actions Inactive

Merge branch 'main' into feature/llm-api-support

a8e5680

fabnemEPFL temporarily deployed to docker-ci May 12, 2026 14:19 — with GitHub Actions Inactive

fabnemEPFL deployed to docker-ci May 12, 2026 14:19 — with GitHub Actions Active

fabnemEPFL marked this pull request as ready for review May 12, 2026 15:38

fabnemEPFL approved these changes May 12, 2026

View reviewed changes

fabnemEPFL merged commit f1a3068 into main May 12, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/llm api support#32

Feature/llm api support#32
fabnemEPFL merged 48 commits into
mainfrom
feature/llm-api-support

legstar67 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

legstar67 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

legstar67 commented Mar 26, 2026 •

edited

Loading