Feature/llm api support#32
Merged
Merged
Conversation
…ta logging + integrate the byte-chunking engine into the main llmprocessor + add a buffer over the iteration of map() to bridge row batches parameter of the function and the API byte limit without send non-full batches to the API + add metadata receipt generation to collect output in a asynchronous way later
… + config mock for text and vision + some corrections
Contributor
There was a problem hiding this comment.
Pull request overview
Adds provider-batch (OpenAI Batch API) support to MMIRAGE, enabling asynchronous submission, status polling, and later result collection/merge, and wires this into the LLM processor pipeline.
Changes:
- Introduces a provider-agnostic batch submission layer (adapter contracts, chunking, orchestration, registry) plus an OpenAI Batch adapter.
- Adds receiver-side CLIs/utilities to check batch status and collect/merge results from receipt JSONL metadata.
- Integrates “batch mode” into
LLMProcessorand adds extensive unit/integration test coverage plus example configs.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_openai_batch_adapter.py | Unit tests for OpenAI adapter request building, submission, status, retrieval. |
| tests/test_integration_receiver.py | Integration test for receipt reading + merge output writing. |
| tests/test_integration_batch_pipeline.py | Integration coverage for batch-mode pipeline and chunk flushing behavior. |
| tests/test_batch_status_checker.py | Tests for status checker parsing/dedup + output formatting. |
| tests/test_batch_orchestrator.py | Tests for orchestrator buffering, flush reasons, metadata writing. |
| tests/test_batch_collector.py | Tests for collector merge behavior and plain-text fallback. |
| tests/test_batch_chunking.py | Tests for byte/request-count chunking and oversized request policies. |
| tests/test_batch_adapter_contracts.py | Tests for adapter ABC compliance + registry credential handling. |
| src/mmirage/shard_process.py | Adds lifecycle flush (finalize_processors) after dataset .map(). |
| src/mmirage/core/process/processors/llm/llm_processor.py | Adds batch-provider mode (no SGLang init), request submission + finalize flush. |
| src/mmirage/core/process/processors/llm/config.py | Adds batch_provider config field; defaults TP size from env. |
| src/mmirage/core/process/mapper.py | Adds finalize_processors() hook to call processor finalize(). |
| src/mmirage/core/process/batch/status_checker.py | New CLI/util to poll batch statuses from receipt JSONL. |
| src/mmirage/core/process/batch/registry.py | New adapter registry + factory with credential validation/env fallback. |
| src/mmirage/core/process/batch/orchestrator.py | New stateful orchestrator to buffer across .map() calls and persist receipts. |
| src/mmirage/core/process/batch/openai_adapter.py | New OpenAI Batch adapter implementation. |
| src/mmirage/core/process/batch/collector.py | New CLI/util to retrieve results and merge back into output JSONL. |
| src/mmirage/core/process/batch/chunking.py | New provider-agnostic request chunker by byte/request limits. |
| src/mmirage/core/process/batch/adapter.py | New adapter contracts + normalized submission result model. |
| src/mmirage/core/process/batch/init.py | Exports batch subsystem public API. |
| src/mmirage/config/openai_batch.py | New OpenAI-specific batch config type. |
| src/mmirage/config/batch_provider.py | New provider-agnostic batch config contract and retry policy. |
| configs/config_mock_openai_batch_vision.yaml | Example config for OpenAI batch vision flow. |
| configs/config_mock_openai_batch.yaml | Example config for OpenAI batch text/structured output flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… into feature/llm-api-support
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fabnemEPFL
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Usage :
Setup first the environment variable :
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE"(Note : replace YOUR_OPENAI_API_KEY_HERE by your key, which looks like sk-... )
3 CLI commands:
(You can run this concurrently across multiple shards)
python src/mmirage/shard_process.py --config <yaml_config>(Accepts wildcards to check all shards/modalities at once)
python src/mmirage/core/process/batch/status_checker.py --config <yaml_config>(Aggregates all receipts and perfectly reconstructs the original dataset)
python src/mmirage/core/process/batch/collector.py --config <yaml_config> --output-path <final_dataset.jsonl>Note - For the batch receipts :
Set
metadata_output_pathto a base filename (for example:tests/output/batch_metadata.jsonl).Submission writes suffixed receipts like
batch_metadata.text.<run>.jsonlandbatch_metadata.multimodal.<run>.jsonl. Receiver tools expand the base path intothese files automatically when
--metadata-pathis omitted.With the optional argument
--metadata-path, to select all batch_metadata.jsonl you can use '*' like :PATH_TO_FILE/batch_metadata.*.jsonlExemple of Usage
Submit:
python src/mmirage/shard_process.py --config configs/config_mock_openai_batch_vision.yamlCheck Status:
python src/mmirage/core/process/batch/status_checker.py --config configs/config_mock_openai_batch_vision.yamlCollect & Merge:
python src/mmirage/core/process/batch/collector.py --config configs/config_mock_openai_batch_vision.yaml --output-path tests/output/final_dataset.jsonl