Skip to content

Feature/llm api support#32

Merged
fabnemEPFL merged 48 commits into
mainfrom
feature/llm-api-support
May 12, 2026
Merged

Feature/llm api support#32
fabnemEPFL merged 48 commits into
mainfrom
feature/llm-api-support

Conversation

@legstar67
Copy link
Copy Markdown
Contributor

@legstar67 legstar67 commented Mar 26, 2026

Usage :
Setup first the environment variable : export OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE"
(Note : replace YOUR_OPENAI_API_KEY_HERE by your key, which looks like sk-... )

3 CLI commands:

  1. Submit:

(You can run this concurrently across multiple shards)
python src/mmirage/shard_process.py --config <yaml_config>

  1. Check Status:

(Accepts wildcards to check all shards/modalities at once)
python src/mmirage/core/process/batch/status_checker.py --config <yaml_config>

  1. Collect & Merge:

(Aggregates all receipts and perfectly reconstructs the original dataset)
python src/mmirage/core/process/batch/collector.py --config <yaml_config> --output-path <final_dataset.jsonl>

Note - For the batch receipts :

Set metadata_output_path to a base filename (for example: tests/output/batch_metadata.jsonl).
Submission writes suffixed receipts like batch_metadata.text.<run>.jsonl and
batch_metadata.multimodal.<run>.jsonl. Receiver tools expand the base path into
these files automatically when --metadata-path is omitted.
With the optional argument --metadata-path, to select all batch_metadata.jsonl you can use '*' like : PATH_TO_FILE/batch_metadata.*.jsonl


Exemple of Usage

  1. Submit:
    python src/mmirage/shard_process.py --config configs/config_mock_openai_batch_vision.yaml

  2. Check Status:
    python src/mmirage/core/process/batch/status_checker.py --config configs/config_mock_openai_batch_vision.yaml

  3. Collect & Merge:
    python src/mmirage/core/process/batch/collector.py --config configs/config_mock_openai_batch_vision.yaml --output-path tests/output/final_dataset.jsonl

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds provider-batch (OpenAI Batch API) support to MMIRAGE, enabling asynchronous submission, status polling, and later result collection/merge, and wires this into the LLM processor pipeline.

Changes:

  • Introduces a provider-agnostic batch submission layer (adapter contracts, chunking, orchestration, registry) plus an OpenAI Batch adapter.
  • Adds receiver-side CLIs/utilities to check batch status and collect/merge results from receipt JSONL metadata.
  • Integrates “batch mode” into LLMProcessor and adds extensive unit/integration test coverage plus example configs.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_openai_batch_adapter.py Unit tests for OpenAI adapter request building, submission, status, retrieval.
tests/test_integration_receiver.py Integration test for receipt reading + merge output writing.
tests/test_integration_batch_pipeline.py Integration coverage for batch-mode pipeline and chunk flushing behavior.
tests/test_batch_status_checker.py Tests for status checker parsing/dedup + output formatting.
tests/test_batch_orchestrator.py Tests for orchestrator buffering, flush reasons, metadata writing.
tests/test_batch_collector.py Tests for collector merge behavior and plain-text fallback.
tests/test_batch_chunking.py Tests for byte/request-count chunking and oversized request policies.
tests/test_batch_adapter_contracts.py Tests for adapter ABC compliance + registry credential handling.
src/mmirage/shard_process.py Adds lifecycle flush (finalize_processors) after dataset .map().
src/mmirage/core/process/processors/llm/llm_processor.py Adds batch-provider mode (no SGLang init), request submission + finalize flush.
src/mmirage/core/process/processors/llm/config.py Adds batch_provider config field; defaults TP size from env.
src/mmirage/core/process/mapper.py Adds finalize_processors() hook to call processor finalize().
src/mmirage/core/process/batch/status_checker.py New CLI/util to poll batch statuses from receipt JSONL.
src/mmirage/core/process/batch/registry.py New adapter registry + factory with credential validation/env fallback.
src/mmirage/core/process/batch/orchestrator.py New stateful orchestrator to buffer across .map() calls and persist receipts.
src/mmirage/core/process/batch/openai_adapter.py New OpenAI Batch adapter implementation.
src/mmirage/core/process/batch/collector.py New CLI/util to retrieve results and merge back into output JSONL.
src/mmirage/core/process/batch/chunking.py New provider-agnostic request chunker by byte/request limits.
src/mmirage/core/process/batch/adapter.py New adapter contracts + normalized submission result model.
src/mmirage/core/process/batch/init.py Exports batch subsystem public API.
src/mmirage/config/openai_batch.py New OpenAI-specific batch config type.
src/mmirage/config/batch_provider.py New provider-agnostic batch config contract and retry policy.
configs/config_mock_openai_batch_vision.yaml Example config for OpenAI batch vision flow.
configs/config_mock_openai_batch.yaml Example config for OpenAI batch text/structured output flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mmirage/core/process/processors/llm/llm_processor.py Outdated
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py Outdated
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py
Comment thread src/mmirage/shard_process.py Outdated
Comment thread src/mmirage/core/process/batch/collector.py Outdated
Comment thread src/mmirage/core/process/batch/collector.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mmirage/core/process/processors/llm/llm_processor.py
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py Outdated
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py Outdated
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py Outdated
Comment thread src/mmirage/core/process/processors/llm/config.py
Comment thread src/mmirage/core/process/processors/llm/llm_processor.py
@legstar67 legstar67 requested a review from Copilot March 27, 2026 01:49
@fabnemEPFL fabnemEPFL deployed to docker-ci May 12, 2026 14:19 — with GitHub Actions Active
@fabnemEPFL fabnemEPFL marked this pull request as ready for review May 12, 2026 15:38
@fabnemEPFL fabnemEPFL merged commit f1a3068 into main May 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants