-
Notifications
You must be signed in to change notification settings - Fork 3
Feature/llm api support #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 26 commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
56e925c
Draft : implementation OpenAI support
legstar67 f99a690
continuation and correction of the OpenAI API support
legstar67 7189b3e
cleaning
legstar67 141d214
small modifications and cleaning
legstar67 7811496
reset from the main
legstar67 8e085a3
implementation of config classes and agnostic-provider classes
legstar67 f26f804
pytests for previous implementations
legstar67 2003d34
OpenAI API implementation
legstar67 95ae28c
unit tests for OpenAI implementation
legstar67 4dd199f
chunking by byte implementation + its unit tests
legstar67 6964db8
implementation of the provider-agnostic batch orchestrator and metada…
legstar67 77e39e4
unit tests + corrections
legstar67 a5f9b62
implementation of the status checker , and the collector + unit tests…
legstar67 b32b4ae
correction on the image handling + tests
legstar67 c55efaf
correct the problem Copilot spotted about a global index which is not…
legstar67 fead12e
correction of problem of sample id spotted by copilot
legstar67 cead89b
correction of the file naming, spotted by copilot, to allow sharding
legstar67 03eed23
add an import
legstar67 d60e337
Update src/mmirage/core/process/batch/collector.py
legstar67 46f5e65
Merge branch 'feature/llm-api-support' of github.com:EPFLiGHT/MMIRAGE…
legstar67 c642744
copilot suggestion
legstar67 c7d4bac
template application changed + small correction
legstar67 c5e75a2
suggestion by Copilot, mostly changing the CLI arguement to accept li…
legstar67 1ce4e2c
small correction, to respect backward compatibility
legstar67 30b52b8
small correction on the config for vision
legstar67 71f9d6f
big improvement : making the 3 steps of the pipeline really provider …
legstar67 7ff6e10
abstract method added in BaseProcessor
legstar67 c9c43e9
small update on the completion_window of openai
legstar67 69beb2f
adjust adapter.py thanks to the comments of @fabnemEPFL
legstar67 ddca511
make --metadata_path optional as already in the config file + tests
legstar67 473c906
use of a batch receipt base path
legstar67 92219ac
solve problem of parsing the openai output
legstar67 16b270a
upgrade of status checker according to the comments
legstar67 af013e5
modif according to comments : optimization + deduplication
legstar67 b9ff47c
new class BatchMetadataRecord instead of using dict
legstar67 bfad50e
log, tracking and syntax modif following the comments
legstar67 41d59f2
small modif following copilot's comments
legstar67 eaf089c
syntax improved according to comments
legstar67 91be3de
correction of use of a dict instead of the config class
legstar67 08a1dfa
cleaning
legstar67 69d01f1
improve of openai adapter following the comments
legstar67 e81d1a3
improvement of openai adapter following the comments
legstar67 c32d633
improvement of openai adapter following the comments
legstar67 0298f76
syntax improved following the comments + test recommended by a copilo…
legstar67 4d4f8f1
working version of llm api provider feature
legstar67 f23bc70
Merge branch 'main' into feature/llm-api-support
legstar67 645110a
small correction
legstar67 a8e5680
Merge branch 'main' into feature/llm-api-support
fabnemEPFL File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| processors: | ||
| - type: llm | ||
| server_args: | ||
| model_path: Qwen/Qwen3-4B-Instruct-2507 | ||
| tp_size: 1 | ||
| disable_custom_all_reduce: true | ||
| default_sampling_params: | ||
| temperature: 0.1 | ||
| top_p: 0.9 | ||
| max_new_tokens: 1024 | ||
| custom_params: | ||
| chat_template_kwargs: | ||
| enable_thinking: false | ||
| batch_provider: | ||
| enabled: true | ||
| provider: openai | ||
| model: gpt-4o-mini | ||
| max_chunk_bytes: 52428800 | ||
| metadata_output_path: tests/output/batch_metadata.jsonl | ||
| credentials: | ||
| api_key: | ||
|
|
||
| loading_params: | ||
| datasets: | ||
| - path: tests/mock_data/data.jsonl | ||
| type: JSONL | ||
| output_dir: tests/output/data_openai_batch | ||
|
|
||
| num_shards: 1 | ||
| shard_id: 0 | ||
| batch_size: 64 | ||
|
|
||
| processing_params: | ||
| inputs: | ||
| - name: text | ||
| key: text | ||
|
|
||
| outputs: | ||
| - name: formatted_answer | ||
| type: llm | ||
| output_type: JSON | ||
| output_schema: | ||
| - question | ||
| - answer | ||
| prompt: | | ||
| Generate one question and its corresponding answer using the following text: | ||
| ``` | ||
| {{ text }} | ||
| ``` | ||
|
|
||
| remove_columns: true | ||
| output_schema: | ||
| conversations: | ||
| - role: "user" | ||
| content: "{{ formatted_answer.question }}" | ||
| - role: "assistant" | ||
| content: "{{ formatted_answer.answer }}" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| processors: | ||
| - type: llm | ||
| server_args: | ||
| model_path: Qwen/Qwen3-VL-8B-Instruct | ||
| tp_size: 1 | ||
| trust_remote_code: true | ||
| chat_template: qwen2-vl | ||
| default_sampling_params: | ||
| temperature: 0.1 | ||
| top_p: 0.9 | ||
| max_new_tokens: 512 | ||
| batch_provider: | ||
| enabled: true | ||
| provider: openai | ||
| model: gpt-4o-mini | ||
| metadata_output_path: tests/output/batch_metadata.jsonl | ||
| credentials: | ||
| api_key: "" | ||
|
|
||
| loading_params: | ||
| datasets: | ||
| - path: tests/mock_data_vision/data.jsonl | ||
| type: JSONL | ||
| output_dir: tests/output/data_openai_batch_vision | ||
| image_base_path: tests/mock_data_vision | ||
|
|
||
| num_shards: 1 | ||
| shard_id: 0 | ||
| batch_size: 1 | ||
|
|
||
| processing_params: | ||
| inputs: | ||
| - name: image_input | ||
| key: image | ||
| type: image | ||
|
|
||
| outputs: | ||
| - name: caption | ||
| type: llm | ||
| output_type: plain | ||
| prompt: | | ||
| Describe what you see in this image in one concise sentence. | ||
|
|
||
| remove_columns: false | ||
| output_schema: | ||
| image: "{{ image_input }}" | ||
| caption: "{{ caption }}" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| """Provider-agnostic batch configuration contracts. | ||
|
|
||
| This module defines the shared configuration shape used by any future batch | ||
| submission provider (OpenAI, Anthropic, etc.). | ||
| """ | ||
|
|
||
| from dataclasses import dataclass, field | ||
| from typing import Any, Dict, Literal, Optional | ||
|
|
||
|
|
||
| @dataclass | ||
| class BatchRetryPolicy: | ||
| """Retry behavior used by provider-neutral batch submission orchestration. | ||
|
|
||
| Attributes: | ||
| max_attempts: Maximum number of submission attempts for retryable errors. | ||
| initial_backoff_seconds: Delay before the first retry attempt. | ||
| backoff_multiplier: Multiplicative factor for subsequent retry delays. | ||
| """ | ||
|
|
||
| max_attempts: int = 3 | ||
| initial_backoff_seconds: float = 2.0 | ||
| backoff_multiplier: float = 2.0 | ||
|
|
||
| def __post_init__(self) -> None: | ||
| if self.max_attempts < 1: | ||
| raise ValueError("max_attempts must be >= 1") | ||
| if self.initial_backoff_seconds < 0: | ||
| raise ValueError("initial_backoff_seconds must be >= 0") | ||
| if self.backoff_multiplier < 1: | ||
| raise ValueError("backoff_multiplier must be >= 1") | ||
|
|
||
|
|
||
| @dataclass | ||
| class BatchProviderConfig: | ||
| """Shared contract for provider-specific batch configuration. | ||
|
|
||
| Concrete provider configs should inherit from this dataclass and extend it | ||
| with provider-specific settings. The fields here are intentionally provider | ||
| neutral so chunking/submission orchestration can run through one typed path. | ||
|
|
||
| Attributes: | ||
| provider: Provider identifier (for example, "openai" or "anthropic"). | ||
| enabled: Whether batch submission mode is enabled. | ||
| max_chunk_bytes: Maximum serialized request bytes per chunk. | ||
| Defaults to 50 MB. | ||
| max_requests_per_chunk: Optional hard cap on number of requests in a | ||
| chunk. If None, no request-count cap is enforced. | ||
| metadata_output_path: Path where submission metadata artifacts are saved. | ||
| retry_policy: Retry policy used by the shared batch layer. | ||
| oversized_request_policy: Handling policy when a single request exceeds | ||
| ``max_chunk_bytes``. ``isolate`` creates a dedicated oversized | ||
| chunk, while ``reject`` fails fast. | ||
| extras: Provider-specific knobs that do not belong in the shared fields. | ||
| credentials: Provider credentials required to submit chunks. | ||
| """ | ||
|
|
||
| provider: str | ||
| enabled: bool = True | ||
| max_chunk_bytes: int = 50 * 1024 * 1024 | ||
| max_requests_per_chunk: Optional[int] = None | ||
| metadata_output_path: str = "" | ||
| retry_policy: BatchRetryPolicy = field(default_factory=BatchRetryPolicy) | ||
| oversized_request_policy: Literal["isolate", "reject"] = "isolate" | ||
| extras: Dict[str, Any] = field(default_factory=dict) | ||
| credentials: Dict[str, str] = field(default_factory=dict) | ||
|
|
||
| def __post_init__(self) -> None: | ||
| self.provider = self.provider.strip().lower() | ||
|
|
||
| if not self.provider: | ||
| raise ValueError("provider must be a non-empty string") | ||
| if self.max_chunk_bytes < 1: | ||
| raise ValueError("max_chunk_bytes must be >= 1") | ||
| if self.max_requests_per_chunk is not None and self.max_requests_per_chunk < 1: | ||
| raise ValueError("max_requests_per_chunk must be >= 1 when provided") | ||
| if self.oversized_request_policy not in {"isolate", "reject"}: | ||
| raise ValueError("oversized_request_policy must be either 'isolate' or 'reject'") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| """OpenAI-specific batch configuration.""" | ||
|
|
||
| from dataclasses import dataclass, field | ||
| from typing import Any, Dict, Literal, Optional | ||
|
|
||
| from mmirage.config.batch_provider import BatchProviderConfig | ||
|
|
||
|
|
||
| @dataclass | ||
| class OpenAIBatchConfig(BatchProviderConfig): | ||
| """OpenAI Batch API configuration. | ||
|
|
||
| Attributes: | ||
| provider: Fixed provider identifier for OpenAI. | ||
| model: Model name used in each chat completion request body. | ||
| batch_endpoint: Target endpoint used by OpenAI batch jobs. | ||
| completion_window: OpenAI completion window value. | ||
| base_url: Optional base URL, useful for API-compatible gateways. | ||
| metadata: Metadata sent on batch creation. | ||
| """ | ||
|
|
||
| provider: str = "openai" | ||
| model: str = "gpt-4.1-mini" | ||
| batch_endpoint: str = "/v1/chat/completions" | ||
| completion_window: Literal["24h"] = "24h" | ||
| base_url: Optional[str] = None | ||
| metadata: Dict[str, str] = field(default_factory=dict) | ||
|
|
||
| def __post_init__(self) -> None: | ||
| super().__post_init__() | ||
|
|
||
| if not self.model.strip(): | ||
| raise ValueError("model must be a non-empty string") | ||
| if not self.batch_endpoint.startswith("/"): | ||
| raise ValueError("batch_endpoint must start with '/'") | ||
|
|
||
| # Mirror OpenAI-specific fields into generic extras for provider-neutral consumers. | ||
| self.extras.setdefault("model", self.model) | ||
| self.extras.setdefault("batch_endpoint", self.batch_endpoint) | ||
| self.extras.setdefault("completion_window", self.completion_window) | ||
| if self.base_url: | ||
| self.extras.setdefault("base_url", self.base_url) | ||
| if self.metadata: | ||
| self.extras.setdefault("metadata", dict(self.metadata)) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| """Provider-agnostic batch processing contracts and registry.""" | ||
|
|
||
| from mmirage.core.process.batch.adapter import BatchSubmissionAdapter, BatchSubmissionResult | ||
| from mmirage.core.process.batch.collector import collect_and_merge | ||
| from mmirage.core.process.batch.chunking import BatchRequestChunker, RequestChunk | ||
| from mmirage.core.process.batch.openai_adapter import OpenAIBatchAdapter | ||
| from mmirage.core.process.batch.orchestrator import BatchSubmissionOrchestrator | ||
| from mmirage.core.process.batch.registry import BatchAdapterFactory, BatchAdapterRegistry | ||
| from mmirage.core.process.batch.status_checker import ( | ||
| extract_unique_provider_batches, | ||
| run_status_checker, | ||
| ) | ||
| from mmirage.config.openai_batch import OpenAIBatchConfig | ||
|
|
||
| __all__ = [ | ||
| "BatchSubmissionAdapter", | ||
| "BatchSubmissionResult", | ||
| "collect_and_merge", | ||
| "BatchRequestChunker", | ||
| "RequestChunk", | ||
| "BatchSubmissionOrchestrator", | ||
| "OpenAIBatchAdapter", | ||
| "OpenAIBatchConfig", | ||
| "BatchAdapterFactory", | ||
| "BatchAdapterRegistry", | ||
| "extract_unique_provider_batches", | ||
| "run_status_checker", | ||
| ] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.