-
Notifications
You must be signed in to change notification settings - Fork 458
Geti Tune training loop #5065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Geti Tune training loop #5065
Conversation
📊 Test coverage report
|
Docker Image SizesCPU
GPU
XPU
|
d279088 to
87c945b
Compare
Fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format
87c945b to
76e05cf
Compare
previously, if there was only one polygon, numpy would attempt a type conversion that interfers with Datumaro
| partial_inner_model = partial_model(cast("type[BaseModel]", new_field.annotation)) # type: ignore[arg-type] | ||
| partial_fields[field_name] = ( | ||
| partial_inner_model | None, | ||
| partial_inner_model | new_field.annotation | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because to allow non-partial objects in the initialization of partial models nested fields.
For example, the following code would raise an error before, because PartialGlobalParameters only accepted PartialGlobalDatasetPreparationParameters but not GlobalDatasetPreparationParameters (the superclass).
params = PartialGlobalParameters(
dataset_preparation=GlobalDatasetPreparationParameters(
subset_split=SubsetSplit()
)
)@A-Artemis please review this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, but I still think that we need to rework the TrainingConfiguration model. With our current implementation we have 2 definitions; a partial, and a complete. This makes it complicated in serializing the object for the endpoint, as well as defining a return type for the endpoint (and its API docs).
I think having PartialTrainingConfiguration is too much, instead we should just use the complete training configuration and set its values to None where not in use (different to default value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements the complete training pipeline for fine-tuning models using OpenVINO Training Extensions (OTX). The implementation adds the ability to train models on annotated datasets, evaluate their performance, and export them to OpenVINO format. Key infrastructure includes wrapping Datumaro datasets into OTX-compatible objects, orchestrating the training loop via OTXEngine, and persisting artifacts (checkpoints, metrics, exported models) to disk.
- Implements training orchestration using OTX's
OTXEngine,OTXDataModule, andOTXModel - Adds evaluation and OpenVINO export steps after training
- Updates configuration handling to support subset split ratios and OTX-specific format conversion
- Fixes data format issues (BGR→RGB conversion, ragged polygon arrays, Datumaro field types)
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
application/backend/app/services/training/otx_trainer.py |
Core training pipeline implementation including dataset preparation, model training, evaluation, and export to OpenVINO format |
application/backend/tests/unit/services/training/test_otx_trainer.py |
Comprehensive test coverage for new training pipeline methods including configuration preparation, training, evaluation, and export |
application/backend/app/services/training_configuration_service.py |
Adds default subset split parameters to training configurations |
application/backend/app/services/datumaro_converter.py |
Fixes polygon array handling for instance segmentation and updates field types from score_field to numeric_field |
application/backend/app/workers/inference.py |
Converts frame data from BGR to RGB format before inference |
application/backend/app/stream/stream_data.py |
Documents frame data format as HWC BGR |
application/backend/app/services/active_model_service.py |
Changes default inference device from AUTO to CPU with explanation |
application/backend/tests/unit/services/test_datumaro_converter.py |
Updates test fixture to use valid triangle polygon |
application/backend/app/models/training_configuration/__init__.py |
Exports additional configuration classes |
application/backend/app/models/partial.py |
Fixes partial model generation to allow both partial and full nested models |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sample = convert_sample(dataset_item, image_path, project_labels_ids) | ||
| try: | ||
| sample = convert_sample(dataset_item, image_path, project_labels_ids) | ||
| except: |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: to catch only application-level exceptions.
| dataset.append(sample) | ||
| try: | ||
| dataset.append(sample) | ||
| except: |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: to catch only application-level exceptions.
| self, training_config: dict, dataset_info: DatasetInfo, weights_path: Path, model_id: UUID | ||
| ) -> tuple[Path, OTXEngine]: | ||
| """Execute model training.""" | ||
| _ = weights_path # TODO: use weights path to initialize model from pre-downloaded weights |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weights_path parameter is currently unused. Consider removing it from the method signature until the TODO is implemented, or implement the functionality now to avoid carrying unused parameters.
| _ = weights_path # TODO: use weights path to initialize model from pre-downloaded weights | |
| logger.info( | |
| "Starting training with pre-downloaded weights path: {} (model_id={})", | |
| weights_path, | |
| model_id, | |
| ) |
| if all_with_confidence | ||
| else [] | ||
| ) | ||
| # Polygons array is ragged because the number of vertices is not fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an important conversion for polygons. Is this only for instance segmentation, or could it end up being used somewhere else?
| partial_inner_model = partial_model(cast("type[BaseModel]", new_field.annotation)) # type: ignore[arg-type] | ||
| partial_fields[field_name] = ( | ||
| partial_inner_model | None, | ||
| partial_inner_model | new_field.annotation | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, but I still think that we need to rework the TrainingConfiguration model. With our current implementation we have 2 definitions; a partial, and a complete. This makes it complicated in serializing the object for the endpoint, as well as defining a return type for the endpoint (and its API docs).
I think having PartialTrainingConfiguration is too much, instead we should just use the complete training configuration and set its values to None where not in use (different to default value).
| otx_engine = OTXEngine( | ||
| model=otx_model, | ||
| data=otx_datamodule, | ||
| device=DeviceType.cpu, # TODO make device configurable (#5040) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to leave this as DeviceType.auto if the TODO isn't implemented uet.
| cfg_data = training_config["data"] | ||
| train_subset_cfg_data = cfg_data.get("train_subset") | ||
| val_subset_cfg_data = cfg_data.get("val_subset") | ||
| test_subset_cfg_data = cfg_data.get("test_subset") | ||
| train_subset_cfg_data["input_size"] = cfg_data["input_size"] | ||
| val_subset_cfg_data["input_size"] = cfg_data["input_size"] | ||
| test_subset_cfg_data["input_size"] = cfg_data["input_size"] | ||
| train_subset_sampler_cfg_data = train_subset_cfg_data.pop("sampler", {}) | ||
| val_subset_sampler_cfg_data = val_subset_cfg_data.pop("sampler", {}) | ||
| test_subset_sampler_cfg_data = test_subset_cfg_data.pop("sampler", {}) | ||
| train_subset_config = SubsetConfig( | ||
| sampler=SamplerConfig(**train_subset_sampler_cfg_data), **train_subset_cfg_data | ||
| ) | ||
| val_subset_config = SubsetConfig( | ||
| sampler=SamplerConfig(**val_subset_sampler_cfg_data), **val_subset_cfg_data | ||
| ) | ||
| test_subset_config = SubsetConfig( | ||
| sampler=SamplerConfig(**test_subset_sampler_cfg_data), **test_subset_cfg_data | ||
| ) | ||
| train_subset_config.transforms = TorchVisionTransformLib.generate(train_subset_config) | ||
| val_subset_config.transforms = TorchVisionTransformLib.generate(val_subset_config) | ||
| test_subset_config.transforms = TorchVisionTransformLib.generate(test_subset_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated subset config + sampler + transforms with otx dataset construction below can be extracted into a small helper function parametrized by subset type.
| revision_id=dataset_revision_id, | ||
| ) | ||
|
|
||
| @step("Prepare Model and Training Configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @step("Prepare Model and Training Configuration") | |
| @step("Prepare Model") |
| if training_params.project_id is None: | ||
| raise ValueError("Project ID must be provided for model preparation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since project_id is already validated in run method, probably we can just cast it similar to L166:
| if training_params.project_id is None: | |
| raise ValueError("Project ID must be provided for model preparation") | |
| project_id = cast(UUID, training_params.project_id) |
| model_id = uuid4() | ||
|
|
||
| # Create a separate data directory for this test | ||
| test_data_dir = tmp_path / "test_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any benefit of creating a separate subdirectory test_data within tmp_path rather than reusing tmp_path directly in this test?
Summary
This PR implements the training job steps that fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format. The process to setup the training pipeline involves wrapping the Datumaro subsets into
OTXDatasetobjects, using them to initialize anOTXDataModuleandOTXEngine, then using the latter to orchestrate the training, evaluation and export. At the end of the training, artifacts (including the converted model, logs and metrics) are stored in the model folder.Resolves #4868
How to test
Submit a training job with:
curl -X POST -H "Content-Type: application/json" localhost:7860/api/jobs -d '{"job_type": "train", "project_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "parameters": { "model_architecture_id": "Custom_Counting_Instance_Segmentation_MaskRCNN_EfficientNetB2B" }}'Monitor the job progress with:
curl localhost:7860/api/jobsNote: the above steps assume a project with some dataset items, annotated and user-reviewed.
Checklist