Geti Tune training loop #5065

leoll2 · 2025-12-08T09:21:20Z

Summary

This PR implements the training job steps that fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format. The process to setup the training pipeline involves wrapping the Datumaro subsets into OTXDataset objects, using them to initialize an OTXDataModule and OTXEngine, then using the latter to orchestrate the training, evaluation and export. At the end of the training, artifacts (including the converted model, logs and metrics) are stored in the model folder.

Resolves #4868

How to test

Submit a training job with: curl -X POST -H "Content-Type: application/json" localhost:7860/api/jobs -d '{"job_type": "train", "project_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "parameters": { "model_architecture_id": "Custom_Counting_Instance_Segmentation_MaskRCNN_EfficientNetB2B" }}'

Monitor the job progress with: curl localhost:7860/api/jobs

Note: the above steps assume a project with some dataset items, annotated and user-reviewed.

Checklist

The PR title and description are clear and descriptive
I have manually tested the changes
All changes are covered by automated tests
All related issues are linked to this PR (if applicable)
Documentation has been updated (if applicable)

github-actions · 2025-12-08T09:25:32Z

📊 Test coverage report

Metric	Coverage
Lines	40.2%
Functions	34.9%
Branches	84.8%
Statements	40.2%

github-actions · 2025-12-08T09:31:31Z

Docker Image Sizes

CPU

Image	Size
geti-tune-cpu:pr-5065	2.85G
geti-tune-cpu:sha-d9e9633	2.85G

GPU

Image	Size
geti-tune-gpu:pr-5065	10.62G
geti-tune-gpu:sha-d9e9633	10.62G

XPU

Image	Size
geti-tune-xpu:pr-5065	8.69G
geti-tune-xpu:sha-d9e9633	8.69G

Fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format

previously, if there was only one polygon, numpy would attempt a type conversion that interfers with Datumaro

leoll2 · 2025-12-18T09:53:05Z

application/backend/app/models/partial.py

            partial_inner_model = partial_model(cast("type[BaseModel]", new_field.annotation))  # type: ignore[arg-type]
            partial_fields[field_name] = (
-                partial_inner_model | None,
+                partial_inner_model | new_field.annotation | None,


I made this change because to allow non-partial objects in the initialization of partial models nested fields.
For example, the following code would raise an error before, because PartialGlobalParameters only accepted PartialGlobalDatasetPreparationParameters but not GlobalDatasetPreparationParameters (the superclass).

params = PartialGlobalParameters( dataset_preparation=GlobalDatasetPreparationParameters( subset_split=SubsetSplit() ) )

@A-Artemis please review this change.

Thanks for the fix, but I still think that we need to rework the TrainingConfiguration model. With our current implementation we have 2 definitions; a partial, and a complete. This makes it complicated in serializing the object for the endpoint, as well as defining a return type for the endpoint (and its API docs).

I think having PartialTrainingConfiguration is too much, instead we should just use the complete training configuration and set its values to None where not in use (different to default value).

application/backend/app/services/training_configuration_service.py

Copilot

Pull request overview

This PR implements the complete training pipeline for fine-tuning models using OpenVINO Training Extensions (OTX). The implementation adds the ability to train models on annotated datasets, evaluate their performance, and export them to OpenVINO format. Key infrastructure includes wrapping Datumaro datasets into OTX-compatible objects, orchestrating the training loop via OTXEngine, and persisting artifacts (checkpoints, metrics, exported models) to disk.

Implements training orchestration using OTX's OTXEngine, OTXDataModule, and OTXModel
Adds evaluation and OpenVINO export steps after training
Updates configuration handling to support subset split ratios and OTX-specific format conversion
Fixes data format issues (BGR→RGB conversion, ragged polygon arrays, Datumaro field types)

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`application/backend/app/services/training/otx_trainer.py`	Core training pipeline implementation including dataset preparation, model training, evaluation, and export to OpenVINO format
`application/backend/tests/unit/services/training/test_otx_trainer.py`	Comprehensive test coverage for new training pipeline methods including configuration preparation, training, evaluation, and export
`application/backend/app/services/training_configuration_service.py`	Adds default subset split parameters to training configurations
`application/backend/app/services/datumaro_converter.py`	Fixes polygon array handling for instance segmentation and updates field types from `score_field` to `numeric_field`
`application/backend/app/workers/inference.py`	Converts frame data from BGR to RGB format before inference
`application/backend/app/stream/stream_data.py`	Documents frame data format as HWC BGR
`application/backend/app/services/active_model_service.py`	Changes default inference device from AUTO to CPU with explanation
`application/backend/tests/unit/services/test_datumaro_converter.py`	Updates test fixture to use valid triangle polygon
`application/backend/app/models/training_configuration/__init__.py`	Exports additional configuration classes
`application/backend/app/models/partial.py`	Fixes partial model generation to allow both partial and full nested models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-18T15:44:47Z

application/backend/app/services/datumaro_converter.py

-            sample = convert_sample(dataset_item, image_path, project_labels_ids)
+            try:
+                sample = convert_sample(dataset_item, image_path, project_labels_ids)
+            except:


Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: to catch only application-level exceptions.

Copilot · 2025-12-18T15:44:47Z

application/backend/app/services/datumaro_converter.py

-                dataset.append(sample)
+                try:
+                    dataset.append(sample)
+                except:


Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: to catch only application-level exceptions.

Copilot · 2025-12-18T15:44:47Z

application/backend/app/services/training/otx_trainer.py

+        self, training_config: dict, dataset_info: DatasetInfo, weights_path: Path, model_id: UUID
+    ) -> tuple[Path, OTXEngine]:
+        """Execute model training."""
+        _ = weights_path  # TODO: use weights path to initialize model from pre-downloaded weights


The weights_path parameter is currently unused. Consider removing it from the method signature until the TODO is implemented, or implement the functionality now to avoid carrying unused parameters.

Suggested change

_ = weights_path # TODO: use weights path to initialize model from pre-downloaded weights

logger.info(

"Starting training with pre-downloaded weights path: {} (model_id={})",

weights_path,

model_id,

)

A-Artemis · 2025-12-18T15:52:04Z

application/backend/app/services/datumaro_converter.py

            if all_with_confidence
            else []
        )
+        # Polygons array is ragged because the number of vertices is not fixed


This looks like an important conversion for polygons. Is this only for instance segmentation, or could it end up being used somewhere else?

A-Artemis · 2025-12-18T15:57:12Z

application/backend/app/models/partial.py

            partial_inner_model = partial_model(cast("type[BaseModel]", new_field.annotation))  # type: ignore[arg-type]
            partial_fields[field_name] = (
-                partial_inner_model | None,
+                partial_inner_model | new_field.annotation | None,


Thanks for the fix, but I still think that we need to rework the TrainingConfiguration model. With our current implementation we have 2 definitions; a partial, and a complete. This makes it complicated in serializing the object for the endpoint, as well as defining a return type for the endpoint (and its API docs).

I think having PartialTrainingConfiguration is too much, instead we should just use the complete training configuration and set its values to None where not in use (different to default value).

AlbertvanHouten · 2025-12-19T08:46:33Z

application/backend/app/services/training/otx_trainer.py

+        otx_engine = OTXEngine(
+            model=otx_model,
+            data=otx_datamodule,
+            device=DeviceType.cpu,  # TODO make device configurable (#5040)


Probably best to leave this as DeviceType.auto if the TODO isn't implemented uet.

itallix · 2025-12-22T10:02:58Z

application/backend/app/services/training/otx_trainer.py

+            cfg_data = training_config["data"]
+            train_subset_cfg_data = cfg_data.get("train_subset")
+            val_subset_cfg_data = cfg_data.get("val_subset")
+            test_subset_cfg_data = cfg_data.get("test_subset")
+            train_subset_cfg_data["input_size"] = cfg_data["input_size"]
+            val_subset_cfg_data["input_size"] = cfg_data["input_size"]
+            test_subset_cfg_data["input_size"] = cfg_data["input_size"]
+            train_subset_sampler_cfg_data = train_subset_cfg_data.pop("sampler", {})
+            val_subset_sampler_cfg_data = val_subset_cfg_data.pop("sampler", {})
+            test_subset_sampler_cfg_data = test_subset_cfg_data.pop("sampler", {})
+            train_subset_config = SubsetConfig(
+                sampler=SamplerConfig(**train_subset_sampler_cfg_data), **train_subset_cfg_data
+            )
+            val_subset_config = SubsetConfig(
+                sampler=SamplerConfig(**val_subset_sampler_cfg_data), **val_subset_cfg_data
+            )
+            test_subset_config = SubsetConfig(
+                sampler=SamplerConfig(**test_subset_sampler_cfg_data), **test_subset_cfg_data
+            )
+            train_subset_config.transforms = TorchVisionTransformLib.generate(train_subset_config)
+            val_subset_config.transforms = TorchVisionTransformLib.generate(val_subset_config)
+            test_subset_config.transforms = TorchVisionTransformLib.generate(test_subset_config)


The repeated subset config + sampler + transforms with otx dataset construction below can be extracted into a small helper function parametrized by subset type.

itallix · 2025-12-22T10:12:45Z

application/backend/app/services/training/otx_trainer.py

+                revision_id=dataset_revision_id,
            )

    @step("Prepare Model and Training Configuration")


Suggested change

@step("Prepare Model and Training Configuration")

@step("Prepare Model")

itallix · 2025-12-22T10:16:29Z

application/backend/app/services/training/otx_trainer.py

        if training_params.project_id is None:
            raise ValueError("Project ID must be provided for model preparation")


Since project_id is already validated in run method, probably we can just cast it similar to L166:

Suggested change

if training_params.project_id is None:

raise ValueError("Project ID must be provided for model preparation")

project_id = cast(UUID, training_params.project_id)

itallix · 2025-12-22T10:57:33Z

application/backend/tests/unit/services/training/test_otx_trainer.py

+        model_id = uuid4()
+
+        # Create a separate data directory for this test
+        test_data_dir = tmp_path / "test_data"


Is there any benefit of creating a separate subdirectory test_data within tmp_path rather than reusing tmp_path directly in this test?

leoll2 self-assigned this Dec 8, 2025

github-actions bot added the Geti Tune Backend Issues related to Geti Tune Studio backend label Dec 8, 2025

leoll2 force-pushed the leonardo/geti-tune-training branch from d279088 to 87c945b Compare December 8, 2025 17:02

Geti Tune training loop

76e05cf

Fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format

leoll2 force-pushed the leonardo/geti-tune-training branch from 87c945b to 76e05cf Compare December 10, 2025 16:41

leoll2 added 8 commits December 15, 2025 13:08

upgrade otx and datumaro

98eb98c

fix segmentation converter to create ragged array with consistent shape

7860ba2

previously, if there was only one polygon, numpy would attempt a type conversion that interfers with Datumaro

set CPU as default inference device

dc77c00

bugfix: inference should run on RGB images, not BGR

51ad103

bugfix: provide non-null default configuration for subset split

439c254

fix subset splitting to normalize the ratios

cb210a0

upgrade otx commit

58fb678

Merge branch 'develop' into leonardo/geti-tune-training

a64a393

github-actions bot added the TEST Any changes in tests label Dec 17, 2025

leoll2 commented Dec 18, 2025

View reviewed changes

leoll2 added 3 commits December 18, 2025 16:01

update tests

1cd8d3c

Merge branch 'develop' into leonardo/geti-tune-training

c151109

suppress mypy false positive

38f3b03

leoll2 marked this pull request as ready for review December 18, 2025 15:42

leoll2 requested a review from a team as a code owner December 18, 2025 15:42

leoll2 requested review from A-Artemis, AlbertvanHouten, Copilot, itallix and kprokofi December 18, 2025 15:42

Copilot AI reviewed Dec 18, 2025

View reviewed changes

A-Artemis reviewed Dec 18, 2025

View reviewed changes

AlbertvanHouten reviewed Dec 19, 2025

View reviewed changes

itallix reviewed Dec 22, 2025

View reviewed changes

-        _ = weights_path  # TODO: use weights path to initialize model from pre-downloaded weights
+        logger.info(
+            "Starting training with pre-downloaded weights path: {} (model_id={})",
+            weights_path,
+            model_id,
+        )

	@step("Prepare Model and Training Configuration")
	@step("Prepare Model")

		if training_params.project_id is None:
		raise ValueError("Project ID must be provided for model preparation")

	if training_params.project_id is None:
	raise ValueError("Project ID must be provided for model preparation")
	project_id = cast(UUID, training_params.project_id)

Geti Tune training loop #5065

Are you sure you want to change the base?

Geti Tune training loop #5065

Uh oh!

Conversation

leoll2 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to test

Checklist

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Test coverage report

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docker Image Sizes

CPU

GPU

XPU

Uh oh!

leoll2 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

A-Artemis Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

A-Artemis Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

A-Artemis Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

AlbertvanHouten Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

itallix Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

itallix Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

itallix Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itallix Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leoll2 commented Dec 8, 2025 •

edited

Loading

github-actions bot commented Dec 8, 2025 •

edited

Loading

github-actions bot commented Dec 8, 2025 •

edited

Loading

itallix Dec 22, 2025 •

edited

Loading

itallix Dec 22, 2025 •

edited

Loading