Skip to content

Conversation

@leoll2
Copy link
Contributor

@leoll2 leoll2 commented Dec 8, 2025

Summary

This PR implements the training job steps that fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format. The process to setup the training pipeline involves wrapping the Datumaro subsets into OTXDataset objects, using them to initialize an OTXDataModule and OTXEngine, then using the latter to orchestrate the training, evaluation and export. At the end of the training, artifacts (including the converted model, logs and metrics) are stored in the model folder.

Resolves #4868

How to test

Submit a training job with: curl -X POST -H "Content-Type: application/json" localhost:7860/api/jobs -d '{"job_type": "train", "project_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "parameters": { "model_architecture_id": "Custom_Counting_Instance_Segmentation_MaskRCNN_EfficientNetB2B" }}'

Monitor the job progress with: curl localhost:7860/api/jobs

Note: the above steps assume a project with some dataset items, annotated and user-reviewed.

Checklist

  • The PR title and description are clear and descriptive
  • I have manually tested the changes
  • All changes are covered by automated tests
  • All related issues are linked to this PR (if applicable)
  • Documentation has been updated (if applicable)

@leoll2 leoll2 self-assigned this Dec 8, 2025
@github-actions github-actions bot added the Geti Tune Backend Issues related to Geti Tune Studio backend label Dec 8, 2025
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

📊 Test coverage report

Metric Coverage
Lines 40.2%
Functions 34.9%
Branches 84.8%
Statements 40.2%

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Docker Image Sizes

CPU

Image Size
geti-tune-cpu:pr-5065 2.85G
geti-tune-cpu:sha-d9e9633 2.85G

GPU

Image Size
geti-tune-gpu:pr-5065 10.62G
geti-tune-gpu:sha-d9e9633 10.62G

XPU

Image Size
geti-tune-xpu:pr-5065 8.69G
geti-tune-xpu:sha-d9e9633 8.69G

@leoll2 leoll2 force-pushed the leonardo/geti-tune-training branch from d279088 to 87c945b Compare December 8, 2025 17:02
Fine-tune the model on the training dataset, evaluate it and finally export it to OpenVINO format
@leoll2 leoll2 force-pushed the leonardo/geti-tune-training branch from 87c945b to 76e05cf Compare December 10, 2025 16:41
@github-actions github-actions bot added the TEST Any changes in tests label Dec 17, 2025
partial_inner_model = partial_model(cast("type[BaseModel]", new_field.annotation)) # type: ignore[arg-type]
partial_fields[field_name] = (
partial_inner_model | None,
partial_inner_model | new_field.annotation | None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because to allow non-partial objects in the initialization of partial models nested fields.
For example, the following code would raise an error before, because PartialGlobalParameters only accepted PartialGlobalDatasetPreparationParameters but not GlobalDatasetPreparationParameters (the superclass).

params = PartialGlobalParameters(
    dataset_preparation=GlobalDatasetPreparationParameters(
        subset_split=SubsetSplit()
    )
)

@A-Artemis please review this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, but I still think that we need to rework the TrainingConfiguration model. With our current implementation we have 2 definitions; a partial, and a complete. This makes it complicated in serializing the object for the endpoint, as well as defining a return type for the endpoint (and its API docs).

I think having PartialTrainingConfiguration is too much, instead we should just use the complete training configuration and set its values to None where not in use (different to default value).

@leoll2 leoll2 marked this pull request as ready for review December 18, 2025 15:42
@leoll2 leoll2 requested a review from a team as a code owner December 18, 2025 15:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the complete training pipeline for fine-tuning models using OpenVINO Training Extensions (OTX). The implementation adds the ability to train models on annotated datasets, evaluate their performance, and export them to OpenVINO format. Key infrastructure includes wrapping Datumaro datasets into OTX-compatible objects, orchestrating the training loop via OTXEngine, and persisting artifacts (checkpoints, metrics, exported models) to disk.

  • Implements training orchestration using OTX's OTXEngine, OTXDataModule, and OTXModel
  • Adds evaluation and OpenVINO export steps after training
  • Updates configuration handling to support subset split ratios and OTX-specific format conversion
  • Fixes data format issues (BGR→RGB conversion, ragged polygon arrays, Datumaro field types)

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
application/backend/app/services/training/otx_trainer.py Core training pipeline implementation including dataset preparation, model training, evaluation, and export to OpenVINO format
application/backend/tests/unit/services/training/test_otx_trainer.py Comprehensive test coverage for new training pipeline methods including configuration preparation, training, evaluation, and export
application/backend/app/services/training_configuration_service.py Adds default subset split parameters to training configurations
application/backend/app/services/datumaro_converter.py Fixes polygon array handling for instance segmentation and updates field types from score_field to numeric_field
application/backend/app/workers/inference.py Converts frame data from BGR to RGB format before inference
application/backend/app/stream/stream_data.py Documents frame data format as HWC BGR
application/backend/app/services/active_model_service.py Changes default inference device from AUTO to CPU with explanation
application/backend/tests/unit/services/test_datumaro_converter.py Updates test fixture to use valid triangle polygon
application/backend/app/models/training_configuration/__init__.py Exports additional configuration classes
application/backend/app/models/partial.py Fixes partial model generation to allow both partial and full nested models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sample = convert_sample(dataset_item, image_path, project_labels_ids)
try:
sample = convert_sample(dataset_item, image_path, project_labels_ids)
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: to catch only application-level exceptions.

Copilot uses AI. Check for mistakes.
dataset.append(sample)
try:
dataset.append(sample)
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: to catch only application-level exceptions.

Copilot uses AI. Check for mistakes.
self, training_config: dict, dataset_info: DatasetInfo, weights_path: Path, model_id: UUID
) -> tuple[Path, OTXEngine]:
"""Execute model training."""
_ = weights_path # TODO: use weights path to initialize model from pre-downloaded weights
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights_path parameter is currently unused. Consider removing it from the method signature until the TODO is implemented, or implement the functionality now to avoid carrying unused parameters.

Suggested change
_ = weights_path # TODO: use weights path to initialize model from pre-downloaded weights
logger.info(
"Starting training with pre-downloaded weights path: {} (model_id={})",
weights_path,
model_id,
)

Copilot uses AI. Check for mistakes.
if all_with_confidence
else []
)
# Polygons array is ragged because the number of vertices is not fixed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an important conversion for polygons. Is this only for instance segmentation, or could it end up being used somewhere else?

partial_inner_model = partial_model(cast("type[BaseModel]", new_field.annotation)) # type: ignore[arg-type]
partial_fields[field_name] = (
partial_inner_model | None,
partial_inner_model | new_field.annotation | None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, but I still think that we need to rework the TrainingConfiguration model. With our current implementation we have 2 definitions; a partial, and a complete. This makes it complicated in serializing the object for the endpoint, as well as defining a return type for the endpoint (and its API docs).

I think having PartialTrainingConfiguration is too much, instead we should just use the complete training configuration and set its values to None where not in use (different to default value).

otx_engine = OTXEngine(
model=otx_model,
data=otx_datamodule,
device=DeviceType.cpu, # TODO make device configurable (#5040)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to leave this as DeviceType.auto if the TODO isn't implemented uet.

Comment on lines +213 to +234
cfg_data = training_config["data"]
train_subset_cfg_data = cfg_data.get("train_subset")
val_subset_cfg_data = cfg_data.get("val_subset")
test_subset_cfg_data = cfg_data.get("test_subset")
train_subset_cfg_data["input_size"] = cfg_data["input_size"]
val_subset_cfg_data["input_size"] = cfg_data["input_size"]
test_subset_cfg_data["input_size"] = cfg_data["input_size"]
train_subset_sampler_cfg_data = train_subset_cfg_data.pop("sampler", {})
val_subset_sampler_cfg_data = val_subset_cfg_data.pop("sampler", {})
test_subset_sampler_cfg_data = test_subset_cfg_data.pop("sampler", {})
train_subset_config = SubsetConfig(
sampler=SamplerConfig(**train_subset_sampler_cfg_data), **train_subset_cfg_data
)
val_subset_config = SubsetConfig(
sampler=SamplerConfig(**val_subset_sampler_cfg_data), **val_subset_cfg_data
)
test_subset_config = SubsetConfig(
sampler=SamplerConfig(**test_subset_sampler_cfg_data), **test_subset_cfg_data
)
train_subset_config.transforms = TorchVisionTransformLib.generate(train_subset_config)
val_subset_config.transforms = TorchVisionTransformLib.generate(val_subset_config)
test_subset_config.transforms = TorchVisionTransformLib.generate(test_subset_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated subset config + sampler + transforms with otx dataset construction below can be extracted into a small helper function parametrized by subset type.

revision_id=dataset_revision_id,
)

@step("Prepare Model and Training Configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@step("Prepare Model and Training Configuration")
@step("Prepare Model")

Comment on lines 274 to 275
if training_params.project_id is None:
raise ValueError("Project ID must be provided for model preparation")
Copy link
Contributor

@itallix itallix Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since project_id is already validated in run method, probably we can just cast it similar to L166:

Suggested change
if training_params.project_id is None:
raise ValueError("Project ID must be provided for model preparation")
project_id = cast(UUID, training_params.project_id)

model_id = uuid4()

# Create a separate data directory for this test
test_data_dir = tmp_path / "test_data"
Copy link
Contributor

@itallix itallix Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit of creating a separate subdirectory test_data within tmp_path rather than reusing tmp_path directly in this test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Geti Tune Backend Issues related to Geti Tune Studio backend TEST Any changes in tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement training stage of training job

5 participants