Numpy DataItem #4347

eugene123tw · 2025-04-15T15:54:41Z

Summary

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have ran e2e tests and there is no issues.
I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).
I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
I have linked related issues.

License

I submit my code changes under the same Apache License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

…e unused validation module

…, enhancing flexibility in dataset processing.

…reamline data processing.

ashwinvaidya17

Thanks for the efforts. It looks good overall. I have a few minor comments

src/otx/core/data/dataset/base.py

src/otx/data/numpy/numpy.py

src/otx/core/data/dataset/base.py

…type safety and clarity in collate function usage across OTXDataset and related classes.

…aming and error messaging across Numpy and PyTorch validation classes.

…ity in data processing across various tasks.

… use property method, improving clarity and consistency in dataset handling.

ashwinvaidya17

Thanks! I am fine with the changes

… compatibility with Numpy-based data processing.

…ith NumpyDataBatch, improving compatibility with Numpy-based data processing.

kprokofi

Thanks for the update, Eugene. That said, I feel now the NumpyDataItem is starting to overcomplicate things. We're now duplicating logic by preprocessing and collating everything separately into NumPy arrays. On top of that, for IR validation, we still need the annotations as PyTorch tensors since we're using torchmetrics for evaluation.

Would it make sense to support only NumPy images with optional transforms, and keep the annotations in torch format? From the perspective of the OV Engine and model implementations, it seems more consistent that way.
What do you think?

eugene123tw · 2025-04-22T10:29:04Z

Thanks for the update, Eugene. That said, I feel now the NumpyDataItem is starting to overcomplicate things. We're now duplicating logic by preprocessing and collating everything separately into NumPy arrays. On top of that, for IR validation, we still need the annotations as PyTorch tensors since we're using torchmetrics for evaluation.

Would it make sense to support only NumPy images with optional transforms, and keep the annotations in torch format? From the perspective of the OV Engine and model implementations, it seems more consistent that way. What do you think?

@kprokofi I think we should avoid depending on torchmetrics for IR inference/validation. Using torchmetrics requires converting both predictions and annotations into PyTorch tensors, which doesn’t make sense for IR tasks — the outputs are typically generic (e.g., lists, NumPy arrays) and shouldn't be tied to a specific deep learning framework.

I also feel mixing NumPy arrays and torch tensors within a single data entity isn’t clean and could introduce bugs or unnecessary complexity.

A more framework-agnostic alternative could be HuggingFace’s evaluate library (https://huggingface.co/docs/evaluate/index), which supports segmentation, object detection, and other tasks without depending on any particular DL framework.

As for the duplication concern: if we remove torchmetrics dependency for IR, we wouldn’t need to collate annotations separately into tensors — we could simply pass datasets with NumPy data items directly to the evaluator, without needing to wrap them in a PyTorch DataLoader at all. This would simplify the flow.

What do you think?

eugene123tw added 8 commits April 14, 2025 11:21

🆕 Add Numpy data item and validation implementations

611fd61

🧹 Refactor Numpy data imports and validation checks

c267313

Merge branch 'develop' into eugene/numpy-data

af5e446

🆕 Enhance Numpy data handling with collate function support and remov…

d112209

…e unused validation module

Merge branch 'develop' into eugene/numpy-data

bb99e88

Merge branch 'develop' into eugene/numpy-data

5ca7364

🧹 Refactor data handling to support collate modes for Torch and Numpy…

0b3d7af

…, enhancing flexibility in dataset processing.

🧹 Refactor instance segmentation code to improve mask handling and st…

c3766cd

…reamline data processing.

github-actions bot added the BUILD label Apr 15, 2025

ashwinvaidya17 reviewed Apr 16, 2025

View reviewed changes

eugene123tw added 5 commits April 16, 2025 11:59

🧹 Refactor dataset handling to introduce CollateMode enum, enhancing …

25cc05c

…type safety and clarity in collate function usage across OTXDataset and related classes.

🧹 Update validation functions to improve clarity and consistency in n…

d0283f4

…aming and error messaging across Numpy and PyTorch validation classes.

🧹 Refactor dataset classes to support CollateMode, enhancing flexibil…

e13d8fb

…ity in data processing across various tasks.

🧹 Refactor collate function access in OTXDataset and OTXDataModule to…

b69874e

… use property method, improving clarity and consistency in dataset handling.

Merge branch 'develop' into eugene/numpy-data

d7bfe85

ashwinvaidya17 previously approved these changes Apr 17, 2025

View reviewed changes

🧹 Refactor dataset item retrieval to support NumpyDataItem, enhancing…

c4dbf8e

… compatibility with Numpy-based data processing.

eugene123tw dismissed ashwinvaidya17’s stale review via c4dbf8e April 17, 2025 13:13

eugene123tw marked this pull request as ready for review April 17, 2025 13:15

eugene123tw requested review from samet-akcay, kprokofi, sovrasov, Daankrol, djdameln, rajeshgangireddy and atwinand as code owners April 17, 2025 13:15

eugene123tw changed the title ~~Numpy DataItem PoC~~ Numpy DataItem Apr 17, 2025

🧹 Refactor data handling in model classes to replace TorchDataBatch w…

fd3056c

…ith NumpyDataBatch, improving compatibility with Numpy-based data processing.

kprokofi requested changes Apr 17, 2025

View reviewed changes

Fix unit test

6cf8a1e

github-actions bot added the TEST Any changes in tests label Apr 22, 2025

eugene123tw requested a review from kprokofi April 22, 2025 10:44

Merge branch 'develop' into eugene/numpy-data

1bac323

github-actions bot removed the BUILD label Apr 22, 2025

eugene123tw marked this pull request as draft April 28, 2025 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Numpy DataItem #4347

Numpy DataItem #4347

Uh oh!

eugene123tw commented Apr 15, 2025

Uh oh!

ashwinvaidya17 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashwinvaidya17 left a comment

Uh oh!

kprokofi left a comment

Uh oh!

eugene123tw commented Apr 22, 2025

Uh oh!

Uh oh!

Numpy DataItem #4347

Are you sure you want to change the base?

Numpy DataItem #4347

Uh oh!

Conversation

eugene123tw commented Apr 15, 2025

Summary

How to test

Checklist

License

Uh oh!

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

Uh oh!

kprokofi left a comment

Choose a reason for hiding this comment

Uh oh!

eugene123tw commented Apr 22, 2025

Uh oh!

Uh oh!