Skip to content

Numpy DataItem #4347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

eugene123tw
Copy link
Contributor

Summary

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have ran e2e tests and there is no issues.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

@github-actions github-actions bot added the BUILD label Apr 15, 2025
Copy link
Contributor

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the efforts. It looks good overall. I have a few minor comments

…type safety and clarity in collate function usage across OTXDataset and related classes.
…aming and error messaging across Numpy and PyTorch validation classes.
… use property method, improving clarity and consistency in dataset handling.
ashwinvaidya17
ashwinvaidya17 previously approved these changes Apr 17, 2025
Copy link
Contributor

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I am fine with the changes

… compatibility with Numpy-based data processing.
@eugene123tw eugene123tw marked this pull request as ready for review April 17, 2025 13:15
@eugene123tw eugene123tw changed the title Numpy DataItem PoC Numpy DataItem Apr 17, 2025
…ith NumpyDataBatch, improving compatibility with Numpy-based data processing.
Copy link
Contributor

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, Eugene. That said, I feel now the NumpyDataItem is starting to overcomplicate things. We're now duplicating logic by preprocessing and collating everything separately into NumPy arrays. On top of that, for IR validation, we still need the annotations as PyTorch tensors since we're using torchmetrics for evaluation.

Would it make sense to support only NumPy images with optional transforms, and keep the annotations in torch format? From the perspective of the OV Engine and model implementations, it seems more consistent that way.
What do you think?

@eugene123tw
Copy link
Contributor Author

Thanks for the update, Eugene. That said, I feel now the NumpyDataItem is starting to overcomplicate things. We're now duplicating logic by preprocessing and collating everything separately into NumPy arrays. On top of that, for IR validation, we still need the annotations as PyTorch tensors since we're using torchmetrics for evaluation.

Would it make sense to support only NumPy images with optional transforms, and keep the annotations in torch format? From the perspective of the OV Engine and model implementations, it seems more consistent that way. What do you think?

@kprokofi I think we should avoid depending on torchmetrics for IR inference/validation. Using torchmetrics requires converting both predictions and annotations into PyTorch tensors, which doesn’t make sense for IR tasks — the outputs are typically generic (e.g., lists, NumPy arrays) and shouldn't be tied to a specific deep learning framework.

I also feel mixing NumPy arrays and torch tensors within a single data entity isn’t clean and could introduce bugs or unnecessary complexity.

A more framework-agnostic alternative could be HuggingFace’s evaluate library (https://huggingface.co/docs/evaluate/index), which supports segmentation, object detection, and other tasks without depending on any particular DL framework.

As for the duplication concern: if we remove torchmetrics dependency for IR, we wouldn’t need to collate annotations separately into tensors — we could simply pass datasets with NumPy data items directly to the evaluator, without needing to wrap them in a PyTorch DataLoader at all. This would simplify the flow.

What do you think?

@github-actions github-actions bot added the TEST Any changes in tests label Apr 22, 2025
@eugene123tw eugene123tw requested a review from kprokofi April 22, 2025 10:44
@github-actions github-actions bot removed the BUILD label Apr 22, 2025
@eugene123tw eugene123tw marked this pull request as draft April 28, 2025 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TEST Any changes in tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants