Skip to content

[CodeCamp #15] Add sigmoid focal loss cpu impl #2536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

XULU42
Copy link

@XULU42 XULU42 commented Jan 9, 2023

Motivation

This PR adds sigmoid focal loss CPU implementation on the basis of CUDA implementation.

Modification

  • Add sigmoid focal loss cpu implementation

  • Add focal loss unit test on cpu

  • Add ops to EN/ZH documents

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
  • CLA has been signed and all committers have signed the CLA in this PR.

* Add sigmoid focal loss cpu implementation

* Add focal loss unit test on cpu

* Add ops to EN/ZH documents
@XULU42 XULU42 changed the title CodeCamp #15 [CodeCamp #15] Add sigmoid focal loss cpu impl Jan 10, 2023
Copy link
Member

@grimoire grimoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this op can be implemented with native torch. Please provide a benchmark between this ops and torch implementation.

@XULU42
Copy link
Author

XULU42 commented Jan 11, 2023

Thanks for reviewer's opinion. I have some question about the benchmark. Could you please help me figure out?

  1. Is there a benchmark example for this op? I think this answer would be YES due to the existing cuda op.
  2. I would like to know the goal for mmcv to reimplement this op? I would guess the goal is performance due to the benchmark requirement. If the benchmark result shows current op impl has no superior performance than torch impl, would merge delay until the optimization is completed or would merge before optimization?(HAHA, codecamp has a ddl^_^)
  3. If there was no benchmark before, here is my plan, could you please help me check? I would test for different batch (2, 4, 8, ..., 8192) and different num_classes (2, 4, 8, ..., 8192) for the python api, each case run 10 times, and cal the average latency of both impl. Where would the benchmark code put?

Thanks again^_^

@grimoire
Copy link
Member

  1. Yes, it would help us to find the bottleneck of the ops.
  2. It would be cool if the benchmark can guide you to improve the performance of the op. And we will merge this PR even when the performance is not the best (less is better than nothing). Your benchmark would help us to optimize it in the future.
  3. The plan sounds cool. It would be even better if you can perform some tests on the data size of the real scene. pytest-benchmark or PyTorch profiler can help you build the benchmark. You can place the benchmark in this PR.

Accoding to reviewer's opinion:
1. replace expf with exp
2. direct bind KernelLaucher function to impl function
data case:
    batch: [2, 4, 8, ..., 8096]
    num_classes: [2, 4, 8, ..., 4096]
beta conclusion:
    1. In small data size, implemented op is superior.
    2. In medium and large data size, implemented op is superior.
Note:
    1. implemented op is compated to torchvison.ops.sigmoid_focal_loss,
and the latter is different in 'targets' arg. In benchmark, use different
data to overcome this.
@XULU42
Copy link
Author

XULU42 commented Jan 16, 2023

I have used pytest-benchmark to benchmark the implemented sigmoid focal loss op. Benchmark code is in the latest commit. And on my 8 core notebook, conclusions could be drawn as below: 1. In small data size, implemented op is superior. 2. In medium and large data size, torch implemented is superior. The processed benchmark data could be seen in benchmark_round10.csv and the raw benchmark data is in
0003_4e4fa93e7c5ee5a15dd63068c0fe9992068bc881_20230116_012432_uncommited-changes.txt

Next step is to optimize this op in medium and large data size, and maybe the torchvision implementation may be a good start point. Any suggestions are welcomed^_^

@grimoire
Copy link
Member

Cool.
Do remember to remove the benchmark test code before the final commit. Benchmark test might slow down github action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants