-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Verify GPU memory consistency for Huber loss (delta=0.5) #21814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Verify GPU memory consistency for Huber loss (delta=0.5) #21814
Conversation
Summary of ChangesHello @MalyalaKarthik66, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an observed GPU memory inconsistency issue within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a GPU memory inconsistency in the Huber loss calculation by making dtype casting more explicit. The core change in losses.py is sound, though it introduces a small redundancy. Additionally, a new test has been added for verification, which appears to be a debugging utility and is not suitable for merging as a permanent regression test. My review includes suggestions to remove the redundant code and the temporary test.
| abs_error = ops.abs(error) | ||
| half = ops.convert_to_tensor(0.5, dtype=abs_error.dtype) | ||
| half = ops.cast(ops.convert_to_tensor(0.5), dtype=abs_error.dtype) | ||
| delta = ops.cast(delta, dtype=abs_error.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explicit cast on delta is redundant. On line 1972, delta is already converted to a tensor with y_pred.dtype. Since abs_error.dtype is the same as y_pred.dtype (as it's derived from y_pred and y_true), this line is effectively a no-op (ops.cast(delta, delta.dtype)). You can safely remove this line to avoid the unnecessary operation.
| def test_huber_memory_usage_debug_05(self): | ||
| import tensorflow as tf, numpy as np, keras | ||
|
|
||
| print("\n[Huber GPU Memory Debug: delta=0.5]") | ||
| gpus = tf.config.experimental.list_physical_devices('GPU') | ||
| if not gpus: | ||
| print("No GPU found. Skipping test.") | ||
| return | ||
| try: | ||
| for gpu in gpus: | ||
| tf.config.experimental.set_memory_growth(gpu, True) | ||
| except RuntimeError as e: | ||
| print(f"[Info] GPU memory growth already set or GPU initialized: {e}") | ||
|
|
||
| x = np.random.rand(1000, 1) | ||
| y = ((3 * x) + 2) + np.random.randn(1000, 1) | ||
| huber_loss = keras.losses.Huber(delta=0.5) | ||
| loss = huber_loss(y, x) | ||
| print(f"Huber loss: {loss.numpy():.6f}") | ||
|
|
||
| memory = sum(tf.config.experimental.get_memory_info(f'GPU:{i}')['current'] for i in range(len(gpus))) | ||
| print(f"GPU memory usage: {memory} bytes") | ||
|
|
||
| # sanity check for stable GPU usage (adjust threshold as needed) | ||
| assert memory > 0, f"GPU memory not allocated or usage is zero. Current usage: {memory} bytes" | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test appears to be for debugging and manual verification of GPU memory usage, as suggested by its name and implementation. While valuable for your investigation, it's not ideal as a permanent regression test for a few reasons:
- Backend-Specific: It directly imports and uses
tensorflow, making it non-portable to other backends. - Weak Assertion: The
assert memory > 0check is not robust. It only confirms that some memory was allocated, rather than verifying the stability of memory usage across differentdeltavalues, which was the core issue. - Debugging Artifacts: It includes
printstatements, which are generally not recommended in production test suites. - Naming Convention: The name
test_huber_memory_usage_debug_05indicates it's a temporary debugging utility.
Since this test has served its purpose for verifying the fix, I recommend removing it from the pull request to maintain a clean and focused test suite.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #21814 +/- ##
=======================================
Coverage 82.66% 82.66%
=======================================
Files 577 577
Lines 59419 59420 +1
Branches 9313 9313
=======================================
+ Hits 49121 49122 +1
Misses 7898 7898
Partials 2400 2400
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c871729 to
dc135e8
Compare
dc135e8 to
b17dc86
Compare
fix: #21804
Description:
This PR fixes minor dtype inconsistencies and ensures GPU memory consistency when computing the Huber loss in keras.losses.Huber.
Background:
During internal testing (test_huber_memory_usage_debug_05), it was observed that GPU memory usage slightly differed when using delta=0.5, although the Huber loss logic should behave identically across delta values. The issue was not with the computation itself but with how tensors of different dtypes were handled internally before casting.
Fix:
Verification:
To verify, I manually ran:
pytest keras/src/losses/losses_test.py::HuberLossTest::test_huber_memory_usage_debug_05 -v -sand manually changed the delta value in the test each time (e.g., 0.5, 1.0, 10.0).
Verified that GPU memory usage remains stable across all delta values when running the test individually.
Result:
✅ Stable GPU memory usage across deltas