Fix TensorBoard callback step counter never updating#22357
Fix TensorBoard callback step counter never updating#22357pctablet505 wants to merge 1 commit intokeras-team:masterfrom
Conversation
The _push_writer method captured the step argument by value (always 0) from on_train_begin/on_test_begin. This caused two issues: 1. should_record() always returned True (0 % N == 0), defeating the update_freq gating mechanism 2. writer.as_default(step=0) set the default step for custom user summaries to 0 permanently, so all custom summaries overwrote each other at step 0 Use tf.Variable for the step counter instead of a plain integer. The variable is updated on each batch begin, and since writer.as_default() reads the variable's current value, both the recording gate and custom summary default step now reflect the actual batch number. Fixes keras-team#20143
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a significant bug in the Keras TensorBoard callback where batch-level summaries failed to update due to the step counter being captured as a static integer. By introducing TensorFlow Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses the issue of the TensorBoard step counter not updating for batch-level summaries by replacing the Python int with a tf.Variable. This ensures that the step value is passed by reference and updated correctly. The changes are logical and well-implemented. I have added a couple of minor suggestions to improve code maintainability by de-duplicating tensorflow imports.
| ) | ||
|
|
||
| def on_train_begin(self, logs=None): | ||
| import tensorflow as tf |
There was a problem hiding this comment.
This import tensorflow as tf statement is also present in on_test_begin. To avoid duplication and improve maintainability, consider importing tensorflow once at the top of the file using the Keras-idiomatic lazy loader:
# At the top of keras/src/callbacks/tensorboard.py
from keras.src.utils.module_utils import tensorflow as tfThis would allow you to remove the local imports from both on_train_begin and on_test_begin.
|
|
||
| def on_test_begin(self, logs=None): | ||
| self._push_writer(self._val_writer, self._global_test_batch) | ||
| import tensorflow as tf |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #22357 +/- ##
==========================================
- Coverage 82.95% 82.95% -0.01%
==========================================
Files 595 595
Lines 66040 66048 +8
Branches 10305 10307 +2
==========================================
+ Hits 54785 54791 +6
Misses 8639 8639
- Partials 2616 2618 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fixes: #20143
Problem
The TensorBoard callback was passing a plain Python
int(self._global_train_batch) towriter.as_default(step)andrecord_if(should_record). Becausewriter.as_defaultin TF captures the step value at the moment of the call (not a reference), the step seen by TensorBoard never advanced past 0. Batch-level summaries were either always written (step 0 % N == 0) or the step axis in TensorBoard was frozen at 0, making batch-level curves useless.Root Cause
_push_writeris called once aton_train_beginwith the scalar0. After that,self._global_train_batchis incremented on every batch, but the summary context still holds the original captured integer — not a live reference.Fix
Introduce a
tf.Variable(_train_step_var/_test_step_var) aton_train_begin/on_test_beginand pass the variable to_push_writer. On everyon_train_batch_begin/on_test_batch_begin, assign the current counter into that variable. Becausewriter.as_defaultaccepts a variable and reads it lazily, TensorBoard now sees the correct step for every batch.Files Changed
keras/src/callbacks/tensorboard.py— replace scalar step withtf.Variable, assign on each batch beginThis pull request updates the TensorBoard callback in Keras to improve the way step variables are managed during training and testing. The main change is the introduction of TensorFlow
tf.Variableobjects to track step counts, enabling more robust and flexible summary writing, especially in graph execution mode.Step variable management improvements:
tf.Variableobjects (_train_step_varand_test_step_var) to track training and testing steps, replacing direct use of integer counters for better TensorFlow compatibility. [1] [2]_push_writerand related logic to use the new step variables instead of raw integers, ensuring summaries are written correctly in both eager and graph execution modes.Synchronization of step variables: