Skip to content

Fix TensorBoard callback step counter never updating#22357

Draft
pctablet505 wants to merge 1 commit intokeras-team:masterfrom
pctablet505:fix/20143-tensorboard-step-counter
Draft

Fix TensorBoard callback step counter never updating#22357
pctablet505 wants to merge 1 commit intokeras-team:masterfrom
pctablet505:fix/20143-tensorboard-step-counter

Conversation

@pctablet505
Copy link
Collaborator

Fixes: #20143

Problem

The TensorBoard callback was passing a plain Python int (self._global_train_batch) to writer.as_default(step) and record_if(should_record). Because writer.as_default in TF captures the step value at the moment of the call (not a reference), the step seen by TensorBoard never advanced past 0. Batch-level summaries were either always written (step 0 % N == 0) or the step axis in TensorBoard was frozen at 0, making batch-level curves useless.

Root Cause

_push_writer is called once at on_train_begin with the scalar 0. After that, self._global_train_batch is incremented on every batch, but the summary context still holds the original captured integer — not a live reference.

Fix

Introduce a tf.Variable (_train_step_var / _test_step_var) at on_train_begin / on_test_begin and pass the variable to _push_writer. On every on_train_batch_begin / on_test_batch_begin, assign the current counter into that variable. Because writer.as_default accepts a variable and reads it lazily, TensorBoard now sees the correct step for every batch.

Files Changed

  • keras/src/callbacks/tensorboard.py — replace scalar step with tf.Variable, assign on each batch begin
    This pull request updates the TensorBoard callback in Keras to improve the way step variables are managed during training and testing. The main change is the introduction of TensorFlow tf.Variable objects to track step counts, enabling more robust and flexible summary writing, especially in graph execution mode.

Step variable management improvements:

  • Introduced tf.Variable objects (_train_step_var and _test_step_var) to track training and testing steps, replacing direct use of integer counters for better TensorFlow compatibility. [1] [2]
  • Updated _push_writer and related logic to use the new step variables instead of raw integers, ensuring summaries are written correctly in both eager and graph execution modes.

Synchronization of step variables:

  • Synchronized the new step variables with the global batch counters at the start of each train/test batch, keeping them consistent throughout the training/testing process. [1] [2]

The _push_writer method captured the step argument by value (always 0)
from on_train_begin/on_test_begin. This caused two issues:
1. should_record() always returned True (0 % N == 0), defeating the
   update_freq gating mechanism
2. writer.as_default(step=0) set the default step for custom user
   summaries to 0 permanently, so all custom summaries overwrote
   each other at step 0

Use tf.Variable for the step counter instead of a plain integer. The
variable is updated on each batch begin, and since writer.as_default()
reads the variable's current value, both the recording gate and custom
summary default step now reflect the actual batch number.

Fixes keras-team#20143
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a significant bug in the Keras TensorBoard callback where batch-level summaries failed to update due to the step counter being captured as a static integer. By introducing TensorFlow tf.Variables to dynamically track training and testing steps, this change ensures that TensorBoard accurately reflects the progress of batch-level metrics, providing correct and useful visualizations for model development.

Highlights

  • TensorBoard Step Counter Fix: Resolved a critical issue where the TensorBoard callback's step counter was not updating correctly, causing batch-level summaries to display frozen or inaccurate data.
  • Introduction of tf.Variable for Step Tracking: Replaced static Python integer counters with dynamic tf.Variable objects (_train_step_var, _test_step_var) to accurately track training and testing steps, leveraging TensorFlow's lazy evaluation for correct summary writing.
  • Synchronized Step Variable Updates: Implemented logic to assign the current global batch counts to the new tf.Variables at the beginning of each training and testing batch, ensuring that TensorBoard always receives the most up-to-date step value.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras/src/callbacks/tensorboard.py
    • Updated the _push_writer method to accept a tf.Variable for step tracking instead of a plain integer.
    • Initialized _train_step_var and _test_step_var as tf.Variables in the on_train_begin and on_test_begin methods, respectively.
    • Added assignment operations in on_train_batch_begin and on_test_batch_begin to update the tf.Variables with the current global batch counts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the issue of the TensorBoard step counter not updating for batch-level summaries by replacing the Python int with a tf.Variable. This ensures that the step value is passed by reference and updated correctly. The changes are logical and well-implemented. I have added a couple of minor suggestions to improve code maintainability by de-duplicating tensorflow imports.

)

def on_train_begin(self, logs=None):
import tensorflow as tf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This import tensorflow as tf statement is also present in on_test_begin. To avoid duplication and improve maintainability, consider importing tensorflow once at the top of the file using the Keras-idiomatic lazy loader:

# At the top of keras/src/callbacks/tensorboard.py
from keras.src.utils.module_utils import tensorflow as tf

This would allow you to remove the local imports from both on_train_begin and on_test_begin.


def on_test_begin(self, logs=None):
self._push_writer(self._val_writer, self._global_test_batch)
import tensorflow as tf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This import tensorflow as tf is a duplicate of the one in on_train_begin. As suggested in the other comment, this can be de-duplicated by moving the import to the top of the file for better code maintainability.

@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.95%. Comparing base (95e74a9) to head (4f115a5).

Files with missing lines Patch % Lines
keras/src/callbacks/tensorboard.py 83.33% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #22357      +/-   ##
==========================================
- Coverage   82.95%   82.95%   -0.01%     
==========================================
  Files         595      595              
  Lines       66040    66048       +8     
  Branches    10305    10307       +2     
==========================================
+ Hits        54785    54791       +6     
  Misses       8639     8639              
- Partials     2616     2618       +2     
Flag Coverage Δ
keras 82.78% <83.33%> (-0.01%) ⬇️
keras-jax 60.84% <83.33%> (+<0.01%) ⬆️
keras-numpy 55.02% <8.33%> (-0.01%) ⬇️
keras-openvino 49.10% <8.33%> (-0.01%) ⬇️
keras-tensorflow 62.06% <83.33%> (+<0.01%) ⬆️
keras-torch 60.87% <83.33%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants