Add logging for dataset input and checkpoint saving #1190

linxiulei · 2025-05-18T20:15:22Z

This PR adds:

logging utility class to log when an operation's execution time exceeds a specified threshold.
use mentioned utility class to around dataset input fetch (per iteration) and checkpoint saving to allow us to observe execution in poor performance

findmyway · 2025-05-19T01:42:24Z

axlearn/common/trainer.py

@@ -1091,8 +1097,9 @@ def _run_step(
            train_summaries=outputs["summaries"], force_runs=force_run_evals
        )

-        # Checkpointer policy will decide if we should save.
-        self.save_checkpoint(evaler_summaries=evaler_summaries)
+        with LoggingContext("save_checkpoint", timeout=180, threshold=120):


Can we make this configurable?

Sure, but I am thinking which way is better to make it configurable, env var or flags?

The typical convention we use is that if we want to make the child configurable, then the parent config should contain a config for the child.

apghml · 2025-05-19T18:03:50Z

axlearn/common/logging_utils.py

+import time
+
+
+class LoggingContext:


Could you explain the difference of this from the other monitor functionality that exists like GPUMonitor, etc? If they serve overlapping purposes, should they be unified?

Can you please clarify what other monitor functionality we already have? If they overlap, we can unify them

TfSummaryMonitor, GPUMonitor, GoodputMonitor. I might be forgetting some additional ones.

Thanks, I checked those functionality and believe there don't overlap because this PR is mostly for logging specific operations' execution time (the utility function allows you to monitor any operations you want) and those functionality you mentioned is to monitor with specific purposes (goodputs, GPU hardware status etc)

apghml · 2025-05-19T18:04:16Z

axlearn/common/trainer.py

-        self.save_checkpoint(evaler_summaries=evaler_summaries)
+        with LoggingContext("save_checkpoint", timeout=180, threshold=120):
+            # Checkpointer policy will decide if we should save.
+            self.save_checkpoint(evaler_summaries=evaler_summaries)


Since checkpoint syncing is asynchronous, do we expect to see timeouts inside this block?

Yes, IIUC checkpoint saving is asynchronous if the previous saving is complete or it may stuck in https://github.com/apple/axlearn/blob/main/axlearn/common/array_serialization.py#L554

Ethanlm · 2025-05-19T18:13:37Z

axlearn/common/trainer.py

@@ -581,7 +582,12 @@ def run(
                output = None
                stop_trace_step = None

-                input_iterator = self.input.batches(self._input_iter)
+                input_iterator = LoggingIterator(


Since it is already recorded in the Goodput, I wonder if we need a separate logging here.

If I am reading it correctly, this LoggingIterator creates one timer thread every time next is called, which seems to be adding a lot of thread creation/cancellation overhead to the main process

Assuming you mean this code

self._maybe_record_event(measurement.Event.START_DATA_LOADING) try: input_batch = next(input_iterator) self._maybe_record_event(measurement.Event.END_DATA_LOADING)

If the next() somehow hangs and then the whole process aborts, would Goodput measurement help catch that?

My understanding is this iterator invokes next() every a few seconds depending on the training scale, so creating and cleaning a thread for every a few seconds should have only very minimal overhead (empirically one thread takes 10 to 100 usec to create).

However, if you still have concerns on overhead, we might create a reusable thread for timer purpose.

ruomingp · 2025-05-21T13:52:58Z

axlearn/common/logging_utils.py

+  """A context manager for monitoring the execution time of operations and
+  logging warnings based on defined timeouts and thresholds.
+
+  This context manager can be used to:
+  - Log a warning if an operation exceeds a specified timeout while still
+  running.
+  - Log a warning if an operation's total execution time exceeds a specified
+  threshold.


It's not clear that we need logging-upon-timeout, given the threading concerns raised by @Ethanlm. I wonder if a simpler semantic would be sufficient:

LoggingContext always wait until the op finishes and log a warning if the execution time exceeds the threshold;

We rely on the watchdog

axlearn/axlearn/common/trainer.py

Line 190 in 2c4ec6e

watchdog_timeout_seconds: Optional[float] = None

to tell us where the threads are stuck

WDYT?

I agree with starting something simpler so I changed this PR to only implement logging upon op completion.

However, I feel logging-upon-timeout is important because program hangs are difficult to debug by just looking through tracebacks of so many workers printed by watchdog timeout. A clear logging message may help identify workers that cause the hangs.

Introduce logging utilities designed to proactively identify and flag operations exceeding performance expectations.

linxiulei requested review from ruomingp, markblee and a team as code owners May 18, 2025 20:15

findmyway reviewed May 19, 2025

View reviewed changes

apghml requested changes May 19, 2025

View reviewed changes

Ethanlm reviewed May 19, 2025

View reviewed changes

ruomingp reviewed May 21, 2025

View reviewed changes

linxiulei added 2 commits May 22, 2025 15:58

Introduce LoggingContext

4a3b4a8

Introduce logging utilities designed to proactively identify and flag operations exceeding performance expectations.

Add logging for fetching input dataset and saving checkpoint

c9a34cb

linxiulei force-pushed the logging branch from 73a85d3 to c9a34cb Compare May 22, 2025 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add logging for dataset input and checkpoint saving #1190

Add logging for dataset input and checkpoint saving #1190

linxiulei commented May 18, 2025 •

edited

Loading

Uh oh!

findmyway May 19, 2025

Uh oh!

linxiulei May 19, 2025

Uh oh!

apghml May 19, 2025

Uh oh!

apghml May 19, 2025

Uh oh!

linxiulei May 19, 2025

Uh oh!

apghml May 19, 2025

Uh oh!

linxiulei May 22, 2025

Uh oh!

apghml May 19, 2025

Uh oh!

linxiulei May 19, 2025

Uh oh!

Ethanlm May 19, 2025

Uh oh!

linxiulei May 19, 2025

Uh oh!

ruomingp May 21, 2025

Uh oh!

linxiulei May 22, 2025

Uh oh!

Uh oh!

		import time


		class LoggingContext:

Add logging for dataset input and checkpoint saving #1190

Are you sure you want to change the base?

Add logging for dataset input and checkpoint saving #1190

Conversation

linxiulei commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linxiulei commented May 18, 2025 •

edited

Loading