Skip to content

Conversation

@DiegoTavares
Copy link
Collaborator

When a frame is killed due to OOM, it is possible the that thread that collects stats before reporting back races the frame wrapping process and gather stats for the frame when some of its procs have died, leading to a incorrect reading of memory for the given frame.

How the bug manifests:

For successful frames:**

  1. Frame runs → processes accumulate memory → refresh_procs updates stats normally
  2. Frame completes naturally → all processes exit cleanly together
  3. Final stats are captured before the cache is cleared
  4. Memory reported correctly

For killed frames (OOM):

  1. Frame detected using too much memory (e.g., 12GB actual usage)
  2. kill_session() is called → child processes start dying
  3. Next refresh_procs() cycle happens (this runs every report interval)
  4. session_processes.clear() wipes out all the cached process data including the high memory readings
  5. When rebuilding cache, zombie/dying processes are skipped
  6. Only the session leader remains (in zombie state or about to become one)
  7. collect_proc_stats() now reads only the session leader's memory (typically very small, just the shell wrapper)
  8. Massively underreported memory (e.g., reports 1GB instead of 12GB)

When a frame is killed due to OOM, it is possible the that thread that collects stats before
reporting back races the frame wrapping process and gather stats for the frame when some of its
procs have died, leading to a incorrect reading of memory for the given frame.
Copy link
Collaborator

@ramonfigueiredo ramonfigueiredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM

@DiegoTavares DiegoTavares merged commit 084506c into AcademySoftwareFoundation:master Dec 4, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants