fix data-store memory problem by dwsutherland · Pull Request #7206 · cylc/cylc-flow

dwsutherland · 2026-02-05T12:53:30Z

closes #7199

before

after

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Code covered by existing tests.
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders

Cheers, will have a go at running with the original example.

Did you have to apply some diff to the profiler plugin to reveal the issue?

cylc/flow/data_store_mgr.py

dwsutherland · 2026-02-09T23:21:51Z

Did you have to apply some diff to the profiler plugin to reveal the issue?

~~Not really, because they were very specific so I didn't think of a general use case.~~

Sorry, yes I had to change up the profiler plugin:

sutherlander@cortex-hyper:flow$ git diff main_loop/log_data_store.py
diff --git a/cylc/flow/main_loop/log_data_store.py b/cylc/flow/main_loop/log_data_store.py
index ec9fc8624..3d731d2fc 100644
--- a/cylc/flow/main_loop/log_data_store.py
+++ b/cylc/flow/main_loop/log_data_store.py
@@ -38,8 +38,21 @@ try:
 except ModuleNotFoundError:
     PLT = False
 
-from pympler.asizeof import asized
-
+from pympler.asizeof import asized, asizeof, Asizer
+
+#    'publish_deltas',
+#    'n_window_nodes',
+#    'n_window_edges',
+#    'n_window_node_walks',
+#    'n_window_completed_walks',
+#    'n_window_depths',
+#    'all_n_window_nodes',
+STORE_OTHER = {
+    'family_pruned_ids',
+    'prune_trigger_nodes',
+    'prune_flagged_nodes',
+    'pruned_task_proxies',
+}
 
 @startup
 async def init(scheduler, state):
@@ -51,19 +64,54 @@ async def init(scheduler, state):
         state['objects'][key] = []
         state['size'][key] = []
 
+    for attr_name in STORE_OTHER:
+        state['objects'][attr_name] = []
+        state['size'][attr_name] = []
+
+    #state['objects']['schd.config'] = []
+    #state['size']['schd.config'] = []
+
+    state['objects']['schd'] = []
+    state['size']['schd'] = []
+
+    state['objects']['dsmgr'] = []
+    state['size']['dsmgr'] = []
+
 
 @periodic
 async def log_data_store(scheduler, state):
     """Count the number of objects and the data store size."""
     state['times'].append(time())
-    for key, value in _iter_data_store(scheduler.data_store_mgr.data):
+    ds = scheduler.data_store_mgr
+    for key, value in _iter_data_store(ds.data):
         state['objects'][key].append(
             len(value)
         )
         state['size'][key].append(
-            asized(value).size
+            asizeof(value)
         )
 
+    for attr_name in STORE_OTHER:
+        attr_value = getattr(ds, attr_name)
+        state['objects'][attr_name].append(
+            len(attr_value)
+        )
+        state['size'][attr_name].append(
+            asizeof(attr_value)
+        )
+
+    #state['objects']['schd.config'].append(1)
+    #state['size']['schd.config'].append(asizeof(scheduler.config))
+    asizer = Asizer()
+    asizer.exclude_refs(scheduler.data_store_mgr)
+    state['objects']['schd'].append(1)
+    state['size']['schd'].append(asizer.asizeof(scheduler))
+
+    asizer = Asizer()
+    asizer.exclude_refs(scheduler)
+    state['objects']['dsmgr'].append(1)
+    state['size']['dsmgr'].append(asizer.asizeof(scheduler.data_store_mgr))
+
 
 @shutdown
 async def report(scheduler, state):
@@ -75,7 +123,9 @@ async def report(scheduler, state):
 def _iter_data_store(data_store):
     for item in data_store.values():
         for key, value in item.items():
-            if key != 'workflow':
+            if key == 'workflow':
+                yield (key, [value])
+            else:
                 yield (key, value)
         # there should only be one workflow in the data store
         break

But think it's probably too specific to make permanent..
Although I could make a cut-down version of it as a commit here.

dwsutherland · 2026-02-10T05:34:46Z

Have added a version of the plugin modifications used in leak detection:

~~No idea why some tests are failing (functionality hasn't changed).. Will look~~

Ah, it's a test of the memory profiling..

dpmatthews · 2026-02-10T09:51:29Z

I've tested this against my example workflow and against a less cut down version of the same workflow and can confirm this fixes the leak - thanks.

oliver-sanders

Profiling results look much better!

before:

after:

There's still quite an upward slope to prune_trigger_nodes and n_window_node_walks, but I'm not sure if that's a smaller leak. or just a natural increase due to the character of the workflow's graph.

Thanks for working on the plugin, it's a big help to get these diffs back into the project so we can reproduce the results. I've opened a PR to work on this a bit further (above graphs generated with this diff): dwsutherland#27

The code seems reasonable, but I'm having a bit of difficulty working out the purpose of the different attributes to understand when they should be housekept:

cylc/flow/data_store_mgr.py

dwsutherland · 2026-02-11T20:57:20Z

There's still quite an upward slope

It tappers off if run longer:

* Track all data_store_mgr attributes, not just "data". * Filter by configurable min size for plotting. * Increase plot size. * Prevent legends overlapping. * Remove dead space on RHS of plots.

dwsutherland · 2026-02-12T02:43:24Z

I've updated the plotting, to report on sets and also declutter by reporting on, in addition to the data, only attributes that are larger (any of their values) than 2% (by default) of the max size recorded:

Because it would get cluttered very quickly at a fixed 2kb.

The code seems reasonable, but I'm having a bit of difficulty working out the purpose of the different attributes to understand when they should be housekept

Well the module doc says:

Pruning of data-store elements is done using the collection/set of nodes
generated at the boundary of an active node's graph walk and registering active
node's parents against them. Once active, these boundary nodes act as the prune
triggers for the associated parent nodes. Set operations are used to do a diff
between the nodes of active paths (paths whose node is in n=0)
and the nodes of flagged paths (whose boundary node(s) have become active).

I've added a little more

This method is used to avoid "blinking", where a task becomes non-active then
is removed (along with it's window/walk) before a descendant is added, causing
it to disapear then reappear in the store (and, hence, UIs).

dwsutherland · 2026-02-12T02:57:01Z

to understand when they should be housekept

Also, I believe, the reason for prune_trigger_nodes building up in the first place is because graphs with paths not taken may have boundary nodes that are never reached (hence why the memory problem only presented in some workflows)... I'll comment to this effect above the fix.

dwsutherland · 2026-02-12T03:17:51Z

cylc/flow/data_store_mgr.py

+        # Clear any boundary prune triggers not in the window.
+        # This can happen where the graph has paths not taken, i.e.:
+        # ```
+        # foo => a
+        # foo:failed => b
+        # ```
+        # So if `foo` then `a`, which when active/removed is the prune trigger
+        # for `foo`.. However, `b` is not used so delete the trigger here.
+        for trigger_id in set(
+                self.prune_trigger_nodes).difference(self.all_n_window_nodes):
+            del self.prune_trigger_nodes[trigger_id]


Here's the explanation for why the below fix was needed.

Makes sense!

cylc/flow/main_loop/log_data_store.py

Co-authored-by: Tim Pillinger <26465611+wxtim@users.noreply.github.com>

dwsutherland added this to the 8.6.3 milestone Feb 5, 2026

dwsutherland self-assigned this Feb 5, 2026

dwsutherland added the small label Feb 5, 2026

dwsutherland mentioned this pull request Feb 5, 2026

Memory leak #7199

Closed

dwsutherland requested review from dpmatthews and oliver-sanders February 9, 2026 08:16

oliver-sanders added the efficiency For notable efficiency improvements label Feb 9, 2026

oliver-sanders reviewed Feb 9, 2026

View reviewed changes

cylc/flow/data_store_mgr.py Show resolved Hide resolved

oliver-sanders reviewed Feb 11, 2026

View reviewed changes

cylc/flow/data_store_mgr.py Outdated Show resolved Hide resolved

cylc/flow/data_store_mgr.py Outdated Show resolved Hide resolved

cylc/flow/data_store_mgr.py Show resolved Hide resolved

dwsutherland and others added 9 commits February 12, 2026 15:08

fix data-store memory problem

aca3cfc

changelog entry

01061cd

annotate some data-store manager attributes

3adaa03

preserve detection method plugin mods

72065ad

test fix

bab860f

main loop: log data store ++

c10d354

* Track all data_store_mgr attributes, not just "data". * Filter by configurable min size for plotting. * Increase plot size. * Prevent legends overlapping. * Remove dead space on RHS of plots.

plotting adjustments

c4d1909

test log data-store fix

b5779ab

further elaborate on window management attrs

bf27f06

dwsutherland force-pushed the ds-mem-fix-7199 branch from 4a2d2e0 to 7bdf8c6 Compare February 12, 2026 02:35

add pruning justification to module docs

cb627cc

dwsutherland force-pushed the ds-mem-fix-7199 branch from 7bdf8c6 to cb627cc Compare February 12, 2026 02:42

comment on unused trigger removal

a720542

dwsutherland force-pushed the ds-mem-fix-7199 branch from a393dcc to a720542 Compare February 12, 2026 03:16

dwsutherland commented Feb 12, 2026

View reviewed changes

oliver-sanders requested review from wxtim and removed request for dpmatthews February 16, 2026 16:35

oliver-sanders approved these changes Feb 17, 2026

View reviewed changes

wxtim reviewed Feb 18, 2026

View reviewed changes

cylc/flow/main_loop/log_data_store.py Show resolved Hide resolved

wxtim reviewed Feb 18, 2026

View reviewed changes

cylc/flow/main_loop/log_data_store.py Outdated Show resolved Hide resolved

Tim's dislike fix

22a737c

Co-authored-by: Tim Pillinger <26465611+wxtim@users.noreply.github.com>

wxtim approved these changes Feb 18, 2026

View reviewed changes

wxtim merged commit 394001f into cylc:8.6.x Feb 18, 2026
23 checks passed

climbfuji mentioned this pull request Mar 6, 2026

[INSTALL]: cylc-flow@8.6.x to fix memory leak in 8.4.2 JCSDA/spack-stack#1944

Open

Conversation

dwsutherland commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dwsutherland commented Feb 9, 2026 • edited by MetRonnie Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwsutherland commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpmatthews commented Feb 10, 2026

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dwsutherland commented Feb 11, 2026

Uh oh!

dwsutherland commented Feb 12, 2026

Uh oh!

dwsutherland commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwsutherland Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dwsutherland commented Feb 5, 2026 •

edited

Loading

dwsutherland commented Feb 9, 2026 •

edited by MetRonnie

Loading

dwsutherland commented Feb 10, 2026 •

edited

Loading

dwsutherland commented Feb 12, 2026 •

edited

Loading