Skip to content

fix data-store memory problem#7206

Merged
wxtim merged 12 commits intocylc:8.6.xfrom
dwsutherland:ds-mem-fix-7199
Feb 18, 2026
Merged

fix data-store memory problem#7206
wxtim merged 12 commits intocylc:8.6.xfrom
dwsutherland:ds-mem-fix-7199

Conversation

@dwsutherland
Copy link
Copy Markdown
Member

@dwsutherland dwsutherland commented Feb 5, 2026

closes #7199

before
image

after
image

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Code covered by existing tests.
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@dwsutherland dwsutherland added this to the 8.6.3 milestone Feb 5, 2026
@dwsutherland dwsutherland self-assigned this Feb 5, 2026
@dwsutherland dwsutherland mentioned this pull request Feb 5, 2026
@oliver-sanders oliver-sanders added the efficiency For notable efficiency improvements label Feb 9, 2026
Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers, will have a go at running with the original example.

Did you have to apply some diff to the profiler plugin to reveal the issue?

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Feb 9, 2026

Did you have to apply some diff to the profiler plugin to reveal the issue?

Not really, because they were very specific so I didn't think of a general use case.

Sorry, yes I had to change up the profiler plugin:

sutherlander@cortex-hyper:flow$ git diff main_loop/log_data_store.py
diff --git a/cylc/flow/main_loop/log_data_store.py b/cylc/flow/main_loop/log_data_store.py
index ec9fc8624..3d731d2fc 100644
--- a/cylc/flow/main_loop/log_data_store.py
+++ b/cylc/flow/main_loop/log_data_store.py
@@ -38,8 +38,21 @@ try:
 except ModuleNotFoundError:
     PLT = False
 
-from pympler.asizeof import asized
-
+from pympler.asizeof import asized, asizeof, Asizer
+
+#    'publish_deltas',
+#    'n_window_nodes',
+#    'n_window_edges',
+#    'n_window_node_walks',
+#    'n_window_completed_walks',
+#    'n_window_depths',
+#    'all_n_window_nodes',
+STORE_OTHER = {
+    'family_pruned_ids',
+    'prune_trigger_nodes',
+    'prune_flagged_nodes',
+    'pruned_task_proxies',
+}
 
 @startup
 async def init(scheduler, state):
@@ -51,19 +64,54 @@ async def init(scheduler, state):
         state['objects'][key] = []
         state['size'][key] = []
 
+    for attr_name in STORE_OTHER:
+        state['objects'][attr_name] = []
+        state['size'][attr_name] = []
+
+    #state['objects']['schd.config'] = []
+    #state['size']['schd.config'] = []
+
+    state['objects']['schd'] = []
+    state['size']['schd'] = []
+
+    state['objects']['dsmgr'] = []
+    state['size']['dsmgr'] = []
+
 
 @periodic
 async def log_data_store(scheduler, state):
     """Count the number of objects and the data store size."""
     state['times'].append(time())
-    for key, value in _iter_data_store(scheduler.data_store_mgr.data):
+    ds = scheduler.data_store_mgr
+    for key, value in _iter_data_store(ds.data):
         state['objects'][key].append(
             len(value)
         )
         state['size'][key].append(
-            asized(value).size
+            asizeof(value)
         )
 
+    for attr_name in STORE_OTHER:
+        attr_value = getattr(ds, attr_name)
+        state['objects'][attr_name].append(
+            len(attr_value)
+        )
+        state['size'][attr_name].append(
+            asizeof(attr_value)
+        )
+
+    #state['objects']['schd.config'].append(1)
+    #state['size']['schd.config'].append(asizeof(scheduler.config))
+    asizer = Asizer()
+    asizer.exclude_refs(scheduler.data_store_mgr)
+    state['objects']['schd'].append(1)
+    state['size']['schd'].append(asizer.asizeof(scheduler))
+
+    asizer = Asizer()
+    asizer.exclude_refs(scheduler)
+    state['objects']['dsmgr'].append(1)
+    state['size']['dsmgr'].append(asizer.asizeof(scheduler.data_store_mgr))
+
 
 @shutdown
 async def report(scheduler, state):
@@ -75,7 +123,9 @@ async def report(scheduler, state):
 def _iter_data_store(data_store):
     for item in data_store.values():
         for key, value in item.items():
-            if key != 'workflow':
+            if key == 'workflow':
+                yield (key, [value])
+            else:
                 yield (key, value)
         # there should only be one workflow in the data store
         break

But think it's probably too specific to make permanent..
Although I could make a cut-down version of it as a commit here.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Feb 10, 2026

Have added a version of the plugin modifications used in leak detection:
image

No idea why some tests are failing (functionality hasn't changed).. Will look

Ah, it's a test of the memory profiling..

@dpmatthews
Copy link
Copy Markdown
Contributor

I've tested this against my example workflow and against a less cut down version of the same workflow and can confirm this fixes the leak - thanks.

Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling results look much better!

before:

Image

after:

Image

There's still quite an upward slope to prune_trigger_nodes and n_window_node_walks, but I'm not sure if that's a smaller leak. or just a natural increase due to the character of the workflow's graph.


Thanks for working on the plugin, it's a big help to get these diffs back into the project so we can reproduce the results. I've opened a PR to work on this a bit further (above graphs generated with this diff): dwsutherland#27


The code seems reasonable, but I'm having a bit of difficulty working out the purpose of the different attributes to understand when they should be housekept:

@dwsutherland
Copy link
Copy Markdown
Member Author

There's still quite an upward slope

It tappers off if run longer:
image

@dwsutherland
Copy link
Copy Markdown
Member Author

I've updated the plotting, to report on sets and also declutter by reporting on, in addition to the data, only attributes that are larger (any of their values) than 2% (by default) of the max size recorded:
image

Because it would get cluttered very quickly at a fixed 2kb.

The code seems reasonable, but I'm having a bit of difficulty working out the purpose of the different attributes to understand when they should be housekept

Well the module doc says:

Pruning of data-store elements is done using the collection/set of nodes
generated at the boundary of an active node's graph walk and registering active
node's parents against them. Once active, these boundary nodes act as the prune
triggers for the associated parent nodes. Set operations are used to do a diff
between the nodes of active paths (paths whose node is in n=0)
and the nodes of flagged paths (whose boundary node(s) have become active).

I've added a little more

This method is used to avoid "blinking", where a task becomes non-active then
is removed (along with it's window/walk) before a descendant is added, causing
it to disapear then reappear in the store (and, hence, UIs).

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Feb 12, 2026

to understand when they should be housekept

Also, I believe, the reason for prune_trigger_nodes building up in the first place is because graphs with paths not taken may have boundary nodes that are never reached (hence why the memory problem only presented in some workflows)... I'll comment to this effect above the fix.

Comment on lines +2041 to +2051
# Clear any boundary prune triggers not in the window.
# This can happen where the graph has paths not taken, i.e.:
# ```
# foo => a
# foo:failed => b
# ```
# So if `foo` then `a`, which when active/removed is the prune trigger
# for `foo`.. However, `b` is not used so delete the trigger here.
for trigger_id in set(
self.prune_trigger_nodes).difference(self.all_n_window_nodes):
del self.prune_trigger_nodes[trigger_id]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the explanation for why the below fix was needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@oliver-sanders oliver-sanders requested review from wxtim and removed request for dpmatthews February 16, 2026 16:35
Co-authored-by: Tim Pillinger <26465611+wxtim@users.noreply.github.com>
@wxtim wxtim merged commit 394001f into cylc:8.6.x Feb 18, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

efficiency For notable efficiency improvements small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants