CPU and Max RSS Analysis tools by ChrisPaulBennett · Pull Request #6663 · cylc/cylc-flow

ChrisPaulBennett · 2025-03-12T09:14:38Z

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders

🎉

cylc/flow/etc/job.sh

cylc/flow/job_file.py

cylc/flow/etc/job.sh

cylc/flow/scripts/profile.py

tests/functional/jobscript/02-profiler.t

oliver-sanders

👍

cylc/flow/cfgspec/globalcfg.py

cylc/flow/etc/job.sh

cylc/flow/scripts/profiler.py

oliver-sanders · 2025-04-03T10:59:35Z

tests/functional/jobscript/02-profiler.t

+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#-------------------------------------------------------------------------------
+# cylc profile test


This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.

I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.

We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.

@ChrisPaulBennett

The profiler actually fails in this test, but the test passes anyway because it doesn't check whether the profiler did anything useful.

I've had a crack at a test here: ChrisPaulBennett#1

A couple of the sub-tests don't pass at the moment because the cpu/memory are not returned if the job fails.

tests/functional/jobscript/02-profiler/flow.cylc

cylc/flow/scripts/profiler.py

oliver-sanders · 2025-04-03T11:03:48Z

(please ignore the manylinux test failures, we'll be removing this test on master shortly)

cylc/flow/cfgspec/globalcfg.py

wxtim · 2025-04-16T13:50:42Z

I'm getting lots of failures with this (admittedly nasty) workflow on localhost:

[task parameters]
    time = 1..10
    reps = 1..5
[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task<time><reps>
[runtime]
    [[task<time><reps>]]
        script = sleep $CYLC_TASK_PARAM_time

About 2/3 of tasks have FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time' - It looks to me like the profiler fails if the task exits too fast?

Full Traceback

Traceback (most recent call last):
  File "/home/users/tim.pillinger/conda-envs/cylc39/bin/cylc", line 8, in <module>
    sys.exit(main())
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 702, in main
    execute_cmd(command, *cmd_args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 333, in execute_cmd
    entry_point.load()(*args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/terminal.py", line 298, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 62, in main
    get_config(options)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 180, in get_config
    profile(process, cgroup_version, args.delay)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 159, in profile
    write_data(str(cpu_time), "cpu_time")
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 103, in write_data
    with open(filename, 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time'

oliver-sanders · 2025-04-16T14:17:45Z

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup, but jobs that exit faster than the profiler's poll interval is an edge case that we should handle.

wxtim · 2025-04-17T12:59:20Z

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup

Probably need some user safety rails/warnings about that

oliver-sanders · 2025-04-17T13:01:33Z

Probably need some user safety rails/warnings about that

It's difficult for us to say which job runners do or do not support cgroup profiling. The best we can do is to document it.

cylc/flow/etc/job.sh

cylc/flow/cfgspec/globalcfg.py

cylc/flow/etc/job.sh

ChrisPaulBennett · 2025-04-28T10:23:28Z

I'm not sure how to deal with the linting failure. My Perl is rusty, at best.
If I add "export", as the error code recommends, the test fails. If I remove it the test also fails.
Dave Matthews recommendations have been implemented

oliver-sanders · 2025-05-07T11:57:17Z

Works fine for me:

$ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok    20179 ms ( 0.01 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.29 CPU)
Result: PASS

$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #-------------------------------------------------------------------------------
 # cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
 . "$(dirname "$0")/test_header"
 #-------------------------------------------------------------------------------
 set_test_number 2

$ etc/bin/shellchecker 
$ echo $?
0

tests/unit/scripts/test_profiler.py

cylc/flow/scripts/profiler.py

oliver-sanders · 2026-04-08T15:18:42Z

tests/functional/cylc-cat-log/12-delete-kill.t

 log_file="${WORKFLOW_RUN_DIR}/log/foo.log"
 echo "Hello, Mr. Thompson" > "$log_file"

-export CYLC_PROC_POLL_INTERVAL=0.5


This shouldn't have been removed.

oliver-sanders · 2026-04-09T15:14:17Z

Have tested with this example:

./flow.cylc

[scheduling]
    [[graph]]
        R1 = foo

[runtime]
    [[foo]]
        script = """
            python -c '
            from random import random
            from time import sleep
            X = 1000000
            Y = 10
            Z = []
            sleep(1)
            for _ in range(Y):
                Z.extend(random() for _ in range(X))
                sleep(1)
            '
        """
        platform = ...

./bin/memory-spike

#!/usr/bin/env python

from random import random
from time import sleep
X = 1000000
Y = 10
Z = []
sleep(1)
for _ in range(Y):
    Z.extend(random() for _ in range(X))
    sleep(5)

The above python script results in this pattern of memory usage:

Running the above workflow with this diff:

diff --git a/cylc/flow/scripts/profiler.py b/cylc/flow/scripts/profiler.py
index 59ced2173..3acc6ecb2 100755
--- a/cylc/flow/scripts/profiler.py
+++ b/cylc/flow/scripts/profiler.py
@@ -223,6 +223,8 @@ async def profile(_process: Process, delay, keep_looping=lambda: True):
     while keep_looping():
         # Polling the cgroup for memory and keeping track of the max rss value
         max_rss = parse_memory_file(_process)
+        import sys
+        print(f'# {max_rss}', file=sys.stderr)
         if max_rss is not None and max_rss > _process.max_rss:
             _process.max_rss = max_rss
         await asyncio.sleep(delay)

Results in the following memory measurements being written to stderr (when the polling interval is set to 1):

# 73994240
# 110489600
# 150994944
# 191504384
# 232009728
# 272519168
# 313028608
# 353538048
# 394047488
# 434552832

Which is roughly consistent. LGTM.

@dpmatthews, please could you confirm you're happy with the fields we're polling lines 96-112.

oliver-sanders · 2026-04-09T15:21:26Z

cylc/flow/cfgspec/globalcfg.py

+                Conf('polling interval', VDR.V_INTEGER,
+                     default=10,
+                     desc='''
+                     Configure the profiler polling interval.
+
+                     The interval (in seconds) at which the profiler will
+                     poll the cgroups filesystem for resource usage data.
+                     The default value of 10 seconds should be sufficient for
+                     most use cases, but can be adjusted as needed.
+                ''')


This should really be an ISO8601 interval rather than an int (we switched the other configs from in in Cylc 6).

But will punt that to a follow-on issue - #7265

oliver-sanders · 2026-04-09T15:31:11Z

cylc/flow/scripts/profiler.py

+        if cgroup_version == 2:
+            return Process(
+                cgroup_memory_path=location +
+                cgroup_name + "/" + "memory.stat",


Path mangling like this is a bit ugly, but can also go wrong.

We use Python's own path joining functions in Cylc/Rose, i.e, pathlib.Path or os.path.join.

Opened a follow-on to avoid holding this one up: #7266

oliver-sanders · 2026-04-09T15:48:46Z

tests/unit/scripts/test_profiler.py

+
+
+async def test_profile_data(mocker):
+    # This test should run without error


This test should run without error

This is true of all tests!

Could you add docstrings to these tests to explain what behaviours they are trying to ensure.

E.g, one pattern for test descriptions is to start with the word "it", e.g: "It should run one iteration of the cgroup poller".

oliver-sanders · 2026-04-09T15:54:30Z

tests/integration/scripts/test_cat_log.py



-def test_bad_task_dir(run_dir, brokendir, capsys):
+@pytest.mark.asyncio


Ach, we just got rid of these in the profiler tests, only to add them in the cat_log tests!

We don't need to decorate tests with pytest.mark.asyncio.

oliver-sanders · 2026-04-09T15:58:24Z

tests/functional/jobscript/04-profiler.t

+# NOTE: This test will run the Cylc profiler on the given test platform.
+# The test platform may need to be configured for this to work (e.g.
+# "cgroups path" may need to be set).


This description is still wrong.

This test is the one that mocks the profiler output, so doesn't need to be run on a test platform or require cgroups in order to work.

dpmatthews · 2026-04-09T16:07:09Z

@dpmatthews, please could you confirm you're happy with the fields we're polling lines 96-112.

Yes, anon appears to be a sensible measure to use with cgroups v2. See comments in https://stackoverflow.com/questions/74796436/rss-memory-equivalent-in-cgroup-v2

ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19

oliver-sanders reviewed Mar 12, 2025

View reviewed changes

oliver-sanders added this to the 8.x milestone Mar 12, 2025

oliver-sanders assigned ChrisPaulBennett Mar 12, 2025

This was referenced Mar 13, 2025

CPU and Max RSS Analysis tools cylc/cylc-ui#2100

Open

CPU and Max RSS Analysis tools cylc/cylc-uiserver#675

Open

ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from fb1b12b to c5d30b3 Compare March 21, 2025 11:37

ChrisPaulBennett force-pushed the cylc_profiler branch 3 times, most recently from 30a7bb0 to 7091711 Compare April 2, 2025 08:35

ChrisPaulBennett marked this pull request as ready for review April 2, 2025 14:20

oliver-sanders reviewed Apr 3, 2025

View reviewed changes

oliver-sanders reviewed Apr 10, 2025

View reviewed changes

cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved

ChrisPaulBennett force-pushed the cylc_profiler branch from 4f3d03a to 49fcbc8 Compare April 15, 2025 07:51

ChrisPaulBennett requested a review from oliver-sanders April 15, 2025 09:39

ChrisPaulBennett force-pushed the cylc_profiler branch from 68b0687 to 66acd1f Compare April 17, 2025 09:41

oliver-sanders reviewed Apr 23, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/etc/job.sh Show resolved Hide resolved

ChrisPaulBennett requested a review from oliver-sanders April 28, 2025 14:02

oliver-sanders reviewed Mar 11, 2026

View reviewed changes

tests/unit/scripts/test_profiler.py Show resolved Hide resolved

oliver-sanders reviewed Mar 11, 2026

View reviewed changes

tests/unit/scripts/test_profiler.py Outdated Show resolved Hide resolved

oliver-sanders reviewed Mar 11, 2026

View reviewed changes

tests/unit/scripts/test_profiler.py Outdated Show resolved Hide resolved

oliver-sanders reviewed Mar 11, 2026

View reviewed changes

tests/unit/scripts/test_profiler.py Show resolved Hide resolved

ChrisPaulBennett added 2 commits March 17, 2026 11:26

Removing redundant asyncio

3520e8c

Linting

4071d8e

ChrisPaulBennett requested a review from oliver-sanders March 17, 2026 13:11

oliver-sanders mentioned this pull request Mar 24, 2026

cat-log: Fix ps subprocesses running every 1s #7233

Open

8 tasks

Merge branch 'cylc:master' into cylc_profiler

cc9b5a8

ChrisPaulBennett requested a review from MetRonnie March 25, 2026 10:28

MetRonnie mentioned this pull request Mar 27, 2026

Refactor watch and kill logic #7255

Draft

8 tasks

ChrisPaulBennett added 2 commits March 31, 2026 15:34

Reverted back to constant polling of the memory statistic

6c3d108

Linting

e42cc58

oliver-sanders reviewed Mar 31, 2026

View reviewed changes

cylc/flow/scripts/profiler.py Outdated Show resolved Hide resolved

Code review changes

c2a4099

ChrisPaulBennett force-pushed the cylc_profiler branch from ba23e55 to c2a4099 Compare April 7, 2026 14:32

ChrisPaulBennett added 2 commits April 8, 2026 13:38

Fix unit tests

a79a3ce

Type hinting

9e8f426

oliver-sanders reviewed Apr 8, 2026

View reviewed changes

oliver-sanders requested review from oliver-sanders and removed request for wxtim April 9, 2026 14:15

oliver-sanders mentioned this pull request Apr 9, 2026

profiler: change the polling interval field from an integer to an ISO8601 interval #7265

Open

oliver-sanders reviewed Apr 9, 2026

View reviewed changes

oliver-sanders mentioned this pull request Apr 9, 2026

profiler: use stdlib for FS path mangling #7266

Open

oliver-sanders reviewed Apr 9, 2026

View reviewed changes



		async def test_profile_data(mocker):
		# This test should run without error



		def test_bad_task_dir(run_dir, brokendir, capsys):
		@pytest.mark.asyncio

Conversation

ChrisPaulBennett commented Mar 12, 2025 • edited by oliver-sanders Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders commented Apr 3, 2025

Uh oh!

Uh oh!

wxtim commented Apr 16, 2025

Uh oh!

oliver-sanders commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxtim commented Apr 17, 2025

Uh oh!

oliver-sanders commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisPaulBennett commented Apr 28, 2025

Uh oh!

oliver-sanders commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

oliver-sanders commented Apr 9, 2026

Uh oh!

oliver-sanders Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 9, 2026

ChrisPaulBennett commented Mar 12, 2025 •

edited by oliver-sanders

Loading

oliver-sanders commented Apr 16, 2025 •

edited

Loading

oliver-sanders commented May 7, 2025 •

edited

Loading

oliver-sanders Apr 9, 2026 •

edited

Loading