CPU and Max RSS Analysis tools#6663
Conversation
fb1b12b to
c5d30b3
Compare
30a7bb0 to
7091711
Compare
| # You should have received a copy of the GNU General Public License | ||
| # along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
| #------------------------------------------------------------------------------- | ||
| # cylc profile test |
There was a problem hiding this comment.
This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.
I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.
We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.
There was a problem hiding this comment.
The profiler actually fails in this test, but the test passes anyway because it doesn't check whether the profiler did anything useful.
I've had a crack at a test here: ChrisPaulBennett#1
A couple of the sub-tests don't pass at the moment because the cpu/memory are not returned if the job fails.
|
(please ignore the manylinux test failures, we'll be removing this test on master shortly) |
4f3d03a to
49fcbc8
Compare
|
I'm getting lots of failures with this (admittedly nasty) workflow on localhost: [task parameters]
time = 1..10
reps = 1..5
[scheduling]
cycling mode = integer
[[graph]]
R1 = task<time><reps>
[runtime]
[[task<time><reps>]]
script = sleep $CYLC_TASK_PARAM_timeAbout 2/3 of tasks have Full Traceback |
|
Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup, but jobs that exit faster than the profiler's poll interval is an edge case that we should handle. |
68b0687 to
66acd1f
Compare
Probably need some user safety rails/warnings about that |
It's difficult for us to say which job runners do or do not support cgroup profiling. The best we can do is to document it. |
|
I'm not sure how to deal with the linting failure. My Perl is rusty, at best. |
|
Works fine for me: $ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok 20179 ms ( 0.01 usr 0.01 sys + 3.16 cusr 1.10 csys = 4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr 0.01 sys + 3.16 cusr 1.10 csys = 4.29 CPU)
Result: PASS
$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#-------------------------------------------------------------------------------
# cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
. "$(dirname "$0")/test_header"
#-------------------------------------------------------------------------------
set_test_number 2
$ etc/bin/shellchecker
$ echo $?
0 |
ba23e55 to
c2a4099
Compare
| log_file="${WORKFLOW_RUN_DIR}/log/foo.log" | ||
| echo "Hello, Mr. Thompson" > "$log_file" | ||
|
|
||
| export CYLC_PROC_POLL_INTERVAL=0.5 |
There was a problem hiding this comment.
This shouldn't have been removed.
|
Have tested with this example:
[scheduling]
[[graph]]
R1 = foo
[runtime]
[[foo]]
script = """
python -c '
from random import random
from time import sleep
X = 1000000
Y = 10
Z = []
sleep(1)
for _ in range(Y):
Z.extend(random() for _ in range(X))
sleep(1)
'
"""
platform = ...
#!/usr/bin/env python
from random import random
from time import sleep
X = 1000000
Y = 10
Z = []
sleep(1)
for _ in range(Y):
Z.extend(random() for _ in range(X))
sleep(5)The above python script results in this pattern of memory usage:
Running the above workflow with this diff: diff --git a/cylc/flow/scripts/profiler.py b/cylc/flow/scripts/profiler.py
index 59ced2173..3acc6ecb2 100755
--- a/cylc/flow/scripts/profiler.py
+++ b/cylc/flow/scripts/profiler.py
@@ -223,6 +223,8 @@ async def profile(_process: Process, delay, keep_looping=lambda: True):
while keep_looping():
# Polling the cgroup for memory and keeping track of the max rss value
max_rss = parse_memory_file(_process)
+ import sys
+ print(f'# {max_rss}', file=sys.stderr)
if max_rss is not None and max_rss > _process.max_rss:
_process.max_rss = max_rss
await asyncio.sleep(delay)Results in the following memory measurements being written to stderr (when the Which is roughly consistent. LGTM. @dpmatthews, please could you confirm you're happy with the fields we're polling lines 96-112. |
| Conf('polling interval', VDR.V_INTEGER, | ||
| default=10, | ||
| desc=''' | ||
| Configure the profiler polling interval. | ||
|
|
||
| The interval (in seconds) at which the profiler will | ||
| poll the cgroups filesystem for resource usage data. | ||
| The default value of 10 seconds should be sufficient for | ||
| most use cases, but can be adjusted as needed. | ||
| ''') |
There was a problem hiding this comment.
This should really be an ISO8601 interval rather than an int (we switched the other configs from in in Cylc 6).
But will punt that to a follow-on issue - #7265
| if cgroup_version == 2: | ||
| return Process( | ||
| cgroup_memory_path=location + | ||
| cgroup_name + "/" + "memory.stat", |
There was a problem hiding this comment.
Path mangling like this is a bit ugly, but can also go wrong.
We use Python's own path joining functions in Cylc/Rose, i.e, pathlib.Path or os.path.join.
Opened a follow-on to avoid holding this one up: #7266
|
|
||
|
|
||
| async def test_profile_data(mocker): | ||
| # This test should run without error |
There was a problem hiding this comment.
This test should run without error
This is true of all tests!
Could you add docstrings to these tests to explain what behaviours they are trying to ensure.
E.g, one pattern for test descriptions is to start with the word "it", e.g: "It should run one iteration of the cgroup poller".
|
|
||
|
|
||
| def test_bad_task_dir(run_dir, brokendir, capsys): | ||
| @pytest.mark.asyncio |
There was a problem hiding this comment.
Ach, we just got rid of these in the profiler tests, only to add them in the cat_log tests!
We don't need to decorate tests with pytest.mark.asyncio.
| # NOTE: This test will run the Cylc profiler on the given test platform. | ||
| # The test platform may need to be configured for this to work (e.g. | ||
| # "cgroups path" may need to be set). |
There was a problem hiding this comment.
This description is still wrong.
This test is the one that mocks the profiler output, so doesn't need to be run on a test platform or require cgroups in order to work.
Yes, |

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.
This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.
This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.
Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675
Check List
CONTRIBUTING.mdand added my name as a Code Contributor.setup.cfg(andconda-environment.ymlif present).?.?.xbranch.