Restore CPU usage metrics; add memory usage metrics; fix various small issues with IP accounting and logs#184
Draft
SeanGeb wants to merge 11 commits into
Draft
Conversation
- systemd consistently uses lowercase [1]; we should follow this. - Apply small code tidies and cleanups that help following commits. [1]: https://systemd.io/ Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
Example output:
# HELP systemd_meta Static systemd metadata
# TYPE systemd_meta gauge
systemd_meta{full_version="257.10-1.fc42"} 1
Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
We can easily get the systemd version from its DBus API and use that to automatically enable certain metrics that otherwise require the user to manually check their systemd version and enable the corresponding flag. To avoid breakages in unusual situations - e.g. where somehow support for those metrics has been patched or compiled out - make this behaviour opt-in to start. Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
Example output:
# HELP systemd_meta Static systemd metadata
# TYPE systemd_meta gauge
systemd_meta{architecture="x86-64",full_version="257.10-1.fc42",virtualization="wsl"} 1
Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
Some log messages contain sprintf formatting directives but are only passed as the message argument to a slog instance; remove those directives and avoid complaining about the same error multiple times. Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
For each of these units there's nothing too interesting to collect: - automount units have some metadata, mainly Where=. - path units have some metadata, mainly Unit=, Paths=, MakeDirectory=. - target units don't provide any functionality of their own and have no unit-type-specific metadata. Therefore there's no need to log the fact there's no handler for these units - they're well-known and uninteresting from a metrics standpoint so it's reasonable to simply ignore them. Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
Scope units are a close correspondence to services for e.g. shell sessions, so can also be units of resource accounting and control; similarly, slices (e.g. system.slice) are units of resource accounting and control that aggregate accounting and apply limits against the aggregate of their child units (which can be slices, scopes, and services). By default, templated units are spawned under a scope named after the unit's (non-templated) prefix, so immediately we'll start collecting resource consumption metrics for all instances of templated units (e.g. capsule@.service, systemd-journald@.service, modprobe@.service). This also has an immediate benefit on systems used interactively: user- spawned processes (e.g. from an SSH session) are put under user.slice by default, while system processes (e.g. the SSH server itself) are put under system.slice, and VMs or containers under machine.slice; this means an admin could now use systemd_exporter metrics to tell where CPU time or memory are being split between users, system services, and system containers/VMs. Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
If IP accounting isn't enabled for a unit, systemd returns the max uint64 value over DBus (i.e. -1 casted to uint64). When this happens we shouldn't export the metric; this is consistent with other metrics like those provided by tasks accounting. Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
Also makes the labels on the cpu_seconds_total and IP accounting metrics
consistent with other metrics defined against multiple unit types.
Example output:
# HELP systemd_unit_cpu_seconds_total Unit CPU time in seconds
# TYPE systemd_unit_cpu_seconds_total counter
systemd_unit_cpu_seconds_total{name="-.slice",type="Slice"} 2719.072
systemd_unit_cpu_seconds_total{name="NetworkManager-wait-online.service",type="Service"} 0.011585
systemd_unit_cpu_seconds_total{name="NetworkManager.service",type="Service"} 0.183858
systemd_unit_cpu_seconds_total{name="clickhouse-server.service",type="Service"} 875.48686
systemd_unit_cpu_seconds_total{name="console-getty.service",type="Service"} 0.005837
systemd_unit_cpu_seconds_total{name="dbus-broker.service",type="Service"} 1.652653
systemd_unit_cpu_seconds_total{name="dnf-makecache.service",type="Service"} 0.621209
systemd_unit_cpu_seconds_total{name="getty@tty1.service",type="Service"} 0.005612
systemd_unit_cpu_seconds_total{name="init.scope",type="Scope"} 1393.11924
systemd_unit_cpu_seconds_total{name="kmod-static-nodes.service",type="Service"} 0.003782
# HELP systemd_service_ip_egress_bytes Service unit egress IP accounting in bytes.
# TYPE systemd_service_ip_egress_bytes counter
systemd_service_ip_egress_bytes{name="clickhouse-server.service",type="Service"} 281676
Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
This is immensely handy to debug e.g. services with a memory leak, or
unexpected OOM kills on an optimised, resource-allocated, multi-workload
system (for an example of why this can happen, see Facebook's extensive
documentation on optimising resource usage with cgroups [1]).
Example output:
# HELP systemd_unit_memory_current_bytes Current memory usage in bytes.
# TYPE systemd_unit_memory_current_bytes gauge
systemd_unit_memory_current_bytes{name="NetworkManager.service",type="Service"} 7.737344e+06
systemd_unit_memory_current_bytes{name="clickhouse-server.service",type="Service"} 2.003456e+09
systemd_unit_memory_current_bytes{name="console-getty.service",type="Service"} 421888
# HELP systemd_unit_memory_peak_bytes Peak memory usage in bytes.
# TYPE systemd_unit_memory_peak_bytes gauge
systemd_unit_memory_peak_bytes{name="NetworkManager-wait-online.service",type="Service"} 2.625536e+06
systemd_unit_memory_peak_bytes{name="NetworkManager.service",type="Service"} 1.8141184e+07
systemd_unit_memory_peak_bytes{name="clickhouse-server.service",type="Service"} 2.224668672e+09
# HELP systemd_unit_swap_current_bytes Current swap usage in bytes.
# TYPE systemd_unit_swap_current_bytes gauge
systemd_unit_swap_current_bytes{name="NetworkManager.service",type="Service"} 0
systemd_unit_swap_current_bytes{name="clickhouse-server.service",type="Service"} 1.94056192e+08
systemd_unit_swap_current_bytes{name="console-getty.service",type="Service"} 0
# HELP systemd_unit_swap_peak_bytes Peak swap usage in bytes.
# TYPE systemd_unit_swap_peak_bytes gauge
systemd_unit_swap_peak_bytes{name="NetworkManager-wait-online.service",type="Service"} 0
systemd_unit_swap_peak_bytes{name="NetworkManager.service",type="Service"} 0
systemd_unit_swap_peak_bytes{name="clickhouse-server.service",type="Service"} 2.51211776e+08
[1]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html
Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
Signed-off-by: Sean Gebbett <10674942+SeanGeb@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello!
This is a quick PR to make some quick quality of life improvements to systemd_exporter.
The highlight is that I've restored support for CPU usage metrics. These were previously removed as they queried the cgroup filesystem directly, which trod on the toes of cgroup exporter.
systemd actually provides this information via the
CPUUsageNSec=property of active units, so we can bypass any cgroup shenanigans and just get it straight from systemd; this should also smooth over any cgroup v1 vs v2 differences.In the same vein I've also added equivalent metrics for memory usage.
Other smaller fixes include:
systemdto match what the systemd project themselves use.This PR is not quite ready yet - I still intend to add some additional CPU and memory metrics to reflect any assigned reservations, quotas, or limits for those resources, which can help diagnose throttling or be used for alerts when nearing a hard limit - but please feel free to leave early feedback.
Related issues
Resolves:
MemoryCurrent=and others)Partially resolves:
node_systemd_version), add systemd version #27May mitigate:
user-%d.sliceunits).