Skip to content

Commit 25cd4a9

Browse files
authored
Add Kueue integration (#23908)
* Add initial Kueue OpenMetrics integration scaffold. Start with a basic OpenMetrics V2 check that forwards endpoint metrics under the kueue namespace to enable early endpoint validation before adding curated mappings. * Add basic Kueue OpenMetrics scraping * Add curated Kueue OpenMetrics mapping * Add Kueue resource metric suffixing * Document Kueue as a cluster check * Implement Kueue integration * Fix code coverage missing * Update README * Fix manifest * Update metadata * Add owners * Fix Kueue CI validation * Address Kueue review feedback. * Add memory * Pin kind networking and node image for Kueue E2E Use non-default service/pod subnets so the kind cluster's API service IP does not collide with the host environment's Kubernetes networking, which hijacked in-cluster traffic and broke Kueue's webhook cert bootstrap. Also scope the LocalQueue readiness wait to the default namespace. * Fix Kueue go info tag validation. Rename the generic Go version label before submission so E2E metrics pass tag validation. * Make Kueue E2E test pass against live cluster Relax metric tag assertions to match the actual tag set emitted by the controller (endpoint, replica_role, cohort tags) instead of pinning an exact subset, and add the missing assets/service_checks.json (with its manifest reference) that assert_service_checks requires. * Wait for Kueue webhook before applying queue manifests The controller deployment can report `Available` before its webhook server is actually serving, causing intermittent `connection refused` failures when applying ResourceFlavor/ClusterQueue. Wait for the webhook service endpoints and retry the apply to absorb the brief cert-propagation window. * Use remapped tag names in Kueue metric descriptions Metric descriptions referenced the raw Prometheus labels ('cluster_queue', 'local_queue'/'localQueue') instead of the tags Datadog actually emits after remapping ('kueue_cluster_queue', 'kueue_local_queue'). * Rename cluster_queue.pending_workloads to pending_workloads The raw kueue_pending_workloads metric has no cluster_queue in its name, so the cluster_queue. prefix was inconsistent with every other cluster-queue- indexed metric (which keep bare names and just carry the kueue_cluster_queue tag). Drop the prefix to match the source name and the rest of the convention. * Sync Kueue configuration example * Update codeowners * Apply suggestions from code review * Apply Kueue review cleanup * Use metadata assertion for Kueue metric coverage * Remove Kueue service check metadata * Configure Kueue manifestless metadata * Refactor Kueue tag assertions * Assert Kueue e2e metrics from metadata * Assert idle Kueue e2e metrics * Rename Kueue flavor label * Document Kueue GC summary metrics * Remove checks * Add more e2e metrics * Fix e2e setup * Fix Kueue e2e controller rollout wait. * Change codeowners * Unify Kueue unit and e2e metric expectations. Share EXPECTED_METRIC_TAGS between tests, align the OpenMetrics fixture with e2e queue labels, expand unit coverage, and pin go_version to go1.26.3 for kueue.go.info to match the controller toolchain.
1 parent 94b2b7f commit 25cd4a9

33 files changed

Lines changed: 2865 additions & 0 deletions

.ddev/config.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ teamcity = "TeamCity"
4141
win32_event_log = "Windows Event Log"
4242
krakend = "KrakenD"
4343
lustre = "Lustre"
44+
kueue = "Kueue"
4445
prefect = "Prefect"
4546
n8n = "n8n"
4647
hpe_aruba_edgeconnect = "HPE Aruba EdgeConnect"
@@ -51,6 +52,7 @@ dell_powerflex = "Dell Powerflex"
5152
[overrides.metrics-prefix]
5253
krakend = "krakend.api."
5354
lustre = "lustre."
55+
kueue = "kueue."
5456
prefect = "prefect.server."
5557
n8n = "n8n."
5658
control_m = "control_m."
@@ -270,6 +272,7 @@ __pycache__ = false
270272
[overrides.manifest.platforms]
271273
krakend = ["linux", "windows", "mac_os"]
272274
lustre = ["linux", "windows", "mac_os"]
275+
kueue = ["linux", "windows", "mac_os"]
273276
prefect = ["linux", "windows", "mac_os"]
274277
n8n = ["linux", "windows", "mac_os"]
275278
control_m = ["linux", "windows", "mac_os"]

.github/CODEOWNERS

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -635,6 +635,10 @@ plaid/assets/logs/ @DataDog/saa
635635
/gpu/*.md @DataDog/ebpf-platform @DataDog/documentation
636636
/gpu/manifest.json @DataDog/ebpf-platform @DataDog/agent-integrations @DataDog/documentation
637637

638+
/kueue/ @DataDog/gpu-monitoring-agent
639+
/kueue/*.md @DataDog/gpu-monitoring-agent @DataDog/documentation
640+
/kueue/manifest.json @DataDog/gpu-monitoring-agent @DataDog/agent-integrations @DataDog/documentation
641+
638642
/linux_audit_logs/ @DataDog/agent-integrations
639643
/linux_audit_logs/*.md @DataDog/agent-integrations @DataDog/documentation
640644
/linux_audit_logs/manifest.json @DataDog/agent-integrations @DataDog/documentation

.github/workflows/config/labeler.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -889,6 +889,10 @@ integration/kubevirt_handler:
889889
- changed-files:
890890
- any-glob-to-any-file:
891891
- kubevirt_handler/**/*
892+
integration/kueue:
893+
- changed-files:
894+
- any-glob-to-any-file:
895+
- kueue/**/*
892896
integration/kuma:
893897
- changed-files:
894898
- any-glob-to-any-file:

.github/workflows/test-all.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2338,6 +2338,26 @@ jobs:
23382338
minimum-base-package: ${{ inputs.minimum-base-package }}
23392339
pytest-args: ${{ inputs.pytest-args }}
23402340
secrets: inherit
2341+
j3c620e6:
2342+
uses: ./.github/workflows/test-target.yml
2343+
with:
2344+
job-name: Kueue
2345+
target: kueue
2346+
platform: linux
2347+
runner: '["ubuntu-22.04"]'
2348+
repo: "${{ inputs.repo }}"
2349+
context: ${{ inputs.context }}
2350+
python-version: "${{ inputs.python-version }}"
2351+
latest: ${{ inputs.latest }}
2352+
agent-image: "${{ inputs.agent-image }}"
2353+
agent-image-py2: "${{ inputs.agent-image-py2 }}"
2354+
agent-image-windows: "${{ inputs.agent-image-windows }}"
2355+
agent-image-windows-py2: "${{ inputs.agent-image-windows-py2 }}"
2356+
test-py2: ${{ inputs.test-py2 }}
2357+
test-py3: ${{ inputs.test-py3 }}
2358+
minimum-base-package: ${{ inputs.minimum-base-package }}
2359+
pytest-args: ${{ inputs.pytest-args }}
2360+
secrets: inherit
23412361
j739f9be:
23422362
uses: ./.github/workflows/test-target.yml
23432363
with:

code-coverage.datadog.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -421,6 +421,9 @@ services:
421421
- id: kubevirt_handler
422422
paths:
423423
- kubevirt_handler/datadog_checks/kubevirt_handler/
424+
- id: kueue
425+
paths:
426+
- kueue/datadog_checks/kueue/
424427
- id: kuma
425428
paths:
426429
- kuma/datadog_checks/kuma/

kueue/CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# CHANGELOG - Kueue
2+
3+
<!-- towncrier release notes start -->
4+

kueue/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Agent Check: Kueue
2+
3+
## Overview
4+
5+
This check monitors Kueue through the Datadog Agent.
6+
7+
Kueue is a Kubernetes workload queueing system that allows you to manage and schedule workloads on your Kubernetes cluster. It provides a way to prioritize and manage workloads, and to ensure that workloads are scheduled in a fair and efficient manner. This integration collects metrics from the Kueue controller manager and Kueue API server to help you monitor the health and performance of your Kueue cluster.
8+
9+
## Setup
10+
11+
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates][3].
12+
13+
### Installation
14+
15+
The Kueue check is included in the [Datadog Agent][2] package.
16+
No additional installation is required on your server.
17+
18+
### Configuration
19+
20+
Kueue is a cluster-level service. Configure this integration as a Cluster Agent cluster check so only one Agent instance scrapes the Kueue metrics endpoint.
21+
22+
1. To collect optional ClusterQueue resource metrics, such as `kueue.cluster_queue.resource_usage.gpu`, configure Kueue with `metrics.enableClusterQueueResources: true` and restart the Kueue controller manager.
23+
24+
2. Provide a [cluster check configuration][10] to the Cluster Agent. For file or ConfigMap based configuration, set `cluster_check: true` in the instance:
25+
26+
```yaml
27+
clusterAgent:
28+
confd:
29+
kueue.yaml: |-
30+
cluster_check: true
31+
init_config:
32+
instances:
33+
- openmetrics_endpoint: http://kueue-controller-manager-metrics-service.kueue-system.svc:8080/metrics
34+
```
35+
36+
3. Alternatively, annotate the Kueue metrics service with Autodiscovery cluster check annotations:
37+
38+
```yaml
39+
ad.datadoghq.com/endpoints.checks: |
40+
{
41+
"kueue": {
42+
"instances": [
43+
{
44+
"openmetrics_endpoint": "http://%%host%%:%%port%%/metrics"
45+
}
46+
]
47+
}
48+
}
49+
```
50+
51+
See the [sample kueue.d/conf.yaml][4] for all available configuration options.
52+
53+
### Validation
54+
55+
[Run the Cluster Agent's `clusterchecks` subcommand][11] and look for `kueue` under the Checks section.
56+
57+
## Data Collected
58+
59+
### Metrics
60+
61+
See [metadata.csv][7] for a list of metrics provided by this integration.
62+
63+
### Events
64+
65+
The Kueue integration does not include any events.
66+
67+
## Troubleshooting
68+
69+
Need help? Contact [Datadog support][8].
70+
71+
72+
[2]: https://app.datadoghq.com/account/settings/agent/latest
73+
[3]: https://docs.datadoghq.com/containers/kubernetes/integrations/
74+
[4]: https://github.com/DataDog/integrations-core/blob/master/kueue/datadog_checks/kueue/data/conf.yaml.example
75+
[5]: https://docs.datadoghq.com/agent/configuration/agent-commands/#start-stop-and-restart-the-agent
76+
[6]: https://docs.datadoghq.com/agent/configuration/agent-commands/#agent-status-and-information
77+
[7]: https://github.com/DataDog/integrations-core/blob/master/kueue/metadata.csv
78+
[8]: https://docs.datadoghq.com/help/
79+
[10]: https://docs.datadoghq.com/containers/cluster_agent/clusterchecks/?tab=helm#configuration-from-configuration-files
80+
[11]: https://docs.datadoghq.com/containers/troubleshooting/cluster-and-endpoint-checks/#dispatching-logic-in-the-cluster-agent
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Kueue
2+
fleet_configurable: true
3+
files:
4+
- name: kueue.yaml
5+
options:
6+
- template: init_config
7+
options:
8+
- template: init_config/openmetrics
9+
- template: instances
10+
options:
11+
- template: instances/openmetrics
12+
overrides:
13+
openmetrics_endpoint.value.example: http://localhost:8080/metrics
14+
openmetrics_endpoint.description: |
15+
Endpoint exposing Kueue's Prometheus metrics.
16+
- name: cluster_check
17+
description: |
18+
Set to true when configuring this integration as a Cluster Agent cluster check.
19+
value:
20+
type: boolean
21+
example: true
22+
- name: resource_name_map
23+
description: |
24+
Mapping of Kueue resource label values to metric name suffixes for resource metrics.
25+
26+
By default, `cpu` is reported as `cpu`, `memory` is reported as `memory`, `nvidia.com/gpu` is reported as
27+
`gpu`, and unmapped resources are reported as `other`. Built-in resource names cannot be overridden.
28+
value:
29+
type: object
30+
example:
31+
example.com/fpga: fpga

kueue/changelog.d/23908.added

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Initial Release.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# (C) Datadog, Inc. 2026-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
__version__ = '0.0.1'

0 commit comments

Comments
 (0)