Skip to content

Commit d800567

Browse files
committed
Collect additional GPU metrics
Export GPU metrics to Custom Metrics API Enrich metrics exporter to Custom Metrics API with workload information Replace logging lib to castai.Logging and improve logging Update mockery
1 parent 002c64d commit d800567

29 files changed

+1409
-347
lines changed

.mockery.yaml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,18 @@
1-
with-expecter: true
2-
dir: 'mock/{{replace .InterfaceDirRelative "internal" "" 1}}'
1+
dir: mock/{{.InterfaceDirRelative | replace "internal" "" 1}}
2+
filename: "mock_{{.InterfaceName | lower}}.go"
3+
template: testify
4+
template-data:
5+
unroll-variadic: true
36
packages:
47
"github.com/castai/gpu-metrics-exporter/internal/exporter":
58
interfaces:
6-
Exporter:
7-
Scraper:
8-
MetricMapper:
9-
HttpClient:
9+
Exporter: {}
10+
Scraper: {}
11+
MetricMapper: {}
12+
HTTPClient: {}
13+
"github.com/castai/gpu-metrics-exporter/internal/workload":
14+
interfaces:
15+
Resolver: {}
1016
"github.com/castai/gpu-metrics-exporter/internal/castai":
1117
interfaces:
12-
Client:
18+
Client: {}

README.md

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,38 @@ nv-hostengine.
2020
Make sure that these fields are exposed by DCGM exporter as metrics:
2121

2222
```
23+
DCGM_FI_DEV_GPU_TEMP
24+
DCGM_FI_DEV_MEMORY_TEMP
25+
DCGM_FI_DEV_POWER_USAGE
26+
DCGM_FI_DEV_MEM_MAX_OP_TEMP
27+
DCGM_FI_DEV_GPU_MAX_OP_TEMP
28+
DCGM_FI_DEV_SM_CLOCK
29+
DCGM_FI_DEV_GPU_UTIL
2330
DCGM_FI_PROF_SM_ACTIVE
2431
DCGM_FI_PROF_SM_OCCUPANCY
25-
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
26-
DCGM_FI_PROF_DRAM_ACTIVE
27-
DCGM_FI_PROF_PCIE_TX_BYTES
28-
DCGM_FI_PROF_PCIE_RX_BYTES
2932
DCGM_FI_PROF_GR_ENGINE_ACTIVE
30-
DCGM_FI_DEV_FB_TOTAL
33+
DCGM_FI_PROF_DRAM_ACTIVE
3134
DCGM_FI_DEV_FB_FREE
3235
DCGM_FI_DEV_FB_USED
36+
DCGM_FI_DEV_FB_TOTAL
37+
DCGM_FI_DEV_MEM_COPY_UTIL
38+
DCGM_FI_PROF_PCIE_TX_BYTES
39+
DCGM_FI_PROF_PCIE_RX_BYTES
3340
DCGM_FI_DEV_PCIE_LINK_GEN
3441
DCGM_FI_DEV_PCIE_LINK_WIDTH
35-
DCGM_FI_DEV_GPU_TEMP
36-
DCGM_FI_DEV_MEMORY_TEMP
37-
DCGM_FI_DEV_POWER_USAGE
42+
DCGM_FI_PROF_NVLINK_TX_BYTES
43+
DCGM_FI_PROF_NVLINK_RX_BYTES
44+
DCGM_FI_PROF_PIPE_INT_ACTIVE
45+
DCGM_FI_PROF_PIPE_FP16_ACTIVE
46+
DCGM_FI_PROF_PIPE_FP32_ACTIVE
47+
DCGM_FI_PROF_PIPE_FP64_ACTIVE
48+
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
49+
DCGM_FI_DEV_MIG_MODE
50+
DCGM_FI_DEV_MIG_MAX_SLICES
51+
DCGM_FI_DEV_CLOCKS_EVENT_REASONS
52+
DCGM_FI_DEV_XID_ERRORS
53+
DCGM_FI_DEV_POWER_VIOLATION
54+
DCGM_FI_DEV_THERMAL_VIOLATION
3855
```
3956

4057
## Installation

charts/gpu-metrics-exporter/templates/dcgm-exporter-configmap.yaml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ data:
1616
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
1717
DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned
1818
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The fraction of resident warps on a multiprocessor
19-
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)
2019
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
2120
DCGM_FI_PROF_DRAM_ACTIVE, gauge, The ratio of cycles the device memory interface is active sending or receiving data.
2221
2322
# Memory usage,,
2423
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
2524
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
2625
DCGM_FI_DEV_FB_TOTAL, gauge, Total Frame Buffer of the GPU in MB.
26+
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Utilization of the memory copy engine.
2727
2828
# PCIE,,
2929
DCGM_FI_PROF_PCIE_TX_BYTES, gauge, Total number of bytes transmitted through PCIe TX
@@ -36,4 +36,15 @@ data:
3636
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipe is active.
3737
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipe is active.
3838
DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipe is active.
39+
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)
40+
41+
# Health,,
42+
DCGM_FI_DEV_CLOCKS_EVENT_REASONS, gauge, Current clock event reasons (bitmask of DCGM_CLOCKS_EVENT_REASON_*)
43+
DCGM_FI_DEV_XID_ERRORS, gauge, The value is the specific XID error
44+
DCGM_FI_DEV_POWER_VIOLATION, gauge, Power Violation time in ns.
45+
DCGM_FI_DEV_THERMAL_VIOLATION, gauge, Thermal Violation time in ns.
46+
47+
# NVLink,,
48+
DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, The number of bytes of active NvLink tx (transmit) data including both header and payload.
49+
DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, The number of bytes of active NvLink rx (read) data including both header and payload.
3950
{{- end }}

charts/gpu-metrics-exporter/templates/move-api-key-to-secret.yaml

Lines changed: 0 additions & 44 deletions
This file was deleted.

cmd/main.go

Lines changed: 58 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"context"
55
"errors"
66
"fmt"
7+
"log/slog"
78
"net/http"
89
"os"
910
"os/signal"
@@ -14,14 +15,17 @@ import (
1415
"github.com/sirupsen/logrus"
1516
"k8s.io/apimachinery/pkg/labels"
1617
"k8s.io/apimachinery/pkg/selection"
17-
"k8s.io/client-go/kubernetes"
18+
"k8s.io/client-go/dynamic"
1819
"k8s.io/client-go/tools/clientcmd"
1920
"k8s.io/client-go/util/flowcontrol"
2021

2122
"github.com/castai/gpu-metrics-exporter/internal/castai"
2223
"github.com/castai/gpu-metrics-exporter/internal/config"
2324
"github.com/castai/gpu-metrics-exporter/internal/exporter"
2425
"github.com/castai/gpu-metrics-exporter/internal/server"
26+
"github.com/castai/gpu-metrics-exporter/internal/workload"
27+
"github.com/castai/logging"
28+
"github.com/castai/metrics"
2529
)
2630

2731
var (
@@ -38,19 +42,30 @@ func main() {
3842
log.Fatal(err)
3943
}
4044

41-
logLevel, err := logrus.ParseLevel(cfg.LogLevel)
45+
logLevel, err := parseLogLevel(cfg.LogLevel)
4246
if err != nil {
43-
log.Fatal(err)
47+
log.Warnf("failed to parse log level, defaulting to 'info': %v", err)
48+
logLevel = slog.LevelInfo
4449
}
45-
log.SetLevel(logLevel)
4650

47-
if err := run(cfg, log); err != nil && !errors.Is(err, context.Canceled) {
51+
castaiLogger := logging.New(logging.NewTextHandler(logging.TextHandlerConfig{
52+
Output: os.Stdout,
53+
Level: logLevel,
54+
}))
55+
56+
if err := run(cfg, castaiLogger); err != nil && !errors.Is(err, context.Canceled) {
4857
log.Fatal(err)
4958
}
5059
}
5160

52-
func run(cfg *config.Config, log logrus.FieldLogger) error {
53-
mux := server.NewServerMux(log)
61+
func parseLogLevel(level string) (slog.Level, error) {
62+
var lvl slog.Level
63+
err := lvl.UnmarshalText([]byte(level))
64+
return lvl, err
65+
}
66+
67+
func run(cfg *config.Config, log *logging.Logger) error {
68+
mux := server.NewServerMux()
5469

5570
srv := &http.Server{
5671
Addr: fmt.Sprintf(":%d", cfg.HTTPListenPort),
@@ -75,19 +90,45 @@ func run(cfg *config.Config, log logrus.FieldLogger) error {
7590
cancel()
7691
}()
7792

78-
clientset, err := newKubernetesClientset(cfg)
93+
dynClient, err := newDynamicClient(cfg)
7994
if err != nil {
80-
log.Fatal(err)
95+
log.WithField("error", err.Error()).Fatal("failed to create kubernetes dynamic client")
8196
}
8297

8398
labelSelector, err := selectorFromMap(cfg.DCGMLabels)
8499
if err != nil {
85-
log.Fatal(err)
100+
log.WithField("error", err.Error()).Fatal("failed to create get label selector")
101+
}
102+
103+
metricClient, err := metrics.NewMetricClient(
104+
metrics.Config{
105+
APIAddr: cfg.TelemetryURL,
106+
APIToken: cfg.APIKey,
107+
ClusterID: cfg.ClusterID,
108+
}, log)
109+
if err != nil {
110+
log.WithField("error", err.Error()).Warn("failed to create metrics client")
111+
}
112+
113+
if metricClient != nil {
114+
go func() {
115+
if err := metricClient.Start(ctx); err != nil && !errors.Is(err, context.Canceled) {
116+
log.WithField("error", err.Error()).Error("error in metrics client")
117+
}
118+
}()
86119
}
87120

88121
client := setupCastAIClient(log, cfg)
89122
scraper := exporter.NewScraper(&http.Client{}, log)
90-
mapper := exporter.NewMapper(cfg.NodeName)
123+
workloadResolver, err := workload.NewResolver(dynClient, workload.Config{
124+
LabelKeys: []string{"workloads.cast.ai/custom-workload"},
125+
CacheSize: 512,
126+
})
127+
if err != nil {
128+
log.WithField("error", err.Error()).Fatal("failed to create workload resolver")
129+
}
130+
131+
mapper := exporter.NewMapper(cfg.NodeName, workloadResolver, log)
91132
ex := exporter.NewExporter(exporter.Config{
92133
ExportInterval: cfg.ExportInterval,
93134
Selector: labelSelector.String(),
@@ -96,7 +137,7 @@ func run(cfg *config.Config, log logrus.FieldLogger) error {
96137
DCGMExporterHost: cfg.DCGMHost,
97138
Enabled: true,
98139
NodeName: cfg.NodeName,
99-
}, clientset, log, scraper, mapper, client)
140+
}, dynClient, log, scraper, mapper, client, metricClient)
100141

101142
go func() {
102143
if err := ex.Start(ctx); err != nil && !errors.Is(err, context.Canceled) {
@@ -108,19 +149,14 @@ func run(cfg *config.Config, log logrus.FieldLogger) error {
108149
return srv.ListenAndServe()
109150
}
110151

111-
func newKubernetesClientset(cfg *config.Config) (*kubernetes.Clientset, error) {
112-
config, err := clientcmd.BuildConfigFromFlags("", cfg.KubeConfigPath)
113-
if err != nil {
114-
return nil, err
115-
}
116-
config.RateLimiter = flowcontrol.NewTokenBucketRateLimiter(float32(10), 25)
117-
118-
clientset, err := kubernetes.NewForConfig(config)
152+
func newDynamicClient(cfg *config.Config) (dynamic.Interface, error) {
153+
restConfig, err := clientcmd.BuildConfigFromFlags("", cfg.KubeConfigPath)
119154
if err != nil {
120155
return nil, err
121156
}
157+
restConfig.RateLimiter = flowcontrol.NewTokenBucketRateLimiter(float32(10), 25)
122158

123-
return clientset, nil
159+
return dynamic.NewForConfig(restConfig)
124160
}
125161

126162
func selectorFromMap(labelMap map[string]string) (labels.Selector, error) {
@@ -138,7 +174,7 @@ func selectorFromMap(labelMap map[string]string) (labels.Selector, error) {
138174
return selector.Add(requirements...), nil
139175
}
140176

141-
func setupCastAIClient(log logrus.FieldLogger, cfg *config.Config) castai.Client {
177+
func setupCastAIClient(log *logging.Logger, cfg *config.Config) castai.Client {
142178
clientConfig := castai.Config{
143179
ClusterID: cfg.ClusterID,
144180
APIKey: cfg.APIKey,

gen_mockery.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
//go:generate go run github.com/vektra/mockery/v2@v2.42.0 --all
1+
//go:generate go run github.com/vektra/mockery/v3@v3.5.3
22
package mockery

go.mod

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,66 @@
11
module github.com/castai/gpu-metrics-exporter
22

3-
go 1.22.12
3+
go 1.23.4
44

55
require (
6+
github.com/castai/logging v0.1.0
7+
github.com/castai/metrics v0.0.0-20250917084341-1533777a055a
68
github.com/go-resty/resty/v2 v2.11.0
9+
github.com/hashicorp/golang-lru/v2 v2.0.7
710
github.com/jarcoal/httpmock v1.3.1
811
github.com/kelseyhightower/envconfig v1.4.0
912
github.com/prometheus/client_model v0.6.0
1013
github.com/prometheus/common v0.49.0
1114
github.com/sirupsen/logrus v1.9.3
12-
github.com/stretchr/testify v1.8.4
13-
golang.org/x/sync v0.6.0
14-
google.golang.org/protobuf v1.33.0
15+
github.com/stretchr/testify v1.9.0
16+
golang.org/x/sync v0.8.0
17+
google.golang.org/protobuf v1.35.2
1518
k8s.io/api v0.29.2
1619
k8s.io/apimachinery v0.29.2
1720
k8s.io/client-go v0.29.2
1821
)
1922

2023
require (
21-
github.com/davecgh/go-spew v1.1.1 // indirect
24+
github.com/beorn7/perks v1.0.1 // indirect
25+
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
26+
github.com/cespare/xxhash/v2 v2.3.0 // indirect
27+
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
2228
github.com/emicklei/go-restful/v3 v3.11.0 // indirect
2329
github.com/evanphx/json-patch v4.12.0+incompatible // indirect
24-
github.com/go-logr/logr v1.3.0 // indirect
30+
github.com/fsnotify/fsnotify v1.6.0 // indirect
31+
github.com/go-logr/logr v1.4.2 // indirect
2532
github.com/go-openapi/jsonpointer v0.19.6 // indirect
2633
github.com/go-openapi/jsonreference v0.20.2 // indirect
2734
github.com/go-openapi/swag v0.22.3 // indirect
2835
github.com/gogo/protobuf v1.3.2 // indirect
29-
github.com/golang/protobuf v1.5.3 // indirect
36+
github.com/golang/protobuf v1.5.4 // indirect
3037
github.com/google/gnostic-models v0.6.8 // indirect
3138
github.com/google/gofuzz v1.2.0 // indirect
32-
github.com/google/uuid v1.3.0 // indirect
39+
github.com/google/uuid v1.6.0 // indirect
40+
github.com/hamba/avro/v2 v2.27.0 // indirect
3341
github.com/imdario/mergo v0.3.6 // indirect
3442
github.com/josharian/intern v1.0.0 // indirect
3543
github.com/json-iterator/go v1.1.12 // indirect
3644
github.com/mailru/easyjson v0.7.7 // indirect
45+
github.com/mitchellh/mapstructure v1.5.0 // indirect
3746
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
3847
github.com/modern-go/reflect2 v1.0.2 // indirect
3948
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
4049
github.com/pkg/errors v0.9.1 // indirect
41-
github.com/pmezard/go-difflib v1.0.0 // indirect
50+
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
51+
github.com/prometheus/client_golang v1.19.0 // indirect
52+
github.com/prometheus/procfs v0.12.0 // indirect
53+
github.com/sercand/kuberesolver/v5 v5.1.1 // indirect
4254
github.com/spf13/pflag v1.0.5 // indirect
43-
github.com/stretchr/objx v0.5.0 // indirect
44-
golang.org/x/net v0.23.0 // indirect
45-
golang.org/x/oauth2 v0.17.0 // indirect
46-
golang.org/x/sys v0.18.0 // indirect
47-
golang.org/x/term v0.18.0 // indirect
48-
golang.org/x/text v0.14.0 // indirect
49-
golang.org/x/time v0.3.0 // indirect
50-
google.golang.org/appengine v1.6.7 // indirect
55+
github.com/stretchr/objx v0.5.2 // indirect
56+
golang.org/x/net v0.30.0 // indirect
57+
golang.org/x/oauth2 v0.23.0 // indirect
58+
golang.org/x/sys v0.26.0 // indirect
59+
golang.org/x/term v0.25.0 // indirect
60+
golang.org/x/text v0.19.0 // indirect
61+
golang.org/x/time v0.6.0 // indirect
62+
google.golang.org/genproto/googleapis/rpc v0.0.0-20241015192408-796eee8c2d53 // indirect
63+
google.golang.org/grpc v1.69.0 // indirect
5164
gopkg.in/inf.v0 v0.9.1 // indirect
5265
gopkg.in/yaml.v2 v2.4.0 // indirect
5366
gopkg.in/yaml.v3 v3.0.1 // indirect

0 commit comments

Comments
 (0)