Skip to content

Commit 24e95f1

Browse files
added extra pipeline for metrics monitoring (#416)
* add extra pipeline for kubeletstats monitoring in fargate pods * feat: OB-41415 send sidecar metrics to observe directly and restructure config for extensibility Customers have been requesting that we support EKS fargate hosted clusters. To do this, I add a new fargate mode (off my default) that will install an otel operator, which will use a sidecar container to query metrics from the pod it is attached to.
1 parent 7a06760 commit 24e95f1

11 files changed

+260
-68
lines changed

charts/agent/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ apiVersion: v2
22
name: agent
33
description: Chart to install K8s collection stack based on Observe Agent
44
type: application
5-
version: 0.74.4
5+
version: 0.75.0
66
appVersion: "2.10.1"
77
dependencies:
88
- name: opentelemetry-collector

charts/agent/README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# agent
22

3-
![Version: 0.74.4](https://img.shields.io/badge/Version-0.74.4-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.10.1](https://img.shields.io/badge/AppVersion-2.10.1-informational?style=flat-square)
3+
![Version: 0.75.0](https://img.shields.io/badge/Version-0.75.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.10.1](https://img.shields.io/badge/AppVersion-2.10.1-informational?style=flat-square)
44

55
Chart to install K8s collection stack based on Observe Agent
66

@@ -30,6 +30,10 @@ This service is a *daemonset* which means it runs on every node in the cluster.
3030

3131
This service is a *single-instance deployment*. It's critical that this service is only a single instance since otherwise it would produce duplicate data. It is responsible for monitoring the other containers of Observe Agent running by scraping the exposed Prometheus metrics of those agents. It's best practice to separate the monitoring of the agents from the agents themselves since if problems develop in those pipelines, we would need the agent telemetry to keep flowing in order to diagnose.
3232

33+
## fargate-collector
34+
35+
This service is an *OpenTelemetryCollector*, a custom resource that is managed by a OpenTelemetry Operator (must be installed separately) It is responsible for collecting metrics from nodes when running on AWS Fargate. It injects a sidecar into every pod with the appropriate annotation, and scrapes the API of the kubelet of that node for metrics. Daemonsets are not allowed on fargate, so this service is intended as a replacement for the usual approach for node metric collection with the `node-logs-metrics` daemonset.
36+
3337
## Maintainers
3438

3539
| Name | Email | Url |
@@ -599,6 +603,10 @@ This service is a *single-instance deployment*. It's critical that this service
599603
| node.metrics.fileSystem.excludeMountPoints | string | `"[\"/dev/*\",\"/proc/*\",\"/sys/*\",\"/run/k3s/containerd/*\",\"/var/lib/docker/*\",\"/var/lib/kubelet/*\",\"/snap/*\"]"` | |
600604
| node.metrics.fileSystem.rootPath | string | `"/hostfs"` | |
601605
| node.metrics.interval | string | `"60s"` | |
606+
| nodeless.enabled | bool | `false` | Enables nodeless mode. Nodeless mode is intended for environments where daemonsets are not supported. |
607+
| nodeless.hostingPlatform | string | `""` | The hosting platform for the nodeless mode. Valid values are "fargate". |
608+
| nodeless.metrics.enabled | bool | `false` | |
609+
| nodeless.serviceAccounts | object | `{}` | A map of namespaces to lists of service accounts. If you provide service accounts here we will attach a cluster role and binding granting the service accounts permission to the relevant Kubernetes APIs needed to collect metrics. If empty, you will need to manually grant the service accounts the necessary permissions. Example: serviceAccounts: default: ["app1-sa", "app2-sa"] fargate-ns: ["fargate-app-sa"] |
602610
| observe.collectionEndpoint.value | string | `""` | |
603611
| observe.entityToken.create | bool | `false` | |
604612
| observe.entityToken.use | bool | `false` | |

charts/agent/README.md.gotmpl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ This service is a *daemonset* which means it runs on every node in the cluster.
3131

3232
This service is a *single-instance deployment*. It's critical that this service is only a single instance since otherwise it would produce duplicate data. It is responsible for monitoring the other containers of Observe Agent running by scraping the exposed Prometheus metrics of those agents. It's best practice to separate the monitoring of the agents from the agents themselves since if problems develop in those pipelines, we would need the agent telemetry to keep flowing in order to diagnose.
3333

34+
## fargate-collector
35+
36+
This service is an *OpenTelemetryCollector*, a custom resource that is managed by a OpenTelemetry Operator (must be installed separately) It is responsible for collecting metrics from nodes when running on AWS Fargate. It injects a sidecar into every pod with the appropriate annotation, and scrapes the API of the kubelet of that node for metrics. Daemonsets are not allowed on fargate, so this service is intended as a replacement for the usual approach for node metric collection with the `node-logs-metrics` daemonset.
37+
3438
{{ template "chart.homepageLine" . }}
3539

3640
{{ template "chart.maintainersSection" . }}

charts/agent/templates/_config-processors.tpl

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,14 @@ attributes/debug_source_cadvisor_metrics:
148148
{{- end -}}
149149
{{- end -}}
150150

151+
{{- define "config.processors.attributes.sidecar_kubeletstats_metrics" -}}
152+
attributes/debug_source_sidecar_kubeletstats_metrics:
153+
actions:
154+
- key: debug_source
155+
action: insert
156+
value: sidecar_kubeletstats_metrics
157+
{{- end -}}
158+
151159
{{- define "config.processors.attributes.drop_container_info" -}}
152160
resource/drop_container_info:
153161
attributes:
@@ -162,6 +170,21 @@ resource/drop_service_name:
162170
key: service.name
163171
{{- end -}}
164172

173+
{{- define "config.processors.metricstransform.duplicate_k8s_cpu_metrics" -}}
174+
# convert new k8s metric names to the names our Kubernetes Explorer relies on
175+
metricstransform/duplicate_k8s_cpu_metrics:
176+
transforms:
177+
- include: container.cpu.usage
178+
action: insert
179+
new_name: container.cpu.utilization
180+
- include: k8s.pod.cpu.usage
181+
action: insert
182+
new_name: k8s.pod.cpu.utilization
183+
- include: k8s.node.cpu.usage
184+
action: insert
185+
new_name: k8s.node.cpu.utilization
186+
{{- end -}}
187+
165188
{{- define "config.processors.filter.drop_long_spans" -}}
166189
{{- if eq .Values.node.forwarder.traces.maxSpanDuration "none" }}
167190
{{- else if (regexMatch "^[0-9]+(ns|us|ms|s|m|h)$" .Values.node.forwarder.traces.maxSpanDuration) }}

charts/agent/templates/_config.tpl

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,14 @@
1212
{{- toYaml $config | indent 2 }}
1313
{{- end }}
1414

15+
{{- define "observe.sidecar.applyFargateSidecarConfig" -}}
16+
{{- $values := deepCopy .Values }}
17+
{{- $data := dict "Values" $values | mustMergeOverwrite (deepCopy .) }}
18+
{{- $config := mustMergeOverwrite ( include "observe.sidecar.FargateSidecar.config" $data | fromYaml ) ($values.agent.config.FargateSidecar) ($values.agent.config.global.overrides) -}}
19+
{{- toYaml $config | indent 2 }}
20+
{{- end }}
21+
22+
1523
{{- define "observe.deployment.applyPrometheusScraperConfig" -}}
1624
{{- $values := deepCopy .Values }}
1725
{{- $data := dict "Values" $values | mustMergeOverwrite (deepCopy .) }}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
{{- define "observe.sidecar.FargateSidecar.config" -}}
2+
3+
receivers:
4+
{{- include "observe.kubeletstats.receiver" (dict "Values" .Values "endpoint" "https://kubernetes.default.svc/api/v1/nodes/${env:K8S_NODE_NAME}/proxy") | nindent 2 }}
5+
6+
processors:
7+
8+
{{- include "config.processors.memory_limiter" . | nindent 2 }}
9+
{{- include "config.processors.batch" . | nindent 2 }}
10+
{{- include "config.processors.resource_detection.cloud" . | nindent 2 }}
11+
{{- include "config.processors.attributes.k8sattributes" . | nindent 2 }}
12+
{{- include "config.processors.resource.observe_common" . | nindent 2 }}
13+
{{- include "config.processors.deltatocumulative" . | nindent 2 }}
14+
{{- include "config.processors.attributes.add_empty_service_attributes" . | nindent 2 }}
15+
{{- include "config.processors.metricstransform.duplicate_k8s_cpu_metrics" . | nindent 2 }}
16+
{{- include "config.processors.attributes.sidecar_kubeletstats_metrics" . | nindent 2 }}
17+
18+
exporters:
19+
{{- include "config.exporters.debug" . | nindent 2 }}
20+
{{- include "config.exporters.prometheusremotewrite" . | nindent 2 }}
21+
22+
{{ $kubeletstatsExporters := (list "prometheusremotewrite/observe") -}}
23+
24+
{{- if eq .Values.agent.config.global.debug.enabled true }}
25+
{{- $kubeletstatsExporters = concat $kubeletstatsExporters ( list "debug/override" ) | uniq }}
26+
{{- end }}
27+
28+
# in the future, we may add other pipelines, and the failure condition should change to
29+
# being that no telemetry collection was enabled
30+
service:
31+
pipelines:
32+
{{- if .Values.nodeless.metrics.enabled }}
33+
metrics/kubeletstats:
34+
receivers: [kubeletstats]
35+
processors: [memory_limiter, metricstransform/duplicate_k8s_cpu_metrics, k8sattributes, deltatocumulative/observe, batch, resourcedetection/cloud, resource/observe_common, attributes/debug_source_sidecar_kubeletstats_metrics]
36+
exporters: [{{ join ", " $kubeletstatsExporters }}]
37+
{{- else }}
38+
{{- fail "nodeless.metrics.enabled must be true for Fargate sidecar - otherwise no telemetry will be collected" }}
39+
{{- end }}
40+
{{- end }}
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
{{- define "observe.kubeletstats.receiver" -}}
2+
kubeletstats:
3+
collection_interval: {{.Values.node.containers.metrics.interval}}
4+
auth_type: 'serviceAccount'
5+
endpoint: {{ .endpoint }}
6+
node: '${env:K8S_NODE_NAME}'
7+
insecure_skip_verify: true
8+
k8s_api_config:
9+
auth_type: serviceAccount
10+
metric_groups:
11+
- node
12+
- pod
13+
- container
14+
metrics:
15+
# The following metrics are optional and must be enabled manually as per:
16+
# https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#optional-metrics
17+
container.cpu.usage:
18+
enabled: true
19+
container.uptime:
20+
enabled: true
21+
k8s.container.cpu.node.utilization:
22+
enabled: true
23+
k8s.container.cpu_limit_utilization:
24+
enabled: true
25+
k8s.container.cpu_request_utilization:
26+
enabled: true
27+
k8s.container.memory.node.utilization:
28+
enabled: true
29+
k8s.container.memory_limit_utilization:
30+
enabled: true
31+
k8s.container.memory_request_utilization:
32+
enabled: true
33+
k8s.node.cpu.usage:
34+
enabled: true
35+
k8s.node.uptime:
36+
enabled: true
37+
k8s.pod.cpu.node.utilization:
38+
enabled: true
39+
k8s.pod.cpu.usage:
40+
enabled: true
41+
k8s.pod.cpu_limit_utilization:
42+
enabled: true
43+
k8s.pod.cpu_request_utilization:
44+
enabled: true
45+
k8s.pod.memory.node.utilization:
46+
enabled: true
47+
k8s.pod.memory_limit_utilization:
48+
enabled: true
49+
k8s.pod.memory_request_utilization:
50+
enabled: true
51+
k8s.pod.uptime:
52+
enabled: true
53+
extra_metadata_labels:
54+
- container.id
55+
{{- end }}

charts/agent/templates/_node-logs-metrics-config.tpl

Lines changed: 8 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -70,59 +70,13 @@ receivers:
7070
network: null
7171
{{ end -}}
7272
{{- if .Values.node.containers.metrics.enabled }}
73-
kubeletstats:
74-
collection_interval: {{.Values.node.containers.metrics.interval}}
75-
auth_type: 'serviceAccount'
76-
endpoint: {{ if .Values.node.kubeletstats.useNodeIp }}"${env:K8S_NODE_IP}:10250"{{ else }}"${env:K8S_NODE_NAME}:10250"{{ end }}
77-
node: '${env:K8S_NODE_NAME}'
78-
insecure_skip_verify: true
79-
k8s_api_config:
80-
auth_type: serviceAccount
81-
metric_groups:
82-
- node
83-
- pod
84-
- container
85-
metrics:
86-
# The following metrics are optional and must be enabled manually as per:
87-
# https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#optional-metrics
88-
container.cpu.usage:
89-
enabled: true
90-
container.uptime:
91-
enabled: true
92-
k8s.container.cpu.node.utilization:
93-
enabled: true
94-
k8s.container.cpu_limit_utilization:
95-
enabled: true
96-
k8s.container.cpu_request_utilization:
97-
enabled: true
98-
k8s.container.memory.node.utilization:
99-
enabled: true
100-
k8s.container.memory_limit_utilization:
101-
enabled: true
102-
k8s.container.memory_request_utilization:
103-
enabled: true
104-
k8s.node.cpu.usage:
105-
enabled: true
106-
k8s.node.uptime:
107-
enabled: true
108-
k8s.pod.cpu.node.utilization:
109-
enabled: true
110-
k8s.pod.cpu.usage:
111-
enabled: true
112-
k8s.pod.cpu_limit_utilization:
113-
enabled: true
114-
k8s.pod.cpu_request_utilization:
115-
enabled: true
116-
k8s.pod.memory.node.utilization:
117-
enabled: true
118-
k8s.pod.memory_limit_utilization:
119-
enabled: true
120-
k8s.pod.memory_request_utilization:
121-
enabled: true
122-
k8s.pod.uptime:
123-
enabled: true
124-
extra_metadata_labels:
125-
- container.id
73+
{{- $endpoint := "" }}
74+
{{- if .Values.node.kubeletstats.useNodeIp }}
75+
{{- $endpoint = "\"${env:K8S_NODE_IP}:10250\"" }}
76+
{{- else }}
77+
{{- $endpoint = "\"${env:K8S_NODE_NAME}:10250\"" }}
78+
{{- end }}
79+
{{- include "observe.kubeletstats.receiver" (dict "Values" .Values "endpoint" $endpoint) | nindent 2 }}
12680
{{ end -}}
12781
{{- if .Values.node.containers.logs.enabled }}
12882
filelog:
@@ -165,6 +119,7 @@ processors:
165119
{{- include "config.processors.batch" . | nindent 2 }}
166120
{{- include "config.processors.attributes.k8sattributes" . | nindent 2 }}
167121
{{- include "config.processors.resource.observe_common" . | nindent 2 }}
122+
{{- include "config.processors.metricstransform.duplicate_k8s_cpu_metrics" . | nindent 2 }}
168123

169124
{{- if .Values.agent.config.global.fleet.enabled }}
170125
{{- include "config.processors.resource_detection" . | nindent 2 }}
@@ -189,18 +144,6 @@ processors:
189144
action: insert
190145
value: kubeletstats_metrics
191146

192-
# convert new k8s metric names to the names our Kubernetes Explorer relies on
193-
metricstransform/duplicate_k8s_cpu_metrics:
194-
transforms:
195-
- include: container.cpu.usage
196-
action: insert
197-
new_name: container.cpu.utilization
198-
- include: k8s.pod.cpu.usage
199-
action: insert
200-
new_name: k8s.pod.cpu.utilization
201-
- include: k8s.node.cpu.usage
202-
action: insert
203-
new_name: k8s.node.cpu.utilization
204147

205148
# Create intermediate lists for pipeline arrays to then modify based on values.yaml
206149
{{- $logsExporters := (list "otlphttp/observe/base") -}}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{{- if .Values.nodeless.enabled }}
2+
{{- if .Values.nodeless.serviceAccounts }}
3+
---
4+
apiVersion: rbac.authorization.k8s.io/v1
5+
kind: ClusterRole
6+
metadata:
7+
name: observe-agent-nodeless-cluster-role-{{ template "observe-agent.namespace" . }}
8+
labels:
9+
app.kubernetes.io/name: observe-agent-nodeless-cluster-role
10+
app.kubernetes.io/instance: observe-agent
11+
rules:
12+
- apiGroups: [""]
13+
resources:
14+
- nodes
15+
- nodes/proxy
16+
- namespaces
17+
- pods
18+
- configmaps
19+
verbs: ["get", "list", "watch"]
20+
21+
- apiGroups: ["apps"]
22+
resources:
23+
- replicasets
24+
verbs: ["get", "list", "watch"]
25+
---
26+
{{- range $namespace, $serviceAccounts := .Values.nodeless.serviceAccounts }}
27+
{{- if not (kindIs "slice" $serviceAccounts) }}
28+
{{- fail (printf "nodeless.serviceAccounts[%s] must be a list, but got: %v (type: %s)" $namespace $serviceAccounts (kindOf $serviceAccounts)) }}
29+
{{- end }}
30+
{{- if not $serviceAccounts }}
31+
{{- fail (printf "nodeless.serviceAccounts[%s] is empty. Please provide at least one service account or remove the namespace from the map." $namespace) }}
32+
{{- end }}
33+
{{- range $serviceAccounts }}
34+
{{- if not (kindIs "string" .) }}
35+
{{- fail (printf "nodeless.serviceAccounts[%s] contains a non-string value: %v. All service account names must be strings." $namespace .) }}
36+
{{- end }}
37+
apiVersion: rbac.authorization.k8s.io/v1
38+
kind: ClusterRoleBinding
39+
metadata:
40+
name: observe-agent-nodeless-cluster-role-binding-{{ $namespace }}-{{ . }}
41+
labels:
42+
app.kubernetes.io/name: observe-agent-nodeless-cluster-role-binding
43+
app.kubernetes.io/instance: observe-agent
44+
roleRef:
45+
apiGroup: rbac.authorization.k8s.io
46+
kind: ClusterRole
47+
name: observe-agent-nodeless-cluster-role-{{ template "observe-agent.namespace" $ }}
48+
subjects:
49+
- kind: ServiceAccount
50+
name: {{ . }}
51+
namespace: {{ $namespace }}
52+
---
53+
{{- end }}
54+
{{- end }}
55+
{{- end }}
56+
{{- end }}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
{{- if .Values.nodeless.enabled }}
2+
{{- if eq .Values.nodeless.hostingPlatform "fargate" }}
3+
apiVersion: opentelemetry.io/v1beta1
4+
kind: OpenTelemetryCollector
5+
metadata:
6+
name: fargate-collector
7+
spec:
8+
mode: sidecar
9+
image: "ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest"
10+
env:
11+
- name: K8S_NODE_NAME
12+
valueFrom:
13+
fieldRef:
14+
fieldPath: spec.nodeName
15+
- name: OBSERVE_CLUSTER_NAME
16+
value: "{{ .Values.cluster.name }}"
17+
- name: OBSERVE_CLUSTER_UID
18+
valueFrom:
19+
configMapKeyRef:
20+
name: cluster-info
21+
key: id
22+
- name: OBSERVE_PROMETHEUS_ENDPOINT
23+
value: "{{ .Values.observe.collectionEndpoint.value }}v1/prometheus"
24+
- name: OBSERVE_AUTHORIZATION_HEADER
25+
value: "Bearer {{ .Values.observe.token.value }}"
26+
config:
27+
{{- include "observe.sidecar.applyFargateSidecarConfig" . | nindent 4 }}
28+
initContainers:
29+
- name: kube-cluster-info
30+
image: observeinc/kube-cluster-info:v0.11.5
31+
imagePullPolicy: Always
32+
env:
33+
- name: NAMESPACE
34+
valueFrom:
35+
fieldRef:
36+
fieldPath: metadata.namespace
37+
{{- else }}
38+
{{- fail "Invalid nodeless.hostingPlatform, valid values are 'fargate', provided value is %s" .Values.nodeless.hostingPlatform }}
39+
{{- end }}
40+
{{- end }}

0 commit comments

Comments
 (0)