Skip to content

Commit 267684a

Browse files
feat: OB-41415 send sidecar metrics to observe directly and restructure config for extensibility
Customers have been requesting that we support EKS fargate hosted clusters. To do this, I add a new fargate mode (off my default) that will install an otel operator, which will use a sidecar container to query metrics from the pod it is attached to.
1 parent 5026226 commit 267684a

15 files changed

+232
-643
lines changed

charts/agent/Chart.yaml

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ apiVersion: v2
22
name: agent
33
description: Chart to install K8s collection stack based on Observe Agent
44
type: application
5-
version: 0.74.3
5+
version: 0.75.0
66
appVersion: "2.10.1"
77
dependencies:
88
- name: opentelemetry-collector
@@ -40,11 +40,6 @@ dependencies:
4040
repository: https://open-telemetry.github.io/opentelemetry-helm-charts
4141
alias: gateway
4242
condition: gatewayDeployment.enabled
43-
- name: opentelemetry-operator
44-
version: 0.93.1
45-
repository: https://open-telemetry.github.io/opentelemetry-helm-charts
46-
alias: fargate-sidecar-injector
47-
condition: node.fargateMode
4843
maintainers:
4944
- name: Observe
5045

charts/agent/README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# agent
22

3-
![Version: 0.74.3](https://img.shields.io/badge/Version-0.74.3-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.10.1](https://img.shields.io/badge/AppVersion-2.10.1-informational?style=flat-square)
3+
![Version: 0.75.0](https://img.shields.io/badge/Version-0.75.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.10.1](https://img.shields.io/badge/AppVersion-2.10.1-informational?style=flat-square)
44

55
Chart to install K8s collection stack based on Observe Agent
66

@@ -30,6 +30,10 @@ This service is a *daemonset* which means it runs on every node in the cluster.
3030

3131
This service is a *single-instance deployment*. It's critical that this service is only a single instance since otherwise it would produce duplicate data. It is responsible for monitoring the other containers of Observe Agent running by scraping the exposed Prometheus metrics of those agents. It's best practice to separate the monitoring of the agents from the agents themselves since if problems develop in those pipelines, we would need the agent telemetry to keep flowing in order to diagnose.
3232

33+
## fargate-collector
34+
35+
This service is an *OpenTelemetryCollector*, a custom resource that is managed by a OpenTelemetry Operator (must be installed separately) It is responsible for collecting metrics from nodes when running on AWS Fargate. It injects a sidecar into every pod with the appropriate annotation, and scrapes the API of the kubelet of that node for metrics. Daemonsets are not allowed on fargate, so this service is intended as a replacement for the usual approach for node metric collection with the `node-logs-metrics` daemonset.
36+
3337
## Maintainers
3438

3539
| Name | Email | Url |
@@ -599,6 +603,10 @@ This service is a *single-instance deployment*. It's critical that this service
599603
| node.metrics.fileSystem.excludeMountPoints | string | `"[\"/dev/*\",\"/proc/*\",\"/sys/*\",\"/run/k3s/containerd/*\",\"/var/lib/docker/*\",\"/var/lib/kubelet/*\",\"/snap/*\"]"` | |
600604
| node.metrics.fileSystem.rootPath | string | `"/hostfs"` | |
601605
| node.metrics.interval | string | `"60s"` | |
606+
| nodeless.enabled | bool | `false` | Enables nodeless mode. Nodeless mode is intended for environments where daemonsets are not supported. |
607+
| nodeless.hostingPlatform | string | `""` | The hosting platform for the nodeless mode. Valid values are "fargate". |
608+
| nodeless.metrics.enabled | bool | `false` | |
609+
| nodeless.serviceAccounts | object | `{}` | A map of namespaces to lists of service accounts. If you provide service accounts here we will attach a cluster role and binding granting the service accounts permission to the relevant Kubernetes APIs needed to collect metrics. If empty, you will need to manually grant the service accounts the necessary permissions. Example: serviceAccounts: default: ["app1-sa", "app2-sa"] fargate-ns: ["fargate-app-sa"] |
602610
| observe.collectionEndpoint.value | string | `""` | |
603611
| observe.entityToken.create | bool | `false` | |
604612
| observe.entityToken.use | bool | `false` | |

charts/agent/README.md.gotmpl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ This service is a *daemonset* which means it runs on every node in the cluster.
3131

3232
This service is a *single-instance deployment*. It's critical that this service is only a single instance since otherwise it would produce duplicate data. It is responsible for monitoring the other containers of Observe Agent running by scraping the exposed Prometheus metrics of those agents. It's best practice to separate the monitoring of the agents from the agents themselves since if problems develop in those pipelines, we would need the agent telemetry to keep flowing in order to diagnose.
3333

34+
## fargate-collector
35+
36+
This service is an *OpenTelemetryCollector*, a custom resource that is managed by a OpenTelemetry Operator (must be installed separately) It is responsible for collecting metrics from nodes when running on AWS Fargate. It injects a sidecar into every pod with the appropriate annotation, and scrapes the API of the kubelet of that node for metrics. Daemonsets are not allowed on fargate, so this service is intended as a replacement for the usual approach for node metric collection with the `node-logs-metrics` daemonset.
37+
3438
{{ template "chart.homepageLine" . }}
3539

3640
{{ template "chart.maintainersSection" . }}

charts/agent/templates/_config.tpl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@
1212
{{- toYaml $config | indent 2 }}
1313
{{- end }}
1414

15-
{{- define "observe.sidecar.applyFargateSidecarMetricsConfig" -}}
15+
{{- define "observe.sidecar.applyFargateSidecarConfig" -}}
1616
{{- $values := deepCopy .Values }}
1717
{{- $data := dict "Values" $values | mustMergeOverwrite (deepCopy .) }}
18-
{{- $config := mustMergeOverwrite ( include "observe.sidecar.fargateSidecarMetrics.config" $data | fromYaml ) ($values.agent.config.fargateSidecarMetrics) ($values.agent.config.global.overrides) -}}
18+
{{- $config := mustMergeOverwrite ( include "observe.sidecar.FargateSidecar.config" $data | fromYaml ) ($values.agent.config.FargateSidecar) ($values.agent.config.global.overrides) -}}
1919
{{- toYaml $config | indent 2 }}
2020
{{- end }}
2121

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
{{- define "observe.sidecar.FargateSidecar.config" -}}
2+
3+
receivers:
4+
{{- include "observe.kubeletstats.receiver" (dict "Values" .Values "endpoint" "https://kubernetes.default.svc/api/v1/nodes/${env:K8S_NODE_NAME}/proxy") | nindent 2 }}
5+
6+
processors:
7+
8+
{{- include "config.processors.memory_limiter" . | nindent 2 }}
9+
{{- include "config.processors.batch" . | nindent 2 }}
10+
{{- include "config.processors.resource_detection.cloud" . | nindent 2 }}
11+
{{- include "config.processors.attributes.k8sattributes" . | nindent 2 }}
12+
{{- include "config.processors.resource.observe_common" . | nindent 2 }}
13+
{{- include "config.processors.deltatocumulative" . | nindent 2 }}
14+
{{- include "config.processors.attributes.add_empty_service_attributes" . | nindent 2 }}
15+
{{- include "config.processors.metricstransform.duplicate_k8s_cpu_metrics" . | nindent 2 }}
16+
{{- include "config.processors.attributes.sidecar_kubeletstats_metrics" . | nindent 2 }}
17+
18+
exporters:
19+
{{- include "config.exporters.debug" . | nindent 2 }}
20+
{{- include "config.exporters.prometheusremotewrite" . | nindent 2 }}
21+
22+
{{ $kubeletstatsExporters := (list "prometheusremotewrite/observe") -}}
23+
24+
{{- if eq .Values.agent.config.global.debug.enabled true }}
25+
{{- $kubeletstatsExporters = concat $kubeletstatsExporters ( list "debug/override" ) | uniq }}
26+
{{- end }}
27+
28+
# in the future, we may add other pipelines, and the failure condition should change to
29+
# being that no telemetry collection was enabled
30+
service:
31+
pipelines:
32+
{{- if .Values.nodeless.metrics.enabled }}
33+
metrics/kubeletstats:
34+
receivers: [kubeletstats]
35+
processors: [memory_limiter, metricstransform/duplicate_k8s_cpu_metrics, k8sattributes, deltatocumulative/observe, batch, resourcedetection/cloud, resource/observe_common, attributes/debug_source_sidecar_kubeletstats_metrics]
36+
exporters: [{{ join ", " $kubeletstatsExporters }}]
37+
{{- else }}
38+
{{- fail "nodeless.metrics.enabled must be true for Fargate sidecar - otherwise no telemetry will be collected" }}
39+
{{- end }}
40+
{{- end }}
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
{{- define "observe.kubeletstats.receiver" -}}
2+
kubeletstats:
3+
collection_interval: {{.Values.node.containers.metrics.interval}}
4+
auth_type: 'serviceAccount'
5+
endpoint: {{ .endpoint }}
6+
node: '${env:K8S_NODE_NAME}'
7+
insecure_skip_verify: true
8+
k8s_api_config:
9+
auth_type: serviceAccount
10+
metric_groups:
11+
- node
12+
- pod
13+
- container
14+
metrics:
15+
# The following metrics are optional and must be enabled manually as per:
16+
# https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#optional-metrics
17+
container.cpu.usage:
18+
enabled: true
19+
container.uptime:
20+
enabled: true
21+
k8s.container.cpu.node.utilization:
22+
enabled: true
23+
k8s.container.cpu_limit_utilization:
24+
enabled: true
25+
k8s.container.cpu_request_utilization:
26+
enabled: true
27+
k8s.container.memory.node.utilization:
28+
enabled: true
29+
k8s.container.memory_limit_utilization:
30+
enabled: true
31+
k8s.container.memory_request_utilization:
32+
enabled: true
33+
k8s.node.cpu.usage:
34+
enabled: true
35+
k8s.node.uptime:
36+
enabled: true
37+
k8s.pod.cpu.node.utilization:
38+
enabled: true
39+
k8s.pod.cpu.usage:
40+
enabled: true
41+
k8s.pod.cpu_limit_utilization:
42+
enabled: true
43+
k8s.pod.cpu_request_utilization:
44+
enabled: true
45+
k8s.pod.memory.node.utilization:
46+
enabled: true
47+
k8s.pod.memory_limit_utilization:
48+
enabled: true
49+
k8s.pod.memory_request_utilization:
50+
enabled: true
51+
k8s.pod.uptime:
52+
enabled: true
53+
extra_metadata_labels:
54+
- container.id
55+
{{- end }}

charts/agent/templates/_kubeletstats-sidecar.tpl

Lines changed: 0 additions & 73 deletions
This file was deleted.

charts/agent/templates/_node-logs-metrics-config.tpl

Lines changed: 7 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -70,59 +70,13 @@ receivers:
7070
network: null
7171
{{ end -}}
7272
{{- if .Values.node.containers.metrics.enabled }}
73-
kubeletstats:
74-
collection_interval: {{.Values.node.containers.metrics.interval}}
75-
auth_type: 'serviceAccount'
76-
endpoint: {{ if .Values.node.kubeletstats.useNodeIp }}"${env:K8S_NODE_IP}:10250"{{ else }}"${env:K8S_NODE_NAME}:10250"{{ end }}
77-
node: '${env:K8S_NODE_NAME}'
78-
insecure_skip_verify: true
79-
k8s_api_config:
80-
auth_type: serviceAccount
81-
metric_groups:
82-
- node
83-
- pod
84-
- container
85-
metrics:
86-
# The following metrics are optional and must be enabled manually as per:
87-
# https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#optional-metrics
88-
container.cpu.usage:
89-
enabled: true
90-
container.uptime:
91-
enabled: true
92-
k8s.container.cpu.node.utilization:
93-
enabled: true
94-
k8s.container.cpu_limit_utilization:
95-
enabled: true
96-
k8s.container.cpu_request_utilization:
97-
enabled: true
98-
k8s.container.memory.node.utilization:
99-
enabled: true
100-
k8s.container.memory_limit_utilization:
101-
enabled: true
102-
k8s.container.memory_request_utilization:
103-
enabled: true
104-
k8s.node.cpu.usage:
105-
enabled: true
106-
k8s.node.uptime:
107-
enabled: true
108-
k8s.pod.cpu.node.utilization:
109-
enabled: true
110-
k8s.pod.cpu.usage:
111-
enabled: true
112-
k8s.pod.cpu_limit_utilization:
113-
enabled: true
114-
k8s.pod.cpu_request_utilization:
115-
enabled: true
116-
k8s.pod.memory.node.utilization:
117-
enabled: true
118-
k8s.pod.memory_limit_utilization:
119-
enabled: true
120-
k8s.pod.memory_request_utilization:
121-
enabled: true
122-
k8s.pod.uptime:
123-
enabled: true
124-
extra_metadata_labels:
125-
- container.id
73+
{{- $endpoint := "" }}
74+
{{- if .Values.node.kubeletstats.useNodeIp }}
75+
{{- $endpoint = "\"${env:K8S_NODE_IP}:10250\"" }}
76+
{{- else }}
77+
{{- $endpoint = "\"${env:K8S_NODE_NAME}:10250\"" }}
78+
{{- end }}
79+
{{- include "observe.kubeletstats.receiver" (dict "Values" .Values "endpoint" $endpoint) | nindent 2 }}
12680
{{ end -}}
12781
{{- if .Values.node.containers.logs.enabled }}
12882
filelog:

charts/agent/templates/kubeletstats-sidecar.yaml

Lines changed: 0 additions & 24 deletions
This file was deleted.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{{- if .Values.nodeless.enabled }}
2+
{{- if .Values.nodeless.serviceAccounts }}
3+
---
4+
apiVersion: rbac.authorization.k8s.io/v1
5+
kind: ClusterRole
6+
metadata:
7+
name: observe-agent-nodeless-cluster-role-{{ template "observe-agent.namespace" . }}
8+
labels:
9+
app.kubernetes.io/name: observe-agent-nodeless-cluster-role
10+
app.kubernetes.io/instance: observe-agent
11+
rules:
12+
- apiGroups: [""]
13+
resources:
14+
- nodes
15+
- nodes/proxy
16+
- namespaces
17+
- pods
18+
- configmaps
19+
verbs: ["get", "list", "watch"]
20+
21+
- apiGroups: ["apps"]
22+
resources:
23+
- replicasets
24+
verbs: ["get", "list", "watch"]
25+
---
26+
{{- range $namespace, $serviceAccounts := .Values.nodeless.serviceAccounts }}
27+
{{- if not (kindIs "slice" $serviceAccounts) }}
28+
{{- fail (printf "nodeless.serviceAccounts[%s] must be a list, but got: %v (type: %s)" $namespace $serviceAccounts (kindOf $serviceAccounts)) }}
29+
{{- end }}
30+
{{- if not $serviceAccounts }}
31+
{{- fail (printf "nodeless.serviceAccounts[%s] is empty. Please provide at least one service account or remove the namespace from the map." $namespace) }}
32+
{{- end }}
33+
{{- range $serviceAccounts }}
34+
{{- if not (kindIs "string" .) }}
35+
{{- fail (printf "nodeless.serviceAccounts[%s] contains a non-string value: %v. All service account names must be strings." $namespace .) }}
36+
{{- end }}
37+
apiVersion: rbac.authorization.k8s.io/v1
38+
kind: ClusterRoleBinding
39+
metadata:
40+
name: observe-agent-nodeless-cluster-role-binding-{{ $namespace }}-{{ . }}
41+
labels:
42+
app.kubernetes.io/name: observe-agent-nodeless-cluster-role-binding
43+
app.kubernetes.io/instance: observe-agent
44+
roleRef:
45+
apiGroup: rbac.authorization.k8s.io
46+
kind: ClusterRole
47+
name: observe-agent-nodeless-cluster-role-{{ template "observe-agent.namespace" $ }}
48+
subjects:
49+
- kind: ServiceAccount
50+
name: {{ . }}
51+
namespace: {{ $namespace }}
52+
---
53+
{{- end }}
54+
{{- end }}
55+
{{- end }}
56+
{{- end }}

0 commit comments

Comments
 (0)