Skip to content

feat: hpa with cpu + mem util scaling options#628

Open
burnjake wants to merge 1 commit intoinfluxdata:masterfrom
burnjake:burnjake/telegraf-hpa
Open

feat: hpa with cpu + mem util scaling options#628
burnjake wants to merge 1 commit intoinfluxdata:masterfrom
burnjake:burnjake/telegraf-hpa

Conversation

@burnjake
Copy link
Copy Markdown

@burnjake burnjake commented Mar 7, 2024

  • CHANGELOG.md updated - n/a?
  • Rebased/mergable
  • Tests pass (see comment below)
  • Sign CLA (if not already signed)

We would like to scale the number of replicas based on usage which is a slight pain currently as you have to set the deployment.spec.replicas field to none if we were to roll our own HPA resource. There's also a pre-existing issue: #624.

$ helm version
version.BuildInfo{Version:"v3.14.2", GitCommit:"c309b6f0ff63856811846ce18f3bdc93d2b4d54b", GitTreeState:"clean", GoVersion:"go1.22.0"}

Setting autoscaling.enabled: true templates the following Deployment and HPA resources:

$ cat values.yaml | grep autoscaling -A10
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  behavior: {}

$ helm template ./ -s templates/deployment.yaml
---
# Source: telegraf/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: telegraf
      app.kubernetes.io/instance: release-name
  template:
    metadata:
      labels:
        app.kubernetes.io/name: telegraf
        app.kubernetes.io/instance: release-name
      annotations:
        checksum/config: 11e7bc3db613c177911535018f65051a22f67ef0cf419dc2f19448d2a629282f
    spec:
      serviceAccountName: release-name-telegraf
      containers:
      - name: telegraf
        image: "docker.io/library/telegraf:1.29-alpine"
        imagePullPolicy: "IfNotPresent"
        resources:
          {}
        env:
        - name: HOSTNAME
          value: telegraf-polling-service
        volumeMounts:
        - name: config
          mountPath: /etc/telegraf
      volumes:
      - name: config
        configMap:
          name: release-name-telegraf

$ helm template ./ -s templates/horizontalpodautoscaler.yaml
---
# Source: telegraf/templates/horizontalpodautoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: release-name-telegraf
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

Setting autoscaling.enabled: false templates the following Deployment resource:

$ cat values.yaml | grep autoscaling -A10
autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  behavior: {}

helm template ./ -s templates/deployment.yaml
---
# Source: telegraf/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: telegraf
      app.kubernetes.io/instance: release-name
  template:
    metadata:
      labels:
        app.kubernetes.io/name: telegraf
        app.kubernetes.io/instance: release-name
      annotations:
        checksum/config: 11e7bc3db613c177911535018f65051a22f67ef0cf419dc2f19448d2a629282f
    spec:
      serviceAccountName: release-name-telegraf
      containers:
      - name: telegraf
        image: "docker.io/library/telegraf:1.29-alpine"
        imagePullPolicy: "IfNotPresent"
        resources:
          {}
        env:
        - name: HOSTNAME
          value: telegraf-polling-service
        volumeMounts:
        - name: config
          mountPath: /etc/telegraf
      volumes:
      - name: config
        configMap:
          name: release-name-telegraf

$ helm template ./ -s templates/horizontalpodautoscaler.yaml
Error: could not find template templates/horizontalpodautoscaler.yaml in chart

An example with behaviour:

$ cat values.yaml | grep autoscaling -A20
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  behavior:
    scaleDown:
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
      - type: Percent
        value: 10
        periodSeconds: 60

$ helm template ./ -s templates/horizontalpodautoscaler.yaml
---
# Source: telegraf/templates/horizontalpodautoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: release-name-telegraf
  minReplicas: 1
  maxReplicas: 5
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 4
      - periodSeconds: 60
        type: Percent
        value: 10
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so I understand that the autoscaler will launch additional telegraf nodes if you get above a certain memory and CPU usage, but what ensures that the first pod gets reduced usage? Is there a load balancer or some other proxy in front that would round robin the usage?

Trying to understand the full use-case and how a user would take advantage of this without needing to make modifications to their config. Thanks!

Copy link
Copy Markdown
Author

@burnjake burnjake Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Apologies I've been away for a few days. So our use case is to utilise the opentelemetry input, aggregate with basicstats and output with the prometheusclient. We have a traffic pattern where the number of connections varies quite a lot within the day, so varying our replica count is prudent.

As the opentelemetry input expects connections via gRPC, we can't depend on normal load balancing via a k8s service and instead we need to use rely on an external LB which we've plumbed into the ingress of the cluster which will discover the new replicas and do the things to spread the traffic (update its connection pool I think?). In short, we don't need extra configuration within telegraf for this to work, but our use case is indeed very specific!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants