Skip to content

Dkron agent problems connecting to cluster on larger installations. #1727

Open
@githblogin

Description

@githblogin

Describe the bug
We use digital ocean k8s cluster. Dkron v4 server is deployed using dkron helm chart.
Each application namespace has its own(or several) dkron agents to run jobs based on tags.

All is well and working until we hit what looks to be a scaling threshold( ~80-85 agent pods ) and then the serf becomes unstable.
Agents can't fully connect to cluster anymore, or rather set their tags. The same thing that worked when cluster was smaller now mostly fails, although if left on its own it sometimes can join and works until restarted.

Related logs from agent:

dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app1-dkron-worker-dfcd587f7-57wkn 10.244.27.220"
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" node=app1-dkron-worker-dfcd587f7-57wkn
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Joining cluster..." cluster=LAN node=app1-dkron-worker-dfcd587f7-57wkn
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.12.71:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app2-service-dkron-worker-545cbd6958-hs7wk 10.244.7.245"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app3-dkron-worker-8578d5d74b-7b88w 10.244.6.221"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-1 10.244.8.31"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-agent-fd99df5cd-9k7fr 10.244.6.82"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-0 10.244.24.233"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-2 10.244.12.71"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-3 10.244.27.229"
...
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.24.233:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.8.31:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with: 10.244.27.229:8946"
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Join LAN completed. Synced with 4 initial agents" node=app1-dkron-worker-dfcd587f7-57wkn
...
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberUpdate: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:57Z" level=info msg="2025/04/09 13:31:57 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
dkron time="2025-04-09T13:31:57Z" level=info msg="2025/04/09 13:31:57 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"
...
dkron Error: agent: Error setting tags: timeout waiting for update broadcast
dkron Usage:
dkron dkron agent [flags]

After this it simply goes into crashloop and retries until it starts to work, or more likely not.

To Reproduce
Steps to reproduce the behavior:

  1. Setup dkron cluster using official helm chart(we have 4 server nodes in raft)

  2. Create application pod with dkron agent - it simply runs "dkron agent" command and connects using following:
    dkron.yml: |-
    server: false
    retry-join: ["dkron.dkron.svc.cluster.local"]
    encrypt: custom-hash-value
    profile: wan
    statsd-addr: "127.0.0.1:9125"
    tags:
    app1: cron

  3. Add some jobs(we have 187), simple jobs nothing too fancy.
    {
    "id": "test_job_app",
    "name": "test_job_app",
    "displayname": "",
    "timezone": "",
    "schedule": "@every 1m",
    "owner": "Gitlab",
    "owner_email": "[email protected]",
    "disabled": false,
    "tags": {
    "app1": "cron:1"
    },
    "metadata": null,
    "retries": 0,
    "dependent_jobs": null,
    "parent_job": "",
    "processors": {
    "fluent": {
    "fluent": "",
    "project": "test_job"
    }
    },
    "concurrency": "forbid",
    "executor": "shell",
    "executor_config": {
    "allowed_exitcodes": "0, 199, 255",
    "command": "echo OK",
    "cwd": "/www/default/cron",
    "mem_limit_kb": "inf",
    "project": "test_job",
    "shell": "true",
    "timeout": "1m"
    },
    "status": "failed",
    "next": "2025-04-09T09:30:26Z",
    "ephemeral": false,
    "expires_at": null
    },

  4. Then start adding more agents(we have ~85 agent pods) until at one point new agents start to become unstable and refuse to connect.

Expected behavior
I expect agents to continue connecting and taking jobs as before.

Specifications:

  • OS: alpine:latest
  • Version: dkron server 4.0.4; agent latest Devel
  • Executor: shell

Additional context
We also have older dkron v3 implementation (Version: 3.1.10) which is being used in parallel, so for this new implementation we added encrypt setting to prevent cluster merging.

I suspect that serf simply takes more time to broadcast changes than timeout, that's why everything becomes unstable and only works sometimes and works well on smaller scale. Hope you can advice something, I have tried changing available timeout related settings like serf-reconnect-timeout, raft-multiplier, which did not lead to any noticeable change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions