Dkron agent problems connecting to cluster on larger installations.

**Describe the bug**
We use digital ocean k8s cluster. Dkron v4 server is deployed using dkron helm chart.
Each application namespace has its own(or several) dkron agents to run jobs based on tags.

All is well and working until we hit what looks to be a scaling threshold( ~80-85 agent pods ) and then the serf becomes unstable.
Agents can't fully connect to cluster anymore, or rather set their tags. The same thing that worked  when cluster was smaller now mostly fails, although if left on its own it sometimes can join and works until restarted.

Related logs from agent:

dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app1-dkron-worker-dfcd587f7-57wkn 10.244.27.220"                                                                                          
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" node=app1-dkron-worker-dfcd587f7-57wkn  
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Joining cluster..." cluster=LAN node=app1-dkron-worker-dfcd587f7-57wkn                                                                                                               
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with:  10.244.12.71:8946"                                                                                                
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app2-service-dkron-worker-545cbd6958-hs7wk 10.244.7.245"                                                                                 
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: app3-dkron-worker-8578d5d74b-7b88w 10.244.6.221"                                                                     
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-1 10.244.8.31"                                                                                                             
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-agent-fd99df5cd-9k7fr 10.244.6.82"                                                                                                
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-0 10.244.24.233"                                                                                                           
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-2 10.244.12.71"                                                                                                            
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberJoin: dkron4-server-3 10.244.27.229"
...                                                                                                           
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with:  10.244.24.233:8946"                                                                                               
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with:  10.244.8.31:8946"    
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] memberlist: Initiating push/pull sync with:  10.244.27.229:8946"  
dkron time="2025-04-09T13:31:56Z" level=info msg="agent: Join LAN completed. Synced with 4 initial agents" node=app1-dkron-worker-dfcd587f7-57wkn 
...
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [INFO] serf: EventMemberUpdate: app1-dkron-worker-dfcd587f7-57wkn"                                                                                                      
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"                                                                                                       
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"                                                                                                       
dkron time="2025-04-09T13:31:56Z" level=info msg="2025/04/09 13:31:56 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"                                                                                                       
dkron time="2025-04-09T13:31:57Z" level=info msg="2025/04/09 13:31:57 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"                                                                                                       
dkron time="2025-04-09T13:31:57Z" level=info msg="2025/04/09 13:31:57 [DEBUG] serf: messageJoinType: app1-dkron-worker-dfcd587f7-57wkn"                                                                                                       
...
dkron Error: agent: Error setting tags: timeout waiting for update broadcast                                                                                                                                                                 
dkron Usage:                                                                                                                                                                                                                                 
dkron   dkron agent [flags]                                                                                                               

After this it simply goes into crashloop and retries until it starts to work, or more likely not.

**To Reproduce**
Steps to reproduce the behavior:
1. Setup dkron cluster using official helm chart(we have 4 server nodes in raft)

2. Create application pod with dkron agent - it simply runs "dkron agent" command and connects using following:
  dkron.yml: |-                                                                                                                                                                                                                              
    server: false                                                                                                                                                                                                                            
    retry-join: ["dkron.dkron.svc.cluster.local"]                                                                                                                                                                                            
    encrypt: custom-hash-value
    profile: wan                                                                                                                                                                                                                             
    statsd-addr: "127.0.0.1:9125"                                                                                                                                                                                                            
    tags:                                                                                                                                                                                                                                    
      app1: cron                       

3. Add some jobs(we have 187), simple jobs nothing too fancy.
  {                                                                                                                                                                                                                                          
    "id": "test_job_app",                                                                                                                                                                                   
    "name": "test_job_app",                                                                                                                                                                                 
    "displayname": "",                                                                                                                                                                                                                       
    "timezone": "",                                                                                                                                                                                                                          
    "schedule": "@every 1m",                                                                                                                                                                                                                 
    "owner": "Gitlab",                                                                                                                                                                                                                       
    "owner_email": "test@email.com",                                                                                                                                                                                                 
    "disabled": false,                                                                                                                                                                                                                       
    "tags": {                                                                                                                                                                                                                                
      "app1": "cron:1"                                                                                                                                                                                                   
    },                                                                                                                                                                                                                                       
    "metadata": null,                                                                                                                                                                                                                        
    "retries": 0,                                                                                                                                                                                                                            
    "dependent_jobs": null,                                                                                                                                                                                                                  
    "parent_job": "",                                                                                                                                                                                                                        
    "processors": {                                                                                                                                                                                                                          
      "fluent": {                                                                                                                                                                                                                            
        "fluent": "",                                                                                                                                                                                                                        
        "project": "test_job"                                                                                                                                                                                                
      }                                                                                                                                                                                                                                      
    },                                                                                                                                                                                                                                       
    "concurrency": "forbid",                                                                                                                                                                                                                 
    "executor": "shell",                                                                                                                                                                                                                     
    "executor_config": {                                                                                                                                                                                                                     
      "allowed_exitcodes": "0, 199, 255",                                                                                                                                                                                                    
      "command": "echo OK",                                                                                                                                                                                                                  
      "cwd": "/www/default/cron",                                                                                                                                                                                                            
      "mem_limit_kb": "inf",                                                                                                                                                                                                                 
      "project": "test_job",                                                                                                                                                                                                 
      "shell": "true",                                                                                                                                                                                                                       
      "timeout": "1m"                                                                                                                                                                                                                        
    },                                                                                                                                                                                                                                       
    "status": "failed",                                                                                                                                                                                                                      
    "next": "2025-04-09T09:30:26Z",                                                                                                                                                                                                          
    "ephemeral": false,                                                                                                                                                                                                                      
    "expires_at": null                                                                                                                                                                                                                       
  },                                                          

4. Then start adding more agents(we have ~85 agent pods) until at one point new agents start to become unstable and refuse to connect.

**Expected behavior**
I expect agents to continue connecting and taking jobs as before.

**Specifications:**
 - OS: alpine:latest
 - Version: dkron server 4.0.4; agent latest Devel
 - Executor: shell

**Additional context**
We also have older dkron v3 implementation (Version: 3.1.10) which is being used in parallel, so for this new implementation we added encrypt setting to prevent cluster merging.

I suspect that serf simply takes more time to broadcast changes than timeout, that's why everything becomes unstable and only works sometimes and works well on smaller scale. Hope you can advice something, I have tried changing available timeout related settings like serf-reconnect-timeout, raft-multiplier, which did not lead to any noticeable change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dkron agent problems connecting to cluster on larger installations. #1727

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dkron agent problems connecting to cluster on larger installations. #1727

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions