Skip to content

Bug: Persistent ghost instance in the member ring #13810

@lennarkivimae

Description

@lennarkivimae

What is the bug?

In somewhat unknown condition, one of the instances within the member ring got "corrupted". It received invalid timestamp 0001-01-01T00:00:00Z and was unable to remove it from the ring. New pods failed to join the ring, as there was already instance in place. Resulting in pod crash loop.

How to reproduce it?

Unfortunately we don't know what exactly caused this. Our distributors and ingesters went into OOM kills because of the lack of resources. And some of the instances were unhealthy in the ring. After scaling everything up, one of the ingesters started failing. It was a random pod amongst all of the pods.

What did you think would happen?

The instance in the ring should've disappeared when pod got removed. Or when pressing forget in the UI

What was your environment?

Kubernetes - v1.32.9-eks-3025e55
ArgoCD - v2.13.4
Helm - v3.17.4
Mimir - v2.16.0
Consul - v1.21.3

Chart: mimir-distributed - v5.7.0

Any additional context to share?

We tried following things to get the instance out of the ring:

  • Press "forget" in the ring UI
  • Batch send "forget" to all instances and availability zones
  • Drained ingesters, distributors, store-gateway, queriers (initially thought those are part of the ring and might put the instance back, causing the issue)
  • Eventually drained everything in the mimir namespace, along with consul.
    • After scaling everything back up, things returned normal
Image

Logs:

| ts=2025-12-10T13:43:44.844666756Z caller=experimental.go:25 level=warn msg="experimental feature in use" feature=ruler.tenant-federation   
│ ts=2025-12-10T13:43:44.844845679Z caller=main.go:225 level=info msg="Starting application" version="(version=2.16.0, branch=HEAD, revision=b4f36da)"                                                                                            
│ ts=2025-12-10T13:43:44.848668752Z caller=server.go:368 level=info msg="server listening on addresses" http=[::]:8080 grpc=[::]:9095                                                                                                             
│ ts=2025-12-10T13:43:44.860257598Z caller=memberlist_client.go:463 level=info msg="Using memberlist cluster label and node name" cluster_label=mimir-monitoring node=mimir-monitoring-ingester-eu-west-1c-14-521d78ed                                                                                                  
│ ts=2025-12-10T13:43:44.86138556Z caller=ingester.go:445 level=info msg="TSDB idle compaction timeout set" timeout=1h14m11.259541951s                                                                                                           
│ ts=2025-12-10T13:43:44.862067601Z caller=module_service.go:82 level=info msg=starting module=sanity-check                                                                                                                                  
│ ts=2025-12-10T13:43:44.862120122Z caller=sanity_check.go:32 level=info msg="Checking directories read/write access"                                                                                                                      
│ ts=2025-12-10T13:43:44.862117798Z caller=module_service.go:82 level=info msg=starting module=active-groups-cleanup-service                                                                                                                 
│ ts=2025-12-10T13:43:44.862136176Z caller=module_service.go:82 level=info msg=starting module=usage-stats                                                                                                                                   
│ ts=2025-12-10T13:43:44.862153335Z caller=module_service.go:82 level=info msg=starting module=activity-tracker                                                                                                                              
│ ts=2025-12-10T13:43:44.862251028Z caller=sanity_check.go:37 level=info msg="Directories read/write access successfully checked"                                                                                                          
│ ts=2025-12-10T13:43:44.862261514Z caller=sanity_check.go:39 level=info msg="Checking object storage config"                                                                                                                              
│ ts=2025-12-10T13:43:44.865358868Z caller=memberlist_client.go:594 level=info msg="memberlist fast-join starting" nodes_found=135 to_join=12                                                                                                    
│ ts=2025-12-10T13:43:44.931652122Z caller=sanity_check.go:44 level=info msg="Object storage config successfully checked"                                                                                                                  
│ ts=2025-12-10T13:43:44.93169839Z caller=module_service.go:82 level=info msg=starting module=server                                                                                                                                        
│ ts=2025-12-10T13:43:44.931816041Z caller=module_service.go:82 level=info msg=starting module=memberlist-kv                                                                                                                                 
│ ts=2025-12-10T13:43:44.931812433Z caller=module_service.go:82 level=info msg=starting module=runtime-config                                                                                                                                
│ ts=2025-12-10T13:43:44.931984065Z caller=module_service.go:82 level=info msg=starting module=ingester-ring                                                                                                                                 
│ ts=2025-12-10T13:43:44.938946398Z caller=reporter.go:147 level=info msg="usage stats reporter initialized" cluster_id=68bb18dc-9a2a-4a07-97b6-af22b1f06ea2                                                                                       
│ ts=2025-12-10T13:43:44.951982674Z caller=memberlist_client.go:614 level=info msg="memberlist fast-join finished" joined_nodes=12 elapsed_time=90.629797ms                                                                                      
│ ts=2025-12-10T13:43:44.952032757Z caller=memberlist_client.go:626 level=info phase=startup msg="joining memberlist cluster" join_members=dns+mimir-monitoring-gossip-ring.mimir.svc.cluster.local.:7946  
│ ts=2025-12-10T13:43:44.955998383Z caller=module_service.go:82 level=info msg=starting module=ingester-service                                                                                                                              
│ ts=2025-12-10T13:43:44.956060825Z caller=ingester.go:2924 level=info msg="opening existing TSDBs"                                                                                                                                      
│ ts=2025-12-10T13:43:44.956249299Z caller=mimir.go:941 level=info msg="Application started"                                                                                                                                         
│ ts=2025-12-10T13:43:44.956408015Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:45.947717475Z caller=memberlist_client.go:633 level=info phase=startup msg="joining memberlist cluster succeeded" reached_nodes=135 elapsed_time=995.63153ms                                                                                                                                          
│ ts=2025-12-10T13:43:45.956982197Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:46.957618602Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:47.958693636Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:48.959966319Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:49.960549206Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:50.961162979Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:51.961847052Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:52.962739582Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:53.963710896Z caller=lifecycler.go:730 level=info msg="existing instance found in ring" state=ACTIVE tokens=512 ring=ingester readOnly=false readOnlyStateUpdate=0001-01-01T00:00:00Z                                                                                                           
│ ts=2025-12-10T13:43:53.964119535Z caller=ingester.go:699 level=warn msg="failed to stop ingester lifecycler" err="failed to join the ring ingester: failed to CAS-update key collectors/ring: no change detected"                                                                                                     
│ ts=2025-12-10T13:43:53.964146293Z caller=ingester.go:705 level=warn msg="failed to remove shutdown marker" path=/data/tsdb/shutdown-requested.txt err="open /data/tsdb: no such file or directory"                                                                                                                  
│ ts=2025-12-10T13:43:53.964171813Z caller=module_service.go:118 level=warn msg="module failed with error" module=ingester-service err="ingester subservice failed: service ingester ring lifecycler failed: failed to join the ring ingester: failed to CAS-update key collectors/ring: no change detected"                                                                                           
│ ts=2025-12-10T13:43:53.964197777Z caller=mimir.go:958 level=error msg="module failed" module=ingester-service err="ingester subservice failed: service ingester ring lifecycler failed: failed to join the ring ingester: failed to CAS-update key collectors/ring: no change detected"                                                                                                                   
│ ts=2025-12-10T13:43:53.964240608Z caller=module_service.go:120 level=info msg="module stopped" module=ingester-ring                                                                                                                         
│ ts=2025-12-10T13:43:53.964259198Z caller=memberlist_client.go:773 level=info msg="leaving memberlist cluster"                                                                                                                                  
│ ts=2025-12-10T13:43:53.964265234Z caller=module_service.go:120 level=info msg="module stopped" module=active-groups-cleanup-service                                                                                                         
│ ts=2025-12-10T13:43:53.964298465Z caller=module_service.go:120 level=info msg="module stopped" module=runtime-config                                                                                                                        
│ ts=2025-12-10T13:43:54.662455355Z caller=module_service.go:120 level=info msg="module stopped" module=memberlist-kv                                                                                                                         
│ ts=2025-12-10T13:43:54.662718976Z caller=server_service.go:55 level=info msg="server stopped"                                                                                                                                              
│ ts=2025-12-10T13:43:54.662738683Z caller=module_service.go:120 level=info msg="module stopped" module=server                                                                                                                                
│ ts=2025-12-10T13:43:54.662763296Z caller=module_service.go:120 level=info msg="module stopped" module=sanity-check                                                                                                                          
│ ts=2025-12-10T13:43:54.662785005Z caller=module_service.go:120 level=info msg="module stopped" module=usage-stats                                                                                                                           
│ ts=2025-12-10T13:43:54.66370698Z caller=module_service.go:120 level=info msg="module stopped" module=activity-tracker                                                                                                                      
│ ts=2025-12-10T13:43:54.663730914Z caller=mimir.go:945 level=info msg="Application stopped"                                                                                                                                         
│ ts=2025-12-10T13:43:54.66378564Z caller=log.go:134 level=error msg="error running application" err="failed services\ngithub.com/grafana/mimir/pkg/mimir.(*Mimir).Run\n\t/__w/mimir/mimir/pkg/mimir/mimir.go:1001\nmain.main\n\t/__w/mimir/mimir/cmd/mimir/main.go:227\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1223"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions