Skip to content

Goroutine Leak on Thanos Receive (Ingestor&Router) #8557

@xvzf

Description

@xvzf

Thanos, Prometheus and Golang version used:
Thanos: v0.39.2 (considering historic data, we've seen this behavior on previous releases)

Object Storage Provider:
Azure (should be not relevant)

What happened:
Exponential increase in active goroutines, probably linked to a mutex within the receiver router service. The increase of ingestor goroutines resets when router deployments are restarted and is likely caused by it.

Here's an example goroutine:

goroutine 1838951702 [sync.Cond.Wait, 24 minutes]:
sync.runtime_notifyListWait(0xc000d93d50, 0x0)
	/go/pkg/mod/golang.org/[email protected]/src/runtime/sema.go:597 +0x159
sync.(*Cond).Wait(0xc005790f98?)
	/go/pkg/mod/golang.org/[email protected]/src/sync/cond.go:71 +0x85
google.golang.org/grpc/internal/transport.(*http2Client).keepalive(0xc004116488)
	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:1710 +0x225
created by google.golang.org/grpc/internal/transport.newHTTP2Client in goroutine 1838951656
	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:399 +0x1dab

How to reproduce it (as minimally and precisely as possible):
No consistent way has been discovered yet though it happens somewhat frequently on our thanos installation.

Full logs to relevant components:
Logs are looking normal, there are no warnings/errors other than what's expected

Anything else we need to know:

goroutines-receive-ingestor.txt
goroutines-receive-router.txt

Image Image

After rotating all receive routers (ingestors have not been restarted):
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions