-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the bug
After upgrading to Loki 3.3.2 (Helm chart 6.25.1), the number of open file descriptors in the backend pod (compactor) continuously increases over time.
lsof shows thousands of open REG files located under /var/loki/tsdb-shipper-cache, mostly temporary .tsdb files created by the compactor.
The descriptors are never released until the pod is restarted. Below is the conclusion regarding open file descriptors and paths. 618 REG → 10,540 REG in ~24h
Container: 8a5779c3db2fe Pod/Name: loki-backend-0
PID: 1992412
Types summary (count TYPE):
10540 REG
8 sock
2 FIFO
2 DIR
2 a_inode
1 TYPE
1 CHR
Top paths/files (top 20):
8 protocol:
6 (stat:
2 pipe
2 /
1 /var/loki/tsdb-shipper-cache/loki_index_20377/self-monitoring/1760583858916619197-compactor-1760568336175-1760582766176-fd12713.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20376/self-monitoring/1760576660070618371-compactor-1760479384685-1760575547177-5a7c1c53.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20376/reluna-monitoring/1760590459899468451-compactor-1760479200014-1760589218042-74d0f7c7.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20375/self-monitoring/1760487486505869949-compactor-1760399979685-1760486593685-1a141f20.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20375/reluna-monitoring/1760589262123649326-compactor-1760365900365-1760575838926-2b470474.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20374/self-monitoring/1760408286299790462-compactor-1760308337011-1760407211686-fd8b884a.tsdb
1 /usr/bin/loki
1 NAME
1 [eventpoll]
1 [eventfd]
1 /dev/null
Top loki-related paths (wal|chunk|loki):
1 /var/loki/tsdb-shipper-cache/loki_index_20377/self-monitoring/1760583858916619197-compactor-1760568336175-1760582766176-fd12713.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20376/self-monitoring/1760576660070618371-compactor-1760479384685-1760575547177-5a7c1c53.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20376/reluna-monitoring/1760590459899468451-compactor-1760479200014-1760589218042-74d0f7c7.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20375/self-monitoring/1760487486505869949-compactor-1760399979685-1760486593685-1a141f20.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20375/reluna-monitoring/1760589262123649326-compactor-1760365900365-1760575838926-2b470474.tsdb
1 /var/loki/tsdb-shipper-cache/loki_index_20374/self-monitoring/1760408286299790462-compactor-1760308337011-1760407211686-fd8b884a.tsdb
1 /usr/bin/loki
TCP sockets count: 6
Total open fds (from /proc): 19
To Reproduce
Steps to reproduce the behavior:
- Deploy Loki 3.3.2 using Helm chart 6.25.1 in SimpleScalable mode.
compactor:
enabled: true
replicas: 1
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
loki:
server:
http_server_idle_timeout: 60s
commonConfig:
replication_factor: 1
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 2190h
retention_period: 2190h
max_cache_freshness_per_query: 10m
split_queries_by_interval: 15m
query_timeout: 300s
volume_enabled: true
ingestion_rate_mb: 5
ingestion_burst_size_mb: 10
schemaConfig:
configs:
- from: 2024-04-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage:
bucketNames:
chunks: loki-12345
ruler: loki-12345
admin: loki-12345
type: s3
s3:
s3: s3://loki-12345
endpoint: null
region: eu-central-1
secretAccessKey: <secretAccessKey>
accessKeyId: <accessKeyId>
signatureVersion: null
s3ForcePathStyle: false
insecure: false
http_config: {}
backoff_config: {}
disable_dualstack: false
filesystem:
chunks_directory: /var/loki/chunks
rules_directory: /var/loki/rules
admin_api_directory: /var/loki/admin
ingester:
chunk_encoding: snappy
tracing:
enabled: true
querier:
max_concurrent: 4
deploymentMode: SimpleScalable
auth_enabled: true
monitoring:
serviceMonitor:
enabled: true
labels:
release: prometheus-operator
compactor:
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
backend:
replicas: 1
persistence:
volumeClaimsEnabled: true
enableStatefulSetAutoDeletePVC: true
size: 10Gi
storageClass: openebs-hostpath
read:
replicas: 1
persistence:
enableStatefulSetAutoDeletePVC: true
size: 10Gi
storageClass: openebs-hostpath
write:
replicas: 1
persistence:
volumeClaimsEnabled: true
enableStatefulSetAutoDeletePVC: false
size: 10Gi
storageClass: openebs-hostpath
chunksCache:
enabled: false
nodeSelector:
eks.amazonaws.com/nodegroup: svc
minio:
enabled: false
singleBinary:
replicas: 0
ingester:
replicas: 0
querier:
replicas: 0
queryFrontend:
replicas: 0
queryScheduler:
replicas: 0
distributor:
replicas: 0
indexGateway:
replicas: 0
bloomCompactor:
replicas: 0
bloomGateway:
replicas: 0
- Use boltdb-shipper with compactor enabled.
- Wait a few hours while compactor runs periodically.
- Run:
sudo crictl ps | grep loki
sudo crictl inspect <container-hash> | grep pid
sudo lsof -p <pid> 2>/dev/null | awk '{print $5}' | sort | uniq -c | sort -nr
Expected behavior
Compactor should close .tsdb file descriptors after finishing compaction and shipping tasks.
The number of open files should remain stable over time.
Environment:
- Infrastructure: Kubernetes (bare-metal Ubuntu 22.04)
- Deployment tool: Helm (chart 6.25.1 via ArgoCD + AVP)
- Loki version: 3.3.2
- Storage: AWS S3 (boltdb-shipper)
- Compactor: Enabled, running in loki-backend-0
Screenshots, Promtail config, or terminal output
After restarting pod loki-backend-0, open files are automatically closed, but accumulation does not stop.

Promtail configuration:
source:
repoURL: "https://grafana.github.io/helm-charts"
chart: promtail
targetRevision: "6.16.6"
values.yaml
serviceMonitor:
enabled: true
annotations:
team: logging
labels:
release: prometheus-operator
interval: 30s
scrapeTimeout: 15s
relabelings:
- action: replace
target_label: cluster
replacement: "<region>"
prometheusRule:
enabled: true
rules:
- alert: PromtailRequestErrors
expr: 100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 10
for: 5m
labels:
severity: critical
annotations:
description: |
The {{ $labels.job }} {{ $labels.route }} is experiencing
{{ printf "%.2f" $value }} errors.
VALUE = {{ $value }}
LABELS = {{ $labels }}
summary: Promtail request errors (instance {{ $labels.instance }})
- alert: PromtailRequestLatency
expr: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: critical
annotations:
summary: Promtail request latency (instance {{ $labels.instance }})
description: |
The {{ $labels.job }} {{ $labels.route }} is experiencing
{{ printf "%.2f" $value }}s 99th percentile latency.
VALUE = {{ $value }}
LABELS = {{ $labels }}
config:
clients:
- url: http://loki-gateway/loki/api/v1/push
tenant_id: "<tenant_id>"
snippets:
extraRelabelConfigs:
- action: replace
target_label: cluster
replacement: "<value>"
I really hope that I'm just being stupid and didn't understand something, or didn't configure values.yaml correctly. In any case, I would really appreciate some practical advice from a colleague.