-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Describe the bug
When the service is killed by the OS due to OOM, the systemd automatically starts it again.
Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.
Logs
I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...).
I'll be happy to provide specific grep/messages, otherwise the log is huge.
Go-carbon Configuration:
go-carbon.conf:
[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
max-cpu = 4
metric-interval = "1m0s"
[whisper]
data-dir = "/data/graphite/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
quotas-file = ""
workers = 4
max-updates-per-second = 0
sparse-create = false
physical-size-factor = 0.75
flock = true
compressed = false
enabled = true
hash-filenames = true
remove-empty-file = false
online-migration = false
online-migration-rate = 5
online-migration-global-scope = ""
[cache]
max-size = 100000000
write-strategy = "max"
[udp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0
[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0
compression = ""
[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0
[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"
[grpc]
listen = "127.0.0.1:7003"
enabled = true
[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/data/graphite/tagging/"
tagdb-timeout = "1s"
[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
query-cache-enabled = true
streaming-query-cache-enabled = false
query-cache-size-mb = 0
find-cache-enabled = true
buckets = 100
max-globs = 1000
fail-on-max-globs = false
empty-result-ok = true
do-not-log-404s = false
metrics-as-counters = false
trigram-index = true
internal-stats-dir = ""
cache-scan = false
max-metrics-globbed = 1000000000
max-metrics-rendered = 100000000
trie-index = false
concurrent-index = false
realtime-index = 0
file-list-cache = ""
file-list-cache-version = 1
max-creates-per-second = 0
no-service-when-index-is-not-ready = false
max-inflight-requests = 0
render-trace-logging-enabled = false
[carbonserver.grpc]
listen = ""
enabled = false
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "5m0s"
quota-usage-report-frequency = "1m0s"
[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0
[pprof]
listen = "127.0.0.1:7007"
enabled = false
[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = ""
sample-initial = 0
sample-thereafter = 0
[prometheus]
enabled = false
endpoint = "/metrics"
[prometheus.labels]
[tracing]
enabled = false
jaegerEndpoint = ""
stdout = false
send_timeout = "10s"
storage-schemas.conf:
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[redash-metrics]
pattern = (.*{something I prefer to not share}.*)
retentions = 1m:7y
[production]
pattern = (^production.*|^secTeam.*)
retentions = 1m:60d,15m:120d,1h:3y
[non-production]
pattern = (^non-production.*|^canary.*)
retentions = 1m:14d,30m:30d,1h:180d
[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y
storage-aggregation.conf files:
[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min
[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = max
[someTeam_aggregation]
pattern = ^someTeam.*
xFilesFactor = 0
aggregationMethod = average
[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average
I wonder whether fields max-size, max-metrics-globbed or max-metrics-rendered have to do with the issue.
Additional context
carbonapi service also runs in same server.
We've an identical dev server, but it's carbonapi is almost not queried.
Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries.
Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:

In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day.
Dev:
$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
Active: active (running) since Mon 2023-12-18 10:07:53 UTC; 2 weeks 6 days ago
Memory: 26.5G
Prod:
$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
Active: active (running) since Sat 2024-01-06 05:18:57 UTC; 1 day 8h ago
Memory: 42.0G
Although that shall make sense since there are almost zero queries from the dev server.