Skip to content

[BUG] OOM (out of memory) recurring every 8-9 days #579

@interfan7

Description

@interfan7

Describe the bug
When the service is killed by the OS due to OOM, the systemd automatically starts it again.
Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.

Logs
I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...).
I'll be happy to provide specific grep/messages, otherwise the log is huge.

Go-carbon Configuration:

go-carbon.conf:

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
max-cpu = 4
metric-interval = "1m0s"

[whisper]
data-dir = "/data/graphite/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
quotas-file = ""
workers = 4
max-updates-per-second = 0
sparse-create = false
physical-size-factor = 0.75
flock = true
compressed = false
enabled = true
hash-filenames = true
remove-empty-file = false
online-migration = false
online-migration-rate = 5
online-migration-global-scope = ""

[cache]
max-size = 100000000
write-strategy = "max"

[udp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0
compression = ""

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/data/graphite/tagging/"
tagdb-timeout = "1s"

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
query-cache-enabled = true
streaming-query-cache-enabled = false
query-cache-size-mb = 0
find-cache-enabled = true
buckets = 100
max-globs = 1000
fail-on-max-globs = false
empty-result-ok = true
do-not-log-404s = false
metrics-as-counters = false
trigram-index = true
internal-stats-dir = ""
cache-scan = false
max-metrics-globbed = 1000000000
max-metrics-rendered = 100000000
trie-index = false
concurrent-index = false
realtime-index = 0
file-list-cache = ""
file-list-cache-version = 1
max-creates-per-second = 0
no-service-when-index-is-not-ready = false
max-inflight-requests = 0
render-trace-logging-enabled = false
[carbonserver.grpc]
listen = ""
enabled = false
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "5m0s"
quota-usage-report-frequency = "1m0s"

[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0

[pprof]
listen = "127.0.0.1:7007"
enabled = false

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = ""
sample-initial = 0
sample-thereafter = 0

[prometheus]
enabled = false
endpoint = "/metrics"
[prometheus.labels]

[tracing]
enabled = false
jaegerEndpoint = ""
stdout = false
send_timeout = "10s"

storage-schemas.conf:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[redash-metrics]
pattern = (.*{something I prefer to not share}.*)
retentions = 1m:7y

[production]
pattern = (^production.*|^secTeam.*)
retentions = 1m:60d,15m:120d,1h:3y

[non-production]
pattern = (^non-production.*|^canary.*)
retentions = 1m:14d,30m:30d,1h:180d

[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y

storage-aggregation.conf files:

[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = max

[someTeam_aggregation]
pattern = ^someTeam.*
xFilesFactor = 0
aggregationMethod = average

[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average

I wonder whether fields max-size, max-metrics-globbed or max-metrics-rendered have to do with the issue.

Additional context
carbonapi service also runs in same server.
We've an identical dev server, but it's carbonapi is almost not queried.
Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries.
Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:
image
In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day.
Dev:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Mon 2023-12-18 10:07:53 UTC; 2 weeks 6 days ago
     Memory: 26.5G

Prod:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Sat 2024-01-06 05:18:57 UTC; 1 day 8h ago
     Memory: 42.0G

Although that shall make sense since there are almost zero queries from the dev server.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions