Skip to content

Memory leak in kubernetes daemonset #4941

Open
@der

Description

@der

Describe the bug

Running as fluentd-kubernetes-daemonset on modest EKS clusters (6-12 nodes, *.large instances) each fluentd pod steadily grows in memory use until it hits allocated memory limit (currently 750MB) and restarts (timescale ~ 1 week).

Image

Occurs in multiple clusters with different workloads, different message formats and different patterns of usage, even very low use instances.

No obvious problems in fluentd logs. No obvious correlation to particular log messages or usage pattern.

To Reproduce

Deploy fluentd-kubernetes-daemonset 1.7.1 with matching debian-s3 and debian-elasticsearch8 backends, tail parser plugin and small number of match rules for json log formats. Watch memory use via cluster metrics.

Expected behavior

Memory use should stabilise below levels anything like 750MB per instance.

Your Environment

- Fluentd version: 1.7.1
- Package version:
- Operating system: k8s 1.31 with AL2023 nodes (Amazon Linux 2023.6.20250115)
- Kernel version: 6.1.119-129.201.amzn2023.x86_64

Your Configuration

<match kubernetes.log.hydrology-data-explorer.**>
  @id json_hydrology_data_explorer
  @type rewrite_tag_filter
  <rule>
    key stream
    pattern ^(.+)$
    tag kubernetes.log.json.$1
  </rule>
</match>

<match kubernetes.log.hydro-api.**>
  @id json_hydro_api
  @type rewrite_tag_filter
  <rule>
    key stream
    pattern ^(.+)$
    tag kubernetes.log.json.$1.ts
  </rule>
</match>

Your Error Log

Most instances have few logs. Others have (successful) retries on pushing to elasticsearch backend once or twice a day :

2025-04-26 12:45:25 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-26 12:45:27 +0000 chunk="633add28dae5764171d00d39ffc11934" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-26 12:45:26 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633add2e01810b613843ebaa2caa91aa"
2025-04-26 15:06:06 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-26 15:06:07 +0000 chunk="633afc9ad3b10c13ec620ea3304c84d8" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-26 15:06:08 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633afc9f9bfdff286397141c43de483c"
2025-04-27 01:21:11 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-27 01:21:12 +0000 chunk="633b861655fe9a1718104b9411c3f151" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-27 01:21:13 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633b861b1ba05e2e41d3a813d04a3061"
2025-04-27 04:36:17 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-27 04:36:18 +0000 chunk="633bb1b124e3e35e90244a93581b001c" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-27 04:36:18 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633bb1b6df310f483235bdf467bc487b"
2025-04-28 11:52:20 +0000 [warn]: #0 [in_tail_container_logs] /var/log/containers/publish-telemetry-15min-1745840160-ingest-telemetry-1370545147_hydro-production_init-a2d394ce175ea71c68eeffad5bd99d474efbde4c9599237ae658223f2cacf136.log unreadable. It is excluded and would be examined next time.

Additional context

Have attempted Ruby GC tuning in case that helped, including setting RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR to 1.2 to no apparent effect.

Appreciate this is likely hard to reproduce, especially without matching cluster workloads, but we can't find a pattern that would provide a minimal complete test case. Just hoping this either matches a known past issue (that we've missed) or matches to other future reports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    memorywaiting-for-userSimilar to "moreinfo", but especially need feedback from user

    Type

    No type

    Projects

    Status

    To-Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions