Memory leak in kubernetes daemonset

### Describe the bug

Running as fluentd-kubernetes-daemonset on modest EKS clusters (6-12 nodes, *.large instances) each fluentd pod steadily grows in memory use until it hits allocated memory limit (currently 750MB) and restarts (timescale ~ 1 week).

![Image](https://github.com/user-attachments/assets/4107a93d-69eb-4e91-8537-30a2574cfa15)

Occurs in multiple clusters with different workloads, different message formats and different patterns of usage, even very low use instances. 

No obvious problems in fluentd logs. No obvious correlation to particular log messages or usage pattern. 

### To Reproduce

Deploy fluentd-kubernetes-daemonset 1.7.1 with matching debian-s3 _and_ debian-elasticsearch8 backends, tail parser plugin and small number of match rules for json log formats. Watch memory use via cluster metrics.

### Expected behavior

Memory use should stabilise below levels anything like 750MB per instance.

### Your Environment

```markdown
- Fluentd version: 1.7.1
- Package version:
- Operating system: k8s 1.31 with AL2023 nodes (Amazon Linux 2023.6.20250115)
- Kernel version: 6.1.119-129.201.amzn2023.x86_64
```

### Your Configuration

```apache
<match kubernetes.log.hydrology-data-explorer.**>
  @id json_hydrology_data_explorer
  @type rewrite_tag_filter
  <rule>
    key stream
    pattern ^(.+)$
    tag kubernetes.log.json.$1
  </rule>
</match>

<match kubernetes.log.hydro-api.**>
  @id json_hydro_api
  @type rewrite_tag_filter
  <rule>
    key stream
    pattern ^(.+)$
    tag kubernetes.log.json.$1.ts
  </rule>
</match>
```

### Your Error Log

```shell
Most instances have few logs. Others have (successful) retries on pushing to elasticsearch backend once or twice a day :

2025-04-26 12:45:25 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-26 12:45:27 +0000 chunk="633add28dae5764171d00d39ffc11934" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-26 12:45:25 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-26 12:45:26 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633add2e01810b613843ebaa2caa91aa"
2025-04-26 15:06:06 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-26 15:06:07 +0000 chunk="633afc9ad3b10c13ec620ea3304c84d8" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-26 15:06:06 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-26 15:06:08 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633afc9f9bfdff286397141c43de483c"
2025-04-27 01:21:11 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-27 01:21:12 +0000 chunk="633b861655fe9a1718104b9411c3f151" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-27 01:21:11 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-27 01:21:13 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633b861b1ba05e2e41d3a813d04a3061"
2025-04-27 04:36:17 +0000 [warn]: #0 [out_es7] failed to flush the buffer. retry_times=0 next_retry_time=2025-04-27 04:36:18 +0000 chunk="633bb1b124e3e35e90244a93581b001c" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-master\", :port=>9200, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): read timeout reached"
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1148:in `rescue in send_bulk'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:1110:in `send_bulk'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `block in write'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `each'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluent-plugin-elasticsearch-5.3.0/lib/fluent/plugin/out_elasticsearch.rb:885:in `write'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2025-04-27 04:36:17 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.2.0/gems/fluentd-1.17.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2025-04-27 04:36:18 +0000 [warn]: #0 [out_es7] retry succeeded. chunk_id="633bb1b6df310f483235bdf467bc487b"
2025-04-28 11:52:20 +0000 [warn]: #0 [in_tail_container_logs] /var/log/containers/publish-telemetry-15min-1745840160-ingest-telemetry-1370545147_hydro-production_init-a2d394ce175ea71c68eeffad5bd99d474efbde4c9599237ae658223f2cacf136.log unreadable. It is excluded and would be examined next time.
```

### Additional context

Have attempted Ruby GC tuning in case that helped, including setting `RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR` to `1.2` to no apparent effect.

Appreciate this is likely hard to reproduce, especially without matching cluster workloads, but we can't find a pattern that would provide a minimal complete test case. Just hoping this either matches a known past issue (that we've missed) or matches to other future reports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak in kubernetes daemonset #4941

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory leak in kubernetes daemonset #4941

Description

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions