Skip to content

Possible memory leak when using s3_sink, disk buffer, and uploads temporarily break #23875

@jhbigler-pnnl

Description

@jhbigler-pnnl

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We use vector to ship zeek and suricata logs to our Minio cluster using the AWS S3 sink (configured to use Minio as its endpoint). Because there can sometimes be downtime (usually due to the network), we configure a large disk buffer to preserve as much backflow as possible. I also configure it to retry on every error.

We had a period of time where vector could not send logs to Minio, and after a while we noticed that the host running vector was completely unresponsive. We realized that vector was using a huge percentage of memory - it was over 20GB at that time, unfortunately I don't have a screenshot to prove that figure.

I'm a bit surprised that it ended up being that large, because as I understood, using a disk buffer means vector should not be keeping as many events in memory - it reads them from disk as needed. I attempted disabling concurrency on that sink and reducing the request rate, but that only seemed to slow its memory use growth.

Suspecting a memory leak, I generated a kibana graph of vector running with memory profiling enabled, overlayed with systemd's reporting of vector.service's memory usage. What it shows is that vector's actual memory use increases even though the memory use of the vector components, as reported by vector, do not noticeably increase:

Image

Could this be a memory leak, or am I not actually understanding Vector's disk buffering system correctly?

NOTE - I am not claiming this problem is unique to the s3 sink, it is simply where I am seeing this issue.

Configuration

# /etc/vector/configs/sinks/minio_output.toml
# I do not provide every configuration detail, only the ones I feel might be relevant

type = "aws_s3"
compression = "gzip"

[batch]
max_bytes = 450000000
max_events = 10000000
timeout_secs = 75
[buffer]
max_size = 1771061398732
type = "disk"
when_full = "block"
[encoding]
codec = "text"
[framing]
method = "newline_delimited"
[healthcheck]
enabled = false
[request]
concurrency = "none"
rate_limit_num = 3
timeout = 120
rate_limit_duration_secs = 10
[retry_strategy]
type = "all"

# This is simply the easiest way to simulate an outage - providing fake credentials
[auth]
secret_access_key = "madeup"
access_key_id = "madeup"

Version

0.49.0

Debug Output


Example Data

No response

Additional Context

Vector runs in a systemd unit on a Rocky Linux 8 server.

References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions