-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We use vector to ship zeek and suricata logs to our Minio cluster using the AWS S3 sink (configured to use Minio as its endpoint). Because there can sometimes be downtime (usually due to the network), we configure a large disk buffer to preserve as much backflow as possible. I also configure it to retry on every error.
We had a period of time where vector could not send logs to Minio, and after a while we noticed that the host running vector was completely unresponsive. We realized that vector was using a huge percentage of memory - it was over 20GB at that time, unfortunately I don't have a screenshot to prove that figure.
I'm a bit surprised that it ended up being that large, because as I understood, using a disk buffer means vector should not be keeping as many events in memory - it reads them from disk as needed. I attempted disabling concurrency on that sink and reducing the request rate, but that only seemed to slow its memory use growth.
Suspecting a memory leak, I generated a kibana graph of vector running with memory profiling enabled, overlayed with systemd's reporting of vector.service's memory usage. What it shows is that vector's actual memory use increases even though the memory use of the vector components, as reported by vector, do not noticeably increase:

Could this be a memory leak, or am I not actually understanding Vector's disk buffering system correctly?
NOTE - I am not claiming this problem is unique to the s3 sink, it is simply where I am seeing this issue.
Configuration
# /etc/vector/configs/sinks/minio_output.toml
# I do not provide every configuration detail, only the ones I feel might be relevant
type = "aws_s3"
compression = "gzip"
[batch]
max_bytes = 450000000
max_events = 10000000
timeout_secs = 75
[buffer]
max_size = 1771061398732
type = "disk"
when_full = "block"
[encoding]
codec = "text"
[framing]
method = "newline_delimited"
[healthcheck]
enabled = false
[request]
concurrency = "none"
rate_limit_num = 3
timeout = 120
rate_limit_duration_secs = 10
[retry_strategy]
type = "all"
# This is simply the easiest way to simulate an outage - providing fake credentials
[auth]
secret_access_key = "madeup"
access_key_id = "madeup"
Version
0.49.0
Debug Output
Example Data
No response
Additional Context
Vector runs in a systemd unit on a Rocky Linux 8 server.
References
No response