A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Background
We run Vector as a Kubernetes DaemonSet to collect container logs.【Vector DaemonSet -> AutoMQ / Kafka】
The Kubernetes cluster uses Docker with the json-file log driver.
Docker log rotation:
max files: 8
max size per file: 100MB
Real Docker log files are stored under:
/data1/docker/containers/<container_id>/
Kubernetes pod log paths are symlinks:
/var/log/pods///0.log
Issue 1: Some Vector pods keep around 2K seconds source lag
On some Kubernetes nodes, the Vector kubernetes_logs source lag stays around 2K seconds.
Metric:
vector_source_lag_time_seconds_bucket
We tried increasing Vector CPU limit to 6 cores and increasing max_read_bytes to 4MB, but it did not help much.
The source read throughput still stays around ~ 12MB/s
After increasing CPU and max_read_bytes, the read throughput did not improve significantly, and the source lag still stayed around 2K seconds.
On the sink side, we checked:
vector_kafka_queue_messages_bytes
It stays very stable around ~ 500KB
So this does not look like Kafka sink backpressure or producer queue accumulation.
Another important observation: during business low-traffic hours, the lag can fully recover back to 0.
This suggests that Vector's current consumption capacity is not enough during peak traffic, but we are not sure where the bottleneck is.
Issue 2: Incorrect log interleaving when multiple watchers exist for the same container
Possibly because Vector cannot consume logs fast enough and Docker log rotation is frequent, multiple watchers for rotated log files of the same container can coexist.
For example, Vector may hold multiple file descriptors for the same container:
<container_id>-json.log
<container_id>-json.log.1
<container_id>-json.log.2
<container_id>-json.log.3
...
When this happens, we sometimes see log interleaving/corruption.
A large application log line is split by Docker into multiple 16KB JSON log records. Vector is expected to merge only fragments from the same original log line.
However, in the actual query result, fragments from different logs can be merged together. Sometimes a fragment from another log appears in the middle of the merged log.
This problem happens frequently during peak traffic, but rarely happens during business low-traffic hours.
Our current suspicion is:
high source lag
-> old rotated file watchers remain active for a long time
-> multiple watchers for the same container coexist
-> Docker partial log merge may incorrectly merge fragments across different rotated files
Please help confirm whether this is an expected limitation, a configuration issue, or a correctness bug in
kubernetes_logs / Docker partial log merging.
Configuration
apiVersion: v1
data:
agent.yaml: |
api:
enabled: true
address: "0.0.0.0:8686"
data_dir: /data/vector/
sources:
k8s:
type: file
include:
- /data1/log/kubernetes.audit
logs:
type: kubernetes_logs
use_apiserver_cache: true
max_read_bytes: 4194304
rotate_wait_secs: 1800
exclude_paths_glob_patterns:
- "**/*colo-system_*"
- "**/*k8s-iaas_*"
- "**/*.gz"
- "**/*.tmp"
oldest_first: true
internal_metrics:
type: internal_metrics
transforms:
filter:
type: filter
inputs:
- logs
condition: |-
# Filter by Kubernetes labels
remap:
type: remap
inputs:
- filter
source: |-
# Rewrite the event as:log, pod, app_name, unit_name, container_name, version, idc, @timestamp
route:
type: route
inputs:
- remap
reroute_unmatched: true
route:
multiline_std:
type: reduce
inputs:
- route.std
starts_when: starts_with(string!(.log), "[")
expire_after_ms: 1000
max_events: 100
merge_strategies:
log: concat_newline
group_by:
- pod
- container_name
sinks:
prom_exporter:
type: prometheus_exporter
inputs:
- internal_metrics
address: 0.0.0.0:9090
kafka:
type: kafka
inputs:
- multiline_std
- route._unmatched
- k8s_remap
bootstrap_servers: "${KAFKA_HOSTS}"
topic: log.log-container-app.stdout
librdkafka_options:
request.required.acks: "-1"
batch:
max_events: 1000
encoding:
codec: "json"
Version
vector 0.43.0 (x86_64-unknown-linux-musl)
Debug Output
Example Data
No response
Additional Context
No response
References
No response
A note for the community
Problem
Background
We run Vector as a Kubernetes DaemonSet to collect container logs.【Vector DaemonSet -> AutoMQ / Kafka】
The Kubernetes cluster uses Docker with the json-file log driver.
Docker log rotation:
max files: 8
max size per file: 100MB
Real Docker log files are stored under:
/data1/docker/containers/<container_id>/
Kubernetes pod log paths are symlinks:
/var/log/pods///0.log
Issue 1: Some Vector pods keep around 2K seconds source lag
On some Kubernetes nodes, the Vector
kubernetes_logssource lag stays around 2K seconds.Metric:
vector_source_lag_time_seconds_bucket
We tried increasing Vector CPU limit to 6 cores and increasing max_read_bytes to 4MB, but it did not help much.
The source read throughput still stays around ~ 12MB/s
After increasing CPU and max_read_bytes, the read throughput did not improve significantly, and the source lag still stayed around 2K seconds.
On the sink side, we checked:
vector_kafka_queue_messages_bytes
It stays very stable around ~ 500KB
So this does not look like Kafka sink backpressure or producer queue accumulation.
Another important observation: during business low-traffic hours, the lag can fully recover back to 0.
This suggests that Vector's current consumption capacity is not enough during peak traffic, but we are not sure where the bottleneck is.
Issue 2: Incorrect log interleaving when multiple watchers exist for the same container
Possibly because Vector cannot consume logs fast enough and Docker log rotation is frequent, multiple watchers for rotated log files of the same container can coexist.
For example, Vector may hold multiple file descriptors for the same container:
<container_id>-json.log
<container_id>-json.log.1
<container_id>-json.log.2
<container_id>-json.log.3
...
When this happens, we sometimes see log interleaving/corruption.
A large application log line is split by Docker into multiple 16KB JSON log records. Vector is expected to merge only fragments from the same original log line.
However, in the actual query result, fragments from different logs can be merged together. Sometimes a fragment from another log appears in the middle of the merged log.
This problem happens frequently during peak traffic, but rarely happens during business low-traffic hours.
Our current suspicion is:
high source lag
-> old rotated file watchers remain active for a long time
-> multiple watchers for the same container coexist
-> Docker partial log merge may incorrectly merge fragments across different rotated files
Please help confirm whether this is an expected limitation, a configuration issue, or a correctness bug in
kubernetes_logs / Docker partial log merging.
Configuration
Version
vector 0.43.0 (x86_64-unknown-linux-musl)
Debug Output
Example Data
No response
Additional Context
No response
References
No response