`kubernetes_logs` has high source lag and may incorrectly merge Docker partial logs when multiple rotated file watchers exist for the same pod log path

### A note for the community


* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment



### Problem

### Background

  We run Vector as a Kubernetes DaemonSet to collect container logs.【Vector DaemonSet -> AutoMQ / Kafka】

  The Kubernetes cluster uses Docker with the json-file log driver.

  Docker log rotation:

  max files: 8
  max size per file: 100MB

  Real Docker log files are stored under:

  /data1/docker/containers/<container_id>/

  Kubernetes pod log paths are symlinks:

  /var/log/pods/<namespace>_<pod>_<uid>/<container>/0.log

### Issue 1: Some Vector pods keep around 2K seconds source lag
On some Kubernetes nodes, the Vector `kubernetes_logs` source lag stays around 2K seconds.

  Metric:

  **vector_source_lag_time_seconds_bucket**

  We tried increasing Vector CPU limit to 6 cores and increasing max_read_bytes to 4MB, but it did not help much.

<img width="2996" height="1466" alt="Image" src="https://github.com/user-attachments/assets/1cd06643-3341-4506-9614-70f4519e384c" />

  The source read throughput still stays around ~ 12MB/s

<img width="2970" height="1390" alt="Image" src="https://github.com/user-attachments/assets/b8ba4386-ae93-413c-b8b2-ac5fb214a283" />

  After increasing CPU and max_read_bytes, the read throughput did not improve significantly, and the source lag still stayed around 2K seconds.

  On the sink side, we checked:

  **vector_kafka_queue_messages_bytes**

  It stays very stable around ~ 500KB

<img width="2990" height="1440" alt="Image" src="https://github.com/user-attachments/assets/6d9573e0-cac0-4e2c-aef9-118bdb92deb5" />

  So this does not look like Kafka sink backpressure or producer queue accumulation.

  Another important observation: during business low-traffic hours, the lag can fully recover back to 0.

  This suggests that Vector's current consumption capacity is not enough during peak traffic, but we are not sure where the bottleneck is.

### Issue 2: Incorrect log interleaving when multiple watchers exist for the same container

  Possibly because Vector cannot consume logs fast enough and Docker log rotation is frequent, multiple watchers for rotated log files of the same container can coexist.

  For example, Vector may hold multiple file descriptors for the same container:

  <container_id>-json.log
  <container_id>-json.log.1
  <container_id>-json.log.2
  <container_id>-json.log.3
  ...

  When this happens, we sometimes see log interleaving/corruption.

  A large application log line is split by Docker into multiple 16KB JSON log records. Vector is expected to merge only fragments from the same original log line.

  However, in the actual query result, fragments from different logs can be merged together. Sometimes a fragment from another log appears in the middle of the merged log.

  This problem happens frequently during peak traffic, but rarely happens during business low-traffic hours.

  Our current suspicion is:

  high source lag
  -> old rotated file watchers remain active for a long time
  -> multiple watchers for the same container coexist
  -> Docker partial log merge may incorrectly merge fragments across different rotated files

  Please help confirm whether this is an expected limitation, a configuration issue, or a correctness bug in
  kubernetes_logs / Docker partial log merging.

### Configuration

```text
apiVersion: v1
data:
  agent.yaml: |
    api:
      enabled: true
      address: "0.0.0.0:8686"
    data_dir: /data/vector/
    sources:
      k8s:
        type: file
        include:
          - /data1/log/kubernetes.audit
      logs:
        type: kubernetes_logs
        use_apiserver_cache: true
        max_read_bytes: 4194304
        rotate_wait_secs: 1800
        exclude_paths_glob_patterns:
        - "**/*colo-system_*"
        - "**/*k8s-iaas_*"
        - "**/*.gz"
        - "**/*.tmp"
        oldest_first: true
      internal_metrics:
        type: internal_metrics
    transforms:
      filter:
        type: filter
        inputs:
        - logs
        condition: |-
          # Filter by Kubernetes labels
      remap:
        type: remap
        inputs:
        - filter
        source: |-
          # Rewrite the event as:log, pod, app_name, unit_name, container_name, version, idc, @timestamp
      route:
        type: route
        inputs:
        - remap
        reroute_unmatched: true
        route:
      multiline_std:
        type: reduce
        inputs:
        - route.std
        starts_when: starts_with(string!(.log), "[")
        expire_after_ms: 1000
        max_events: 100
        merge_strategies:
          log: concat_newline
        group_by:
        - pod
        - container_name
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs:
        - internal_metrics
        address: 0.0.0.0:9090
      kafka:
        type: kafka
        inputs:
        - multiline_std
        - route._unmatched
        - k8s_remap
        bootstrap_servers: "${KAFKA_HOSTS}"
        topic: log.log-container-app.stdout
        librdkafka_options:
          request.required.acks: "-1"
        batch:
          max_events: 1000
        encoding:
          codec: "json"
```

### Version

vector 0.43.0 (x86_64-unknown-linux-musl)

### Debug Output

```text

```

### Example Data

_No response_

### Additional Context

_No response_

### References

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kubernetes_logs` has high source lag and may incorrectly merge Docker partial logs when multiple rotated file watchers exist for the same pod log path #25385

A note for the community

Problem

Background

Issue 1: Some Vector pods keep around 2K seconds source lag

Issue 2: Incorrect log interleaving when multiple watchers exist for the same container

Configuration

Version

Debug Output

Example Data

Additional Context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kubernetes_logs has high source lag and may incorrectly merge Docker partial logs when multiple rotated file watchers exist for the same pod log path #25385

Description

A note for the community

Problem

Background

Issue 1: Some Vector pods keep around 2K seconds source lag

Issue 2: Incorrect log interleaving when multiple watchers exist for the same container

Configuration

Version

Debug Output

Example Data

Additional Context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`kubernetes_logs` has high source lag and may incorrectly merge Docker partial logs when multiple rotated file watchers exist for the same pod log path #25385