Description
Describe the bug
There are two different metrics that I use for monitoring Fluentd output behavior:
fluentd_status_retry_count
fluentd_output_status_num_errors
I noticed that they both show always the exact same value, meaning each retry = an error.
Although there might be some retries, it doesn't mean by my opinion, that it's immediately an error.
The output destination might be loaded or might have some issues that causes Fluentd to retry, but there is no way to separate when there is a retry, which is fine in my case, and when there is an error - for example the destination server is down.
To test this issue and check if I can understand whether there is an actual error in my output destination or there are only a few retries, I scaled down the deployment of the dest, meaning no logs cant be sent, and set retry limit to 5.
What I saw, was that for each retry, there was an error, and after 5 retries, both metrics stopped going up.
What I expected to see, was errors > retries, because the 5 retries reached their limit, but the error was still there, the dest was down, no log could be accepted.
The issue here , is that I dont have any clear way to figure out whether my logs are just getting retried, or is there an actual continuing problem that no retry can solve.
Currently, Im getting alerts on 'errors', while these errors are only some retries, and I see in Fluentd' logs that these chunks were successfully sent after some retries. So this is not an actual error,
I want to be able to monitor an actual error that prevents all of the logs from being sent.
To Reproduce
Configure an endpoint that is unreachable, an output destination that cannot actually be reached,
and set retry limit on a final number, like 5.
Then watch the metrics.
Expected behavior
I expect to see more errors than retries, and not retries==errors, because an error is still happening while the retries reached their max.
Your Environment
- Fluentd version: 1.14.6
- Running a dockerfile with the following base image: v1.14.6-debian-forward-1.0
- Running on Kubernetes v1.21.7 with a Fluentd daemonset
Your Configuration
config:
service: |-
[SERVICE]
Daemon Off
Flush 1
Log_Level {{.Values.logLevel}}
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{.Values.service.port}}
Health_Check On
storage.metrics on
inputs: |-
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Refresh_Interval 5
Skip_Long_Lines On
Mem_Buf_Limit 25MB
DB /var/log/fluentbit-tail.db
@INCLUDE input-systemd.conf
filters: |-
[FILTER]
Name kubernetes
Match kube.*
K8S-Logging.Parser On
K8S-Logging.Exclude On
Use_Kubelet On
Annotations Off
Labels On
Buffer_Size 0
Keep_Log Off
Merge_Log_Key log_obj
Merge_Log On
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix kubernetes.
[FILTER]
Name modify
Match kube.*
Copy ${APP_NAME} applicationName
Copy ${SUB_SYSTEM} subsystemName
[FILTER]
Name nest
Match kube.*
Operation nest
Wildcard kubernetes.*
Nest_under kubernetes
Remove_prefix kubernetes.
[FILTER]
Name nest
Match kube.*
Operation nest
Wildcard kubernetes
Wildcard log
Wildcard log_obj
Wildcard stream
Wildcard time
Nest_under json
@INCLUDE filters-systemd.conf
outputs: |-
[OUTPUT]
Name http
Match kube.*
Host ${ENDPOINT}
Port 443
URI /logs/rest/singles
Format json_lines
TLS On
Header private_key ${PRIVATE_KEY}
compress gzip
Retry_Limit False
@INCLUDE output-systemd.conf
extraFiles:
input-systemd.conf: |-
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On
Mem_Buf_Limit 5MB
filters-systemd.conf: |-
[FILTER]
Name modify
Match host.*
Add applicationName ${APP_NAME_SYSTEMD}
Add subsystemName ${SUB_SYSTEM_SYSTEMD}
[FILTER]
Name nest
Match host.*
Operation nest
Wildcard _HOSTNAME
Wildcard SYSLOG_IDENTIFIER
Wildcard _CMDLINE
Wildcard MESSAGE
Nest_under json
output-systemd.conf: |-
[OUTPUT]
Name http
Match host.*
Host ${ENDPOINT}
Port 443
URI /logs/rest/singles
Format json_lines
TLS On
Header private_key ${PRIVATE_KEY}
compress gzip
Retry_Limit 10
Your Error Log
The metrics:
* fluentd_status_retry_count
* fluentd_output_status_num_errors
show the exact same value