Description
OS: centos 7
Fluentd version: td-agent-3.2.0-0.el7.x86_64
When aggregator node is failing or responding very slowly while under heavy load, it might take up to 1-2 minutes to get a status page /api/plugins.json on a forwarder node.
Steps to reproduce
Forwarder config
<source>
@type monitor_agent
bind 127.0.0.1
port 24220
</source>
<source>
@type forward
bind 127.0.0.1
port 24224
</source>
<match **>
@type forward
heartbeat_type tcp
send_timeout 60s
recover_wait 10s
heartbeat_interval 1s
# increased this while testing
phi_threshold 160000
hard_timeout 120s
<server>
name logs1
host 172.31.3.5
port 8889
weight 60
</server>
flush_interval 10s
buffer_type file
buffer_path /var/log/fluentd/buffer/forward
buffer_chunk_limit 4m
buffer_queue_limit 4096
num_threads 2
expire_dns_cache 600
</match>
I make some service send logs to the forwarder.
Then on aggregator node I execute
# iptables -A INPUT -m statistic --mode random --probability 0.8 --source forwarder.node.ip.address -j DROP
On the forwarder node I execute the following curl request in a loop
# while true; do timeout 2 curl -s http://localhost:24220/api/plugins.json > /dev/null && echo ok || echo failure; sleep 1; done
In some time it starts showing "failure".
When I flush iptables rules on the aggregator node with
iptables -F
it gets back to normal.
It happens not all the time, but in a rather big percentage of cases it happens.
td-agent 2.5 is not affected.
Also I noticed that docker services that send logs to the forwarder stop responding sometimes as well. But was not able to reproduce it yet in my test environment.
Thanks.
Regards,
Sergey