Description
Problem Description
While real memory circuit breaker helps with parent memory accounting based on real heap usage the DURABILITY
is still derived from child circuit breaker, the logic for which is based on whichever contributed the maximum based on their respective durability
. Now child circuit breakers are known to be not so accurate, as a result, it gets harder to derive the nature of issue resulting in frequent GC or request trips.
Keeping a node around which has request tripping frequently isn't optimal and decays into the throughput, in some cases it may be deemed better to bounce off such nodes to clear garbage.
Sharing a sample instance showing how real memory(parent) and child could be totally off
curl localhost:9200/_cat/thread_pool?v
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8337839048/7.7gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8337839048/7.7gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=0/0b, accounting=6388018/6mb]","bytes_wanted":8337839048,"bytes_limit":8127315968,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8337839048/7.7gb], which is larger than the limit of [8127315968/7.5gb], real usage: [8337839048/7.7gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=0/0b, accounting=6388018/6mb]","bytes_wanted":8337839048,"bytes_limit":8127315968 .....
Proposal
Derive durability based on some function of
- Request trips over a period of time(avoiding flip flops)
- Heap after GC(ensuring GC throughput is still reasonable)
Should nodes still be considered healthy(as a part of the cluster) if nodes continue to trip majority of the requests over a prolonged period of time or even with the present PERMANENT
durability nature of circuit breaker. Thoughts?