- Program: Recursor
- Issue type: Feature request
Short description
Due to NSSpeeds cache decay an authoritative nameserver for a domain that is really slow is still forwarded "live" DNS queries every now and then leading to undesired long response times for such queries.
Usecase
There are quite some domains where not all authoritative servers are similarly fast. Recursor learns slow nameservers but still forwards a few queries to these every now and then. However we'd usually prefer "stable" fast resolution times, especially when people measure response times, look closely at outliers and try to do comparisons.. 😉
Description
Due to NSSpeeds cache decay an authoritative nameserver for a domain that is really slow is still forwarded "live" DNS queries every now and then leading to undesired long response times for such queries. It might make sense to review this behavior and, for example, use some sort of background task to send queries to NS that were slow in the past but should now be re-queried based on their NSSpeeds decay instead of abusing "live" queries for that probing, at least if the difference in (expected) response time is huge.
A rather random example domain would be nic.is which seems to have 4 authoritative NS of which two seem to be in Iceland with a 60+ms rtt and two seem to be Anycasted at 4-5ms. We'd ideally expect all queries to come back with 4-5ms and not some random ones with 60+ms.
Another current case we have is that we see some routing issues to AWS nameserver subnets affecting both IPv4 and IPv6 where the closest exit for these subnets is currently in Brazil with 160+ms RTT while other Anycast instances of their NS are nearby at just 1ms. As they are using some CNAME chaining we even seem to have some rare queries hitting the 160+ms trap twice. Having both IPv4 and IPv6 enabled seems to further increase the probability of hitting that trap as the NSSpeeds cache is per IP/protocol. When quickly (and not in any way scientifically correct) comparing traffic to the four different NS IPv4 subnets we have at one target FQDN we see about 60% of queries going to the fastest NS subnet, 8% to the slowest and the remaining to the other two NS subnets that are at least 11-15ms slower than the fastest. Caching of course helps but unfortunately some of the affected FQDNs on the AWS NS have a TTL of just 5 seconds.
I think there might be some room for improvement 😉
Thanks!
Short description
Due to NSSpeeds cache decay an authoritative nameserver for a domain that is really slow is still forwarded "live" DNS queries every now and then leading to undesired long response times for such queries.
Usecase
There are quite some domains where not all authoritative servers are similarly fast. Recursor learns slow nameservers but still forwards a few queries to these every now and then. However we'd usually prefer "stable" fast resolution times, especially when people measure response times, look closely at outliers and try to do comparisons.. 😉
Description
Due to NSSpeeds cache decay an authoritative nameserver for a domain that is really slow is still forwarded "live" DNS queries every now and then leading to undesired long response times for such queries. It might make sense to review this behavior and, for example, use some sort of background task to send queries to NS that were slow in the past but should now be re-queried based on their NSSpeeds decay instead of abusing "live" queries for that probing, at least if the difference in (expected) response time is huge.
A rather random example domain would be nic.is which seems to have 4 authoritative NS of which two seem to be in Iceland with a 60+ms rtt and two seem to be Anycasted at 4-5ms. We'd ideally expect all queries to come back with 4-5ms and not some random ones with 60+ms.
Another current case we have is that we see some routing issues to AWS nameserver subnets affecting both IPv4 and IPv6 where the closest exit for these subnets is currently in Brazil with 160+ms RTT while other Anycast instances of their NS are nearby at just 1ms. As they are using some CNAME chaining we even seem to have some rare queries hitting the 160+ms trap twice. Having both IPv4 and IPv6 enabled seems to further increase the probability of hitting that trap as the NSSpeeds cache is per IP/protocol. When quickly (and not in any way scientifically correct) comparing traffic to the four different NS IPv4 subnets we have at one target FQDN we see about 60% of queries going to the fastest NS subnet, 8% to the slowest and the remaining to the other two NS subnets that are at least 11-15ms slower than the fastest. Caching of course helps but unfortunately some of the affected FQDNs on the AWS NS have a TTL of just 5 seconds.
I think there might be some room for improvement 😉
Thanks!