Skip to content

Updated GELFUDPHandler to not make a DNS query for every log message. #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

camlee
Copy link

@camlee camlee commented Mar 16, 2022

I ran into performance issues in prod because I'm using GELFUDPHandler with a host that requires a DNS lookup (our graylog server is available at something like monitoring.prod.example.com). This resulted in a DNS lookup for every single log message because the CPython DatagramHandler uses socket.sendto() like this: self.sock.sendto(s, self.address): https://github.com/python/cpython/blob/main/Lib/logging/handlers.py#L732

The production-impacting issue we had was overwhelming our internal DNS server with too much traffic which slowed down logging and brought our web services to a crawl. Even for a healthy DNS server, the time to do a DNS lookup per log message is non-zero and worth saving, particularly for apps that do a lot of logging.

This Pull Request updates GELFUDPHandler to connect the socket to the address ahead of time using socket.connect() and then send the data using socket.sendall(). Using wireshark, I've tested that the DNS lookup occurs during connect() and not sendall(). I've also been running this algorithm in production for months and seen the expected drop in traffic to the DNS server.

To handle DNS changes, the code does make a new socket periodically and there's a sock_max_age parameter to control how often (I've set this to 1 hour in prod but figured 5 minutes was a more reasonable default in the library here).

@nklapste
Copy link
Collaborator

nklapste commented May 15, 2022

Good catch! Having a DNS lookup every message can be a huge slowdown!

Though, I'm hesitant on having this override the default behavior of GELFUDPHandler as its supposed to have connection properties similar to the DatagramHandler. This session-ed GELFUDPHandler should likely be defined in its own subclass.

One thing to consider is that with UDP socket session connections things like NATs, or stateful firewalls may over time may decide to drop the connection and leave the sender and receiver unaware that the socket is now dead. This is likely one of the reasons DatagramHandler uses sendto and re-establishes a connection every message.

If possible, are you able to setup/verify DNS caching on the host running your graypy logger? That would likely reduce the number of DNS lookups to your DNS server while graypy logs multiple messages.

@camlee
Copy link
Author

camlee commented May 15, 2022

I'm on a team that runs multiple deployed apps in a variety of languages and on both virtual machines and within kubernetes. We have indeed setup DNS caching on the host machine (in some cases, on the Linux box, in other cases as a sidecar container within kubernetes). This does work but introduces some deployment complexity. For our Python microservices, we've been using a graypy fork with the code in this PR so that the problem is solved within the app: nice and simple.

As for things like NAT; yes, there is indeed the risk that the connection dies and data is lost. For us in particular, the IP address behind the DNS name for our graylog server rotates every 2 weeks or so. This is why I included a mechanism to refresh the socket, and made it configurable: sock_max_age. We could lower the default from 5 minutes if you think that would be safer. For what it's worth though, we also use pystatsd in our microservices which also sends data out over UDP (metrics instead of logs, but same basic idea) and discovered that it actually resolves the IP address on startup and uses that forever.

I'm submitting this PR to help others who use graypy, particularly those who's systems are growing and aren't going to find out about this DNS issue until it bites them in production. I think the out-of-the-box behavior can and should be efficient. Let me know if I can tweak or modify this PR in a way that makes it more adoptable. A new subclass for this "session-like" handler sounds reasonable to me and I'll make that change if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants