Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor aro-dnsmasq-pre.sh to not overwrite /etc/resolv.conf #4100

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ventifus
Copy link
Collaborator

@ventifus ventifus commented Feb 12, 2025

Which issue this PR addresses:

Fixes ARO-15180
Companion installer PR openshift/installer-aro-wrapper#255

Derived from the method in https://github.com/openshift/machine-config-operator/blob/master/templates/common/gcp/files/usr-local-bin-update-dns-server.yaml

What this PR does / why we need it:

We've been overwriting /etc/resolv.conf. NetworkManager owns this file and if NetworkManager needs to refresh it we will lose our changes. Instead, create a NetworkManager drop-in /etc/NetworkManager/conf.d/dns-servers.conf with the node's IP.

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

No, but the change needs to be socialized amongst ARO SRE since it affects how nameservers are managed.

How do you know this will function as expected in production?

Testing has been done with an extant UDR+bad dns cluster to ensure there are no external DNS dependencies. Nodes boot and scale correctly.

@ventifus
Copy link
Collaborator Author

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What testing do you think we should do before this merges?

@tsatam
Copy link
Collaborator

tsatam commented Feb 13, 2025

Do we need to make a corresponding change to what the installer puts down? My understanding is that for new clusters, the changes in our operator won't get applied until the cluster is first upgraded, and the cluster will run with what the installer has until then.

@hawkowl
Copy link
Collaborator

hawkowl commented Feb 14, 2025

@tsatam When the Operator is first installed, it is set to allow all reconciliations, and then that is switched to only on upgrades at the end of the install process. So, this will apply to new clusters (at the cost of a reboot + install time, so we should also update the installer wrapper).

@ventifus ventifus force-pushed the ventifus/ARO-15180-networkmanager-dns-servers branch from 830431a to 2c2ca5c Compare February 15, 2025 00:14
@ventifus
Copy link
Collaborator Author

ventifus commented Feb 15, 2025

I've tested this now in a UDR + misconfigured DNS cluster (vnet dns = 172.16.0.0). I set aro.dnsmasq.enabled: "false" and manually edited 99-master-aro-dns and 99-worker-aro-dns to have the new content.

After all nodes roll out they have the following config, which looks good

sh-5.1# cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 10.0.0.7
sh-5.1# cat /etc/resolv.conf.dnsmasq
# Generated by NetworkManager
search reddog.microsoft.com
nameserver 172.16.0.0
sh-5.1# cat /etc/NetworkManager/conf.d/dns-servers.conf
# Added by dnsmasq.service
[global-dns-domain-*]
servers=10.0.0.7

I made sure all the cluster operators were healthy, and worker machinesets can scale up.

N.B. even with this change we still end up touching /etc/resolv.conf with dnsmasq.service's

ExecStopPost=/bin/bash -c '/bin/mv /etc/resolv.conf.dnsmasq /etc/resolv.conf; /usr/sbin/restorecon /etc/resolv.conf'

I'll fix that up to delete dns-servers.conf instead.

@ventifus
Copy link
Collaborator Author

A concern I have with this is we're losing search. Depending on pod configuration, this can be exposed to user workloads so they may depend on it. I'll see if that's something we can preserve.

@ventifus
Copy link
Collaborator Author

Ok I've fixed the search domain by adding a [global-dns] section.

sh-5.1# cat /etc/resolv.conf
# Generated by NetworkManager
search reddog.microsoft.com
nameserver 10.0.2.7
sh-5.1# cat /etc/resolv.conf.dnsmasq
# Generated by NetworkManager
search reddog.microsoft.com
nameserver 172.16.0.0
sh-5.1# cat /etc/NetworkManager/conf.d/dns-servers.conf
# Added by dnsmasq.service
[global-dns]
searches=reddog.microsoft.com

[global-dns-domain-*]
servers=10.0.2.7

@ventifus
Copy link
Collaborator Author

Companion installer PR openshift/installer-aro-wrapper#255

echo "$LOCAL_IPS_RAW" | while read -r line
do
echo "nameserver $line" | cut -d'/' -f 1 >> $TMPSELFRESOLV
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you were having trouble with the code to retrieve the search domains and IP addresses, I'd be tempted to keep it the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have problems with the existing code. It tries to make guesses about which network interface to use based on if the interface br-ex exists. We've seen a number of instances where this fails, particularly if the service startup order changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb Another approach to finding the search domain

for DEV in $(nmcli --fields device,state,type --terse device | awk 'BEGIN {FS=":"} ; {if ($2 == "connected") { print $1 }}'); do nmcli dev sho $DEV | awk 'BEGIN {FS=":\\s*"}; { if ($1 ~ /DOMAIN/ && $2 ~ /.+/) { print $2} }'; done | sort -u

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, non-determinism is definitely not what you want here 😄

Looping over all interfaces doesn't look that bad though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants