Skip to content

Add support for using DNSNames instead of raw IPs for IMEX daemons#433

Merged
klueska merged 3 commits intokubernetes-sigs:mainfrom
klueska:imex-with-hostnames
Sep 5, 2025
Merged

Add support for using DNSNames instead of raw IPs for IMEX daemons#433
klueska merged 3 commits intokubernetes-sigs:mainfrom
klueska:imex-with-hostnames

Conversation

@klueska
Copy link
Copy Markdown
Contributor

@klueska klueska commented Jul 16, 2025

This patch lays the foundation for eventually resolving the following issues:

It does this by leveraging the new "hostnames" featured introduced in newer versions of the IMEX daemon. This new "hostnames" feature allows one to list a set of DNS resolvable names in the nodes_config.cfg file read by an IMEX daemon. In doing so, we are able to provide a level-of-indirection between each individual IMEX daemon and how it connects to other IMEX daemons in the same IMEX domain.

So long as all of the IMEX daemons agree on the same set of DNS names in their nodes_config.cfg files, we can dynamically change which IMEX daemon each of these names resolves to, thus providing natural support for things like (1) failing over worker pods to other nodes, (2) replacing nodes on node failures, and (2) expanding workloads to include more or fewer workers over time.

The only feature directly enabled by this PR is #348. The others are not directly addressed, but this PR does lay the foundation for resolving each of them in the future.

Fixes: #348

Test ran to verify its working as expected:

#!/bin/bash

MAX_ITER=100
TIMEOUT=300  # 5 minutes in seconds
WORKER_DELETE_DELAY=15 # seconds
DAEMON_DELETE_DELAY=30 # seconds

for ((i=1; i<=MAX_ITER; i++)); do
    echo "Starting iteration $i"

    kubectl apply -f nvbandwidth-test-job.yaml

    # Wait up to TIMEOUT seconds for the pod to complete
    SECONDS=0
    WORKER_DELETED=false
    DAEMON_DELETED=false
    while true; do
        STATUS=$(kubectl get pod -l job-name=nvbandwidth-test-launcher -o jsonpath="{.items[0].status.phase}" 2>/dev/null)

        # Pod succeeded
        if [ "$STATUS" == "Succeeded" ]; then
            echo "Pod completed successfully."
            break

        # Pod failed
        elif [ "$STATUS" == "Failed" ]; then
            echo "Pod failed."
            break

        # Delete worker pod after $WORKER_DELETE_DELAY seconds
        elif [ "$SECONDS" -ge $WORKER_DELETE_DELAY ] && [ "$WORKER_DELETED" = "false" ]; then
                kubectl delete pod nvbandwidth-test-worker-0
                WORKER_DELETED=true

        # Force delete (i.e. with SIGKILL) IMEX daemon pods after $DAEMON_DELETE_DELAY seconds
        elif [ "$SECONDS" -ge $DAEMON_DELETE_DELAY ] && [ "$DAEMON_DELETED" = "false" ]; then
                kubectl delete pod -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --grace-period=0 --force
                DAEMON_DELETED=true

        # Timeout reached
        elif [ "$SECONDS" -ge $TIMEOUT ]; then
            echo "Timeout reached ($TIMEOUT seconds). Exiting loop."
            exit 1
        fi

        sleep 5
    done

    kubectl delete -f nvbandwidth-test-job.yaml
done

echo "Finished $MAX_ITER iterations successfully."

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jul 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska klueska force-pushed the imex-with-hostnames branch 5 times, most recently from 0a2bc66 to 03b41e2 Compare July 16, 2025 22:57
// might be pointless. Terminate us.
return fmt.Errorf("error (re)starting IMEX daemon: %w", err)
if err := processManager.EnsureStarted(); err != nil {
return fmt.Errorf("failed to ensure IMEX daemon is started: %w", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I write next is probably premature before we have talked synchronously. But I already want to leave a thought / questions here that I have on my mind right now.

Does the daemon watch for node config file changes?

So far, I thought that here (after an update) we have to

  1. send a signal to the child process to either make it "re-initiate all TCP connections" or "re-read the config" (and have it decide itself if it wants to (re-)initiate some TCP connections (I choose this wording because as the first step of a TCP connect() there's DNS (re)resolution (hopefully bypassing any cache (I've never nested parentheses in normal text so deeply so I thought I add another layer just for fun))))
  2. or (as before) bluntly restart the child.

And zooming out a little further, in your thinking you are probably carefully distinguishing between these cases:

  • 'growing' (new DNS name in pool)
  • 'shrinking' (DNS name removal from pool)
  • 'changing' (old DNS name pointing to a different IP address)
  • 'not changing' (nothing changed)

And of course these are not all equally difficult to support. I assume your experiment / poccing focused on just one of them?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About:

I thought that here (after an update) we have to send a signal to the child process to either make it "re-initiate all TCP connections"

Answered and discussed here: https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/433/files#r2318598969

I assume your experiment / poccing focused on just one of them?

Given the test shared by @klueska I believe it's fair to say we are confirming here mainly

  • 'changing' (old DNS name pointing to a different IP address)

and maybe implicitly also

  • 'not changing' (nothing changed)'

@klueska klueska self-assigned this Aug 13, 2025
@klueska klueska added this to the v25.12.0 milestone Aug 13, 2025
@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Aug 13, 2025
@klueska klueska modified the milestones: v25.12.0, v25.8.0 Aug 29, 2025
@klueska klueska force-pushed the imex-with-hostnames branch 5 times, most recently from 32d6a21 to 7db9d06 Compare August 29, 2025 17:30
@klueska
Copy link
Copy Markdown
Contributor Author

klueska commented Aug 29, 2025

In its current form, this only works if the rescheduled pods reland on the same node...

@klueska klueska force-pushed the imex-with-hostnames branch 2 times, most recently from 16cea5a to ff87ce4 Compare August 30, 2025 07:40
@klueska klueska changed the title Add support for using hostnames instead of raw IPs for imex daemons Add support for using DNSNames instead of raw IPs for IMEX daemons Aug 30, 2025
@klueska
Copy link
Copy Markdown
Contributor Author

klueska commented Aug 30, 2025

In its current form, this only works if the rescheduled pods reland on the same node...

The latest change fixes this issue.

It does this by ensuring a consistent dnsname --> IMEXDaemonIP mapping on all nodes in the IMEX daemon ensemble. Without this, the IMEX daemons were getting confused if the same DNS name was used by different nodes to point to different IMEX daemons in the ensemble.

And it make sense -- normally DNS names do provide such a global mapping — I was just abusing things by only maintaining a local mapping without any regard for how that mapping was done on other nodes.

@klueska klueska force-pushed the imex-with-hostnames branch 5 times, most recently from fb0b257 to 948b741 Compare August 31, 2025 20:29
@klueska
Copy link
Copy Markdown
Contributor Author

klueska commented Aug 31, 2025

I encountered a number of issues while stress testing this PR that resulted in a number of bug fixes and overall improvements that will need to be split out into separate PRs.

I'm happy to say though, that with the current branch, on a 4-node GB200 setup, I am able to run the following test to completion for all 100 iterations (where rescheduled workers sometimes land on the same node, sometimes on different nodes):

#!/bin/bash

MAX_ITER=100
TIMEOUT=300  # 5 minutes in seconds
WORKER_DELETE_DELAY=15 # seconds
DAEMON_DELETE_DELAY=30 # seconds

for ((i=1; i<=MAX_ITER; i++)); do
    echo "Starting iteration $i"

    kubectl apply -f nvbandwidth-test-job.yaml

    # Wait up to TIMEOUT seconds for the pod to complete
    SECONDS=0
    WORKER_DELETED=false
    DAEMON_DELETED=false
    while true; do
        STATUS=$(kubectl get pod -l job-name=nvbandwidth-test-launcher -o jsonpath="{.items[0].status.phase}" 2>/dev/null)

        # Pod succeeded
        if [ "$STATUS" == "Succeeded" ]; then
            echo "Pod completed successfully."
            break

        # Pod failed
        elif [ "$STATUS" == "Failed" ]; then
            echo "Pod failed."
            break

        # Delete worker pod after $WORKER_DELETE_DELAY seconds
        elif [ "$SECONDS" -ge $WORKER_DELETE_DELAY ] && [ "$WORKER_DELETED" = "false" ]; then
                kubectl delete pod nvbandwidth-test-worker-0
                WORKER_DELETED=true

        # Force delete (i.e. with SIGKILL) IMEX daemon pods after $DAEMON_DELETE_DELAY seconds
        elif [ "$SECONDS" -ge $DAEMON_DELETE_DELAY ] && [ "$DAEMON_DELETED" = "false" ]; then
                kubectl delete pod -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --grace-period=0 --force
                DAEMON_DELETED=true

        # Timeout reached
        elif [ "$SECONDS" -ge $TIMEOUT ]; then
            echo "Timeout reached ($TIMEOUT seconds). Exiting loop."
            exit 1
        fi

        sleep 5
    done

    kubectl delete -f nvbandwidth-test-job.yaml
done

echo "Finished $MAX_ITER iterations successfully."

Comment thread cmd/compute-domain-daemon/main.go Outdated
}

// IMEXDaemonUpdateLoopWithDNSNames reacts to ComputeDomain status changes by updating the
// /etc/hosts file with IP to DNS name mappings and restarting the IMEX daemon.
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversight, this implies we restart the daemon upon every /etc/hosts file update:

[...] by updating the /etc/hosts file with IP to DNS name mappings and restarting the IMEX daemon.

Let's maybe change this to:

[...] by updating the /etc/hosts file with IP to DNS name mappings. Relies on the IMEX daemon to pick that change up automatically and quickly -- which it seems to do via grpc-based health-checking of individual connections. Restart the IMEX daemon upon expected and unexpected crashes.

Connection monitoring (including automatic re-connects involving DNS re-resolution) implemented in grpc libraries used in the IMEX daemon: for me the currently best explanation for why something here works at all (without us actively triggering the daemon to do something).

We see things working, and hence I am excited about using this. To be clear, we should not iterate on this approach further in this PR, because we have something working now, and that's great.

I want to explain here, however, that I am rather certain that we will have to revisit this this approach in the future. Specifically, to

  • optimize: take control of and minimize fail-over duration
  • fix: missed fail-over scenarios

The core argument for why this "passive" approach might not be sufficient in the long run: tor certain and also for predictably fast fail-over, we need certain and also predictably fast TCP connection re-establishment involving DNS name re-resolution.

In that last sentence, I refer to TCP connections used between IMEX daemons, in the IMEX daemon ensemble.

Here's why I see that the current approach is not really giving that, or it's yielding that rather indirectly (hard-ish to predict what exactly may happen in prod):

  1. A TCP connection's lifetime can be arbitrarily long.
  2. During connection lifetime, DNS is not used, DNS names don't play a role. DNS names only play a role right before issuing a TCP connect() syscall.
  3. Even health checks on an established (long-lived) TCP connection would not care about DNS mapping changes.
  4. The only way to make a long-lived TCP connection (and hence the grpc conn sitting on top of it) fail is to make one of the remote ends go away (clean socket shutdown, which can be fast, or unclean socket disappearance -- hitting a timeout criterion).
  5. Without us actively doing something here (restarting a process, sending a signal, ...) we do not ever actively take control of destructing a TCP connections.

That explains why this passive approach entails the potential for wasting time or even for transitioning permanently into a state we cannot recover from, and why an active approach is generally desirable.

Again, this is me writing down thoughts that I know will be relevant in the future, not me trying us to re-think what we do here in this patch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment. Will revisit the points made in the rest of the discussion later.

@klueska klueska force-pushed the imex-with-hostnames branch 4 times, most recently from 972cd01 to 44d2462 Compare September 3, 2025 22:02
Comment thread cmd/compute-domain-daemon/main.go
Comment thread cmd/compute-domain-daemon/main.go
Comment thread cmd/compute-domain-daemon/main.go
Comment thread cmd/compute-domain-daemon/main.go
Comment thread cmd/compute-domain-daemon/dnsnames.go
Comment thread cmd/compute-domain-daemon/dnsnames.go Outdated
Comment thread cmd/compute-domain-daemon/dnsnames.go Outdated
Comment thread cmd/compute-domain-daemon/dnsnames.go
Comment thread cmd/compute-domain-daemon/dnsnames.go
Comment thread cmd/compute-domain-daemon/computedomain.go
Comment thread cmd/compute-domain-daemon/dnsnames.go Outdated
Comment thread cmd/compute-domain-daemon/dnsnames.go
@klueska klueska force-pushed the imex-with-hostnames branch 3 times, most recently from 0bc7f08 to be1170e Compare September 4, 2025 13:31
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Without this the IMEX daemons were getting confused if the same DNS name
was used to by different nodes to point to differernt IMEX daemons in
the ensemble.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska klueska force-pushed the imex-with-hostnames branch from be1170e to 9715887 Compare September 5, 2025 11:08
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! After lots of discussion and iteration: let's get this in 🚀

@klueska klueska merged commit e850619 into kubernetes-sigs:main Sep 5, 2025
14 checks passed
@klueska klueska linked an issue Oct 17, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ComputeDomain: add support for elastic workloads ComputeDomain: add support for node failure (follow workload as it is re-scheduled)

2 participants