Skip to content

WVA Stack Deployment Script#416

Open
dumb0002 wants to merge 2 commits intollm-d-incubation:mainfrom
dumb0002:wva-fma
Open

WVA Stack Deployment Script#416
dumb0002 wants to merge 2 commits intollm-d-incubation:mainfrom
dumb0002:wva-fma

Conversation

@dumb0002
Copy link
Copy Markdown
Collaborator

@dumb0002 dumb0002 commented Apr 9, 2026

This PR provides a script to deploy the WVA Stack.

What the Script Does:
The deploy-wva-stack.sh script provides a comprehensive deployment automation tool that:

  • Clones WVA Repository: Fetches the official WVA repository from GitHub (configurable branch)
  • Manages Kind Clusters: Creates/deletes Kind clusters with emulated GPU resources for testing
  • Deploys Full WVA Stack: Orchestrates deployment of:
    • WVA Controller (workload variant autoscaling)
    • llm-d Infrastructure (LLM deployment infrastructure)
    • Prometheus & Prometheus Adapter (monitoring and metrics)
    • Optional: HorizontalPodAutoscaler (HPA)
    • Optional: VariantAutoscaling (VA)

Signed-off-by: dumb0002 <Braulio.Dumba@ibm.com>
Signed-off-by: dumb0002 <Braulio.Dumba@ibm.com>
Copy link
Copy Markdown
Collaborator

@diegocastanibm diegocastanibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error while creating the cluster:

Creating cluster "kind-wva-gpu-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.32.0) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✗ Joining worker nodes 🚜
Deleted nodes: ["kind-wva-gpu-cluster-control-plane" "kind-wva-gpu-cluster-worker" "kind-wva-gpu-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged kind-wva-gpu-cluster-worker kubeadm join --config /kind/kubeadm.conf --v=6" failed with error: exit status 1
Command Output: I0410 19:07:05.433916     137 join.go:421] [preflight] found NodeName empty; using OS hostname as NodeName
I0410 19:07:05.434093     137 joinconfiguration.go:83] loading configuration from "/kind/kubeadm.conf"
W0410 19:07:05.434523     137 common.go:101] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta3" (kind: "JoinConfiguration"). Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.

@diegocastanibm
Copy link
Copy Markdown
Collaborator

If I reduce the number of nodes to 2, the kind cluster is created, but then I have another error:

$ KIND_CLUSTER_NODES=2 ./deploy-wva-stack.sh --create-kind --llmd-only --with-hpa
Install Script Output:
  ==========================================
  [INFO] Starting Workload-Variant-Autoscaler Deployment on kind-emulator
  [INFO] ===========================================================

  [INFO] Checking prerequisites...
  [SUCCESS] All generic prerequisites tools met
  [INFO] Setting TLS verification...
  [INFO] Emulated environment detected - enabling TLS skip verification for self-signed certificates
  [SUCCESS] Successfully set TLS verification to: true
  [INFO] Setting WVA logging level...
  [INFO] Development environment - using debug logging
  [SUCCESS] WVA logging level set to: debug

  [INFO] Loading environment-specific functions for kind-emulator...
  [INFO] Checking Kubernetes-specific prerequisites...
  [INFO] Cluster creation skipped (CREATE_CLUSTER=false)
  [SUCCESS] Using KIND cluster 'kind-wva-gpu-cluster'
  [INFO] Loading WVA image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' into KIND cluster...
  [INFO] Pulling single-platform image for KIND (platform=linux/arm64) to avoid load errors...
  Error response from daemon: Head "https://ghcr.io/v2/llm-d/llm-d-workload-variant-autoscaler/manifests/latest": denied:
  denied
  [WARNING] Failed to pull image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' (platform=linux/arm64)
  [INFO] Attempting to use existing local image...
  [ERROR] Image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' not found locally - Please build or pull the image

  ==========================================
  [ERROR] Install script failed with exit code: 0

Why do I need ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest if I'm using the --llmd-only option?

@dumb0002
Copy link
Copy Markdown
Collaborator Author

If I reduce the number of nodes to 2, the kind cluster is created, but then I have another error:

$ KIND_CLUSTER_NODES=2 ./deploy-wva-stack.sh --create-kind --llmd-only --with-hpa
Install Script Output:
  ==========================================
  [INFO] Starting Workload-Variant-Autoscaler Deployment on kind-emulator
  [INFO] ===========================================================

  [INFO] Checking prerequisites...
  [SUCCESS] All generic prerequisites tools met
  [INFO] Setting TLS verification...
  [INFO] Emulated environment detected - enabling TLS skip verification for self-signed certificates
  [SUCCESS] Successfully set TLS verification to: true
  [INFO] Setting WVA logging level...
  [INFO] Development environment - using debug logging
  [SUCCESS] WVA logging level set to: debug

  [INFO] Loading environment-specific functions for kind-emulator...
  [INFO] Checking Kubernetes-specific prerequisites...
  [INFO] Cluster creation skipped (CREATE_CLUSTER=false)
  [SUCCESS] Using KIND cluster 'kind-wva-gpu-cluster'
  [INFO] Loading WVA image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' into KIND cluster...
  [INFO] Pulling single-platform image for KIND (platform=linux/arm64) to avoid load errors...
  Error response from daemon: Head "https://ghcr.io/v2/llm-d/llm-d-workload-variant-autoscaler/manifests/latest": denied:
  denied
  [WARNING] Failed to pull image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' (platform=linux/arm64)
  [INFO] Attempting to use existing local image...
  [ERROR] Image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' not found locally - Please build or pull the image

  ==========================================
  [ERROR] Install script failed with exit code: 0

Why do I need ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest if I'm using the --llmd-only option?

@diegocastanibm, this is the current behavior of the installation script from the wva repo - it always loads all images as part of the prerequisites steps.

@dumb0002
Copy link
Copy Markdown
Collaborator Author

Error while creating the cluster:

Creating cluster "kind-wva-gpu-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.32.0) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✗ Joining worker nodes 🚜
Deleted nodes: ["kind-wva-gpu-cluster-control-plane" "kind-wva-gpu-cluster-worker" "kind-wva-gpu-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged kind-wva-gpu-cluster-worker kubeadm join --config /kind/kubeadm.conf --v=6" failed with error: exit status 1
Command Output: I0410 19:07:05.433916     137 join.go:421] [preflight] found NodeName empty; using OS hostname as NodeName
I0410 19:07:05.434093     137 joinconfiguration.go:83] loading configuration from "/kind/kubeadm.conf"
W0410 19:07:05.434523     137 common.go:101] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta3" (kind: "JoinConfiguration"). Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.

@diegocastanibm, this could be due to resource constraints in your environment.

@diegocastanibm
Copy link
Copy Markdown
Collaborator

If I reduce the number of nodes to 2, the kind cluster is created, but then I have another error:

$ KIND_CLUSTER_NODES=2 ./deploy-wva-stack.sh --create-kind --llmd-only --with-hpa
Install Script Output:
  ==========================================
  [INFO] Starting Workload-Variant-Autoscaler Deployment on kind-emulator
  [INFO] ===========================================================

  [INFO] Checking prerequisites...
  [SUCCESS] All generic prerequisites tools met
  [INFO] Setting TLS verification...
  [INFO] Emulated environment detected - enabling TLS skip verification for self-signed certificates
  [SUCCESS] Successfully set TLS verification to: true
  [INFO] Setting WVA logging level...
  [INFO] Development environment - using debug logging
  [SUCCESS] WVA logging level set to: debug

  [INFO] Loading environment-specific functions for kind-emulator...
  [INFO] Checking Kubernetes-specific prerequisites...
  [INFO] Cluster creation skipped (CREATE_CLUSTER=false)
  [SUCCESS] Using KIND cluster 'kind-wva-gpu-cluster'
  [INFO] Loading WVA image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' into KIND cluster...
  [INFO] Pulling single-platform image for KIND (platform=linux/arm64) to avoid load errors...
  Error response from daemon: Head "https://ghcr.io/v2/llm-d/llm-d-workload-variant-autoscaler/manifests/latest": denied:
  denied
  [WARNING] Failed to pull image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' (platform=linux/arm64)
  [INFO] Attempting to use existing local image...
  [ERROR] Image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' not found locally - Please build or pull the image

  ==========================================
  [ERROR] Install script failed with exit code: 0

Why do I need ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest if I'm using the --llmd-only option?

@diegocastanibm, this is the current behavior of the installation script from the wva repo - it always loads all images as part of the prerequisites steps.

Maybe the behaviour needs to change. I do not see the need of loading WVA in this case. What's the point?

@dumb0002
Copy link
Copy Markdown
Collaborator Author

dumb0002 commented Apr 13, 2026

If I reduce the number of nodes to 2, the kind cluster is created, but then I have another error:

$ KIND_CLUSTER_NODES=2 ./deploy-wva-stack.sh --create-kind --llmd-only --with-hpa
Install Script Output:
  ==========================================
  [INFO] Starting Workload-Variant-Autoscaler Deployment on kind-emulator
  [INFO] ===========================================================

  [INFO] Checking prerequisites...
  [SUCCESS] All generic prerequisites tools met
  [INFO] Setting TLS verification...
  [INFO] Emulated environment detected - enabling TLS skip verification for self-signed certificates
  [SUCCESS] Successfully set TLS verification to: true
  [INFO] Setting WVA logging level...
  [INFO] Development environment - using debug logging
  [SUCCESS] WVA logging level set to: debug

  [INFO] Loading environment-specific functions for kind-emulator...
  [INFO] Checking Kubernetes-specific prerequisites...
  [INFO] Cluster creation skipped (CREATE_CLUSTER=false)
  [SUCCESS] Using KIND cluster 'kind-wva-gpu-cluster'
  [INFO] Loading WVA image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' into KIND cluster...
  [INFO] Pulling single-platform image for KIND (platform=linux/arm64) to avoid load errors...
  Error response from daemon: Head "https://ghcr.io/v2/llm-d/llm-d-workload-variant-autoscaler/manifests/latest": denied:
  denied
  [WARNING] Failed to pull image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' (platform=linux/arm64)
  [INFO] Attempting to use existing local image...
  [ERROR] Image 'ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest' not found locally - Please build or pull the image

  ==========================================
  [ERROR] Install script failed with exit code: 0

Why do I need ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest if I'm using the --llmd-only option?

@diegocastanibm, this is the current behavior of the installation script from the wva repo - it always loads all images as part of the prerequisites steps.

Maybe the behaviour needs to change. I do not see the need of loading WVA in this case. What's the point?

@diegocastanibm, any change would require opening a PR in the WVA upstream repo since we're just reusing their scripts. IMO from their use-case it makes sense to always load the WVA image as that's the main goal of their installation script.

Copy link
Copy Markdown
Collaborator

@diegocastanibm diegocastanibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical points:

1- As I pointed out during the deployment, when using --llmd-only --kind, the deployment fails because the upstream WVA install script (deploy/kind-emulator/install.sh, line 131) calls load_image() unconditionally inside check_prerequisites_kind_emulated(), regardless of the DEPLOY_WVA flag. This means it always tries to pull ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest, which requires ghcr.io authentication and isn't needed when WVA is not being deployed.

This is an upstream bug in the WVA repo, but this script should handle it — either by documenting the limitation, adding a workaround (e.g., exporting WVA_IMAGE_PULL_POLICY=IfNotPresent with a dummy image), or opening an upstream issue.

2- The upstream WVA install script (deploy/install.sh, line 50) defaults to LLM_D_RELEASE=v0.3.0, but the latest llm-d release is v0.6.0. This deploys a significantly outdated version of llm-d. There is also an inconsistency within the WVA defaults themselves: the inference-scheduler image is pinned to v0.7.0 while the llm-d repo is cloned at v0.3.0.

This script should either export LLM_D_RELEASE to a more recent version, or at minimum document this version gap and how users can override it (e.g., export LLM_D_RELEASE=v0.6.0).

3- Similarly to the llm-d version issue above, the modelservice image version used by the upstream WVA deploy scripts is v0.2.11, while the current release of llm-d-modelservice is v0.4.11. This is a significant version gap that may cause compatibility issues or missing features during testing.

4- The llm-d-sim (inference simulator used in Kind emulator environments) is currently at v0.8.2, but the upstream WVA deploy scripts use an older version. Since the Kind emulator environment relies on llm-d-sim instead of real model serving, running an outdated simulator version could mask bugs or miss behavioral changes that are present in the current release.

Comment on lines +377 to +388
# Run the cleanup function from WVA repository
log_info "Running WVA cleanup function..."
echo ""
echo "=========================================="
echo "WVA Cleanup Output:"
echo "=========================================="

# Disable exit on error temporarily to capture cleanup result
set +e
cleanup
local cleanup_exit_code=$?
set -e
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup is ALWAYS executed. It should be executed ONLY if nothing fails. Otherwise is difficult to debug the WVA scripts si something fails

Comment on lines +603 to +613
# Check if this is cleanup-only mode (--cleanup flag without other deployment changes)
# We detect cleanup-only mode by checking if CLEANUP_BEFORE_DEPLOY is true and
# no cluster creation is requested. The deployment flags (DEPLOY_WVA, DEPLOY_LLM_D)
# should be at their default values (both true) when using --cleanup alone.
local is_cleanup_only=false
if [ "$CLEANUP_BEFORE_DEPLOY" = true ] && [ "$CREATE_KIND_CLUSTER" != "true" ]; then
# Check if deployment flags are at defaults (not modified by --wva-only or --llmd-only)
if [ "$DEPLOY_WVA" = "true" ] && [ "$DEPLOY_LLM_D" = "true" ]; then
is_cleanup_only=true
fi
fi
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to detect cleanup-only depends of using default values for DEPLOY_WVA and DEPLOY_LLM_D. This means that --cleanup --llmd-only is not doing a clean up. We should add an explicit --cleanup-only flag

Comment on lines +478 to +481
echo ""
echo "=========================================="
log_error "Install script failed with exit code: $?"
exit 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your $? is capturing the exit code from echo. Better to safe if in a variable before the echos:

Suggested change
echo ""
echo "=========================================="
log_error "Install script failed with exit code: $?"
exit 1
local exit_code=$?
echo ""
echo "=========================================="
log_error "Install script failed with exit code: $exit_code"
exit 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants