Conversation
Signed-off-by: dumb0002 <Braulio.Dumba@ibm.com>
Signed-off-by: dumb0002 <Braulio.Dumba@ibm.com>
diegocastanibm
left a comment
There was a problem hiding this comment.
Error while creating the cluster:
Creating cluster "kind-wva-gpu-cluster" ...
✓ Ensuring node image (kindest/node:v1.32.0) 🖼
✓ Preparing nodes 📦 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✗ Joining worker nodes 🚜
Deleted nodes: ["kind-wva-gpu-cluster-control-plane" "kind-wva-gpu-cluster-worker" "kind-wva-gpu-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged kind-wva-gpu-cluster-worker kubeadm join --config /kind/kubeadm.conf --v=6" failed with error: exit status 1
Command Output: I0410 19:07:05.433916 137 join.go:421] [preflight] found NodeName empty; using OS hostname as NodeName
I0410 19:07:05.434093 137 joinconfiguration.go:83] loading configuration from "/kind/kubeadm.conf"
W0410 19:07:05.434523 137 common.go:101] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta3" (kind: "JoinConfiguration"). Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.
|
If I reduce the number of nodes to 2, the kind cluster is created, but then I have another error: Why do I need |
@diegocastanibm, this is the current behavior of the installation script from the wva repo - it always loads all images as part of the prerequisites steps. |
@diegocastanibm, this could be due to resource constraints in your environment. |
Maybe the behaviour needs to change. I do not see the need of loading WVA in this case. What's the point? |
@diegocastanibm, any change would require opening a PR in the WVA upstream repo since we're just reusing their scripts. IMO from their use-case it makes sense to always load the WVA image as that's the main goal of their installation script. |
diegocastanibm
left a comment
There was a problem hiding this comment.
Critical points:
1- As I pointed out during the deployment, when using --llmd-only --kind, the deployment fails because the upstream WVA install script (deploy/kind-emulator/install.sh, line 131) calls load_image() unconditionally inside check_prerequisites_kind_emulated(), regardless of the DEPLOY_WVA flag. This means it always tries to pull ghcr.io/llm-d/llm-d-workload-variant-autoscaler:latest, which requires ghcr.io authentication and isn't needed when WVA is not being deployed.
This is an upstream bug in the WVA repo, but this script should handle it — either by documenting the limitation, adding a workaround (e.g., exporting WVA_IMAGE_PULL_POLICY=IfNotPresent with a dummy image), or opening an upstream issue.
2- The upstream WVA install script (deploy/install.sh, line 50) defaults to LLM_D_RELEASE=v0.3.0, but the latest llm-d release is v0.6.0. This deploys a significantly outdated version of llm-d. There is also an inconsistency within the WVA defaults themselves: the inference-scheduler image is pinned to v0.7.0 while the llm-d repo is cloned at v0.3.0.
This script should either export LLM_D_RELEASE to a more recent version, or at minimum document this version gap and how users can override it (e.g., export LLM_D_RELEASE=v0.6.0).
3- Similarly to the llm-d version issue above, the modelservice image version used by the upstream WVA deploy scripts is v0.2.11, while the current release of llm-d-modelservice is v0.4.11. This is a significant version gap that may cause compatibility issues or missing features during testing.
4- The llm-d-sim (inference simulator used in Kind emulator environments) is currently at v0.8.2, but the upstream WVA deploy scripts use an older version. Since the Kind emulator environment relies on llm-d-sim instead of real model serving, running an outdated simulator version could mask bugs or miss behavioral changes that are present in the current release.
| # Run the cleanup function from WVA repository | ||
| log_info "Running WVA cleanup function..." | ||
| echo "" | ||
| echo "==========================================" | ||
| echo "WVA Cleanup Output:" | ||
| echo "==========================================" | ||
|
|
||
| # Disable exit on error temporarily to capture cleanup result | ||
| set +e | ||
| cleanup | ||
| local cleanup_exit_code=$? | ||
| set -e |
There was a problem hiding this comment.
The cleanup is ALWAYS executed. It should be executed ONLY if nothing fails. Otherwise is difficult to debug the WVA scripts si something fails
| # Check if this is cleanup-only mode (--cleanup flag without other deployment changes) | ||
| # We detect cleanup-only mode by checking if CLEANUP_BEFORE_DEPLOY is true and | ||
| # no cluster creation is requested. The deployment flags (DEPLOY_WVA, DEPLOY_LLM_D) | ||
| # should be at their default values (both true) when using --cleanup alone. | ||
| local is_cleanup_only=false | ||
| if [ "$CLEANUP_BEFORE_DEPLOY" = true ] && [ "$CREATE_KIND_CLUSTER" != "true" ]; then | ||
| # Check if deployment flags are at defaults (not modified by --wva-only or --llmd-only) | ||
| if [ "$DEPLOY_WVA" = "true" ] && [ "$DEPLOY_LLM_D" = "true" ]; then | ||
| is_cleanup_only=true | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The logic to detect cleanup-only depends of using default values for DEPLOY_WVA and DEPLOY_LLM_D. This means that --cleanup --llmd-only is not doing a clean up. We should add an explicit --cleanup-only flag
| echo "" | ||
| echo "==========================================" | ||
| log_error "Install script failed with exit code: $?" | ||
| exit 1 |
There was a problem hiding this comment.
Your $? is capturing the exit code from echo. Better to safe if in a variable before the echos:
| echo "" | |
| echo "==========================================" | |
| log_error "Install script failed with exit code: $?" | |
| exit 1 | |
| local exit_code=$? | |
| echo "" | |
| echo "==========================================" | |
| log_error "Install script failed with exit code: $exit_code" | |
| exit 1 |
This PR provides a script to deploy the WVA Stack.
What the Script Does:
The
deploy-wva-stack.shscript provides a comprehensive deployment automation tool that: