-
Notifications
You must be signed in to change notification settings - Fork 6.2k
[Test][KubeRay] Add doctests for Kuberay Autoscaling #51884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Test][KubeRay] Add doctests for Kuberay Autoscaling #51884
Conversation
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: jiangjiawei1103 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can omit checking output for
kubectl exec -it $HEAD_POD -- ray list actors
kubectl exec $HEAD_POD -it -c ray-head -- ray status
kubectl logs $HEAD_POD -c autoscaler | tail -n 20
Otherwise, we would need to create a lot of special regex.
cc @kevin85421 Do you think this is okay?
[time-stamp] | ||
regex: \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}[.,]?\d*|[A-Z][a-z]{2}\s[A-Z][a-z]{2}\s+\d{1,2}\s\d{2}:\d{2}:\d{2}\s\d{4} | ||
replace: TIME-STAMP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use |
. Split it into multiple regex for readability. Like [time-stamp]
and [time-stamp-miliseconds]
Signed-off-by: jiangjiawei1103 <[email protected]>
makes sense to me |
Signed-off-by: jiangjiawei1103 <[email protected]>
We now ignore those checks and remove the corresponding regex patterns. Thanks! |
Could you resolve the conflicts? Thanks. |
Signed-off-by: 江家瑋 <[email protected]>
Done. Thanks a lot! |
} | ||
], | ||
"source": [ | ||
"sleep 10 && export WORKER_POD1=$(kubectl get pods --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1:].metadata.name}')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may become a flaky test if we wait for a fixed time. We might need to extend the timeout. By the way, is there a selector to specifically target the worker pod? It doesn't seem ideal to fetch all pods and sort them by timestamp just to find the newly created worker pod.
} | ||
], | ||
"source": [ | ||
"sleep 10 && export WORKER_POD2=$(kubectl get pods --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1:].metadata.name}')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
"source": [ | ||
"### Step 8: Clean up the Kubernetes cluster" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 22, | ||
"id": "92d74542-1984-4519-adde-641b05f9efe8", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [ | ||
"nbval-ignore-output", | ||
"remove-output" | ||
] | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"raycluster.ray.io \"raycluster-autoscaler\" deleted\n", | ||
"configmap \"ray-example\" deleted\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Delete RayCluster and ConfigMap\n", | ||
"kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-cluster.autoscaler.yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 23, | ||
"id": "429d84e4-eb5f-4174-8344-306916216dfa", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [ | ||
"nbval-ignore-output", | ||
"remove-output" | ||
] | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"release \"kuberay-operator\" uninstalled\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Uninstall the KubeRay operator\n", | ||
"helm uninstall kuberay-operator" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For step 8, simply kind delete cluster
is enough.
Signed-off-by: jiangjiawei1103 <[email protected]>
….com/JiangJiaWei1103/ray into add-doctests-for-kuberay-autoscaling
"while true; do\n", | ||
" WORKER_POD1=$(kubectl get pods --selector=ray.io/node-type=worker -o custom-columns=POD:metadata.name --no-headers)\n", | ||
" if [[ -n \"$WORKER_POD1\" ]]; then\n", | ||
" export WORKER_POD1\n", | ||
" break\n", | ||
" fi \n", | ||
" sleep 2\n", | ||
"done\n", | ||
"kubectl wait --for=condition=ready pod/$WORKER_POD1 --timeout=500s\n", | ||
"echo $WORKER_POD1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use kubectl wait
to wait for the status.availableWorkerReplicas
field of raycluster/raycluster-autoscaler
becomes to 1 instead of waiting for pod name becoming non-empty string.
"while true; do\n", | ||
" WORKER_POD2=$(kubectl get pods \\\n", | ||
" --selector=ray.io/node-type=worker \\\n", | ||
" --field-selector=\"metadata.name!=$WORKER_POD1\" \\\n", | ||
" -o custom-columns=POD:metadata.name \\\n", | ||
" --no-headers\n", | ||
" )\n", | ||
" if [[ -n \"$WORKER_POD2\" ]]; then\n", | ||
" export WORKER_POD2\n", | ||
" break\n", | ||
" fi \n", | ||
" sleep 2\n", | ||
"done\n", | ||
"kubectl wait --for=condition=ready pod/$WORKER_POD2 --timeout=500s\n", | ||
"echo $WORKER_POD2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, wait for availableWorkerReplicas
becoming to 2.
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: 江家瑋 <[email protected]>
cc @dayshah for review as ray-docs code owner. Thanks. |
cc @kevin85421 for review as KubeRay owner. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any changes to the text instead of the instructions? If so, would you mind adding some comments to the paragraph so that I can review the changes?
In addition, please make sure the test can pass at least 10 consecutive runs in your local environment. cc @MortalHappiness
@kevin85421 No. But I asked the contributor to split some code blocks into multiple cells. You can just check the code blocks in https://anyscale-ray--51884.com.readthedocs.build/en/51884/cluster/kubernetes/user-guides/configuring-autoscaling.html to see if they look good to you. |
@JiangJiaWei1103 Could you post a screenshot of running it and pass 10 consecutive runs in your local environement here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. Waiting for the screenshot to prove that it's not flaky locally.
I noticed that the test can be a bit flaky locally. # kubectl wait --for=condition=ready pod/$WORKER_POD2 --timeout=500s
To confirm this, I ran the following script to test the notebook 10 times: #!/bin/bash
success=0
total=10
for i in $(seq 1 $total); do
echo "Run #$i"
py.test --nbval user-guides/configuring-autoscaling.ipynb --nbval-kernel-name bash --sanitize-with doc_sanitize.cfg
if [ $? -eq 0 ]; then
((success++))
fi
done
echo "===================="
echo "Total Runs: $total"
echo "Successful Runs: $success"
echo "Success Rate: $((success * 100 / total))%" Let me know if there’s a more stable way to ensure the pod is truly ready before proceeding—happy to update accordingly! Thanks! |
The failed cell is step 5.2 so it is not related to |
Sorry about the confusion. That was my mistake in misinterpreting the failed block. We use I looked into this and came across this blog post, which explains the situation well:
To verify, I tried disabling kubectl get pods -l=ray.io/is-ray-node=yes --field-selector=status.phase=Running (Update) By adding one more field selector as shown above, experiments indicate that the tests can consistently pass at least 20 times in a row: I'll continue surveying and experimenting to make this test more stable. Thanks! cc @kevin85421 |
Why are these changes needed?
To automate doc testing and reduce the manual testing burden, as highlighted in ray-project/kuberay#3157, this PR introduces doctests for Kuberay Autoscaling section.
Doc link: https://anyscale-ray--51884.com.readthedocs.build/en/51884/cluster/kubernetes/user-guides/configuring-autoscaling.html
Related issue number
ray-project/kuberay#3157
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.