[Test][KubeRay] Add doctests for Kuberay Autoscaling #51884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jjyao merged 11 commits into ray-project:master from JiangJiaWei1103:add-doctests-for-kuberay-autoscaling

May 4, 2025

Contributor

JiangJiaWei1103 commented Apr 1, 2025 •

edited

Loading

Why are these changes needed?

To automate doc testing and reduce the manual testing burden, as highlighted in ray-project/kuberay#3157, this PR introduces doctests for Kuberay Autoscaling section.

Doc link: https://anyscale-ray--51884.com.readthedocs.build/en/51884/cluster/kubernetes/user-guides/configuring-autoscaling.html

Related issue number

ray-project/kuberay#3157

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

JiangJiaWei1103 added 2 commits

April 1, 2025 00:40


          docs: Add doctests for Kuberay Autoscaling

1083d3e

Signed-off-by: jiangjiawei1103 <[email protected]>


          fix: Pass py.test and make develop

c53a09f

Signed-off-by: jiangjiawei1103 <[email protected]>

JiangJiaWei1103 requested review from pcmoritz, kevin85421 and a team as code owners

April 1, 2025 14:35

kevin85421 assigned MortalHappiness

jcotant1 added the core label

MortalHappiness reviewed

View reviewed changes

Member

MortalHappiness left a comment

I think we can omit checking output for

kubectl exec -it $HEAD_POD -- ray list actors
kubectl exec $HEAD_POD -it -c ray-head -- ray status
kubectl logs $HEAD_POD -c autoscaler | tail -n 20

Otherwise, we would need to create a lot of special regex.

cc @kevin85421 Do you think this is okay?

doc/source/cluster/kubernetes/doc_sanitize.cfg Outdated

Comment on lines 29 to 31

+              [time-stamp]
+              regex: \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}[.,]?\d*|[A-Z][a-z]{2}\s[A-Z][a-z]{2}\s+\d{1,2}\s\d{2}:\d{2}:\d{2}\s\d{4}
+              replace: TIME-STAMP

Member

MortalHappiness Apr 2, 2025

Don't use |. Split it into multiple regex for readability. Like [time-stamp] and [time-stamp-miliseconds]


          refactor: Split timestamp regex into separate patterns for readability

ee4930d

Signed-off-by: jiangjiawei1103 <[email protected]>

hainesmichaelc added the community-contribution label

Member

kevin85421 commented Apr 5, 2025

I think we can omit checking output for ...

makes sense to me


          Skip checks for cells with dynamic outputs

3d3e4b9

Signed-off-by: jiangjiawei1103 <[email protected]>

Contributor Author

JiangJiaWei1103 commented Apr 5, 2025

I think we can omit checking output for

kubectl exec -it $HEAD_POD -- ray list actors

kubectl exec $HEAD_POD -it -c ray-head -- ray status

kubectl logs $HEAD_POD -c autoscaler | tail -n 20

Otherwise, we would need to create a lot of special regex.

cc @kevin85421 Do you think this is okay?

We now ignore those checks and remove the corresponding regex patterns. Thanks!

Member

MortalHappiness commented Apr 9, 2025

Could you resolve the conflicts? Thanks.


          Merge branch 'master' into add-doctests-for-kuberay-autoscaling

b98193e

Signed-off-by: 江家瑋 <[email protected]>

Contributor Author

JiangJiaWei1103 commented Apr 9, 2025

Could you resolve the conflicts? Thanks.

Done. Thanks a lot!

MortalHappiness reviewed

View reviewed changes

doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.ipynb Outdated

+                  }
+                 ],
+                 "source": [
+                  "sleep 10 && export WORKER_POD1=$(kubectl get pods --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1:].metadata.name}')\n",

Member

MortalHappiness Apr 9, 2025

This may become a flaky test if we wait for a fixed time. We might need to extend the timeout. By the way, is there a selector to specifically target the worker pod? It doesn't seem ideal to fetch all pods and sort them by timestamp just to find the newly created worker pod.

doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.ipynb Outdated

+                  }
+                 ],
+                 "source": [
+                  "sleep 10 && export WORKER_POD2=$(kubectl get pods --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1:].metadata.name}')\n",

Member

MortalHappiness Apr 9, 2025

ditto

doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.ipynb Outdated

Comment on lines 851 to 911

+                 "source": [
+                  "### Step 8: Clean up the Kubernetes cluster"
+                 ]
+                },
+                {
+                 "cell_type": "code",
+                 "execution_count": 22,
+                 "id": "92d74542-1984-4519-adde-641b05f9efe8",
+                 "metadata": {
+                  "editable": true,
+                  "slideshow": {
+                   "slide_type": ""
+                  },
+                  "tags": [
+                   "nbval-ignore-output",
+                   "remove-output"
+                  ]
+                 },
+                 "outputs": [
+                  {
+                   "name": "stdout",
+                   "output_type": "stream",
+                   "text": [
+                    "raycluster.ray.io \"raycluster-autoscaler\" deleted\n",
+                    "configmap \"ray-example\" deleted\n"
+                   ]
+                  }
+                 ],
+                 "source": [
+                  "# Delete RayCluster and ConfigMap\n",
+                  "kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-cluster.autoscaler.yaml"
+                 ]
+                },
+                {
+                 "cell_type": "code",
+                 "execution_count": 23,
+                 "id": "429d84e4-eb5f-4174-8344-306916216dfa",
+                 "metadata": {
+                  "editable": true,
+                  "slideshow": {
+                   "slide_type": ""
+                  },
+                  "tags": [
+                   "nbval-ignore-output",
+                   "remove-output"
+                  ]
+                 },
+                 "outputs": [
+                  {
+                   "name": "stdout",
+                   "output_type": "stream",
+                   "text": [
+                    "release \"kuberay-operator\" uninstalled\n"
+                   ]
+                  }
+                 ],
+                 "source": [
+                  "# Uninstall the KubeRay operator\n",
+                  "helm uninstall kuberay-operator"
+                 ]
+                },

Member

MortalHappiness Apr 9, 2025

For step 8, simply kind delete cluster is enough.

JiangJiaWei1103 added 2 commits

April 11, 2025 00:40


          refactor: Enable dynamic waiting without fixed time

65b9ab9

Signed-off-by: jiangjiawei1103 <[email protected]>


          Merge branch 'add-doctests-for-kuberay-autoscaling' of https://github…

cb8b337

….com/JiangJiaWei1103/ray into add-doctests-for-kuberay-autoscaling

MortalHappiness reviewed

View reviewed changes

doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.ipynb Outdated

Comment on lines 388 to 397

+                  "while true; do\n",
+                  "    WORKER_POD1=$(kubectl get pods --selector=ray.io/node-type=worker -o custom-columns=POD:metadata.name --no-headers)\n",
+                  "    if [[ -n \"$WORKER_POD1\" ]]; then\n",
+                  "        export WORKER_POD1\n",
+                  "        break\n",
+                  "    fi      \n",
+                  "    sleep 2\n",
+                  "done\n",
+                  "kubectl wait --for=condition=ready pod/$WORKER_POD1 --timeout=500s\n",
+                  "echo $WORKER_POD1"

Member

MortalHappiness Apr 12, 2025

I think we can use kubectl wait to wait for the status.availableWorkerReplicas field of raycluster/raycluster-autoscaler becomes to 1 instead of waiting for pod name becoming non-empty string.

doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.ipynb Outdated

Comment on lines 480 to 494

+                  "while true; do\n",
+                  "    WORKER_POD2=$(kubectl get pods \\\n",
+                  "        --selector=ray.io/node-type=worker \\\n",
+                  "        --field-selector=\"metadata.name!=$WORKER_POD1\" \\\n",
+                  "        -o custom-columns=POD:metadata.name \\\n",
+                  "        --no-headers\n",
+                  "    )\n",
+                  "    if [[ -n \"$WORKER_POD2\" ]]; then\n",
+                  "        export WORKER_POD2\n",
+                  "        break\n",
+                  "    fi      \n",
+                  "    sleep 2\n",
+                  "done\n",
+                  "kubectl wait --for=condition=ready pod/$WORKER_POD2 --timeout=500s\n",
+                  "echo $WORKER_POD2"

Member

MortalHappiness Apr 12, 2025

Similarly, wait for availableWorkerReplicas becoming to 2.

JiangJiaWei1103 and others added 3 commits

April 12, 2025 23:25


          refactor: Wait for available worker replicas

48215ad

Signed-off-by: jiangjiawei1103 <[email protected]>


          Merge branch 'master' into add-doctests-for-kuberay-autoscaling

4b5d8a0


          Merge branch 'master' into add-doctests-for-kuberay-autoscaling

b49f24d

Signed-off-by: 江家瑋 <[email protected]>

MortalHappiness added the go label

MortalHappiness approved these changes

View reviewed changes

MortalHappiness assigned dayshah

Member

MortalHappiness commented Apr 15, 2025 •

edited

Loading

cc @dayshah for review as ray-docs code owner. Thanks.

Member

MortalHappiness commented Apr 15, 2025

cc @kevin85421 for review as KubeRay owner. Thanks.

MortalHappiness assigned kevin85421

kevin85421 reviewed

View reviewed changes

Member

kevin85421 left a comment •

edited

Loading

Are there any changes to the text instead of the instructions? If so, would you mind adding some comments to the paragraph so that I can review the changes?

In addition, please make sure the test can pass at least 10 consecutive runs in your local environment. cc @MortalHappiness

Member

MortalHappiness commented Apr 15, 2025

@kevin85421 No. But I asked the contributor to split some code blocks into multiple cells. You can just check the code blocks in https://anyscale-ray--51884.com.readthedocs.build/en/51884/cluster/kubernetes/user-guides/configuring-autoscaling.html to see if they look good to you.

Member

MortalHappiness commented Apr 15, 2025

In addition, please make sure the test can pass at least 10 consecutive runs in your local environment.

@JiangJiaWei1103 Could you post a screenshot of running it and pass 10 consecutive runs in your local environement here?

kevin85421 approved these changes

View reviewed changes

Member

kevin85421 left a comment

The change looks good to me. Waiting for the screenshot to prove that it's not flaky locally.

Contributor Author

JiangJiaWei1103 commented Apr 16, 2025 •

edited

Loading

I noticed that the test can be a bit flaky locally. ~~As shown in the screenshot below, 1 out of 10 runs failed because the second worker pod wasn’t ready, even though we explicitly use~~

# kubectl wait --for=condition=ready pod/$WORKER_POD2 --timeout=500s

~~to wait for it to reach the ready state.~~

To confirm this, I ran the following script to test the notebook 10 times:

#!/bin/bash

success=0
total=10

for i in $(seq 1 $total); do
    echo "Run #$i"
    py.test --nbval user-guides/configuring-autoscaling.ipynb --nbval-kernel-name bash --sanitize-with doc_sanitize.cfg
    if [ $? -eq 0 ]; then
        ((success++))
    fi
done

echo "===================="
echo "Total Runs: $total"
echo "Successful Runs: $success"
echo "Success Rate: $((success * 100 / total))%"

Final result:

Let me know if there’s a more stable way to ensure the pod is truly ready before proceeding—happy to update accordingly! Thanks!

Member

MortalHappiness commented Apr 17, 2025 •

edited

Loading

The failed cell is step 5.2 so it is not related to kubectl wait for the worker pod 2. The problem is why there are 2 worker pods in step 5.2?

Contributor Author

JiangJiaWei1103 commented Apr 17, 2025 •

edited

Loading

The failed cell is step 5.2 so it is not related to kubectl wait for the worker pod 2. The problem is why there are 2 worker pods in step 5.2?

Sorry about the confusion. That was my mistake in misinterpreting the failed block.

We use kubectl wait to ensure that worker pod 1 reached the Ready status. However, it appears that consistent reads aren’t guaranteed by the API server due to the use of the watch cache. I suspect this issue is caused by outdated information (specifically, the pod still appearing as Init because the cache hasn’t been fully updated yet). This might be relevant since we’re using Kubernetes v1.26.

I looked into this and came across this blog post, which explains the situation well:

Kubernetes has long used a watch cache to optimize read operations. The watch cache stores a snapshot of the cluster state and receives updates through etcd watches. However, until now, it couldn't serve consistent reads directly, as there was no guarantee the cache was sufficiently up-to-date.

To verify, I tried disabling --watch-cache when spinning up the kind cluster, but unfortunately, the same error still occurred. Do you think it's acceptable to add one more field selector to make selection stricter as follows?

kubectl get pods -l=ray.io/is-ray-node=yes --field-selector=status.phase=Running

(Update) By adding one more field selector as shown above, experiments indicate that the tests can consistently pass at least 20 times in a row:

I'll continue surveying and experimenting to make this test more stable. Thanks!

Member

MortalHappiness commented May 2, 2025

Hi. Please resolve the conflict. Thanks.


          Remove configuring-autoscaling markdown file

6064b33

Signed-off-by: jiangjiawei1103 <[email protected]>

Member

MortalHappiness commented May 3, 2025

cc @kevin85421 I think this is ready for merge. Thanks

Member

kevin85421 commented May 3, 2025

cc @dayshah would you mind approving this PR?

dayshah approved these changes

View reviewed changes

Contributor

dayshah left a comment •

edited

Loading

no actual docs wording changes right?

Member

kevin85421 commented May 3, 2025

kevin85421 assigned edoakes and jjyao

Member

kevin85421 commented May 3, 2025

cc @jjyao @edoakes would you mind merging this PR? Thanks!

jjyao merged commit 64de3e4 into ray-project:master

5 checks passed

vickytsang pushed a commit to ROCm/ray that referenced this pull request


          [Test][KubeRay] Add doctests for Kuberay Autoscaling (ray-project#51884)

610850a

Signed-off-by: jiangjiawei1103 <[email protected]>

GokuMohandas pushed a commit that referenced this pull request


          [Test][KubeRay] Add doctests for Kuberay Autoscaling (#51884)

a202143

Signed-off-by: jiangjiawei1103 <[email protected]>

zhaoch23 pushed a commit to Bye-legumes/ray that referenced this pull request


          [Test][KubeRay] Add doctests for Kuberay Autoscaling (ray-project#51884)

de5110a

Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>

hainesmichaelc added the community-backlog label

rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request


          [Test][KubeRay] Add doctests for Kuberay Autoscaling (ray-project#51884)

76becf8

Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: Scott Lee <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kevin85421 kevin85421 approved these changes

MortalHappiness MortalHappiness approved these changes

dayshah dayshah approved these changes

pcmoritz Awaiting requested review from pcmoritz pcmoritz is a code owner

Labels

community-backlog community-contribution core go