Skip to content

[Dashboard] Add flush() after job_id is populated #52780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 7, 2025

Conversation

LeoLiao123
Copy link
Contributor

@LeoLiao123 LeoLiao123 commented May 5, 2025

Why are these changes needed?

See the issue description and debugging details in ray-project/kuberay#3508.

Manual Test

command :

kubectl ray job submit\         
  --working-dir . \
  --name my-rayjob \
  --runtime-env-json='{"excludes":[
      "ray-operator/bin",
      "ray-operator/bin/k8s",
      ".git",
      "apiserver/pkg/swagger/datafile.go"
    ]}' \
  -- python task.py | ts '[%Y-%m-%d %H:%M:%S]'

task.py :

import time
import ray

ray.init(address="auto")

@ray.remote
def f():
    for i in range(20):
        print(i)
        time.sleep(1)
    return 1

print(ray.get([f.remote()]))

Result before adding flush()
Log output is delayed until after task.py completes :
before

my-rayjob remains in the Waiting state until the script finishes :
image

Result after adding flush()
Log output appears immediately after submission :
after
my-rayjob transitions to Running state right after submission.
image

Related issue number

Closes ray-project/kuberay#3508

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@LeoLiao123 LeoLiao123 marked this pull request as ready for review May 5, 2025 07:32
@LeoLiao123
Copy link
Contributor Author

@MortalHappiness PTAL

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@MortalHappiness
Copy link
Member

@edoakes @jjyao Could you help to merge this PR? Thanks. The context is that kubectl ray job submit (one subcommand of kubectl Ray plugin in KubeRay repo) relies on the output of the ray job submit command. We need to flush stdout to ensure the job submission ID can be properly read.

@MortalHappiness MortalHappiness added the go add ONLY when ready to merge, run all tests label May 7, 2025
@edoakes
Copy link
Collaborator

edoakes commented May 7, 2025

relies on the output of the ray job submit command

This is not a good thing to rely on, it's not a public API... do we have plans to fix it? You should be able to use the REST API directly instead?

@@ -302,6 +302,7 @@ def submit(
cli_logger.print(cf.bold(f"ray job stop {job_id}"))

cli_logger.newline()
cli_logger.flush()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please leave a comment here so that it isn't inadvertently broken again in the future

and see my other feedback, we really shouldn't rely on these types of behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes I’ve added the comment. Thanks for the review!

@edoakes edoakes enabled auto-merge (squash) May 7, 2025 01:33
@MortalHappiness
Copy link
Member

relies on the output of the ray job submit command

This is not a good thing to rely on, it's not a public API... do we have plans to fix it? You should be able to use the REST API directly instead?

The kubectl Ray plugin is for interactive use in non-production environments, so I thought it was acceptable. But if you think we should not depend on it, I can open an issue in KubeRay and find some contributors to work on replacing it with the dashboard API instead.

Currently kubectl ray job submit is a wrapper for ray job submit.

@MortalHappiness
Copy link
Member

Created issue ray-project/kuberay#3556. I'll add more description on that issue.

@edoakes
Copy link
Collaborator

edoakes commented May 7, 2025

Wrapping ray job submit is fine, but relying on stdout output as a public API is not. If this is required for functionality, would suggest removing that dependency

@MortalHappiness
Copy link
Member

Okay I got your point. We'll still call the ray job submit CLI but not rely on its stdout. We'll use the dashboard API to get the job submission ID.

auto-merge was automatically disabled May 7, 2025 02:29

Head branch was pushed to by a user without write access

@edoakes edoakes merged commit 0b65b4a into ray-project:master May 7, 2025
5 checks passed
zhaoch23 pushed a commit to Bye-legumes/ray that referenced this pull request May 14, 2025
)

See the issue description and [debugging
details](ray-project/kuberay#3508 (comment))
in
[ray-project/kuberay#3508](ray-project/kuberay#3508).

## Manual Test
command :
```bash
kubectl ray job submit\
  --working-dir . \
  --name my-rayjob \
  --runtime-env-json='{"excludes":[
      "ray-operator/bin",
      "ray-operator/bin/k8s",
      ".git",
      "apiserver/pkg/swagger/datafile.go"
    ]}' \
  -- python task.py | ts '[%Y-%m-%d %H:%M:%S]'
```
`task.py` :
```python
import time
import ray

ray.init(address="auto")

@ray.remote
def f():
    for i in range(20):
        print(i)
        time.sleep(1)
    return 1

print(ray.get([f.remote()]))
```

**Result before adding `flush()`**
Log output is delayed until after `task.py` completes :

![before](https://github.com/user-attachments/assets/554a80a3-b4df-4e1f-a483-ff5a24428d5f)

`my-rayjob` remains in the `Waiting` state until the script finishes :

![image](https://github.com/user-attachments/assets/b4035dc9-02d9-41ed-b522-f1fbfafa257f)

**Result after adding `flush()`**
Log output appears immediately after submission :

![after](https://github.com/user-attachments/assets/acf792ea-9008-48ae-9daa-d65e6932e998)
`my-rayjob` transitions to `Running` state right after submission.

![image](https://github.com/user-attachments/assets/801f1fb4-995d-4601-9706-dc51fcf7e5a6)

Closes ray-project/kuberay#3508

---------

Signed-off-by: LeoLiao123 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
@LeoLiao123 LeoLiao123 deleted the bug/logger-flush branch May 17, 2025 11:17
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jun 3, 2025
)

See the issue description and [debugging
details](ray-project/kuberay#3508 (comment))
in
[ray-project/kuberay#3508](ray-project/kuberay#3508).

## Manual Test
command :
```bash
kubectl ray job submit\
  --working-dir . \
  --name my-rayjob \
  --runtime-env-json='{"excludes":[
      "ray-operator/bin",
      "ray-operator/bin/k8s",
      ".git",
      "apiserver/pkg/swagger/datafile.go"
    ]}' \
  -- python task.py | ts '[%Y-%m-%d %H:%M:%S]'
```
`task.py` :
```python
import time
import ray

ray.init(address="auto")

@ray.remote
def f():
    for i in range(20):
        print(i)
        time.sleep(1)
    return 1

print(ray.get([f.remote()]))
```

**Result before adding `flush()`**
Log output is delayed until after `task.py` completes :

![before](https://github.com/user-attachments/assets/554a80a3-b4df-4e1f-a483-ff5a24428d5f)

`my-rayjob` remains in the `Waiting` state until the script finishes :

![image](https://github.com/user-attachments/assets/b4035dc9-02d9-41ed-b522-f1fbfafa257f)

**Result after adding `flush()`**
Log output appears immediately after submission :

![after](https://github.com/user-attachments/assets/acf792ea-9008-48ae-9daa-d65e6932e998)
`my-rayjob` transitions to `Running` state right after submission.

![image](https://github.com/user-attachments/assets/801f1fb4-995d-4601-9706-dc51fcf7e5a6)

Closes ray-project/kuberay#3508

---------

Signed-off-by: LeoLiao123 <[email protected]>
Signed-off-by: Vicky Tsang <[email protected]>
rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request Jun 21, 2025
)

See the issue description and [debugging
details](ray-project/kuberay#3508 (comment))
in
[ray-project/kuberay#3508](ray-project/kuberay#3508).

## Manual Test
command :
```bash
kubectl ray job submit\
  --working-dir . \
  --name my-rayjob \
  --runtime-env-json='{"excludes":[
      "ray-operator/bin",
      "ray-operator/bin/k8s",
      ".git",
      "apiserver/pkg/swagger/datafile.go"
    ]}' \
  -- python task.py | ts '[%Y-%m-%d %H:%M:%S]'
```
`task.py` :
```python
import time
import ray

ray.init(address="auto")

@ray.remote
def f():
    for i in range(20):
        print(i)
        time.sleep(1)
    return 1

print(ray.get([f.remote()]))
```

**Result before adding `flush()`**
Log output is delayed until after `task.py` completes :

![before](https://github.com/user-attachments/assets/554a80a3-b4df-4e1f-a483-ff5a24428d5f)

`my-rayjob` remains in the `Waiting` state until the script finishes :

![image](https://github.com/user-attachments/assets/b4035dc9-02d9-41ed-b522-f1fbfafa257f)

**Result after adding `flush()`**
Log output appears immediately after submission :

![after](https://github.com/user-attachments/assets/acf792ea-9008-48ae-9daa-d65e6932e998)
`my-rayjob` transitions to `Running` state right after submission.

![image](https://github.com/user-attachments/assets/801f1fb4-995d-4601-9706-dc51fcf7e5a6)

Closes ray-project/kuberay#3508

---------

Signed-off-by: LeoLiao123 <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][Kubectl-Plugin] RayJob stucks at Waiting state for long-running jobs
4 participants