Skip to content

fix: ray module not found handling (#1049) #1055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 1, 2025

Conversation

andywag
Copy link
Contributor

@andywag andywag commented Apr 28, 2025

Summary:
TorchX has been handling ModuleNotFoundError gracefully for a while now, e.g. for SageMaker when running torchx runopts we get:

...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...

But for ray we get an exception after which we won't get next runopts:

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined

That's because ray_scheduler has custom ModuleNotFoundException handling - perhaps for historic reasons.

Test Plan: [x] existing test must pass

Differential Revision: D73751531

Pulled By: andywag

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73751531

Summary:

TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get:

```
...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...
```

But for `ray` we get an exception after which we won't get next runopts:
```
gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined
```

That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons.


Test Plan: [x] existing test must pass

Reviewed By: tonykao8080

Differential Revision: D73751531

Pulled By: andywag
@andywag andywag force-pushed the export-D73751531 branch from 5526463 to c489694 Compare May 1, 2025 04:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73751531

@andywag andywag closed this May 1, 2025
@andywag andywag reopened this May 1, 2025
@facebook-github-bot facebook-github-bot merged commit 9120355 into pytorch:main May 1, 2025
40 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants