Skip to content

Use iris bundle for Ray working_dir to include pb2 files#4182

Merged
rjpower merged 6 commits intomainfrom
worktree-agent-a5346899
Mar 27, 2026
Merged

Use iris bundle for Ray working_dir to include pb2 files#4182
rjpower merged 6 commits intomainfrom
worktree-agent-a5346899

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 26, 2026

Extract create_workspace_zip() from BundleCreator as a reusable function
that produces a zip including gitignored protobuf artifacts (pb2 files).
Wire it into fray v1/v2 ray backends and ray_run.py as the working_dir
source, replacing Ray's native directory bundling which skips gitignored
files. This fixes jobs failing because generated pb2 files were missing
from the uploaded workspace.

Extract create_workspace_zip() from BundleCreator and use it in fray
v1/v2 backends and ray_run.py so that gitignored protobuf artifacts
(pb2 files) are included in Ray job bundles.
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 26, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 58s —— View job


PR Review: Use iris bundle for Ray working_dir to include pb2 files

  • Read changed files and understand the diff
  • Check correctness and spec adherence
  • Post review

Extracts create_workspace_zip() from BundleCreator as a standalone function and wires it into the three Ray submission sites (fray v1, fray v2, ray_run.py), replacing Ray's native working_dir+excludes with a pre-built zip that includes gitignored pb2 files. The refactor is clean and the PR fulfills its stated objective.

Findings:

  • Temp directory leak. create_workspace_zip calls tempfile.mkdtemp() and returns a path into it, but nothing ever cleans up the directory. The comment says "caller should not delete the parent directory while the zip is in use," but none of the three call sites ever delete it afterward either. For the BundleCreator path this is worse than before — the old code used TemporaryDirectory as a context manager. In a long-lived process or one that submits many jobs, this leaks zip files on disk indefinitely. Consider returning a context manager or registering an atexit cleanup, or at minimum documenting the expected lifecycle so callers know to clean up. Fix this →

  • Behavioral difference in tests exclusion. The old excludes list used "tests/" (a top-level directory pattern interpreted by Ray), while the new code uses exclude_dirs={"tests"} which matches any path component named tests at any depth (line 68: any(part in all_dirs for part in relative.parts)). This means nested test directories like lib/foo/tests/ that were previously included are now excluded. This is probably fine (and arguably better), but it is a subtle behavioral change worth noting.

  • _GENERATED_ARTIFACT_GLOBS renamed to GENERATED_ARTIFACT_GLOBS. This makes the constant part of the public API. Intentional and fine given the new public functions, but callers outside this module can now depend on it.

Overall this is a straightforward, well-structured refactor. The temp dir leak is the only item I'd want addressed before merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e888345d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


# Use a persistent temp directory (not a context manager) so the caller
# can use the zip path after this function returns.
td = tempfile.mkdtemp(prefix="workspace_zip_")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delete temporary workspace zip directories after bundling

create_workspace_zip now uses tempfile.mkdtemp(...) to persist workspace.zip, but there is no corresponding cleanup path, and BundleCreator.create_bundle() immediately reads bytes without removing that temp directory. Because this helper is now called for each Ray/Iris submission, long-lived clients will leak one workspace_zip_* directory per job and can eventually exhaust /tmp, causing later submissions to fail.

Useful? React with 👍 / 👎.

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 26, 2026

@claude fix the tempdir leak and the test exclusion bug.

why do we need a new tempdir for every execution? can we not just make a zipfile bundle on the fly?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 3m 34s —— View job


Fixing tempdir leak and test exclusion bug

  • Read current code in bundle.py and call sites
  • Fix tempdir leak — register atexit cleanup for the temp directory
  • Fix test exclusion bug — caller-supplied exclude_dirs now only matches top-level directories
  • Add test for nested tests/ not being excluded
  • Run pre-commit and tests (all 7 pass ✅)
  • Push changes

Changes made:

  1. Tempdir leak — Added atexit.register(shutil.rmtree, td, ignore_errors=True) so the temp directory is cleaned up when the process exits. This is sufficient because Ray uploads the zip during submit_job before the process terminates.

  2. Test exclusion bug — Split the directory exclusion logic: EXCLUDE_DIRS (like __pycache__, .git) still match at any depth, but caller-supplied extra_dirs (like tests, docs) now only match the first path component. This preserves the old Ray excludes=["tests/"] behavior where only the top-level tests/ directory was excluded.

  3. On "why a new tempdir" — We need a file path because Ray's working_dir parameter requires one. The tempdir is now cleaned up via atexit so there's no leak. A BytesIO approach would avoid the temp file entirely, but would require each call site to write the bytes to disk since Ray doesn't accept in-memory zips.

Tests: Ran pytest lib/iris/tests/cluster/client/test_bundle.py — all 7 tests pass, including the new test_custom_exclude_dirs_only_matches_top_level.

View job

- Register atexit cleanup for temp directories so they don't leak in
  long-lived processes.
- Make caller-supplied exclude_dirs match only top-level directories,
  preserving the old Ray excludes behavior. Default EXCLUDE_DIRS
  (__pycache__, .git, etc.) still match at any depth.
- Add test verifying nested tests/ directories are not excluded.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 26, 2026

@claude

working_dir (str): Specifies the working directory for the Ray workers. This must either be (1) an local existing directory with total size at most 500 MiB, (2) a local existing zipped file with total unzipped size at most 500 MiB (Note: excludes has no effect), or (3) a URI to a remotely-stored zip file containing the working directory for your job (no file size limit is enforced by Ray). See Remote URIs for details. The specified directory will be downloaded to each node on the cluster, and Ray workers will be started in their node’s copy of this directory.

Examples

"." # cwd

"/src/my_project"

"/src/my_project.zip"

"s3://path/to/my_dir.zip"

do it right

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 2m 35s —— View job


Fix temp directory leak — use zip file path directly as working_dir

  • Read current code (bundle.py, call sites, tests)
  • Fix create_workspace_zip to use tempfile.mkstemp(suffix=".zip") instead of mkdtemp + nested zip
  • Register atexit cleanup on the single file (no directory to leak)
  • Run pre-commit — ✅ all checks pass
  • Run tests — ✅ all 7 pass (pytest lib/iris/tests/cluster/client/test_bundle.py)
  • Push changes

What changed: create_workspace_zip now uses tempfile.mkstemp(suffix=".zip") to create a temp zip file directly. Ray accepts zip file paths as working_dir, so the intermediate temp directory was unnecessary. The single temp file is cleaned up via atexit. No call site changes needed since the function still returns a str path ending in .zip.

Use tempfile.mkstemp(suffix='.zip') instead of mkdtemp + nested zip file.
Ray accepts zip file paths directly as working_dir, so no intermediate
directory is needed. The single temp file is cleaned up via atexit.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 26, 2026

@claude fix bugs::

Fray - Tests / fray-tpu-tests (pull_request)
Fray - Tests / fray-tpu-tests (pull_request)Failing after 1m

E + where '\x1b[36mray::run_on_pod_ray()\x1b[39m (pid=7980, ip=169.254.123.15)\nRuntimeError: The remote function failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment or cannot be found from sys.path ['/workspace/.venv/bin', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/ray/thirdparty_files', '/workspace/.venv/lib/python3.11/site-packages/ray/_private/workers', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741', '/workspace/lib/fray/tests', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/src']:\n\n\x1b[36mray::run_on_pod_ray()\x1b[39m (pid=7980, ip=169.254.123.15)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/tmp/ray/session_2026-03-26_23-...ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/src/fray/v1/cluster/ray/cluster.py", line 20, in \n from iris.cluster.client.bundle import create_workspace_zip\n File "/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/iris/init.py", line 1, in \n raise RuntimeError("""\nRuntimeError: \n\nAmbiguous 'iris' package.\nPlease use either:\n\nillumon-iris - Python integration for Illumon Iris, a time-series database\n - pip install illumon-iris\n - https://pypi.python.org/pypi/illumon-iris\n - https://deephaven.io/\n\n**SciTools Iris** - Python library for analysing and visualising meteorological and oceanographic data sets\n - http://scitools.org.uk/iris/\n\ninfo.gianlucacosta.iris - A general-purpose library for Python\n - pip install info.gianlucacosta.iris\n - https://pypi.python.org/pypi/info.gianlucacosta.iris\n\nirisapi - Iris is a highly configurable and flexible service for paging and messaging\n - pip install irisapi\n - https://pypi.org/project/irisapi/\n - https://iris.claims/' = str(RayTaskError(RuntimeError)(RuntimeError('The remote function failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment or cannot be found from sys.path ['/workspace/.venv/bin', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/ray/thirdparty_files', '/workspace/.venv/lib/python3.11/site-packages/ray/_private/workers', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741', '/workspace/lib/fray/tests', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/src']:\n\nTraceback (most recent call last):\n File "/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/ray/_private...ssion_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/src/fray/v1/cluster/ray/cluster.py", line 20, in \n from iris.cluster.client.bundle import create_workspace_zip\n File "/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/iris/init.py", line 1, in \n raise RuntimeError("""\nRuntimeError: \n\nAmbiguous 'iris' package.\nPlease use either:\n\nillumon-iris - Python integration for Illumon Iris, a time-series database\n - pip install illumon-iris\n - https://pypi.python.org/pypi/illumon-iris\n - https://deephaven.io/\n\n**SciTools Iris** - Python library for analysing and visualising meteorological and oceanographic data sets\n - http://scitools.org.uk/iris/\n\ninfo.gianlucacosta.iris - A general-purpose library for Python\n - pip install info.gianlucacosta.iris\n - https://pypi.python.org/pypi/info.gianlucacosta.iris\n\nirisapi - Iris is a highly configurable and flexible service for paging and messaging\n - pip install irisapi\n - https://pypi.org/project/irisapi/\n - https://iris.claims/\n\n')))
E + where RayTaskError(RuntimeError)(RuntimeError('The remote function failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment or cannot be found from sys.path ['/workspace/.venv/bin', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/ray/thirdparty_files', '/workspace/.venv/lib/python3.11/site-packages/ray/_private/workers', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741', '/workspace/lib/fray/tests', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages', '/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/src']:\n\nTraceback (most recent call last):\n File "/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/ray/_private...ssion_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/src/fray/v1/cluster/ray/cluster.py", line 20, in \n from iris.cluster.client.bundle import create_workspace_zip\n File "/tmp/ray/session_2026-03-26_23-28-55_320387_241/runtime_resources/working_dir_files/_ray_pkg_a61c7d9d4effc741/.venv/lib/python3.11/site-packages/iris/init.py", line 1, in \n raise RuntimeError("""\nRuntimeError: \n\nAmbiguous 'iris' package.\nPlease use either:\n\nillumon-iris - Python integration for Illumon Iris, a time-series database\n - pip install illumon-iris\n - https://pypi.python.org/pypi/illumon-iris\n - https://deephaven.io/\n\n**SciTools Iris** - Python library for analysing and visualising meteorological and oceanographic data sets\n - http://scitools.org.uk/iris/\n\ninfo.gianlucacosta.iris - A general-purpose library for Python\n - pip install info.gianlucacosta.iris\n - https://pypi.python.org/pypi/info.gianlucacosta.iris\n\nirisapi - Iris is a highly configurable and flexible service for paging and messaging\n - pip install irisapi\n - https://pypi.org/project/irisapi/\n - https://iris.claims/\n\n')) = <ExceptionInfo RayTaskError(RuntimeError)(RuntimeError('The remote function failed to import on the worker. This may be because neede...r paging and messaging\n - pip install irisapi\n - https://pypi.org/project/irisapi/\n - https://iris.claims/\n\n')) tblen=6>.value
=============================== warnings summary ===============================
lib/fray/tests/test_ray_tpu.py::test_single_slice_simple_run
/workspace/.venv/lib/python3.11/site-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 4m 54s —— View job


Fixing iris package namespace conflict on Ray workers

  • Analyze the error — PyPI iris package in worker's .venv shadows our local iris module
  • Read affected files
  • Fix: make iris imports lazy in fray v1/v2 (only needed at submission time, not on workers)
  • Run pre-commit and tests — ✅ all 7 bundle tests pass
  • Push changes

Root cause: The worker machine's .venv has a PyPI iris stub package that raises "Ambiguous 'iris' package". When fray/v1/cluster/ray/cluster.py is loaded on the worker, the top-level from iris.cluster.client.bundle import create_workspace_zip resolves to the PyPI package instead of our source.

Fix: Made all iris imports lazy (moved inside the functions that use them) in both fray v1 and v2 backends:

  • fray/v1/cluster/ray/cluster.py: create_workspace_zip import moved into _get_runtime_env()
  • fray/v2/ray_backend/backend.py: create_workspace_zip import moved into build_runtime_env(), configure_logging import moved into _RayActorHostBase.__init__()

These imports are only needed at submission time (not on workers), so lazy loading avoids the namespace conflict entirely.

Tests: pytest lib/iris/tests/cluster/client/test_bundle.py — all 7 pass ✅

…on workers

The worker's .venv has a PyPI iris package that shadows our local iris
source. Since create_workspace_zip and configure_logging are only needed
at submission time (not on workers), move the imports inside the
functions that use them.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower rjpower requested a review from yonromai March 27, 2026 00:10
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

rjpower added 2 commits March 27, 2026 07:30
Removed several tests related to create_workspace_zip functionality.
…mp directory

JobSubmissionClient requires a directory path for working_dir, not a zip.
Split bundle.py into collect_workspace_files (file list), create_workspace_zip
(bytes for Iris), and create_workspace_dir (temp dir for Ray). Collapse the
three separate exclude params (dirs/extensions/subpaths) into a single regex.
@rjpower rjpower enabled auto-merge (squash) March 27, 2026 15:01
@rjpower rjpower merged commit 7f309a5 into main Mar 27, 2026
40 of 41 checks passed
@rjpower rjpower deleted the worktree-agent-a5346899 branch March 27, 2026 15:04
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
Extract create_workspace_zip() from BundleCreator as a reusable function
that produces a zip including gitignored protobuf artifacts (pb2 files).
Wire it into fray v1/v2 ray backends and ray_run.py as the working_dir
source, replacing Ray's native directory bundling which skips gitignored
files. This fixes jobs failing because generated pb2 files were missing
from the uploaded workspace.

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants