Skip to content

iris: use tmpfs for task workdirs, replace du with disk_usage#3696

Merged
rjpower merged 22 commits intomainfrom
multi/79686b7d
Mar 16, 2026
Merged

iris: use tmpfs for task workdirs, replace du with disk_usage#3696
rjpower merged 22 commits intomainfrom
multi/79686b7d

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 15, 2026

This should prevent us from overloading system memory with temp mounts.

Closes #3391

@rjpower rjpower added the agent-generated Created by automation/agent label Mar 15, 2026
Comment thread lib/iris/src/iris/cluster/worker/env_probe.py Outdated
return
if sys.platform != "linux":
raise RuntimeError("Docker workdir disk limits require Linux tmpfs mounts")
workdir.mkdir(parents=True, exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _setup() already calls workdir.mkdir(parents=True, exist_ok=True) on line 473 of task_attempt.py before calling prepare_workdir. This second mkdir is redundant.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c591c3e6b4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/worker/env_probe.py Outdated
@rjpower rjpower changed the title iris: use tmpfs for task workdirs, replace du with statvfs iris: use tmpfs for task workdirs, replace du with disk_usage Mar 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0519b8fe0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/runtime/docker.py Outdated
rjpower and others added 8 commits March 15, 2026 12:37
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Protocol: two new methods for runtime-specific workdir backing storage
- Docker: mount/unmount bounded tmpfs (Linux only)
- K8s: no-op methods + sizeLimit on emptyDir volume when disk_bytes set
- Process: no-op implementations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…with disk_usage

- Call runtime.prepare_workdir() after workdir creation in _setup()
- Call runtime.cleanup_workdir() before shutil.rmtree in _cleanup()
- Replace subprocess du -sm with shutil.disk_usage() in collect_workdir_size_mb
- Add prepare_workdir/cleanup_workdir mocks to test fixtures
- DockerRuntime: test prepare_workdir tmpfs mount/skip/reject/failure
  and cleanup_workdir mount/unmount behavior
- KubernetesRuntime: test emptyDir sizeLimit when disk_bytes is set
- env_probe: test collect_workdir_size_mb for existing and missing paths
- Worker lifecycle: verify prepare_workdir/cleanup_workdir are called
…age semantic shift

- Change cleanup_workdir to log warning on umount failure instead of raising RuntimeError
- Update collect_workdir_size_mb docstring to document shared-disk approximation behavior
- Add test for warning-on-failure behavior
…e WorkdirSpec

Replace the prepare_workdir/cleanup_workdir protocol methods (which required
no-op stubs on K8s and Process runtimes) with a declarative WorkdirSpec that
flows through stage_bundle. The runtime interprets the spec according to its
capabilities — Docker mounts tmpfs, K8s uses emptyDir sizeLimit, Process
ignores it.

Move disk usage monitoring from shutil.disk_usage in env_probe into a
disk_usage_mb() method on ContainerHandle, throttled to ~60s intervals
instead of every 5s poll cycle. K8s returns 0 since the workdir isn't
observable from the worker node.
The worker container needs CAP_SYS_ADMIN to mount tmpfs workdirs for
per-task disk quota enforcement. Without it, any task requesting
disk_bytes fails with "permission denied" during setup.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2db0bdd66f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 805 to 808
if workdir_spec and workdir_spec.disk_bytes > 0:
self._mount_tmpfs(workdir, workdir_spec.disk_bytes)
if bundle_id:
bundle_store.extract_bundle_to(bundle_id, workdir)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Unmount tmpfs if bundle staging raises

This mounts tmpfs before extract_bundle_to/write_workdir_files, but there is no rollback if either staging call throws. In the current task flow, tmpfs unmounting happens in DockerContainerHandle.cleanup(), which is only reachable after create_container; staging failures occur earlier, so the mount remains and rmtree later hits a busy mountpoint. Repeated task failures with disk_bytes set will leak RAM-backed mounts on the worker.

Useful? React with 👍 / 👎.

Comment thread lib/iris/src/iris/cluster/runtime/docker.py Outdated
pytest-xdist spawns 8 workers each with their own session fixture,
so all 8 raced to run `npm ci` in the same dashboard/ directory —
corrupting node_modules and causing exit code 190 non-deterministically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c568c5b78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/runtime/process.py Outdated
…Spec

Replace the Docker-centric `(host_path, container_path, mode)` tuple
representation and the parallel `WorkdirSpec` concept with a single
semantic `MountSpec` dataclass. Each mount declares its kind (WORKDIR,
TMPFS, CACHE) and the runtime resolves host paths at container creation.

Key changes:
- Add MountKind enum and MountSpec frozen dataclass to types.py
- Remove WorkdirSpec entirely
- Docker runtime gains ResolvedMount dataclass, resolve_mounts() method,
  and cache_dir constructor param for host-path resolution
- K8s runtime iterates MountSpecs uniformly: WORKDIR/TMPFS → emptyDir,
  CACHE → hostPath. Fixes the duplicate volumes.append bug.
- Process runtime resolves mount map from MountSpecs via cache_dir
- task_attempt builds semantic MountSpec list instead of host-path tuples
- stage_bundle always receives workdir_mount (not conditional on disk_bytes)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87fbe9e145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/runtime/docker.py Outdated
…_container

create_container() accepted workdir_host_path as a keyword arg but the
ContainerRuntime protocol and all callers pass only config. Read the
path from config.workdir_host_path instead, which task_attempt already
sets, so Docker tasks with WORKDIR mounts resolve correctly.
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 15, 2026

@claude address review comments, let's cleanup mount points if setup fails. Also remove all dumb tests that violate lib/iris/AGENTS.md guidelines

@marin-community marin-community deleted a comment from claude bot Mar 15, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 15, 2026

Claude finished @rjpower's task in 6m 2s —— View job


Addressing review comments

  • Read lib/iris/AGENTS.md and TESTING.md for test guidelines
  • Read current source files (docker.py, process.py, env_probe.py, task_attempt.py, types.py)
  • Fix: cleanup tmpfs mount if staging fails (wrap staging in try/except with unmount)
  • Fix: keep failed umounts tracked for retry
  • Fix: remove redundant mkdir in _mount_tmpfs
  • Remove tests violating TESTING.md guidelines
  • Run pre-commit and tests
  • Commit and push

Changes made

docker.py:

  • stage_bundle(): Wrapped bundle extraction/write in try/except — if staging fails after tmpfs mount, release_tmpfs() is called to clean up the mount and prevent leaking RAM-backed mounts
  • release_tmpfs(): Failed umounts no longer discard the path from _tmpfs_mounts, keeping it tracked so later cleanup passes can retry
  • _mount_tmpfs(): Removed redundant workdir.mkdir() — workdir is already created by task_attempt._setup() before stage_bundle is called

test_docker_runtime.py — Removed tests violating TESTING.md guidelines:

  • test_resolve_mounts_workdir — constructor round-trip
  • test_resolve_mounts_read_only — trivial attribute mapping
  • test_create_container_resolves_mounts — asserts on handle._resolved_mounts (private)
  • test_cleanup_releases_workdir_and_tmpfs_mounts — internal dispatch via monkey-patching
  • Stripped _tmpfs_mounts private attribute assertions from remaining tests
  • Updated test_release_tmpfs_warns_on_umount_failuretest_release_tmpfs_keeps_tracking_on_umount_failure to match new behavior
  • Added test_stage_bundle_unmounts_tmpfs_on_staging_failure for the new cleanup-on-failure behavior

Testing

Ran uv run pytest lib/iris/tests/cluster/runtime/test_docker_runtime.py lib/iris/tests/cluster/worker/ -m "not e2e" -o "addopts=" -v67 passed, 2 skipped (skips are integration tests requiring infrastructure).
| Branch

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 15, 2026

@claude let's also fix CI:

from filelock import FileLock

E ModuleNotFoundError: No module named 'filelock'
_______________ ERROR at setup of test_static_auth_job_ownership _______________
tests/e2e/conftest.py:65: in _ensure_dashboard_built
from filelock import FileLock
E ModuleNotFoundError: No module named 'filelock'
=========================== short test summary info ============================
ERROR tests/e2e/test_smoke.py::test_workers_ready - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_jobs_tab - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_job_detail - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_task_logs - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_constraints - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_scheduling_diagnostic - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_workers_tab - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_worker_detail - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_autoscaler_tab - ModuleNotFoundError: No module named 'filelock'
ERROR tests/e2e/test_smoke.py::test_dashboard_status_tab - ModuleNotFoundError: No module named 'filelock'

should we just use fnctl instead of filelock? or install filelock, your call. we don't support windows.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 15, 2026

Claude finished @rjpower's task in 2m 4s —— View job


Fixed CI: replaced filelock with fcntl

  • Read tests/e2e/conftest.py to understand filelock usage
  • Replace filelock with fcntl-based locking (stdlib, no extra dependency)
  • Run pre-commit — all checks pass
  • Commit and push

Replaced the filelock.FileLock usage with fcntl.flock() — stdlib, no Windows needed, no extra dependency. The lock semantics are the same: exclusive lock around the npm build, released on exit.

github-actions bot and others added 2 commits March 15, 2026 23:03
filelock is not installed in CI. Since we don't support Windows,
use the stdlib fcntl module for cross-worker dashboard build locking.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap bundle staging in try/except in stage_bundle() so tmpfs is
  unmounted if extract_bundle_to or write_workdir_files raises
- Keep failed umounts in _tmpfs_mounts so later cleanup passes can retry
  instead of silently leaking RAM-backed mounts
- Remove redundant mkdir in _mount_tmpfs (workdir already created by
  task_attempt._setup)
- Remove tests violating TESTING.md: private attribute assertions
  (_tmpfs_mounts, _resolved_mounts), internal dispatch tests, trivial
  constructor round-trips
- Add test for new staging failure cleanup behavior

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@rjpower rjpower requested a review from ravwojdyla March 15, 2026 23:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af32041e3e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if rm.container_path == self.config.workdir:
path = Path(rm.host_path)
if path.exists():
return int(shutil.disk_usage(path).used / (1024 * 1024))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute Docker disk metrics from workdir contents

disk_usage_mb() now returns shutil.disk_usage(path).used, which is bytes used by the entire backing filesystem, not by this task's workdir tree. For Docker tasks where /app is just a subdirectory (common when resources.disk_bytes is unset, or when many task dirs share /dev/shm/iris), tasks will report host/tmpfs-wide usage instead of per-task usage, making resource_usage.disk_mb misleading for debugging and capacity analysis.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjpower this is an issue?

Comment thread lib/iris/src/iris/cluster/runtime/kubernetes.py
Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++1, I like that we are still keeping the caches on shm, make sense.

Some notes:

  • #3696 (comment)
  • could docker.py:_resolve_fast_io_dir() and task_attempt.py:get_fast_io_dir() be consolidated?

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 16, 2026

@claude address the comments from Rafal.

For the disk utilization, let's make an actual tmpfs for the workdir mounts, that will give us isolation and accurate reporting. Let's also bump the default workdir size to 10gb to avoid users failing too often by default.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 16, 2026

Claude finished @rjpower's task in 5m 53s —— View job


Addressing Rafal's comments + tmpfs workdir isolation

  • Read current source files (docker.py, task_attempt.py, process.py, types.py, env_probe.py)
  • Consolidate _resolve_fast_io_dir() and get_fast_io_dir() into one function in types.py
  • Always mount tmpfs for Docker workdirs (10GB default) for isolation + accurate disk reporting
  • Update tests for new default-tmpfs behavior
  • Run pre-commit and tests
  • Commit and push

Changes made

types.py — Added canonical get_fast_io_dir() function (moved from both docker.py and task_attempt.py). Single source of truth for tmpfs detection.

docker.py:

  • Removed duplicate _resolve_fast_io_dir() and _TMPFS_DIR/_TMPFS_MIN_FREE_BYTES constants
  • Imports get_fast_io_dir from types.py
  • Added DEFAULT_WORKDIR_DISK_BYTES = 10 * 1024 * 1024 * 1024 (10 GB)
  • stage_bundle(): Now always mounts tmpfs when workdir_mount is provided (even with size_bytes=0), using the 10GB default. This gives every Docker task its own isolated tmpfs, making shutil.disk_usage() report accurate per-task disk metrics

task_attempt.py:

  • Removed duplicate get_fast_io_dir(), _TMPFS_DIR, _TMPFS_MIN_FREE_BYTES
  • Imports get_fast_io_dir from types.py

test_docker_runtime.py — Updated test_stage_bundle_no_tmpfs_when_zero_sizetest_stage_bundle_uses_default_size_when_zero to verify that size_bytes=0 now mounts tmpfs with the 10GB default.

test_fast_io.py — Updated imports to point to new canonical location (iris.cluster.runtime.types).

Testing

Ran uv run pytest tests/cluster/runtime/test_docker_runtime.py tests/cluster/worker/test_fast_io.py tests/cluster/worker/test_worker.py -m "not e2e" -o "addopts=" -v39 passed, 2 skipped (skips are integration tests requiring infrastructure).

| View job | Branch

…dirs

Address Rafal's review comments:
- Consolidate duplicated _resolve_fast_io_dir() (docker.py) and
  get_fast_io_dir() (task_attempt.py) into a single get_fast_io_dir()
  in types.py
- Always mount tmpfs for Docker workdirs (even when disk_bytes=0),
  defaulting to 10GB. This gives per-task isolation and makes
  shutil.disk_usage report accurate per-task disk metrics instead of
  filesystem-wide usage
- Update tests for new default-tmpfs behavior

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 16, 2026

/iris-smoke

…oncurrency

The `mount` utility fails with spurious errors ("tmpfs already mounted on
/dev") when many task threads invoke it concurrently on distinct paths under
/dev/shm. Add a _mount_lock (like the existing _pull_lock for docker pull)
to serialize mount/unmount syscalls, and add a post-failure ismount fallback
for any race that slips through.

Adds tests for race recovery, genuine mount failure, and concurrency
serialization.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef46478462

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +536 to +539
workdir=self.workdir,
workdir_files=dict(self.request.entrypoint.workdir_files),
bundle_store=self._bundle_store,
workdir_mount=workdir_mount,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Unmount staged tmpfs when task aborts before container creation

_download_bundle() now mounts tmpfs by calling stage_bundle(..., workdir_mount=...) before _create_container(), but cleanup only calls handle.cleanup() when _container_handle exists. If cancellation happens at the _check_cancelled() right after download, or create_container() fails, the tmpfs mount is never released and rmtree later cannot remove the busy mountpoint, so repeated early aborts leak RAM-backed mounts on workers.

Useful? React with 👍 / 👎.

bundle_id = self.config.env.get("IRIS_BUNDLE_ID", "")
if bundle_id or workdir_files:
stage_mounts = [{"name": "workdir", "mountPath": self.config.workdir, "readOnly": False}]
stage_mounts = [{"name": workdir_volume_name, "mountPath": self.config.workdir, "readOnly": False}]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require a WORKDIR mount before wiring the staging init container

workdir_volume_name is optional and defaults to None, but when bundle_id or workdir_files is present the init container always uses it in stage_mounts. For ContainerConfig instances that still rely on the default empty mounts list, this emits a volumeMount with name: null, and Kubernetes rejects the pod manifest before the task starts.

Useful? React with 👍 / 👎.

stage_bundle() was mounting tmpfs before the container handle existed,
leaking RAM-backed mounts when tasks were cancelled between download
and container creation. Move tmpfs provisioning into resolve_mounts()
so the mount lifecycle is tied to the container handle (which already
unmounts in cleanup()). Also move get_fast_io_dir to docker.py — only
Docker needs it; task_attempt no longer knows about fast IO dirs.
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 16, 2026

/iris-smoke

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21cce8cef4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if workdir_host_path is None:
raise RuntimeError("WORKDIR mount requires workdir_host_path")
size = mount.size_bytes if mount.size_bytes > 0 else DEFAULT_WORKDIR_DISK_BYTES
self._mount_tmpfs(workdir_host_path, size)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mount tmpfs before staging task bundle

Mounting tmpfs here happens during create_container, but TaskAttempt.run() stages the bundle/workdir files earlier (_download_bundle runs before _create_container in lib/iris/src/iris/cluster/worker/task_attempt.py). On Linux, mounting tmpfs on a populated directory hides the previously extracted files, so Docker tasks with a non-empty bundle/workdir payload can start with an empty /app and fail to find user code.

Useful? React with 👍 / 👎.

if config.workdir_host_path:
result[mount.container_path] = str(config.workdir_host_path)
elif mount.kind in (MountKind.CACHE, MountKind.TMPFS):
if cache_dir:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require cache_dir for process mount resolution

This conditional silently drops CACHE/TMPFS mounts when cache_dir is unset, and the local platform still constructs ProcessRuntime() without a cache dir (lib/iris/src/iris/cluster/platform/local.py:413). As a result, process-mode tasks stop remapping /uv/cache and /root/.cargo/* paths in env/argv, leaving UV_PYTHON_INSTALL_DIR/CARGO_TARGET_DIR pointing at root-level paths that commonly fail on non-root local workers.

Useful? React with 👍 / 👎.

…stage ordering

Remove get_fast_io_dir() indirection and /dev/shm/iris bootstrap ceremony.
Workers have SYS_ADMIN so they can mount tmpfs anywhere — just use /dev/shm
as the hardcoded fast IO base.

Add prepare_workdir() to ContainerRuntime protocol so tmpfs is mounted
BEFORE bundle staging (fixes P1 where mounting after staging hid files).
Make ProcessRuntime.cache_dir required to fix silent CACHE/TMPFS mount drops.
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 16, 2026

/iris-smoke

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab30adb260

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

--network=host \\
--cap-add SYS_ADMIN \\
--ulimit core=0:0 \\
-v {{ cache_dir }}:{{ cache_dir }} \\
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enable shared propagation on worker cache bind mount

DockerRuntime now mounts tmpfs workdirs inside the worker process (_mount_tmpfs in lib/iris/src/iris/cluster/runtime/docker.py), but the worker creates task containers via the host daemon through /var/run/docker.sock; with this plain cache bind (-v {{ cache_dir }}:{{ cache_dir }}) and no shared propagation, those inner mounts are not propagated to the host namespace. In the default bootstrap path, bundle staging and generated scripts can be written to the worker-only tmpfs while the task container sees the underlying host directory, leading to missing files under /app at runtime.

Useful? React with 👍 / 👎.

Comment thread lib/iris/tests/e2e/conftest.py Outdated
subprocess.run(["npm", "ci"], cwd=dashboard_dir, check=True, capture_output=True)
subprocess.run(["npm", "run", "build"], cwd=dashboard_dir, check=True, capture_output=True)

import fcntl
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move local fcntl import to module scope

/workspace/marin/AGENTS.md requires "All imports at the top of the file" and disallows local imports outside narrow exceptions; this introduces import fcntl inside a normal fixture path, which violates the repo rule and makes import/lint behavior less predictable. Please move fcntl into the module-level import block.

Useful? React with 👍 / 👎.

rjpower added 3 commits March 16, 2026 11:47
…ule scope

Add :shared to the cache_dir bind mount so that tmpfs mounts made by
the worker process propagate to task containers created via the host
Docker daemon. Move fcntl import to module scope per AGENTS.md.
stop_all uses --async gcloud delete, so the old controller can still be
alive when _clear_remote_state runs. If it writes a checkpoint after the
wipe, the new controller restores stale jobs. Fix: call stop_controller
with wait=True in the smoke test, polling until the VM is gone.
Remove --async from GcpStandaloneWorkerHandle.terminate() (only used for
controller VMs). The --async flag was added to unblock the heartbeat
thread, but that path uses SliceHandle.terminate(), not this class.
Synchronous deletion prevents the stale checkpoint race where a dying
controller writes state after remote state is cleared.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 867ad728d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +768 to +770
if now - last_disk_check >= _DISK_CHECK_INTERVAL_SECONDS:
self.disk_mb = handle.disk_usage_mb()
last_disk_check = now
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh disk usage when tasks stop

This throttles disk sampling to every 60 seconds, but _monitor_loop exits as soon as status.phase == STOPPED before taking another sample. Because last_disk_check starts at 0, most short-lived tasks only record an early startup sample, so disk_mb can stay stale/near-zero even when the task writes significant data later in execution; this makes per-task disk telemetry misleading for the common sub-minute job case.

Useful? React with 👍 / 👎.

rjpower added 2 commits March 16, 2026 13:00
…collision

The util-linux mount command has duplicate-mount detection that collides
with Docker's "tmpfs /dev tmpfs" entry in /proc/mounts when the device
arg is also "tmpfs". Use a unique device name (iris-<workdir-name>) so
mount(8) doesn't think we're re-mounting /dev.
…rectly

The worker container can't mount tmpfs and have task containers (created
via host docker.sock) see it. Instead of fighting mount propagation,
set cache_dir to /dev/shm/iris — already RAM-backed — and use it as a
plain directory. No mount calls, no SYS_ADMIN, no /dev/shm bind mount.

- Default cache_dir: /var/cache/iris → /dev/shm/iris
- Bootstrap: drop --cap-add SYS_ADMIN, :shared, /dev/shm bind
- DockerRuntime: delete _mount_tmpfs, release_tmpfs, _fast_io_dir
- CACHE/TMPFS mounts resolve under cache_dir instead of /dev/shm
- GCP: make VM deletion synchronous (drop --async)
@rjpower rjpower enabled auto-merge (squash) March 16, 2026 21:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b505e020f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

volumes.append(
{
"name": volume_name,
"hostPath": {"path": mount.container_path, "type": "DirectoryOrCreate"},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Map Kubernetes CACHE mounts to worker cache_dir

KubernetesContainerHandle.run() now emits CACHE volumes with hostPath.path set to mount.container_path (for worker-generated mounts this is /uv/cache, /root/.cargo/registry, etc.), so task pods no longer use the configured worker cache root. In the normal worker flow (TaskAttempt._create_container), this regresses from the previous behavior where cache mounts were rooted under worker.cache_dir; on CoreWeave this pushes heavy caches onto the node root filesystem instead of the configured NVMe cache path and can trigger disk pressure/evictions during dependency sync.

Useful? React with 👍 / 👎.

@rjpower rjpower merged commit 9bcd504 into main Mar 16, 2026
21 of 22 checks passed
@rjpower rjpower deleted the multi/79686b7d branch March 16, 2026 21:12
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 16, 2026

@claude can you send a fix for this in a new PR?

P1 Badge Map Kubernetes CACHE mounts to worker cache_dir

KubernetesContainerHandle.run() now emits CACHE volumes with hostPath.path set to mount.container_path (for worker-generated mounts this is /uv/cache, /root/.cargo/registry, etc.), so task pods no longer use the configured worker cache root. In the normal worker flow (TaskAttempt._create_container), this regresses from the previous behavior where cache mounts were rooted under worker.cache_dir; on CoreWeave this pushes heavy caches onto the node root filesystem instead of the configured NVMe cache path and can trigger disk pressure/evictions during dependency sync.

this is important, the root filesystem for CW/k8s is weak and can't handle this.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 16, 2026

Claude finished @rjpower's task in 3m 25s —— View jobclaude/pr-3696-20260316-2213Create PR ➔


Fix Kubernetes CACHE mounts to use worker cache_dir

  • Read kubernetes.py and understand the mount resolution bug
  • Read types.py for MountSpec/MountKind definitions
  • Implement fix: add cache_dir to KubernetesRuntime and map CACHE mounts under it
  • Update main.py to pass cache_dir
  • Add test for CACHE mount host path resolution
  • Run pre-commit and tests
  • Commit and push

Changes

kubernetes.py — Added cache_dir: Path | None to both KubernetesRuntime and KubernetesContainerHandle. CACHE mounts now compute hostPath as cache_dir / container_path.strip("/").replace("/", "-") (matching Docker runtime's pattern), e.g. /mnt/nvme/iris/uv-cache instead of /uv/cache. Falls back to container_path when cache_dir is unset.

main.py — Passes cache_dir=config.cache_dir when constructing KubernetesRuntime.

test_kubernetes_runtime.py — Added test_cache_mounts_use_cache_dir_host_path and test_cache_mounts_fallback_without_cache_dir.

Testing

Ran uv run pytest tests/cluster/runtime/test_kubernetes_runtime.py -v21 passed.

Pre-commit: all checks pass.

| Branch

claude bot pushed a commit that referenced this pull request Mar 16, 2026
…tainer_path

CACHE mounts (uv, cargo) were using container_path (e.g. /uv/cache) as the
hostPath, pushing caches onto the node root filesystem. On CoreWeave this can
trigger disk pressure and evictions. Now cache_dir is threaded through
KubernetesRuntime -> KubernetesContainerHandle so CACHE mounts are rooted under
the configured NVMe cache path (e.g. /mnt/nvme/iris/uv-cache).

Fixes regression from #3696.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
rjpower added a commit that referenced this pull request Mar 16, 2026
CACHE mounts (uv, cargo) were using container_path as the hostPath,
pushing caches onto the node root filesystem. On CoreWeave this triggers
disk pressure. Now cache_dir is threaded through KubernetesRuntime so
CACHE mounts use the configured NVMe cache path.

Fixes regression from #3696.

Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bound Docker task workdirs by requested disk

2 participants