iris: use tmpfs for task workdirs, replace du with disk_usage#3696
iris: use tmpfs for task workdirs, replace du with disk_usage#3696
Conversation
| return | ||
| if sys.platform != "linux": | ||
| raise RuntimeError("Docker workdir disk limits require Linux tmpfs mounts") | ||
| workdir.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Nit: _setup() already calls workdir.mkdir(parents=True, exist_ok=True) on line 473 of task_attempt.py before calling prepare_workdir. This second mkdir is redundant.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c591c3e6b4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0519b8fe0c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Protocol: two new methods for runtime-specific workdir backing storage - Docker: mount/unmount bounded tmpfs (Linux only) - K8s: no-op methods + sizeLimit on emptyDir volume when disk_bytes set - Process: no-op implementations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…with disk_usage - Call runtime.prepare_workdir() after workdir creation in _setup() - Call runtime.cleanup_workdir() before shutil.rmtree in _cleanup() - Replace subprocess du -sm with shutil.disk_usage() in collect_workdir_size_mb - Add prepare_workdir/cleanup_workdir mocks to test fixtures
- DockerRuntime: test prepare_workdir tmpfs mount/skip/reject/failure and cleanup_workdir mount/unmount behavior - KubernetesRuntime: test emptyDir sizeLimit when disk_bytes is set - env_probe: test collect_workdir_size_mb for existing and missing paths - Worker lifecycle: verify prepare_workdir/cleanup_workdir are called
…age semantic shift - Change cleanup_workdir to log warning on umount failure instead of raising RuntimeError - Update collect_workdir_size_mb docstring to document shared-disk approximation behavior - Add test for warning-on-failure behavior
…e WorkdirSpec Replace the prepare_workdir/cleanup_workdir protocol methods (which required no-op stubs on K8s and Process runtimes) with a declarative WorkdirSpec that flows through stage_bundle. The runtime interprets the spec according to its capabilities — Docker mounts tmpfs, K8s uses emptyDir sizeLimit, Process ignores it. Move disk usage monitoring from shutil.disk_usage in env_probe into a disk_usage_mb() method on ContainerHandle, throttled to ~60s intervals instead of every 5s poll cycle. K8s returns 0 since the workdir isn't observable from the worker node.
The worker container needs CAP_SYS_ADMIN to mount tmpfs workdirs for per-task disk quota enforcement. Without it, any task requesting disk_bytes fails with "permission denied" during setup.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2db0bdd66f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if workdir_spec and workdir_spec.disk_bytes > 0: | ||
| self._mount_tmpfs(workdir, workdir_spec.disk_bytes) | ||
| if bundle_id: | ||
| bundle_store.extract_bundle_to(bundle_id, workdir) |
There was a problem hiding this comment.
Unmount tmpfs if bundle staging raises
This mounts tmpfs before extract_bundle_to/write_workdir_files, but there is no rollback if either staging call throws. In the current task flow, tmpfs unmounting happens in DockerContainerHandle.cleanup(), which is only reachable after create_container; staging failures occur earlier, so the mount remains and rmtree later hits a busy mountpoint. Repeated task failures with disk_bytes set will leak RAM-backed mounts on the worker.
Useful? React with 👍 / 👎.
pytest-xdist spawns 8 workers each with their own session fixture, so all 8 raced to run `npm ci` in the same dashboard/ directory — corrupting node_modules and causing exit code 190 non-deterministically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c568c5b78
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…Spec Replace the Docker-centric `(host_path, container_path, mode)` tuple representation and the parallel `WorkdirSpec` concept with a single semantic `MountSpec` dataclass. Each mount declares its kind (WORKDIR, TMPFS, CACHE) and the runtime resolves host paths at container creation. Key changes: - Add MountKind enum and MountSpec frozen dataclass to types.py - Remove WorkdirSpec entirely - Docker runtime gains ResolvedMount dataclass, resolve_mounts() method, and cache_dir constructor param for host-path resolution - K8s runtime iterates MountSpecs uniformly: WORKDIR/TMPFS → emptyDir, CACHE → hostPath. Fixes the duplicate volumes.append bug. - Process runtime resolves mount map from MountSpecs via cache_dir - task_attempt builds semantic MountSpec list instead of host-path tuples - stage_bundle always receives workdir_mount (not conditional on disk_bytes)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87fbe9e145
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…_container create_container() accepted workdir_host_path as a keyword arg but the ContainerRuntime protocol and all callers pass only config. Read the path from config.workdir_host_path instead, which task_attempt already sets, so Docker tasks with WORKDIR mounts resolve correctly.
|
@claude address review comments, let's cleanup mount points if setup fails. Also remove all dumb tests that violate lib/iris/AGENTS.md guidelines |
|
Claude finished @rjpower's task in 6m 2s —— View job Addressing review comments
Changes made
TestingRan |
|
@claude let's also fix CI: E ModuleNotFoundError: No module named 'filelock' should we just use fnctl instead of filelock? or install filelock, your call. we don't support windows. |
|
Claude finished @rjpower's task in 2m 4s —— View job Fixed CI: replaced
|
filelock is not installed in CI. Since we don't support Windows, use the stdlib fcntl module for cross-worker dashboard build locking. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap bundle staging in try/except in stage_bundle() so tmpfs is unmounted if extract_bundle_to or write_workdir_files raises - Keep failed umounts in _tmpfs_mounts so later cleanup passes can retry instead of silently leaking RAM-backed mounts - Remove redundant mkdir in _mount_tmpfs (workdir already created by task_attempt._setup) - Remove tests violating TESTING.md: private attribute assertions (_tmpfs_mounts, _resolved_mounts), internal dispatch tests, trivial constructor round-trips - Add test for new staging failure cleanup behavior Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af32041e3e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if rm.container_path == self.config.workdir: | ||
| path = Path(rm.host_path) | ||
| if path.exists(): | ||
| return int(shutil.disk_usage(path).used / (1024 * 1024)) |
There was a problem hiding this comment.
Compute Docker disk metrics from workdir contents
disk_usage_mb() now returns shutil.disk_usage(path).used, which is bytes used by the entire backing filesystem, not by this task's workdir tree. For Docker tasks where /app is just a subdirectory (common when resources.disk_bytes is unset, or when many task dirs share /dev/shm/iris), tasks will report host/tmpfs-wide usage instead of per-task usage, making resource_usage.disk_mb misleading for debugging and capacity analysis.
Useful? React with 👍 / 👎.
ravwojdyla
left a comment
There was a problem hiding this comment.
++1, I like that we are still keeping the caches on shm, make sense.
Some notes:
- #3696 (comment)
- could
docker.py:_resolve_fast_io_dir()andtask_attempt.py:get_fast_io_dir()be consolidated?
|
@claude address the comments from Rafal. For the disk utilization, let's make an actual tmpfs for the workdir mounts, that will give us isolation and accurate reporting. Let's also bump the default workdir size to 10gb to avoid users failing too often by default. |
|
Claude finished @rjpower's task in 5m 53s —— View job Addressing Rafal's comments + tmpfs workdir isolation
Changes made
TestingRan |
…dirs Address Rafal's review comments: - Consolidate duplicated _resolve_fast_io_dir() (docker.py) and get_fast_io_dir() (task_attempt.py) into a single get_fast_io_dir() in types.py - Always mount tmpfs for Docker workdirs (even when disk_bytes=0), defaulting to 10GB. This gives per-task isolation and makes shutil.disk_usage report accurate per-task disk metrics instead of filesystem-wide usage - Update tests for new default-tmpfs behavior Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/iris-smoke |
…oncurrency
The `mount` utility fails with spurious errors ("tmpfs already mounted on
/dev") when many task threads invoke it concurrently on distinct paths under
/dev/shm. Add a _mount_lock (like the existing _pull_lock for docker pull)
to serialize mount/unmount syscalls, and add a post-failure ismount fallback
for any race that slips through.
Adds tests for race recovery, genuine mount failure, and concurrency
serialization.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef46478462
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| workdir=self.workdir, | ||
| workdir_files=dict(self.request.entrypoint.workdir_files), | ||
| bundle_store=self._bundle_store, | ||
| workdir_mount=workdir_mount, |
There was a problem hiding this comment.
Unmount staged tmpfs when task aborts before container creation
_download_bundle() now mounts tmpfs by calling stage_bundle(..., workdir_mount=...) before _create_container(), but cleanup only calls handle.cleanup() when _container_handle exists. If cancellation happens at the _check_cancelled() right after download, or create_container() fails, the tmpfs mount is never released and rmtree later cannot remove the busy mountpoint, so repeated early aborts leak RAM-backed mounts on workers.
Useful? React with 👍 / 👎.
| bundle_id = self.config.env.get("IRIS_BUNDLE_ID", "") | ||
| if bundle_id or workdir_files: | ||
| stage_mounts = [{"name": "workdir", "mountPath": self.config.workdir, "readOnly": False}] | ||
| stage_mounts = [{"name": workdir_volume_name, "mountPath": self.config.workdir, "readOnly": False}] |
There was a problem hiding this comment.
Require a WORKDIR mount before wiring the staging init container
workdir_volume_name is optional and defaults to None, but when bundle_id or workdir_files is present the init container always uses it in stage_mounts. For ContainerConfig instances that still rely on the default empty mounts list, this emits a volumeMount with name: null, and Kubernetes rejects the pod manifest before the task starts.
Useful? React with 👍 / 👎.
stage_bundle() was mounting tmpfs before the container handle existed, leaking RAM-backed mounts when tasks were cancelled between download and container creation. Move tmpfs provisioning into resolve_mounts() so the mount lifecycle is tied to the container handle (which already unmounts in cleanup()). Also move get_fast_io_dir to docker.py — only Docker needs it; task_attempt no longer knows about fast IO dirs.
|
/iris-smoke |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21cce8cef4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if workdir_host_path is None: | ||
| raise RuntimeError("WORKDIR mount requires workdir_host_path") | ||
| size = mount.size_bytes if mount.size_bytes > 0 else DEFAULT_WORKDIR_DISK_BYTES | ||
| self._mount_tmpfs(workdir_host_path, size) |
There was a problem hiding this comment.
Mount tmpfs before staging task bundle
Mounting tmpfs here happens during create_container, but TaskAttempt.run() stages the bundle/workdir files earlier (_download_bundle runs before _create_container in lib/iris/src/iris/cluster/worker/task_attempt.py). On Linux, mounting tmpfs on a populated directory hides the previously extracted files, so Docker tasks with a non-empty bundle/workdir payload can start with an empty /app and fail to find user code.
Useful? React with 👍 / 👎.
| if config.workdir_host_path: | ||
| result[mount.container_path] = str(config.workdir_host_path) | ||
| elif mount.kind in (MountKind.CACHE, MountKind.TMPFS): | ||
| if cache_dir: |
There was a problem hiding this comment.
Require cache_dir for process mount resolution
This conditional silently drops CACHE/TMPFS mounts when cache_dir is unset, and the local platform still constructs ProcessRuntime() without a cache dir (lib/iris/src/iris/cluster/platform/local.py:413). As a result, process-mode tasks stop remapping /uv/cache and /root/.cargo/* paths in env/argv, leaving UV_PYTHON_INSTALL_DIR/CARGO_TARGET_DIR pointing at root-level paths that commonly fail on non-root local workers.
Useful? React with 👍 / 👎.
…stage ordering Remove get_fast_io_dir() indirection and /dev/shm/iris bootstrap ceremony. Workers have SYS_ADMIN so they can mount tmpfs anywhere — just use /dev/shm as the hardcoded fast IO base. Add prepare_workdir() to ContainerRuntime protocol so tmpfs is mounted BEFORE bundle staging (fixes P1 where mounting after staging hid files). Make ProcessRuntime.cache_dir required to fix silent CACHE/TMPFS mount drops.
|
/iris-smoke |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab30adb260
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| --network=host \\ | ||
| --cap-add SYS_ADMIN \\ | ||
| --ulimit core=0:0 \\ | ||
| -v {{ cache_dir }}:{{ cache_dir }} \\ |
There was a problem hiding this comment.
Enable shared propagation on worker cache bind mount
DockerRuntime now mounts tmpfs workdirs inside the worker process (_mount_tmpfs in lib/iris/src/iris/cluster/runtime/docker.py), but the worker creates task containers via the host daemon through /var/run/docker.sock; with this plain cache bind (-v {{ cache_dir }}:{{ cache_dir }}) and no shared propagation, those inner mounts are not propagated to the host namespace. In the default bootstrap path, bundle staging and generated scripts can be written to the worker-only tmpfs while the task container sees the underlying host directory, leading to missing files under /app at runtime.
Useful? React with 👍 / 👎.
| subprocess.run(["npm", "ci"], cwd=dashboard_dir, check=True, capture_output=True) | ||
| subprocess.run(["npm", "run", "build"], cwd=dashboard_dir, check=True, capture_output=True) | ||
|
|
||
| import fcntl |
There was a problem hiding this comment.
Move local fcntl import to module scope
/workspace/marin/AGENTS.md requires "All imports at the top of the file" and disallows local imports outside narrow exceptions; this introduces import fcntl inside a normal fixture path, which violates the repo rule and makes import/lint behavior less predictable. Please move fcntl into the module-level import block.
Useful? React with 👍 / 👎.
…ule scope Add :shared to the cache_dir bind mount so that tmpfs mounts made by the worker process propagate to task containers created via the host Docker daemon. Move fcntl import to module scope per AGENTS.md.
stop_all uses --async gcloud delete, so the old controller can still be alive when _clear_remote_state runs. If it writes a checkpoint after the wipe, the new controller restores stale jobs. Fix: call stop_controller with wait=True in the smoke test, polling until the VM is gone.
Remove --async from GcpStandaloneWorkerHandle.terminate() (only used for controller VMs). The --async flag was added to unblock the heartbeat thread, but that path uses SliceHandle.terminate(), not this class. Synchronous deletion prevents the stale checkpoint race where a dying controller writes state after remote state is cleared.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 867ad728d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if now - last_disk_check >= _DISK_CHECK_INTERVAL_SECONDS: | ||
| self.disk_mb = handle.disk_usage_mb() | ||
| last_disk_check = now |
There was a problem hiding this comment.
Refresh disk usage when tasks stop
This throttles disk sampling to every 60 seconds, but _monitor_loop exits as soon as status.phase == STOPPED before taking another sample. Because last_disk_check starts at 0, most short-lived tasks only record an early startup sample, so disk_mb can stay stale/near-zero even when the task writes significant data later in execution; this makes per-task disk telemetry misleading for the common sub-minute job case.
Useful? React with 👍 / 👎.
…collision The util-linux mount command has duplicate-mount detection that collides with Docker's "tmpfs /dev tmpfs" entry in /proc/mounts when the device arg is also "tmpfs". Use a unique device name (iris-<workdir-name>) so mount(8) doesn't think we're re-mounting /dev.
…rectly The worker container can't mount tmpfs and have task containers (created via host docker.sock) see it. Instead of fighting mount propagation, set cache_dir to /dev/shm/iris — already RAM-backed — and use it as a plain directory. No mount calls, no SYS_ADMIN, no /dev/shm bind mount. - Default cache_dir: /var/cache/iris → /dev/shm/iris - Bootstrap: drop --cap-add SYS_ADMIN, :shared, /dev/shm bind - DockerRuntime: delete _mount_tmpfs, release_tmpfs, _fast_io_dir - CACHE/TMPFS mounts resolve under cache_dir instead of /dev/shm - GCP: make VM deletion synchronous (drop --async)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b505e020f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| volumes.append( | ||
| { | ||
| "name": volume_name, | ||
| "hostPath": {"path": mount.container_path, "type": "DirectoryOrCreate"}, |
There was a problem hiding this comment.
Map Kubernetes CACHE mounts to worker cache_dir
KubernetesContainerHandle.run() now emits CACHE volumes with hostPath.path set to mount.container_path (for worker-generated mounts this is /uv/cache, /root/.cargo/registry, etc.), so task pods no longer use the configured worker cache root. In the normal worker flow (TaskAttempt._create_container), this regresses from the previous behavior where cache mounts were rooted under worker.cache_dir; on CoreWeave this pushes heavy caches onto the node root filesystem instead of the configured NVMe cache path and can trigger disk pressure/evictions during dependency sync.
Useful? React with 👍 / 👎.
|
@claude can you send a fix for this in a new PR? P1 Badge Map Kubernetes CACHE mounts to worker cache_dir KubernetesContainerHandle.run() now emits CACHE volumes with hostPath.path set to mount.container_path (for worker-generated mounts this is /uv/cache, /root/.cargo/registry, etc.), so task pods no longer use the configured worker cache root. In the normal worker flow (TaskAttempt._create_container), this regresses from the previous behavior where cache mounts were rooted under worker.cache_dir; on CoreWeave this pushes heavy caches onto the node root filesystem instead of the configured NVMe cache path and can trigger disk pressure/evictions during dependency sync. this is important, the root filesystem for CW/k8s is weak and can't handle this. |
|
Claude finished @rjpower's task in 3m 25s —— View job • Fix Kubernetes CACHE mounts to use worker cache_dir
Changes
TestingRan Pre-commit: all checks pass. | Branch |
…tainer_path CACHE mounts (uv, cargo) were using container_path (e.g. /uv/cache) as the hostPath, pushing caches onto the node root filesystem. On CoreWeave this can trigger disk pressure and evictions. Now cache_dir is threaded through KubernetesRuntime -> KubernetesContainerHandle so CACHE mounts are rooted under the configured NVMe cache path (e.g. /mnt/nvme/iris/uv-cache). Fixes regression from #3696. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
CACHE mounts (uv, cargo) were using container_path as the hostPath, pushing caches onto the node root filesystem. On CoreWeave this triggers disk pressure. Now cache_dir is threaded through KubernetesRuntime so CACHE mounts use the configured NVMe cache path. Fixes regression from #3696. Generated with [Claude Code](https://claude.ai/code) Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
This should prevent us from overloading system memory with temp mounts.
Closes #3391