Skip to content

iris: pin tunnel port scan to start at 10000#3679

Merged
rjpower merged 2 commits intomainfrom
will/dev-ex-iris
Mar 14, 2026
Merged

iris: pin tunnel port scan to start at 10000#3679
rjpower merged 2 commits intomainfrom
will/dev-ex-iris

Conversation

@Helw150
Copy link
Copy Markdown
Member

@Helw150 Helw150 commented Mar 14, 2026

Makes the dashboard port stable across restarts so developers can bookmark or refresh the page copy-pasting the new port each time.

Makes the dashboard port stable across restarts so developers can
bookmark or refresh the page without hunting for a new port each time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Helw150 Helw150 requested review from ravwojdyla and rjpower March 14, 2026 21:49
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 14, 2026

Claude finished @Helw150's task in 40s —— View job


Review of PR #3679

  • Read changed files and understand the diff
  • Review for correctness and issues
  • Post review feedback

LGTM — clean, minimal change.

The diff passes start=10000 to find_free_port() in both _coreweave_tunnel and _gcp_tunnel, which switches from kernel-assigned ephemeral ports (typically 32768–60999) to a sequential scan starting at 10000. The find_free_port implementation in base.py:97 scans range(10000, 11000), so this gives a stable, predictable port range that developers can bookmark.

No issues found. The change is consistent across both platforms and the start parameter is already well-supported by the existing find_free_port API.

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Mar 14, 2026

I like this (I had it originally) but @dlwh bots would try to open multiple tunnels in parallel and then stomp on each other. Maybe someone has a suggestion for how to avoid the stomping? We could fnctl a lock file while the tunnel is being setup?

@claude what do you think, could you make a lock around the port scan, maybe /tmp/iris/port_NNNNN with the pid? could you make that less than 5 lines?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 14, 2026

Claude finished @rjpower's task in 2m 56s —— View job


Implementing port lock mechanism

  • Read find_free_port implementation
  • Read changed files for context
  • Implement PID-based lock file mechanism
  • Run pre-commit and tests
  • Push changes

Added PID-based lock files to find_free_port() in ~5 new lines. The mechanism:

  1. Before trying to bind a port, check if /tmp/iris/port_NNNNN exists with a live PID → skip it
  2. After successfully binding, write the lock file with os.getpid()
  3. Stale lock files (dead PIDs) are automatically ignored via ProcessLookupError
lock = Path(f"/tmp/iris/port_{port}")
try:
    os.kill(int(lock.read_text()), 0)
    continue  # port locked by a live process
except (FileNotFoundError, ValueError, ProcessLookupError, PermissionError):
    pass
# ... after successful bind:
lock.parent.mkdir(parents=True, exist_ok=True)
lock.write_text(str(os.getpid()))

Testing: ./infra/pre-commit.py --all-files --fix ✅ | 24 port-related iris tests passed ✅
· Branch will/dev-ex-iris

When multiple tunnels are opened in parallel, they can race on
find_free_port and grab the same port. Write a lock file at
/tmp/iris/port_NNNNN with the owning PID and skip ports held by
live processes on subsequent scans.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
@Helw150
Copy link
Copy Markdown
Member Author

Helw150 commented Mar 14, 2026

I like this (I had it originally) but @dlwh bots would try to open multiple tunnels in parallel and then stomp on each other. Maybe someone has a suggestion for how to avoid the stomping? We could fnctl a lock file while the tunnel is being setup?

@claude what do you think, could you make a lock around the port scan, maybe /tmp/iris/port_NNNNN with the pid? could you make that less than 5 lines?

Hmmm, I need to think through why the agents were setting up multiple dashboards. My agent tends to just query Iris directly (though upon second reflection, maybe I'm being dumb and this code path effects commands like job logs too)

@Helw150
Copy link
Copy Markdown
Member Author

Helw150 commented Mar 14, 2026

Well, the fix is fine anyhow!!

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 14, 2026

Claude finished @Helw150's task in 58s —— View job


Analysis: tunnel usage and port contention

  • Read current code to understand where find_free_port is used
  • Analyze parallel tunnel scenarios

You're right to think this through — the tunnel code path is used by every CLI command that talks to the controller, not just the dashboard. Looking at cli/main.py:130, every iris CLI invocation (job logs, status, submit, etc.) calls platform.tunnel() which calls find_free_port(start=10000).

So yes, @rjpower's concern is valid: if you run iris logs ... and iris status ... concurrently (or if an agent fires off multiple CLI calls in parallel), they'd both try to grab port 10000 via the sequential scan. Without the lock, the first process binds 10000, then releases it when the socket closes, and the second process also grabs 10000 — but by then the first tunnel is already using it → conflict.

The PID lock file mechanism that was already pushed handles this correctly:

  1. Process A scans, finds 10000 free, writes /tmp/iris/port_10000 with its PID, starts tunnel on 10000
  2. Process B scans, sees /tmp/iris/port_10000 with a live PID → skips, moves to 10001

The lock files are self-cleaning (stale PIDs from dead processes are ignored via ProcessLookupError), so restarts still land on 10000 when no other tunnels are active. This gives the best of both worlds: stable ports for bookmarking when you have a single tunnel, and no stomping when multiple tunnels run in parallel.

One minor note: the lock is written after the socket is closed (with block exits before lock.write_text), so there's a tiny TOCTOU window. In practice this is fine since the tunnel subprocess immediately re-binds the port, but if you wanted belt-and-suspenders you could write the lock while the socket is still held. Not worth changing for this use case though.

· View job · Branch will/dev-ex-iris

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Mar 14, 2026

Yeah for whatever reason codex seemed to like to list tasks, log and list jobs all at once...

@rjpower rjpower enabled auto-merge (squash) March 14, 2026 22:00
@rjpower rjpower merged commit a2d6a05 into main Mar 14, 2026
19 checks passed
@rjpower rjpower deleted the will/dev-ex-iris branch March 14, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants