Skip to content

Commit 4f49462

Browse files
authored
iris: switch auth from per-RPC DB lookups to HMAC-SHA256 JWTs (#3630)
Replaces the ad-hoc bearer token auth system (every RPC does a DB hit to hash-lookup tokens) with proper JWT-based auth using PyJWT and HMAC-SHA256 signing. - **`JwtTokenManager`** replaces `DbTokenVerifier`: verification is a pure crypto check + in-memory revocation set — no database hit on the hot path - **`VerifiedIdentity`** dataclass carries `user_id` + `role` from JWT claims, eliminating per-RPC role lookups from the DB - **Login RPC** now exchanges identity tokens (GCP access tokens or raw static config tokens) for JWTs — all auth providers converge on the same flow - **Signing key** persisted in new `controller_secrets` table (migration 0006) so tokens survive controller restarts - **Dashboard** login page exchanges raw tokens for JWTs via Login RPC before setting session cookie - **CLI** `iris login` unified: all providers go through Login RPC ### Files changed - `lib/iris/src/iris/rpc/auth.py` — `VerifiedIdentity`, `_verified_identity` ContextVar, verifier protocol returns `VerifiedIdentity` - `lib/iris/src/iris/cluster/controller/auth.py` — `JwtTokenManager`, `_get_or_create_signing_key`, JWT worker tokens - `lib/iris/src/iris/cluster/controller/service.py` — `_require_role`/`_require_admin` read from JWT claims, Login/CreateApiKey mint JWTs - `lib/iris/src/iris/cli/main.py` — unified login flow through Login RPC - `lib/iris/dashboard/src/components/controller/LoginPage.vue` — exchange tokens via Login RPC - `lib/iris/src/iris/cluster/controller/migrations/0006_jwt_signing_key.sql` — `controller_secrets` table - `lib/iris/docs/auth.md` — updated architecture docs ## Test plan - [x] 97 unit tests pass (test_auth.py, test_api_keys.py, test_service.py) - [x] 1322 full iris tests pass including all e2e tests - [x] Pre-commit (ruff, black, pyrefly) passes - [x] E2e static auth ownership test validates JWT-based job isolation - [x] E2e dashboard login flow validates token exchange in browser Closes #3623
1 parent ec17543 commit 4f49462

File tree

14 files changed

+849
-419
lines changed

14 files changed

+849
-419
lines changed

lib/iris/dashboard/src/components/controller/LoginPage.vue

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,43 @@ async function login() {
1717
error.value = null
1818
loading.value = true
1919
try {
20+
// Exchange the token for a JWT via Login RPC.
21+
// Handles raw identity tokens (static config tokens, GCP access tokens).
22+
// If Login is unimplemented, the token is already a JWT — use it directly.
23+
let sessionToken = trimmed
24+
try {
25+
const loginResp = await fetch('/iris.cluster.ControllerService/Login', {
26+
method: 'POST',
27+
headers: { 'Content-Type': 'application/json' },
28+
body: JSON.stringify({ identity_token: trimmed }),
29+
})
30+
if (loginResp.ok) {
31+
const loginData = await loginResp.json()
32+
if (loginData.token) {
33+
sessionToken = loginData.token
34+
}
35+
} else {
36+
// Surface auth failures (e.g. invalid token). Only fall through for
37+
// "unimplemented" (Login not configured) — token may already be a JWT.
38+
const errData = await loginResp.json().catch(() => ({}))
39+
const code = errData.code || ''
40+
if (code !== 'unimplemented') {
41+
throw new Error(errData.message || `Login failed (${loginResp.status})`)
42+
}
43+
}
44+
} catch (loginErr) {
45+
// Network errors (Login RPC unreachable) — try token as-is
46+
if (loginErr instanceof TypeError) {
47+
// fetch network error — ignore and try token directly
48+
} else {
49+
throw loginErr
50+
}
51+
}
52+
2053
const resp = await fetch('/auth/session', {
2154
method: 'POST',
2255
headers: { 'Content-Type': 'application/json' },
23-
body: JSON.stringify({ token: trimmed }),
56+
body: JSON.stringify({ token: sessionToken }),
2457
})
2558
if (!resp.ok) {
2659
const body = await resp.json().catch(() => ({}))

lib/iris/docs/auth.md

Lines changed: 37 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,25 @@ Authentication is configured via the `auth` block in `IrisClusterConfig`. Three
1212

1313
| Mode | Config | Behavior |
1414
|------|--------|----------|
15-
| **Null-auth** | No `auth` block | All requests pass as `anonymous` (admin). Workers still get an internal bearer token. |
16-
| **Static** | `auth.static.tokens: {token: username}` | Pre-shared tokens mapped to usernames. Good for local dev and testing. |
17-
| **GCP** | `auth.gcp.project_id: <id>` | Users log in with a GCP OAuth2 access token. The controller verifies it against Google's tokeninfo endpoint and checks project access via Cloud Resource Manager. |
15+
| **Null-auth** | No `auth` block | All requests pass as `anonymous` (admin). Workers still get a JWT. |
16+
| **Static** | `auth.static.tokens: {token: username}` | Pre-shared tokens exchanged for JWTs via Login RPC. Good for local dev and testing. |
17+
| **GCP** | `auth.gcp.project_id: <id>` | Users log in with a GCP OAuth2 access token, exchanged for a JWT via Login RPC. |
1818

1919
### Token lifecycle
2020

21-
All auth modes converge on the same runtime path: **API keys stored in SQLite**.
21+
All tokens are **JWTs signed with HMAC-SHA256**. The signing key is persisted in the `controller_secrets` table so tokens survive controller restarts.
22+
23+
JWT claims: `sub` (user_id), `role`, `jti` (key_id), `iat`, `exp`.
2224

2325
1. On controller start, `create_controller_auth()` reads the config proto and:
24-
- Creates a `system:worker` user with a fresh bearer token (all modes, including null-auth).
25-
- For static auth: preloads config tokens into the `api_keys` table.
26-
- For GCP auth: instantiates a `GcpAccessTokenVerifier` as the *login verifier*.
27-
2. On `Login` RPC (GCP mode): the controller verifies the GCP access token, creates/ensures the user, revokes old login keys, mints a new API key, and returns it.
28-
3. All subsequent RPCs are authenticated by hashing the bearer token (SHA-256) and looking it up in `api_keys`. Expired and revoked keys are rejected.
29-
4. `last_used_at` is throttled to one DB write per 60s per key.
26+
- Loads (or creates) the persistent JWT signing key from `controller_secrets`.
27+
- Creates a `system:worker` user with a fresh worker JWT (all modes, including null-auth).
28+
- For static auth: preloads config tokens into `api_keys` for audit; sets up `StaticTokenVerifier` as the login verifier.
29+
- For GCP auth: instantiates a `GcpAccessTokenVerifier` as the login verifier.
30+
- Loads revoked key_ids into an in-memory revocation set.
31+
2. On `Login` RPC: the controller verifies the identity token (GCP access token or raw static token), creates/ensures the user, revokes old login keys, mints a new JWT, and returns it.
32+
3. All subsequent RPCs are authenticated by **verifying the JWT signature** and checking the in-memory revocation set. No database hit on the hot path.
33+
4. The `api_keys` table is retained for audit, key management RPCs, and revocation tracking.
3034

3135
### Interceptor chain
3236

@@ -36,13 +40,13 @@ Request → SelectiveAuthInterceptor → Service handler
3640
├─ Login, GetAuthInfo → skip auth (unauthenticated RPCs)
3741
└─ everything else → AuthInterceptor.verify()
3842
39-
├─ Authorization: Bearer <token>
40-
└─ Cookie: iris_session=<token>
43+
├─ Authorization: Bearer <JWT>
44+
└─ Cookie: iris_session=<JWT>
4145
```
4246

43-
In null-auth mode, `NullAuthInterceptor` replaces `AuthInterceptor`: tokens are verified if present (workers), but missing tokens fall through as `anonymous`.
47+
In null-auth mode, `NullAuthInterceptor` replaces `AuthInterceptor`: JWTs are verified if present (workers), but missing tokens fall through as `anonymous` admin.
4448

45-
The verified user identity is stored in a `ContextVar` (`_verified_user`) and read by service code via `get_verified_user()`.
49+
The verified identity (user_id + role) is stored in a `ContextVar` (`_verified_identity`) and read by service code via `get_verified_identity()`. Role checks read directly from the JWT claims — no database lookup.
4650

4751
### Authorization model
4852

@@ -61,17 +65,18 @@ In null-auth mode, job ownership enforcement is skipped entirely.
6165

6266
### Client-side auth
6367

64-
- **CLI**: `iris login` exchanges a GCP access token (or picks the first static token) for an API key, stored in `~/.iris/tokens.json` keyed by cluster name.
65-
- **Workers**: receive `auth_token` via `WorkerConfig` proto. The autoscaler passes it through from controller config.
68+
- **CLI**: `iris login` exchanges an identity token (GCP access token or raw static token) for a JWT via the Login RPC, stored in `~/.iris/tokens.json` keyed by cluster name.
69+
- **Workers**: receive `auth_token` (a JWT) via `WorkerConfig` proto. The autoscaler passes it through from controller config.
6670
- **Dashboard**: session cookie (`iris_session`) set via `/auth/session` POST or `?session_token=` query param redirect. The frontend shows a login page when `/auth/config` reports `auth_enabled: true` and no valid session exists.
6771

68-
### Schema (migration 0004)
72+
### Schema
6973

7074
```sql
75+
-- migration 0004
7176
CREATE TABLE api_keys (
7277
key_id TEXT PRIMARY KEY,
73-
key_hash TEXT NOT NULL UNIQUE, -- SHA-256 of raw token
74-
key_prefix TEXT NOT NULL, -- first 8 chars for display
78+
key_hash TEXT NOT NULL UNIQUE, -- "jwt:<key_id>" for JWT tokens
79+
key_prefix TEXT NOT NULL,
7580
user_id TEXT NOT NULL REFERENCES users(user_id),
7681
name TEXT NOT NULL,
7782
created_at_ms INTEGER NOT NULL,
@@ -82,19 +87,27 @@ CREATE TABLE api_keys (
8287

8388
ALTER TABLE users ADD COLUMN role TEXT NOT NULL DEFAULT 'user'
8489
CHECK (role IN ('admin', 'user', 'worker'));
90+
91+
-- migration 0006
92+
CREATE TABLE controller_secrets (
93+
key TEXT PRIMARY KEY,
94+
value TEXT NOT NULL,
95+
created_at_ms INTEGER NOT NULL
96+
);
8597
```
8698

87-
### New RPCs
99+
### RPCs
88100

89101
Added to `ControllerService`:
90102

91103
- `GetAuthInfo` — unauthenticated; returns provider name and GCP project ID.
92-
- `Login` — unauthenticated; exchanges an identity token for an API key.
93-
- `CreateApiKey` / `RevokeApiKey` / `ListApiKeys` — API key management.
94-
- `GetCurrentUser` — returns the authenticated user's identity and role.
104+
- `Login` — unauthenticated; exchanges an identity token for a JWT.
105+
- `CreateApiKey` / `RevokeApiKey` / `ListApiKeys` — API key management (returns JWTs).
106+
- `GetCurrentUser` — returns the authenticated user's identity and role (from JWT claims).
95107

96108
### Known limitations
97109

98110
- **Bundle downloads** are unauthenticated. Bundle IDs are SHA-256 hashes (256 bits of entropy) acting as capability URLs. Workers and K8s init-containers fetch bundles via stdlib `urlopen` which doesn't support auth headers.
99-
- **No token refresh**: API keys don't auto-refresh. Login keys from GCP auth are one-shot; re-run `iris login` to get a new one.
111+
- **No token refresh**: JWTs have a 30-day TTL by default. Re-run `iris login` to get a new one.
100112
- **Single-role model**: a user has exactly one role. No per-job or per-resource ACLs.
113+
- **Revocation is in-memory**: revoked JTIs are loaded from the DB at startup and updated on revocation RPCs. A controller restart reloads the full revocation set.

lib/iris/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ dependencies = [
1717
"httpx>=0.28.1",
1818
"humanfriendly>=10.0",
1919
"pydantic>=2.12.5",
20+
"PyJWT>=2.8.0",
2021
"pyyaml>=6.0",
2122
"starlette>=0.50.0",
2223
"tabulate>=0.9.0",

lib/iris/src/iris/cli/main.py

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ def iris(
216216
@iris.command()
217217
@click.pass_context
218218
def login(ctx):
219-
"""Authenticate with the cluster and store an API key locally."""
219+
"""Authenticate with the cluster and store a JWT locally."""
220220
controller_url = require_controller_url(ctx)
221221
config = ctx.obj.get("config")
222222

@@ -226,7 +226,6 @@ def login(ctx):
226226
if config and config.HasField("auth"):
227227
provider = config.auth.WhichOneof("provider")
228228
else:
229-
# Discover auth method from the controller
230229
client = ControllerServiceClientSync(address=controller_url, timeout_ms=30000)
231230
try:
232231
auth_info = client.get_auth_info(cluster_pb2.GetAuthInfoRequest())
@@ -241,36 +240,34 @@ def login(ctx):
241240
if provider == "gcp":
242241
gcp_provider = GcpAccessTokenProvider()
243242
try:
244-
access_token = gcp_provider.get_token()
243+
identity_token = gcp_provider.get_token()
245244
except Exception as e:
246245
raise click.ClickException(f"Failed to get GCP access token: {e}") from e
247-
248-
client = ControllerServiceClientSync(address=controller_url, timeout_ms=30000)
249-
try:
250-
response = client.login(cluster_pb2.LoginRequest(identity_token=access_token))
251-
except Exception as e:
252-
raise click.ClickException(f"Login failed: {e}") from e
253-
finally:
254-
client.close()
255-
256-
raw_token = response.token
257-
user_id = response.user_id
258246
elif provider == "static":
259247
if not config:
260248
raise click.ClickException("Static auth requires --config (tokens are in the config file)")
261249
tokens = dict(config.auth.static.tokens)
262250
if not tokens:
263251
raise click.ClickException("No static tokens configured")
264-
raw_token = next(iter(tokens))
265-
user_id = tokens[raw_token]
252+
identity_token = next(iter(tokens))
266253
else:
267254
raise click.ClickException(f"Unsupported auth provider: {provider}")
268255

256+
# All providers converge: exchange identity_token for JWT via Login RPC
257+
client = ControllerServiceClientSync(address=controller_url, timeout_ms=30000)
258+
try:
259+
response = client.login(cluster_pb2.LoginRequest(identity_token=identity_token))
260+
except Exception as e:
261+
raise click.ClickException(f"Login failed: {e}") from e
262+
finally:
263+
client.close()
264+
269265
cluster_name = ctx.obj.get("cluster_name", "default")
270-
store_token(cluster_name, controller_url, raw_token)
266+
store_token(cluster_name, controller_url, response.token)
271267

272-
click.echo(f"Authenticated as {user_id}")
273-
click.echo(f"Dashboard: {controller_url}?session_token={raw_token}")
268+
click.echo(f"Authenticated as {response.user_id}")
269+
# Token in URL is visible in browser history/logs — acceptable for internal clusters
270+
click.echo(f"Dashboard: {controller_url}?session_token={response.token}")
274271
click.echo(f"Token stored for cluster '{cluster_name}'")
275272

276273

0 commit comments

Comments
 (0)