Big 🫏 PR: IAP-gated single-image deploy with a fully mediated data plane by christophervoelpel · Pull Request #114 · google-marketing-solutions/scene-machine

christophervoelpel · 2026-06-17T13:05:54Z

Big 🫏 PR: IAP-gated single-image deploy with a fully mediated data plane

This PR replaces Scene Machine's deploy and auth architecture with one that fits a corporate Google Cloud environment. The app now runs as a single container image deployed as two Cloud Run services (a public, IAP-gated app and a private worker), the browser never touches Firestore or Cloud Storage directly (everything goes through a same-origin /api control plane), and Firestore and Storage rules deny all direct client access. It also folds in the GA Gemini 3 models and improved outpainting, a reworked deploy.sh, new CI, and a batch of UI fixes.

This is not an additive option. It is the new authoritative deploy and auth model for the project. Conflicts with main should resolve to this branch.

What and why

The previous architecture could not be deployed in our corporate environment:

It needs a publicly reachable Cloud Run backend behind an API Gateway, but org policy (Domain Restricted Sharing and IAP-mandatory ingress) blocks public, unauthenticated services.
The browser signed in with Firebase Auth and then read and wrote Firestore and Cloud Storage directly from the client. In a locked-down org we cannot hand the browser direct data-plane access, and the public client config is awkward to govern.

This branch solves both:

The front door is Identity-Aware Proxy only. There is no public service and no separate sign-in UI; IAP admits the user before any app code runs.
The data plane is fully mediated. The browser calls a same-origin /api on the same service that serves the SPA. The server uses the Admin SDK and hands back short-lived, server-signed GET and PUT URLs for media. The browser never holds Firestore or Storage credentials.
It is compatible with Domain Restricted Sharing and IAP-mandatory org policies: a single least-privilege runtime service account, service-to-service auth via OIDC, and no public ingress.

Architecture: before and after

flowchart LR
  subgraph BEFORE["BEFORE (main)"]
    direction TB
    b_user["Browser<br/>(Firebase Auth sign-in)"]
    b_gw["API Gateway<br/>(apispec.template.yaml)"]
    b_ui["Angular UI on App Engine<br/>(ui/app.yaml, deploy-ui.sh)"]
    b_run["Backend on Cloud Run"]
    b_fs[("Firestore")]
    b_gcs[("Cloud Storage")]
    b_user --> b_ui
    b_user --> b_gw --> b_run
    b_user -- "direct read/write" --> b_fs
    b_user -- "direct read/write" --> b_gcs
  end

  subgraph AFTER["AFTER (this branch)"]
    direction TB
    a_user["Browser"]
    a_iap{{"Identity-Aware Proxy"}}
    a_app["app service (Cloud Run, public)<br/>serves SPA + same-origin /api"]
    a_worker["worker service (Cloud Run, private)"]
    a_tasks["Cloud Tasks (OIDC)"]
    a_fs[("Firestore<br/>rules: deny all client")]
    a_gcs[("Cloud Storage<br/>rules: deny all client")]
    a_user --> a_iap --> a_app
    a_app -- "Admin SDK" --> a_fs
    a_app -- "server-signed GET/PUT URLs" --> a_gcs
    a_app --> a_tasks -- "OIDC invoke" --> a_worker
    a_worker -- "Admin SDK" --> a_fs
    a_worker -- "Admin SDK" --> a_gcs
  end

Before: three deployables (API Gateway, App Engine UI, Cloud Run backend), Firebase Auth in the browser, and a client that reads and writes Firestore and Storage directly.

After: one container image built once and run as two Cloud Run services selected by environment (ROLE, AUTH_MODE, WORKER_URL). The app service is public but IAP-gated and serves both the built SPA and a same-origin /api control plane. The worker service is private and is only ever invoked by Cloud Tasks using OIDC. No API Gateway, no App Engine, no separate UI deploy. All data access is mediated by the server.

Major changes

Front-door topology (single image, two services). One image (Dockerfile) runs as app or worker based on ROLE, with AUTH_MODE and WORKER_URL selecting behavior. orch.py validates these at startup (rejects an unknown ROLE, requires the IAP audience when ROLE=app and AUTH_MODE=iap, requires WORKER_URL when ROLE=app). The app service serves the built SPA and a same-origin /api; the worker service is private.

IAP front door and Firebase Auth removal. The Firebase Auth sign-in path is fully removed: no @angular/fire, no firebase-admin, no custom-token bridge. IAP is the only admission path. The browser-side sign-in UI is gone; the SPA assumes the request already passed IAP.

Mediated data plane and storyboard split. The browser never touches Firestore or Storage directly. Uploads go through POST /api/uploadUrl (server-signed PUT URLs) and reads through GET /api/signUrl (server-signed GET URLs); config is read through GET /api/config. firebase/firestore.rules and firebase/storage.rules both deny all direct client access (allow read, write: if false). Because the server signs GET URLs for any object in GCS_BUCKET, the deploy treats it as a dedicated bucket: it creates and labels the bucket (app=scene-machine), reuses or auto-adopts the default-named bucket on redeploy, and refuses to adopt a foreign or shared bucket unless the deployer explicitly opts in. Storyboards are split into a per-scene Firestore subcollection (projects/<id>/scenes/<index> in orch.py) so the 1 MiB per-document limit applies per scene instead of per project.

Deploy, infra, and CI. deploy.sh is reworked into a single-image Cloud Build deploy that is idempotent and least-privilege: the runtime service account self-scopes token-creator and act-as, the worker invoker is service-scoped, REST calls use ADC tokens (safe under corporate Certificate-Based Access), and there are opt-in fast-deploy flags (--app-only, --skip-ui-build, --no-build-cache) plus a local no-deploy dev loop. Supporting scripts live in deploy/libs.sh and deploy/grant-access.sh. Build context and image are slimmed (.dockerignore, .gcloudignore, cloudbuild.yaml layer caching).

GA Gemini 3 models and outpainting (folds in upstream PR #104). Defaults move to GA Gemini 3 models (config.template.txt: GEMINI_MODEL=gemini-3.5-flash, IMAGE_MODEL=gemini-3-pro-image) with improved outpainting (actions/outpaint_image.py). Action signatures and example workflows are brought into parity with the new model parameters.

UI features and fixes. Inline project-name editing on the storyboard, composition, and output headers; per-scene failure display; stuck-upload fix; position-based scene labels; and an image-import feature (paste, pasted URL lists, base64, and drag-from-another-tab across setup, storyboard reference, and the overlay dialog, with a browser-side fetch that has a timeout and size cap).

Tests and docs. UI vitest suite at 197 tests; Python suite at 146 tests. Two obsolete UI specs removed (config.spec.ts, remix-engine.spec.ts). README.md and DEVELOPING.md document the IAP-only deploy, the one-time console steps, the local dev loop, and the fast-deploy flags.

Security and review hardening. Deny-all client rules, server-only config and access allow-list reads, an ALLOWED_DOMAINS charset guard in deploy.sh, and static rule tests in CI. Several correctness and security edges found in review since this PR opened were also closed:

a failed project save is no longer silently treated as saved, so unsaved edits are not lost,
a finished render is no longer discarded when its output URL cannot be signed for a moment (it is retried and kept),
the generation paths degrade gracefully when /api/config has not loaded instead of marking a scene failed,
POST /api/projects is create-only (returns 409 on an existing id), so it cannot overwrite a project or reassign its owner,
GET /api/signUrl caps the number of paths per request, preventing signing-quota exhaustion by a single caller,
the project list reads only the first scene of each project (for its thumbnail) instead of every scene,
GCS_BUCKET must be a dedicated bucket that the deploy creates, labels, and owns,
CI now type-checks the UI test files (the fast test runner compiled but did not type-check them).

Breaking changes and migration

This PR replaces main's deploy and auth architecture. The API Gateway, the App Engine UI, and the separate UI deploy are removed (apispec.template.yaml, ui/app.yaml, deploy-ui.sh are deleted). Firebase Auth sign-in is removed.
There is no data migration. Existing deployments do a fresh redeploy on the new topology. Old client behavior (direct Firestore and Storage access) stops working by design once the deny-all rules are in place.
Merge conflicts should resolve to this branch. Do not regress to the gateway, App Engine, or client-Firebase architecture.

Security model

Admission: Identity-Aware Proxy gates the public app service. No request reaches app code without passing IAP.
Data access: only the app and worker service accounts read or write Firestore and Storage, using the Admin SDK. Media reaches the browser only through short-lived, server-signed URLs.
Client rules: Firestore and Storage rules deny all direct client access, so a leaked or replayed client token cannot read or write data directly.
Dedicated storage bucket: signed GET URLs are issued for any object in GCS_BUCKET, so the deploy requires a bucket dedicated to Scene Machine. It creates and labels the bucket, auto-adopts the default-named one on redeploy, and refuses a foreign or shared bucket unless the deployer explicitly opts in.
Service-to-service: the worker service is private and only invokable by Cloud Tasks via OIDC; the invoker permission is scoped to that service.
Shared-team CRUD (intentional): every IAP-admitted user can create, read, update, and delete every project. createdBy is a display label, not an access gate. This is a deliberate small-team design choice, not an oversight.

Testing and CI

All four GitHub Actions workflows are green on Linux runners:

ui-tests.yml: UI vitest suite (197 tests) plus a spec type-check and a production ng build.
python-tests.yml: Python tests on 3.11, 3.12, and 3.13 (146 tests) plus the action-signature check.
deploy-checks.yml: ShellCheck and static safety checks on the deploy scripts.
firebase-rules.yml: static and behavioral tests of the Firestore and Storage rules.

The full flow has also been deploy-validated end to end on a live GCP project using ./deploy.sh.

Deploy

One command does the full, safe deploy. There is no auth-mode flag: the deploy is always IAP-gated (the public Firebase sign-in mode was removed), so a plain run is all there is.

./deploy.sh

A first-time deploy on a brand-new project needs a few one-time console steps:

Configure the OAuth consent screen first. The custom OAuth client step below refuses to be created until the consent screen exists.
On a project without a Google Cloud organization, create a custom OAuth client once for IAP (a project inside an organization gets the managed OAuth client automatically). Skipping this on an org-less project surfaces as 502 "Empty OAuth client".
After the IAP deploy, nobody can open the app (not even the deployer) until they are admitted through IAP. Use deploy/grant-access.sh to admit a user.

The script prints these remaining steps verbatim at the end of a run, and they are documented in README.md.

Deploy options

A plain ./deploy.sh always runs the full, safe, end-to-end deploy. The flags below are opt-in, and each one logs what it skipped:

--non-interactive: for headless or agent-driven runs. It auto-confirms the "deploy to this project?" prompt, and instead of pausing at a human-only console step it fails fast, prints the exact console URL and the command to re-run, then exits so an automation can surface it and resume once the step is done.
--app-only: rebuild the image and redeploy only the app service, reusing the live worker (use when the worker did not change).
--skip-ui-build (alias --use-existing-ui-dist): reuse the existing ui/dist instead of rebuilding the Angular app. The deploy refuses a ui/dist that was built for local dev (sign-in disabled).
--no-build-cache: force a clean, cold image build, for a release or a dependency refresh.

Local development (UI)

You do not run a full deploy to work on the UI. A local loop edits and hot-reloads the Angular app in seconds while still calling the same /api endpoints the deployed app uses. People using Scene Machine should use a deployed instance; this path is for people building it.

One-time setup renders the two gitignored files the UI needs (ui/src/env.ts, ui/definitions/config.json) in the dev front-door mode, and points the local backend at a dev project through your Application Default Credentials:

./scripts/dev-setup.sh
gcloud auth application-default login

Then run two terminals:

# Terminal 1 - the local /api backend
ROLE=app AUTH_MODE=none LOCAL_WORKER=1 FIRESTORE_DB_UI=<your-dev-db> PORT=8080 python3 orch.py

# Terminal 2 - the Angular dev server with the /api proxy
cd ui && npm run dev

Open the printed http://localhost:4200; edits under ui/src hot-reload. A proxy (ui/proxy.conf.json) forwards every /api call to the local backend, so the browser uses the same relative /api/... URLs as production with no CORS setup. On a fresh dev database, seed the config/global document once (DEVELOPING.md gives the command) so /api/config does not return 404.

AUTH_MODE=none (and the UI's controlPlaneMode: 'none') removes the sign-in gate. This is dev only: fine on localhost, unacceptable in production, and deploy.sh refuses to build or ship a none UI (including via --skip-ui-build).
LOCAL_WORKER=1 runs workflow actions in-process instead of scheduling Cloud Tasks, so no separate worker is needed. It is dev-only (no retries or backoff) and is honored only when AUTH_MODE=none, so it can never turn on in a deployed service.

The full local-dev guide, including running an end-to-end workflow locally, is in DEVELOPING.md.

Known trade-offs and notes

Shared-team delete is intentional. Any admitted user can delete any project. For a small trusted team this is the desired behavior; if per-user ownership is ever needed it would be a follow-on.
Storyboard multi-batch write atomicity. A project that fits in one Firestore batch commits atomically. A larger project spans multiple batches committed in sequence, so a failure partway through can leave the project partially written until the next save. This is a documented trade-off of raising the per-project size ceiling via the per-scene subcollection.
No data migration. Old deployments redeploy fresh; there is no in-place upgrade path from the previous architecture for Firestore project documents. Existing storage buckets are kept and reused: the deploy adds an ownership label to the bucket and moves no media.

…eway and App Engine Replace the API Gateway and App Engine UI hosting with a single container image that runs as the app or the worker based on ROLE/AUTH_MODE/WORKER_URL. Drop the gateway config fields and point the status viewer at the same origin.

The app role serves the SPA plus a same-origin /api control plane gated by IAP; the worker role is private and Cloud-Tasks-only. Firestore and Storage access is mediated through /api with server-signed URLs, storyboards are split into a per-scene subcollection, and the client security rules deny all direct access.

Remove @angular/fire from the UI and firebase-admin from the Python requirements; the deployed path no longer uses Firebase Auth.

Behind IAP the user is already authenticated, so the SPA no longer initializes Firebase or signs in with a custom token. The /api auth-header interceptor handles the iap and local none modes.

Replace direct client Firestore/Storage with mediated /api calls and server-signed media URLs, re-signing on read so long-lived bindings never serve an expired URL.

Reorganize into a deploy/ folder, make IAM grants check-before-grant, authenticate REST calls with ADC tokens, scope the runtime service account to least privilege, and add a grant-access helper.

Cache Docker layers via cloudbuild.yaml, shrink the build context and runtime image, and add a no-deploy local UI dev loop with an /api proxy.

Adopt the GA Gemini model ids with a separate image-model region, add usage-tracking headers, and outpaint via a blank canvas with adaptive output size. Sync the action definitions and example workflows to the required params.

… fixes Import images by paste, drag, remote link, or base64 across setup, storyboard and the overlay dialog; rename a project inline from the page headers; surface per-scene generation failures by position; and steady playback, unstick uploads, and recover interrupted renders.

…nput Update the README, DEVELOPING guide and walkthrough for the IAP-only deployment, the three sign-ins, the fast-deploy flags, the local dev loop, the shared-team access model, and the new paste/links/base64 image input.

Add vitest UI specs, Python front-door/engine/rules tests, and GitHub workflows that build and unit-test the UI, run the Python tests and action contract, run the Firebase rule tests, and ShellCheck the deploy scripts.

gps-readability-bot · 2026-06-17T13:06:19Z