Big 🫏 PR: IAP-gated single-image deploy with a fully mediated data plane#114
Big 🫏 PR: IAP-gated single-image deploy with a fully mediated data plane#114christophervoelpel wants to merge 30 commits into
Conversation
…eway and App Engine Replace the API Gateway and App Engine UI hosting with a single container image that runs as the app or the worker based on ROLE/AUTH_MODE/WORKER_URL. Drop the gateway config fields and point the status viewer at the same origin.
The app role serves the SPA plus a same-origin /api control plane gated by IAP; the worker role is private and Cloud-Tasks-only. Firestore and Storage access is mediated through /api with server-signed URLs, storyboards are split into a per-scene subcollection, and the client security rules deny all direct access.
Remove @angular/fire from the UI and firebase-admin from the Python requirements; the deployed path no longer uses Firebase Auth.
Behind IAP the user is already authenticated, so the SPA no longer initializes Firebase or signs in with a custom token. The /api auth-header interceptor handles the iap and local none modes.
Replace direct client Firestore/Storage with mediated /api calls and server-signed media URLs, re-signing on read so long-lived bindings never serve an expired URL.
Reorganize into a deploy/ folder, make IAM grants check-before-grant, authenticate REST calls with ADC tokens, scope the runtime service account to least privilege, and add a grant-access helper.
Cache Docker layers via cloudbuild.yaml, shrink the build context and runtime image, and add a no-deploy local UI dev loop with an /api proxy.
Adopt the GA Gemini model ids with a separate image-model region, add usage-tracking headers, and outpaint via a blank canvas with adaptive output size. Sync the action definitions and example workflows to the required params.
… fixes Import images by paste, drag, remote link, or base64 across setup, storyboard and the overlay dialog; rename a project inline from the page headers; surface per-scene generation failures by position; and steady playback, unstick uploads, and recover interrupted renders.
…nput Update the README, DEVELOPING guide and walkthrough for the IAP-only deployment, the three sign-ins, the fast-deploy flags, the local dev loop, the shared-team access model, and the new paste/links/base64 image input.
Add vitest UI specs, Python front-door/engine/rules tests, and GitHub workflows that build and unit-test the UI, run the Python tests and action contract, run the Firebase rule tests, and ShellCheck the deploy scripts.
|
Still need readability approvals from:
|
|
Still need readability approvals from:
|
|
Still need readability approvals from:
|
|
Still need readability approvals from:
|
|
Still need readability approvals from:
|
1 similar comment
|
Still need readability approvals from:
|
…oogle-marketing-solutions#113 Cloud Tasks lock Pre-merge review and hardening: - Split the build and runtime service accounts. The app and worker now run as a dedicated runtime service account that does not hold artifactregistry.writer, so a compromised app cannot push or overwrite container images. - Enforce a content-type allow-list on the signed-upload endpoint. - Remove the unnecessary storage target-apply from the firestore-only deploy phase so it cannot abort and leave the deny-all rules undeployed. - Route the raw Cloud Run and service-account IAM grants through the retry wrapper, and give the worker a finite gunicorn timeout. - Fix the generation flow in the UI: do not resume until the global config has loaded, keep the in-flight marker when only URL signing fails, make render-resume per-execution, guard a malformed signed-URL expiry, and tolerate a single missing URL in a batch sign request. - Reject an empty template update and surface an expired status-viewer session. - Update the storage-rules test to match the deny-all rules, remove README trailing whitespace, and add a diff-check CI step. - Add regression tests for the above. Keep the PR google-marketing-solutions#113 fix from regressing: - Port the Cloud Tasks duplicate-execution lock. The worker acquires and releases a per-task Firestore lock and the deploy enables its TTL, so this rewrite of orch.py and deploy.sh does not drop the fix when it lands after PR google-marketing-solutions#113.
…ions#113 Cloud Tasks lock The lock added in PR google-marketing-solutions#113 reads executionId, nodeId, groupId and workflowDefinition from the task payload, so the older test that posted the pre-lock payload shape raised a KeyError and returned 500 (red Python CI). - Update test_worker_url_overrides_host_for_trigger_action to send a valid Cloud Tasks payload and acquire the lock. - Add tests for the lock behavior: a duplicate delivery is skipped with 'Already Triggered', a retryable failure releases the lock and returns 429, and a non-retryable failure keeps the lock. - Validate the task payload in the handler so a malformed body returns a clean 400 instead of a 500, and add a test for it. This keeps the PR google-marketing-solutions#113 duplicate-execution protection as the source of truth rather than weakening it to satisfy the stale test.
Direct signed PUT uploads go from the browser to GCS and bypass Flask's request-size limit, so the server could not previously reject an oversized upload. The client now declares sizeBytes on the /api/uploadUrl request and the server rejects an oversized or invalid declaration before signing, using a per-prefix limit (1 GiB for remix-input, 50 MiB for thumbnail). This is a pre-sign bound for the honest and accidental cases. A custom client could still understate or omit sizeBytes; a bypass-proof bound would sign an x-goog-content-length-range into the PUT URL, tracked as a follow-up. Content type is already allow-listed. - orch.py: validate sizeBytes against a per-prefix maximum. - media.ts: send sizeBytes (the file size, or the byte length for text). - Add tests for oversized, invalid, and within-limit declarations.
|
Thanks for the v2 review — all three blockers are addressed on this branch (now at 35b4ad6): P1 · Python CI (stale P1 · P2 · signed-upload size. The client now sends Separately, I validated the IAP-only deploy end-to-end on a fresh GCP project and the app is running. |
- deploy.sh: when this CLI's 'gcloud run deploy' lacks the --iap flag, enable IAP with 'gcloud run services update app --iap' automatically instead of only printing it as a manual step, and drop that manual step from the summary when the auto-enable succeeds. - deploy.sh: point the OAuth consent step at /auth/overview (the 'Get started' entry for a fresh project) and state it must be done before the custom OAuth client, which otherwise refuses with "you need to configure your OAuth consent screen".
The deploy only ever runs in IAP mode, so --auth-mode had a single valid value and the deployed services already set AUTH_MODE=iap themselves. Remove the flag from the script and the docs. Keep AUTH_MODE as a fixed value inside the script so the banners and the IAP post-deploy steps still read clearly.
The --yes flag only skipped the deploy-target confirmation, which is exactly what --non-interactive already does. Drop --yes and its -y alias and have the auto-confirm check key off --non-interactive, so there is one way to skip the prompt. Update the usage text and the README.
a4a8974 to
c1b3a6e
Compare
The global config resource had no error handling, so a failed /api/config request (a brief server error, a network blip, or an expired IAP session) put the resource in its error state. Reading its value then threw, which broke the home and setup pages instead of falling back to defaults. Catch the error in the loader and return nothing, so the existing fall-back-to-defaults code keeps working.
When a render finished, the code asked the backend for a signed link to the output video with no retry. If that one request failed for a moment, the render was recorded as a failure and the finished video was lost, even though the per-scene generation path already retries and keeps the work. Retry the signing the same way the generation path does, and if it still fails, keep the in-flight marker so reopening the project picks the finished video back up. Add a test for this case.
Two backend fixes: - The sign-URL endpoint signed every path in the request, and each path is one signing call to Google. With no limit, one user could send thousands of paths and use up the project's signing quota for everyone. Cap a single request at 100 paths. - Listing projects rebuilt the full storyboard of every project by reading all of its scenes, but the project list only shows the first scene as a thumbnail. Read just the first scene when listing, so opening the home page does far fewer reads.
The static rules tests already checked that no signed-in user can write Firestore directly, but there was no matching check for reads. Add one so a change that re-opened reads to any signed-in user would fail the tests, in every pull request, with no test project needed.
The fast unit-test runner compiles each test file but never reports type errors, so the tests passed even though seven test files had real type errors (they only showed up in the slower local test command). CI never caught them. Add a step that type-checks the test files and fix the seven errors it surfaces: drop references to env fields that no longer exist, complete a test object that was missing required fields, and guard an optional value.
The per-item commits added the code but left a few gaps, and an adversarial re-review of the change set surfaced more. This fills them: - Sign each URL-signing path once and bound the distinct-path count (orch.py). Repeated 'path' params no longer each spend a signBlob RPC, and the cap measures distinct paths (the real RPC cost) by deduping before the cap check, not raw params. - Guard the generation paths against an unloaded global config (remix-engine.ts). generateStoryboard and combineScenes now fail fast with a clear, recoverable message instead of a vague error from a non-null assertion on an undefined config, covering every caller that reads globalConfig fields. - Make the global-config degradation testable: extract the resource loader into loadGlobalConfig() and test that a failed /api/config resolves to undefined and that the default project config still builds. - Update the project-list test to the first-scene-only summary. - Test the signing limits (cap on distinct paths, duplicates signed once) and the resumed render retry-success path. - Match a parenthesized condition in the static no-broad-read rule test.
The autosave dedupe marked a config as saved the moment its POST/PATCH was sent, before the server responded. If that save failed and the user did not click the Retry snackbar, a later flush or save of the same unchanged project skipped it as already-saved and the work was lost on navigation. Split the two states: advance lastSavedConfig only on a successful response, and track the in-flight object separately so concurrent saves are still deduped. On failure the in-flight marker is cleared so a later flush, save, or the next autosave re-attempts the unsaved work. Add tests for the retry and the success dedupe.
Storyboard and combine generation already fail fast with a recoverable message when /api/config has not loaded, but candidate generation still read the config with a non-null assertion. An unloaded config turned into a vague 'Failed to start workflow' and then marked the scene permanently failed. Add the same 'configuration not loaded yet' guard before starting candidate generation: show a recoverable message and leave the scene unflagged. Also add a test for the live combine-render path when its output URL cannot be signed (marker kept, no error run, button reset).
POST wrote the project document directly even when its id already existed, overwriting it and re-stamping createdBy to the poster, which reassigned the owner shown in 'My projects'. PATCH already preserves the owner; POST did not. Return 409 Conflict when the id already exists, so updates go through PATCH and a repeated create cannot clobber an existing project or its owner. Add a test.
Signed download URLs are issued for every object in GCS_BUCKET, which is safe
only if the bucket is dedicated to Scene Machine. The deploy already creates the
default bucket, but if the name already existed it silently reused it, so a
shared or unrelated bucket could have its contents exposed.
Label buckets the deploy creates with app=scene-machine. On a later run a
labeled bucket is reused; an unlabeled bucket with the default
${PROJECT}-scene-machine name (an upgrade of an existing deployment) is adopted
in place by adding the label, with no data moved; any other pre-existing or
custom-named bucket is refused unless the deployer sets ADOPT_EXISTING_BUCKET=1.
No new deployer task or role: roles/storage.admin already covers bucket creation
and label updates.
…-deploy scope Follow-ups from the review: - README: state that Scene Machine needs a dedicated bucket it creates and owns, not to point GCS_BUCKET at shared data, and how redeploys reuse it (the default-named bucket is adopted automatically; a custom existing name needs ADOPT_EXISTING_BUCKET=1). - README: recommend deploy/grant-access.sh for admitting users, keeping the raw gcloud command as an alternative. - README: note this release targets fresh deployments and does not migrate existing Firestore project documents. - DEVELOPING.md: document the one-time config/global seed for local dev, since /api/config returns 404 on a fresh dev database.
After POST /api/projects became create-only, a POST whose id already exists returns 409. The client did not recognise it: it showed the generic save-failed snackbar, and Retry re-POSTed and hit 409 again, so a project the server already had could get stuck in a failing-save loop. On a 409, mark the id as already-persisted and retry the save once as a PATCH, which updates the existing project and keeps the user's edits. The retry goes through the normal save path with the latest project state, so it is tracked for de-duplication and cannot send stale data over newer edits.
A new project is built from the global config defaults (aspect ratio, resolution, model). When /api/config has not loaded yet, or failed to load, those defaults are empty, so a project created in that window was saved with missing settings. Guard the new-project route: if the config is not loaded, show a recoverable 'try again in a moment' message and return to the home page without creating anything.
Big 🫏 PR: IAP-gated single-image deploy with a fully mediated data plane
This PR replaces Scene Machine's deploy and auth architecture with one that fits a corporate Google Cloud environment. The app now runs as a single container image deployed as two Cloud Run services (a public, IAP-gated
appand a privateworker), the browser never touches Firestore or Cloud Storage directly (everything goes through a same-origin/apicontrol plane), and Firestore and Storage rules deny all direct client access. It also folds in the GA Gemini 3 models and improved outpainting, a reworkeddeploy.sh, new CI, and a batch of UI fixes.This is not an additive option. It is the new authoritative deploy and auth model for the project. Conflicts with
mainshould resolve to this branch.What and why
The previous architecture could not be deployed in our corporate environment:
This branch solves both:
/apion the same service that serves the SPA. The server uses the Admin SDK and hands back short-lived, server-signed GET and PUT URLs for media. The browser never holds Firestore or Storage credentials.Architecture: before and after
flowchart LR subgraph BEFORE["BEFORE (main)"] direction TB b_user["Browser<br/>(Firebase Auth sign-in)"] b_gw["API Gateway<br/>(apispec.template.yaml)"] b_ui["Angular UI on App Engine<br/>(ui/app.yaml, deploy-ui.sh)"] b_run["Backend on Cloud Run"] b_fs[("Firestore")] b_gcs[("Cloud Storage")] b_user --> b_ui b_user --> b_gw --> b_run b_user -- "direct read/write" --> b_fs b_user -- "direct read/write" --> b_gcs end subgraph AFTER["AFTER (this branch)"] direction TB a_user["Browser"] a_iap{{"Identity-Aware Proxy"}} a_app["app service (Cloud Run, public)<br/>serves SPA + same-origin /api"] a_worker["worker service (Cloud Run, private)"] a_tasks["Cloud Tasks (OIDC)"] a_fs[("Firestore<br/>rules: deny all client")] a_gcs[("Cloud Storage<br/>rules: deny all client")] a_user --> a_iap --> a_app a_app -- "Admin SDK" --> a_fs a_app -- "server-signed GET/PUT URLs" --> a_gcs a_app --> a_tasks -- "OIDC invoke" --> a_worker a_worker -- "Admin SDK" --> a_fs a_worker -- "Admin SDK" --> a_gcs endBefore: three deployables (API Gateway, App Engine UI, Cloud Run backend), Firebase Auth in the browser, and a client that reads and writes Firestore and Storage directly.
After: one container image built once and run as two Cloud Run services selected by environment (
ROLE,AUTH_MODE,WORKER_URL). Theappservice is public but IAP-gated and serves both the built SPA and a same-origin/apicontrol plane. Theworkerservice is private and is only ever invoked by Cloud Tasks using OIDC. No API Gateway, no App Engine, no separate UI deploy. All data access is mediated by the server.Major changes
Front-door topology (single image, two services). One image (
Dockerfile) runs asapporworkerbased onROLE, withAUTH_MODEandWORKER_URLselecting behavior.orch.pyvalidates these at startup (rejects an unknownROLE, requires the IAP audience whenROLE=appandAUTH_MODE=iap, requiresWORKER_URLwhenROLE=app). Theappservice serves the built SPA and a same-origin/api; theworkerservice is private.IAP front door and Firebase Auth removal. The Firebase Auth sign-in path is fully removed: no
@angular/fire, nofirebase-admin, no custom-token bridge. IAP is the only admission path. The browser-side sign-in UI is gone; the SPA assumes the request already passed IAP.Mediated data plane and storyboard split. The browser never touches Firestore or Storage directly. Uploads go through
POST /api/uploadUrl(server-signed PUT URLs) and reads throughGET /api/signUrl(server-signed GET URLs); config is read throughGET /api/config.firebase/firestore.rulesandfirebase/storage.rulesboth deny all direct client access (allow read, write: if false). Because the server signs GET URLs for any object inGCS_BUCKET, the deploy treats it as a dedicated bucket: it creates and labels the bucket (app=scene-machine), reuses or auto-adopts the default-named bucket on redeploy, and refuses to adopt a foreign or shared bucket unless the deployer explicitly opts in. Storyboards are split into a per-scene Firestore subcollection (projects/<id>/scenes/<index>inorch.py) so the 1 MiB per-document limit applies per scene instead of per project.Deploy, infra, and CI.
deploy.shis reworked into a single-image Cloud Build deploy that is idempotent and least-privilege: the runtime service account self-scopes token-creator and act-as, the worker invoker is service-scoped, REST calls use ADC tokens (safe under corporate Certificate-Based Access), and there are opt-in fast-deploy flags (--app-only,--skip-ui-build,--no-build-cache) plus a local no-deploy dev loop. Supporting scripts live indeploy/libs.shanddeploy/grant-access.sh. Build context and image are slimmed (.dockerignore,.gcloudignore,cloudbuild.yamllayer caching).GA Gemini 3 models and outpainting (folds in upstream PR #104). Defaults move to GA Gemini 3 models (
config.template.txt:GEMINI_MODEL=gemini-3.5-flash,IMAGE_MODEL=gemini-3-pro-image) with improved outpainting (actions/outpaint_image.py). Action signatures and example workflows are brought into parity with the new model parameters.UI features and fixes. Inline project-name editing on the storyboard, composition, and output headers; per-scene failure display; stuck-upload fix; position-based scene labels; and an image-import feature (paste, pasted URL lists, base64, and drag-from-another-tab across setup, storyboard reference, and the overlay dialog, with a browser-side fetch that has a timeout and size cap).
Tests and docs. UI vitest suite at 197 tests; Python suite at 146 tests. Two obsolete UI specs removed (
config.spec.ts,remix-engine.spec.ts).README.mdandDEVELOPING.mddocument the IAP-only deploy, the one-time console steps, the local dev loop, and the fast-deploy flags.Security and review hardening. Deny-all client rules, server-only config and access allow-list reads, an
ALLOWED_DOMAINScharset guard indeploy.sh, and static rule tests in CI. Several correctness and security edges found in review since this PR opened were also closed:/api/confighas not loaded instead of marking a scene failed,POST /api/projectsis create-only (returns 409 on an existing id), so it cannot overwrite a project or reassign its owner,GET /api/signUrlcaps the number of paths per request, preventing signing-quota exhaustion by a single caller,GCS_BUCKETmust be a dedicated bucket that the deploy creates, labels, and owns,Breaking changes and migration
main's deploy and auth architecture. The API Gateway, the App Engine UI, and the separate UI deploy are removed (apispec.template.yaml,ui/app.yaml,deploy-ui.share deleted). Firebase Auth sign-in is removed.Security model
appservice. No request reaches app code without passing IAP.GCS_BUCKET, so the deploy requires a bucket dedicated to Scene Machine. It creates and labels the bucket, auto-adopts the default-named one on redeploy, and refuses a foreign or shared bucket unless the deployer explicitly opts in.workerservice is private and only invokable by Cloud Tasks via OIDC; the invoker permission is scoped to that service.createdByis a display label, not an access gate. This is a deliberate small-team design choice, not an oversight.Testing and CI
All four GitHub Actions workflows are green on Linux runners:
ui-tests.yml: UI vitest suite (197 tests) plus a spec type-check and a productionng build.python-tests.yml: Python tests on 3.11, 3.12, and 3.13 (146 tests) plus the action-signature check.deploy-checks.yml: ShellCheck and static safety checks on the deploy scripts.firebase-rules.yml: static and behavioral tests of the Firestore and Storage rules.The full flow has also been deploy-validated end to end on a live GCP project using
./deploy.sh.Deploy
One command does the full, safe deploy. There is no auth-mode flag: the deploy is always IAP-gated (the public Firebase sign-in mode was removed), so a plain run is all there is.
A first-time deploy on a brand-new project needs a few one-time console steps:
502 "Empty OAuth client".deploy/grant-access.shto admit a user.The script prints these remaining steps verbatim at the end of a run, and they are documented in
README.md.Deploy options
A plain
./deploy.shalways runs the full, safe, end-to-end deploy. The flags below are opt-in, and each one logs what it skipped:--non-interactive: for headless or agent-driven runs. It auto-confirms the "deploy to this project?" prompt, and instead of pausing at a human-only console step it fails fast, prints the exact console URL and the command to re-run, then exits so an automation can surface it and resume once the step is done.--app-only: rebuild the image and redeploy only theappservice, reusing the liveworker(use when the worker did not change).--skip-ui-build(alias--use-existing-ui-dist): reuse the existingui/distinstead of rebuilding the Angular app. The deploy refuses aui/distthat was built for local dev (sign-in disabled).--no-build-cache: force a clean, cold image build, for a release or a dependency refresh.Local development (UI)
You do not run a full deploy to work on the UI. A local loop edits and hot-reloads the Angular app in seconds while still calling the same
/apiendpoints the deployed app uses. People using Scene Machine should use a deployed instance; this path is for people building it.One-time setup renders the two gitignored files the UI needs (
ui/src/env.ts,ui/definitions/config.json) in the dev front-door mode, and points the local backend at a dev project through your Application Default Credentials:Then run two terminals:
Open the printed
http://localhost:4200; edits underui/srchot-reload. A proxy (ui/proxy.conf.json) forwards every/apicall to the local backend, so the browser uses the same relative/api/...URLs as production with no CORS setup. On a fresh dev database, seed theconfig/globaldocument once (DEVELOPING.md gives the command) so/api/configdoes not return 404.AUTH_MODE=none(and the UI'scontrolPlaneMode: 'none') removes the sign-in gate. This is dev only: fine on localhost, unacceptable in production, anddeploy.shrefuses to build or ship anoneUI (including via--skip-ui-build).LOCAL_WORKER=1runs workflow actions in-process instead of scheduling Cloud Tasks, so no separate worker is needed. It is dev-only (no retries or backoff) and is honored only whenAUTH_MODE=none, so it can never turn on in a deployed service.The full local-dev guide, including running an end-to-end workflow locally, is in
DEVELOPING.md.Known trade-offs and notes