feat: add fake-gcs-server to devstack with GCS integration tests#3155
feat: add fake-gcs-server to devstack with GCS integration tests#3155
Conversation
Restore fake-gcs-server (Google Cloud Storage emulator) as a devstack component with full CI coverage via a new gcs-local backend. - Add fake-gcs-server Tiltfile, k8s deployment, bucket init job, and secret - Add gcs-local backend to GHA matrix (minio + postgresql + metadata-service + fake-gcs-server) - Add gcs-local backend to ux_test_config.yaml (runner-only, no scheduler) - Monkey-patch GCS client factory in conftest to use anonymous credentials with the emulator (google.auth.default() fails without real GCP creds) - Update verify_run_provenance to accept ds-type 'gs' for GCS backends - Install google-cloud-storage in CI for gcs-local backend - Set METAFLOW_DEFAULT_DATASTORE=gs and STORAGE_EMULATOR_HOST via GITHUB_ENV
When STORAGE_EMULATOR_HOST is set, create a plain storage.Client() that auto-detects the emulator instead of calling google.auth.default() which fails without real GCP credentials. This fixes gcs-local CI tests where flow subprocesses inherit the emulator env var but have no GCP credentials configured.
Greptile SummaryThis PR adds Confidence Score: 5/5Safe to merge; all findings are P2 style/reliability improvements with no blocking correctness bugs. No P0 or P1 issues found. The factory change is the reliable emulator fix, and the CI matrix entry is correctly configured. Remaining findings (redundant monkey-patch, missing readiness probe, floating image tags, non-idempotent init) are P2. devtools/tilt/k8s/fake-gcs-server.yaml and gcs-bucket-init-job.yaml for readiness probe and image pinning; test/ux/core/conftest.py for the now-redundant monkey-patch. Important Files Changed
Reviews (4): Last reviewed commit: "fix(ci): skip conda tests for gcs-local,..." | Re-trigger Greptile |
| spec: | ||
| containers: | ||
| - name: fake-gcs-server | ||
| image: fsouza/fake-gcs-server:latest |
There was a problem hiding this comment.
Mutable
latest image tag reduces reproducibility
fsouza/fake-gcs-server:latest can silently pick up a breaking upstream release between CI runs, making failures hard to diagnose. Pinning to a specific release tag (e.g. 1.15.0) keeps the environment reproducible. The same applies to curlimages/curl:latest in gcs-bucket-init-job.yaml.
| restartPolicy: OnFailure | ||
| containers: | ||
| - name: init | ||
| image: curlimages/curl:latest |
The gcs-local backend was missing timeout and memory values in the CI matrix, causing pytest --timeout to receive an empty string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| image: curlimages/curl:latest | ||
| command: ["/bin/sh", "-ec"] | ||
| args: | ||
| - | | ||
| curl -sf -X POST \ | ||
| http://fake-gcs-server:4443/storage/v1/b \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"name":"metaflow-test"}' | ||
| echo "Bucket 'metaflow-test' created successfully" |
There was a problem hiding this comment.
Non-idempotent bucket init breaks on retry
With restartPolicy: OnFailure, if the container is killed after the bucket is created (e.g., OOM eviction, node pressure) but before the Job records success, Kubernetes restarts the container. The second attempt POSTs to an already-existing bucket, gets a 409, and curl -sf treats that as a failure — causing repeated retries until the Job's backoffLimit is exhausted and it enters a permanent Failed state. Handling the 409 makes the script safe to retry:
| image: curlimages/curl:latest | |
| command: ["/bin/sh", "-ec"] | |
| args: | |
| - | | |
| curl -sf -X POST \ | |
| http://fake-gcs-server:4443/storage/v1/b \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"name":"metaflow-test"}' | |
| echo "Bucket 'metaflow-test' created successfully" | |
| command: ["/bin/sh", "-ec"] | |
| args: | |
| - | | |
| HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST \ | |
| http://fake-gcs-server:4443/storage/v1/b \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"name":"metaflow-test"}') | |
| [ "$HTTP_STATUS" = "200" ] || [ "$HTTP_STATUS" = "409" ] || \ | |
| (echo "Unexpected status: $HTTP_STATUS" && exit 1) | |
| echo "Bucket 'metaflow-test' ready (status: $HTTP_STATUS)" |
The full-stack-test workflow was timing out on generate-configs with WAIT_TIMEOUT=600 (10 min). CI runners are slow and services sometimes need longer to initialize. - Increase WAIT_TIMEOUT from 600 to 900 (15 min) - Add timeout-minutes: 30 to prevent runaway jobs (was using 6h default) - Add diagnostic step on failure: dump tilt resource status and recent logs - Run teardown with if: always() so cleanup happens on failure too Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The conda test takes too long to set up, causing the minikube port-forwarding to fake-gcs-server to die mid-test. Skip conda tests for gcs-local since they test conda integration, not the GCS datastore backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
gcs-localbackend to the UX test CI matrixResurrected from
npow/devstack-fake-gcswhere these changes were added then removed in a cleanup pass.Test plan
gcs-localbackend passes in CI UX test matrix🤖 Generated with Claude Code