Skip to content

feat: add fake-gcs-server to devstack with GCS integration tests#3155

Open
npow wants to merge 6 commits intomasterfrom
gcs-devstack-tests
Open

feat: add fake-gcs-server to devstack with GCS integration tests#3155
npow wants to merge 6 commits intomasterfrom
gcs-devstack-tests

Conversation

@npow
Copy link
Copy Markdown
Collaborator

@npow npow commented Apr 27, 2026

Summary

  • Add fake-gcs-server (Google Cloud Storage emulator) as a devstack component with full CI coverage
  • Add gcs-local backend to the UX test CI matrix
  • Monkey-patch GCS client factory to use anonymous credentials with the emulator
  • Add Tiltfile, k8s deployment, bucket init job, and secret for fake-gcs-server

Resurrected from npow/devstack-fake-gcs where these changes were added then removed in a cleanup pass.

Test plan

  • Verify gcs-local backend passes in CI UX test matrix
  • Confirm fake-gcs-server starts and bucket init succeeds in devstack
  • Verify GCS anonymous credential monkey-patch works with the emulator

🤖 Generated with Claude Code

Nissan Pow added 3 commits April 27, 2026 18:54
Restore fake-gcs-server (Google Cloud Storage emulator) as a devstack
component with full CI coverage via a new gcs-local backend.

- Add fake-gcs-server Tiltfile, k8s deployment, bucket init job, and secret
- Add gcs-local backend to GHA matrix (minio + postgresql + metadata-service + fake-gcs-server)
- Add gcs-local backend to ux_test_config.yaml (runner-only, no scheduler)
- Monkey-patch GCS client factory in conftest to use anonymous credentials
  with the emulator (google.auth.default() fails without real GCP creds)
- Update verify_run_provenance to accept ds-type 'gs' for GCS backends
- Install google-cloud-storage in CI for gcs-local backend
- Set METAFLOW_DEFAULT_DATASTORE=gs and STORAGE_EMULATOR_HOST via GITHUB_ENV
When STORAGE_EMULATOR_HOST is set, create a plain storage.Client()
that auto-detects the emulator instead of calling google.auth.default()
which fails without real GCP credentials. This fixes gcs-local CI tests
where flow subprocesses inherit the emulator env var but have no GCP
credentials configured.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds fake-gcs-server as a devstack component and wires up a gcs-local CI backend that runs Metaflow flows locally against the emulator. The core GCS client factory change (checking STORAGE_EMULATOR_HOST before calling google.auth.default()) is the right fix and covers all production callers. The remaining P2 items — redundant monkey-patch in conftest.py, missing readiness probe on the server Deployment, and the already-noted floating image tags / non-idempotent bucket init — are worth cleaning up but do not block the feature from working.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/reliability improvements with no blocking correctness bugs.

No P0 or P1 issues found. The factory change is the reliable emulator fix, and the CI matrix entry is correctly configured. Remaining findings (redundant monkey-patch, missing readiness probe, floating image tags, non-idempotent init) are P2.

devtools/tilt/k8s/fake-gcs-server.yaml and gcs-bucket-init-job.yaml for readiness probe and image pinning; test/ux/core/conftest.py for the now-redundant monkey-patch.

Important Files Changed

Filename Overview
metaflow/plugins/gcp/gs_storage_client_factory.py Adds emulator-aware branch to _get_gs_storage_client_default: when STORAGE_EMULATOR_HOST is set, creates a plain storage.Client() instead of calling google.auth.default(). Correct and idiomatic.
test/ux/core/conftest.py Adds _setup_gcs_emulator() which monkey-patches factory.get_gs_storage_client; this is redundant with the factory fix already in this PR and is import-order-fragile for direct importers.
.github/workflows/ux-tests.yml Adds gcs-local matrix entry with STORAGE_EMULATOR_HOST and GCS deps install step; timeout field is present (900), extra_args handled correctly.
devtools/tilt/k8s/fake-gcs-server.yaml Deployment uses floating latest tag and lacks a readiness probe, which can cause the bucket-init job to race with server startup.
devtools/tilt/k8s/gcs-bucket-init-job.yaml Uses floating curlimages/curl:latest tag; curl -sf returns non-zero on a 409 (bucket already exists), making restartPolicy: OnFailure retry loop terminate in a failed state.
devtools/tilt/fake_gcs_server.tiltfile Correctly registers k8s manifests, port-forwards, and returns result with GCS sysroot config and the fake-gcs-secret for pod injection.
devtools/tilt/k8s/fake-gcs-secret.yaml Simple Opaque secret that injects STORAGE_EMULATOR_HOST (cluster-internal URL) into k8s pods.
test/ux/core/test_utils.py Extends verify_run_provenance to assert ds-type == 'gs' when METAFLOW_DEFAULT_DATASTORE=gs; logic is correct.
test/ux/ux_test_config.yaml Adds gcs-local backend entry with scheduler_type/cluster null for local-only flow execution against the GCS emulator.
devtools/Tiltfile Registers fake-gcs-server as a Tilt component with no dependencies and loads the new tiltfile; straightforward addition.
.github/workflows/full-stack-test.yml Adds job-level timeout-minutes, extends wait timeout to 900s, adds tilt state dump on failure, and ensures teardown always runs; all reasonable improvements.

Reviews (4): Last reviewed commit: "fix(ci): skip conda tests for gcs-local,..." | Re-trigger Greptile

Comment thread .github/workflows/ux-tests.yml
spec:
containers:
- name: fake-gcs-server
image: fsouza/fake-gcs-server:latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mutable latest image tag reduces reproducibility

fsouza/fake-gcs-server:latest can silently pick up a breaking upstream release between CI runs, making failures hard to diagnose. Pinning to a specific release tag (e.g. 1.15.0) keeps the environment reproducible. The same applies to curlimages/curl:latest in gcs-bucket-init-job.yaml.

restartPolicy: OnFailure
containers:
- name: init
image: curlimages/curl:latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mutable latest image tag

curlimages/curl:latest is a floating tag; pinning to a digest or specific version (e.g. curlimages/curl:8.7.1) makes the bucket-init job deterministic across CI runs.

The gcs-local backend was missing timeout and memory values in the
CI matrix, causing pytest --timeout to receive an empty string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +12 to +20
image: curlimages/curl:latest
command: ["/bin/sh", "-ec"]
args:
- |
curl -sf -X POST \
http://fake-gcs-server:4443/storage/v1/b \
-H "Content-Type: application/json" \
-d '{"name":"metaflow-test"}'
echo "Bucket 'metaflow-test' created successfully"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-idempotent bucket init breaks on retry

With restartPolicy: OnFailure, if the container is killed after the bucket is created (e.g., OOM eviction, node pressure) but before the Job records success, Kubernetes restarts the container. The second attempt POSTs to an already-existing bucket, gets a 409, and curl -sf treats that as a failure — causing repeated retries until the Job's backoffLimit is exhausted and it enters a permanent Failed state. Handling the 409 makes the script safe to retry:

Suggested change
image: curlimages/curl:latest
command: ["/bin/sh", "-ec"]
args:
- |
curl -sf -X POST \
http://fake-gcs-server:4443/storage/v1/b \
-H "Content-Type: application/json" \
-d '{"name":"metaflow-test"}'
echo "Bucket 'metaflow-test' created successfully"
command: ["/bin/sh", "-ec"]
args:
- |
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST \
http://fake-gcs-server:4443/storage/v1/b \
-H "Content-Type: application/json" \
-d '{"name":"metaflow-test"}')
[ "$HTTP_STATUS" = "200" ] || [ "$HTTP_STATUS" = "409" ] || \
(echo "Unexpected status: $HTTP_STATUS" && exit 1)
echo "Bucket 'metaflow-test' ready (status: $HTTP_STATUS)"

npow and others added 2 commits April 27, 2026 19:24
The full-stack-test workflow was timing out on generate-configs with
WAIT_TIMEOUT=600 (10 min). CI runners are slow and services sometimes
need longer to initialize.

- Increase WAIT_TIMEOUT from 600 to 900 (15 min)
- Add timeout-minutes: 30 to prevent runaway jobs (was using 6h default)
- Add diagnostic step on failure: dump tilt resource status and recent logs
- Run teardown with if: always() so cleanup happens on failure too

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The conda test takes too long to set up, causing the minikube
port-forwarding to fake-gcs-server to die mid-test. Skip conda
tests for gcs-local since they test conda integration, not the
GCS datastore backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant