Skip to content

feat(gcs): adds WIF Workload Identification Federation support#16002

Merged
bryanprosser-acryl merged 39 commits into
masterfrom
gcs-wif-support-update
Mar 27, 2026
Merged

feat(gcs): adds WIF Workload Identification Federation support#16002
bryanprosser-acryl merged 39 commits into
masterfrom
gcs-wif-support-update

Conversation

@bryanprosser-acryl

@bryanprosser-acryl bryanprosser-acryl commented Jan 28, 2026

Copy link
Copy Markdown
Contributor

This PR adds in support for Google' Workload Identity Federation as an alternative to the existing HMAC authentication for connecting to GCP. It also adds unit tests and updates the documentation to reflect the additional config options.

@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jan 28, 2026
@datahub-cyborg datahub-cyborg Bot added the needs-review Label for PRs that need review from a maintainer. label Jan 28, 2026
@codecov

codecov Bot commented Jan 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.01325% with 3 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ion/src/datahub/ingestion/source/gcs/gcs_source.py 97.64% 2 Missing ⚠️
.../datahub/ingestion/source/common/gcp_wif_config.py 98.48% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@bryanprosser-acryl

Copy link
Copy Markdown
Contributor Author

@sgomezvillamor
Updated the PR based on your comments above - main changes made:

  1. Docs structure/organisation

Restructured the GCS docs to follow the new _pre.md / _post.md / _recipe.yml format per docs/sources/AGENTS.md. This also resolved the merge conflicts with master. Added WIF use-case examples to gcs_pre.md covering the scenarios you suggested (GKE workloads, cross-cloud auth, credential rotation for compliance).

  1. Shared GCP location for WIF

Moved the WIF implementation out of gcs_source.py into a new shared module: source/common/gcp_wif_config.py. This exposes two reusable exports:

GCPWIFConfig — a Pydantic config mixin with the three input options (file path, JSON dict, JSON string) and format validators. BigQuery, Dataplex, VertexAI can inherit from this directly when they need WIF support.
load_wif_credentials(config) — loads google.auth.credentials.Credentials from any of the three options, applies the cloud-platform scope, and returns (credentials, project_id). This is the exact interface those other sources would need to pass to their google-cloud clients.
GCSSource now just calls load_wif_credentials(self.config) — the GCS-specific part (bearer token injection into boto3) stays in gcs_source.py since that's unique to the S3-compatible XML API approach.

  1. Missing platform, platform_instance, convert_urns_to_lowercase

These were missing from the WIF branch of create_equivalent_s3_config. Fixed — both DataLakeSourceConfig instantiation paths (HMAC and WIF) now pass all three fields.

  1. Temp file security / avoid writing credentials to disk

Eliminated temp file usage entirely. Switched from load_credentials_from_file() to google.auth.load_credentials_from_dict(), which loads WIF credentials directly from a Python dict in memory — no file is ever written. This applies to all three input options (file path is read once into a dict; the dict and JSON string options never touch disk at all).

@bryanprosser-acryl

Copy link
Copy Markdown
Contributor Author

I don't know why the vertica plugin in failing

Comment on lines +109 to +116
if not any(
[
wif_config.gcp_wif_configuration,
wif_config.gcp_wif_configuration_json,
wif_config.gcp_wif_configuration_json_string,
]
):
raise ValueError("No valid WIF configuration provided")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be moved up as a pydantic validation
you could also add the validation that only one is set

"Using Workload Identity Federation configuration from JSON string"
)

credentials, project_id = load_credentials_from_dict(wif_config_dict)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +184 to +193
if len(provided_options) == 0:
raise ValueError(
"One of gcp_wif_configuration (file path), gcp_wif_configuration_json (JSON content), "
"or gcp_wif_configuration_json_string (JSON string) is required when auth_type is 'workload_identity_federation'"
)
elif len(provided_options) > 1:
raise ValueError(
"Cannot specify multiple WIF configuration options. Use only one of: "
"gcp_wif_configuration, gcp_wif_configuration_json, or gcp_wif_configuration_json_string."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh.... I seee, the validation I was suggesting is here already 👍

Comment on lines +281 to +288
# Pydantic v2 models freeze field assignment after __init__; object.__setattr__
# bypasses this to attach runtime-only state that is not part of the config schema.
object.__setattr__(
aws_config, "_gcs_oauth_credentials", self._wif_credentials
)
object.__setattr__(
aws_config, "_gcs_oauth_project_id", self._wif_project_id
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks hacky, and I haven't found any reference to these fields in internet or ALL github 🤔

how much is this needed?
what's the error you get if not set?

we need more contextual information in the comment in case we ever need to revisit this

@sgomezvillamor sgomezvillamor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

I'm a little bit concerned about the object.__setattr__ calls for _gcs_oauth_credentials and _gcs_oauth_project_id
Approving to unblock but please let's provide more info or context on that.

@sgomezvillamor

Copy link
Copy Markdown
Contributor

I don't know why the vertica plugin in failing

It was fixed with 5b1a316, so update branch to fix

@maggiehays maggiehays added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Mar 26, 2026
…ptions in GCPWIFConfig; refactor GCSOAuthAwsConnectionConfig to use private attributes for credentials
@bryanprosser-acryl

Copy link
Copy Markdown
Contributor Author

Overall LGTM

I'm a little bit concerned about the object.__setattr__ calls for _gcs_oauth_credentials and _gcs_oauth_project_id Approving to unblock but please let's provide more info or context on that.

Updated this to be pydantic v2 way to declare instance-level runtime state that isn't part of the config schema.

@bryanprosser-acryl bryanprosser-acryl merged commit 3a0474c into master Mar 27, 2026
48 of 49 checks passed
@bryanprosser-acryl bryanprosser-acryl deleted the gcs-wif-support-update branch March 27, 2026 17:57
david-leifker pushed a commit that referenced this pull request May 27, 2026
- feat(dataplex): add support for more Dataplex entry groups and hierarchy mapping (#16723)
- feat(dagster): Emit StatusClass aspect during asset ingestion to handle soft deleted assets (#16809)
- fix(ui): enable lint rule on colors for diff files (#16813)
- feat(gcs): adds WIF Workload Identification Federation support (#16002)
- fix(ingest/dep): CVE-2024-27459 (#16822)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants