Skip to content

feat: support DefaultDataCredential for workflow_log and workflow_data #749

@KeitaW

Description

@KeitaW

Problem

workflow_log and workflow_data configs only accept StaticDataCredential, requiring static IAM access keys (access_key_id + access_key). There is no way to use the SDK's default credential chain (EKS Pod Identity, IRSA, GKE Workload Identity, instance metadata) for workflow log/data storage.

The dataset/bucket system already supports DefaultDataCredential through the DataCredential union (added in #508 for Azure workload identity). Workflow logs/data should have parity.

Current Behavior

LogConfig and DataConfig restrict credential type to StaticDataCredential:

class LogConfig(ExtraArgBaseModel):
    credential: StaticDataCredential | None = None  # only static keys

class DataConfig(ExtraArgBaseModel):
    credential: StaticDataCredential | None = None  # only static keys

Attempting to configure with only endpoint/region (no keys):

curl -X PATCH /api/configs/workflow -d '{
  "workflow_log": {
    "credential": { "endpoint": "s3://my-bucket", "region": "us-west-2" }
  }
}'
# → 422 Validation Error: access_key_id field required

Expected Behavior

curl -X PATCH /api/configs/workflow -d '{
  "workflow_log": {
    "credential": { "endpoint": "s3://my-bucket", "region": "us-west-2" }
  }
}'
# → 200 OK (uses pod's ambient AWS credentials via DefaultDataCredential)

Why This Already Almost Works

PR #508 added DefaultDataCredential support for Azure workload identity but left the workflow log/data path incomplete:

Component Supports DefaultDataCredential?
DataCredential union Union[StaticDataCredential, DefaultDataCredential]
storage.Client.create() ✅ accepts DataCredential
S3 backend ✅ match/case handles DefaultDataCredential
workflow_service.py ✅ delegates to Client.create(), never accesses credential fields directly
LogConfig / DataConfig ❌ restricted to StaticDataCredential
create_config_dict / data_endpoints ❌ restricted to StaticDataCredential
Go sidecar MountURL ❌ unconditionally sets empty env vars, clobbers ambient chain

Proposed Changes

Python (2 files)

src/utils/connectors/postgres.py:

  • LogConfig.credential: StaticDataCredential | NoneDataCredential | None
  • DataConfig.credential: StaticDataCredential | NoneDataCredential | None
  • get_all_data_creds return type: Dict[str, StaticDataCredential]Dict[str, DataCredential]

src/utils/job/task.py:

  • create_config_dict parameter: dict[str, StaticDataCredential]dict[str, DataCredential]
  • data_endpoints parameter: Dict[str, StaticDataCredential]Dict[str, DataCredential]

Go sidecar (1 file)

src/runtime/pkg/data/data.goMountURL unconditionally sets static key env vars:

os.Setenv("AWS_ACCESS_KEY_ID", dataCredential.AccessKeyId)
os.Setenv("AWS_SECRET_ACCESS_KEY", dataCredential.AccessKey)

When using DefaultDataCredential, these are empty strings — clobbering the SDK's ambient credential chain. Fix:

if dataCredential.AccessKeyId != "" {
    os.Setenv("AWS_ACCESS_KEY_ID", dataCredential.AccessKeyId)
    os.Setenv("AWS_SECRET_ACCESS_KEY", dataCredential.AccessKey)
}

No changes needed

  • workflow_service.py — already generic
  • client.py — already accepts DataCredential union
  • s3.py — already has DefaultDataCredential match arm
  • credentials.py — union already defined

Pydantic Union Discrimination

No discriminator needed. StaticDataCredential requires access_key_id + access_key (mandatory). DefaultDataCredential does not. With extra = 'forbid' on both:

  • Payload has access_key_idDefaultDataCredential rejects (extra field) → StaticDataCredential matches
  • Payload lacks access_key_idDefaultDataCredential matches

Motivation

Static credential management for workflow logs creates operational complexity:

  1. Create IAM user + access key
  2. Store in a secret manager
  3. Sync to K8s secret
  4. PATCH to OSMO API after every deployment
  5. OSMO encrypts with MEK, stores in Postgres
  6. Any MEK rotation risks credential loss (see MEK re-encryption wraps undecryptable JWE in a new JWE layer, causing config bloat #731)

With DefaultDataCredential, steps 1–6 become: configure endpoint + region once. The pod's ambient credentials handle the rest.

Environment

  • OSMO v6.2-rc6 (also reproducible on main @ d1e8a17)
  • Validated on EKS with IRSA

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalThe author is not in @NVIDIA/osmo-dev

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions