Skip to content

Feat/spiffe add per-tenant identity configuration and token delegation gRPC APIs and a new DB table.#490

Open
prbinu-nvidia wants to merge 9 commits intoNVIDIA:mainfrom
prbinu-nvidia:feat/spiffe-api-db-schema
Open

Feat/spiffe add per-tenant identity configuration and token delegation gRPC APIs and a new DB table.#490
prbinu-nvidia wants to merge 9 commits intoNVIDIA:mainfrom
prbinu-nvidia:feat/spiffe-api-db-schema

Conversation

@prbinu-nvidia
Copy link
Contributor

Description

SPIFFE JWT-SVID machine identity support: add per-tenant identity configuration and token delegation via gRPC APIs and a new DB table.


1. Database Layer (api-db)

New: crates/api-db/src/tenant_identity_config.rs

  • TenantIdentityConfig struct for per-org identity config
  • set() – upsert identity config (issuer, audiences, TTL, signing key)
  • find() – fetch config by org
  • delete() – remove config
  • set_token_delegation() – set token exchange config (endpoint, auth method, client secret)
  • delete_token_delegation() – clear delegation config
  • Placeholder key generation (no real encryption yet)

New: crates/api-db/migrations/20260225120000_tenant_identity_config.sql

  • tenant_identity_config table with:
    • Identity: issuer, default_audience, allowed_audiences, token_ttl, subject_domain_prefix, enabled
    • Signing: encrypted_signing_key, signing_key_public, key_id, algorithm, master_key_id
    • Timestamps: created_at, updated_at
    • Delegation: token_endpoint, auth_method, encrypted_auth_method_config, subject_token_audience, token_delegation_created_at
  • FK to tenants(organization_id) with ON DELETE CASCADE

2. gRPC API (rpc)

New Proto Messages

  • GetIdentityConfiguration / SetIdentityConfiguration / DeleteIdentityConfiguration
  • GetTokenDelegation / SetTokenDelegation / DeleteTokenDelegation
  • Messages: GetIdentityConfigRequest, IdentityConfigRequest, IdentityConfigResponse, TokenDelegationRequest, TokenDelegationResponse, GetTokenDelegationRequest

3. API Handlers (handlers/identity_config.rs)

New: crates/api/src/handlers/identity_config.rs (657 lines)

  • get_identity_configuration – read config by org
  • set_identity_configuration – upsert config with org validation
  • delete_identity_configuration – delete config
  • get_token_delegation – read delegation config
  • set_token_delegation – upsert delegation config
  • delete_token_delegation – clear delegation

Helper Functions

  • compute_secret_hash() – SHA256 hash for secrets
  • truncate_hash_for_display() – truncate hash for display
  • struct_to_json / json_to_struct – protobuf ↔ JSON
  • build_response_auth_config() – omit secrets from responses

Unit Tests (10)

  • compute_secret_hash, truncate_hash_for_display
  • struct_to_json, json_to_struct, json_to_struct_roundtrip
  • build_response_auth_config_omits_client_secret, truncates_hash, passes_through_non_secret, non_object_returns_clone

4. Configuration (cfg/file.rs)

New: MachineIdentityConfig

  • enabled, algorithm, token_ttl_min, token_ttl_max, token_endpoint_http_proxy
  • New [machine_identity] section in CarbideConfig

5. Integration Test Support (api-test-helper)

New: crates/api-test-helper/src/identity_config.rs

  • set_identity_configuration(), get_identity_configuration(), delete_identity_configuration()
  • set_token_delegation(), get_token_delegation(), delete_token_delegation()
  • Uses grpcurl for gRPC calls

6. Integration Tests (api-integration-tests)

run_identity_config_tests() in tests/lib.rs

  • Runs after tenant creation in test_integration
  • Sets config → get → delete
  • Sets config again → sets token delegation → get delegation → delete delegation

7. Fixes

api_fixtures/mod.rs

  • Added machine_identity: MachineIdentityConfig::default() to get_config() in CarbideConfig

8. Documentation

book/src/design/machine-identity/spiffe-svid-sdd.md

  • SDD for SPIFFE JWT-SVID machine identity
  • Architecture, config flows, token delegation

Files Changed (identity_config-related)

File Change
crates/api-db/src/tenant_identity_config.rs New
crates/api-db/migrations/20260225120000_tenant_identity_config.sql New
crates/api/src/handlers/identity_config.rs New
crates/api/src/handlers/mod.rs Register handler
crates/api/src/handlers/machine_identity.rs Modified
crates/api/src/api.rs Route new RPCs
crates/api/src/cfg/file.rs Add MachineIdentityConfig
crates/api/src/tests/common/api_fixtures/mod.rs Add machine_identity
crates/api-test-helper/src/identity_config.rs New
crates/api-test-helper/src/lib.rs Export identity_config
crates/api-integration-tests/tests/lib.rs Add run_identity_config_tests()
crates/rpc/proto/forge.proto New RPCs and messages
book/src/design/machine-identity/spiffe-svid-sdd.md Updated

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

#447

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

This PR is part of larger feature implementation related to #261.

@prbinu-nvidia prbinu-nvidia requested a review from a team as a code owner March 9, 2026 23:05
@prbinu-nvidia prbinu-nvidia force-pushed the feat/spiffe-api-db-schema branch from 610d7b7 to e386435 Compare March 9, 2026 23:45
@prbinu-nvidia prbinu-nvidia force-pushed the feat/spiffe-api-db-schema branch from e386435 to 60ca4cf Compare March 10, 2026 00:03
Copy link
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be throwing an untyped Struct into the identity config here. It's one thing for the db column to be a JSONB blob, but the RPC call really needs to be strongly typed.

/// Set identity config for an org. On first create, generates a placeholder key.
/// Caller must ensure tenant exists and global machine-identity is enabled.
#[allow(clippy::too_many_arguments)]
pub async fn set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a struct for this rather than having 10 different arguments? The call sites look like a huge sea of strings and it's very difficult to know what is what. (Having to #[allow(clippy::too_many_arguments)] is a good sign this needs to be reworked.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. It should take TenantIdentityConfig or at least the parts of it that are actually settable. And that struct should get moved into api-model.


// Identity configuration (per-org)
message GetIdentityConfigRequest {
string org_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please spell out the full organization_id here, and in every similar field in the proto. We shouldn't be changing spellings as we add new messages. (The only place we use "org_id" is in some search filters, which debatably have different naming standards.)

.unwrap_or_else(|| full_hash.to_string())
}

fn struct_to_json(pb: &Struct) -> serde_json::Value {
Copy link
Contributor

@kensimon kensimon Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these struct_to_json, value_to_json, etc need to exist. Serde can already serialize protobuf objects.

string org_id = 1;
string token_endpoint = 2;
string auth_method = 3;
google.protobuf.Struct auth_method_config = 4; // method-specific; never includes secrets in responses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use a google.protobuf.Struct, instead actually strongly-type this. Use a oneof field if you expect the contents to be different depending on the method.


/// Builds response auth_method_config: omits secrets, passes through *_hash fields (stored in blob).
/// Truncates *_hash values to 8 hex chars + ".." for display in get_token_delegation.
fn build_response_auth_config(config: &serde_json::Value) -> serde_json::Value {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pass serde_json::Value's around, this should all be strongly typed.


#[derive(Debug, sqlx::FromRow)]
pub struct TenantIdentityConfig {
pub organization_id: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be here?

If the config is per tenant, then I think this struct could be without it, and you would have things like HashMap<TenantOrgId, TenantIdentitifyConfig>.

pub issuer: String,
pub default_audience: String,
pub allowed_audiences: serde_json::Value,
pub token_ttl: i32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a suffix that indictates the unit (e.g. _s) or making it std::time::Duration

/// Set identity config for an org. On first create, generates a placeholder key.
/// Caller must ensure tenant exists and global machine-identity is enabled.
#[allow(clippy::too_many_arguments)]
pub async fn set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. It should take TenantIdentityConfig or at least the parts of it that are actually settable. And that struct should get moved into api-model.

/// Caller must ensure tenant exists and global machine-identity is enabled.
#[allow(clippy::too_many_arguments)]
pub async fn set(
org_id: &str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a strongly typed TenantOrganizationId that we should probably use everywhere

/// Set token delegation for an org. Identity config must exist first.
/// config_json: serialized auth_method_config. Stored as TEXT (future: encrypted).
pub async fn set_token_delegation(
org_id: &str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a strongly typed TenantOrganizationId that we should probably use everywhere

Ok(())
}

async fn run_identity_config_tests(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to add unit tests into /tests instead of doing the integration test thing, or at least in addition to that. The unit tests run much faster and are actually supposed to do things like raw DB reads. The integration tests should just use gRPC APIs and simulate customer interactions.

.ok_or_else(|| Status::unauthenticated("No authentication context found"))?;

if !api.runtime_config.machine_identity.enabled {
return Err(Status::unavailable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using CarbideError and convert these to tonic::Status using .into. That will lead to more consistent error messages and codes between handlers. E.g.

return Err(CarbideError::InvalidArgument("Machine identity must be enabled in site config".to_string()).into())

string org_id = 1;
}

message IdentityConfigRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be

message IdentityConfigRequest {
  string org_id = 1;
  IdentifyConfig config = 2;
}

with

message IdentityConfig {
  bool enabled = 2;
  string issuer = 3;
  string default_audience = 4;
  repeated string allowed_audiences = 5;
  uint32 token_ttl = 6;
  string subject_domain = 7;
  bool rotate_key = 8;
}

let cfg = api
.database_connection
.with_txn(|txn| {
Box::pin(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should increment a version field for the tenant object whenever we make a change

rpc SignMachineIdentity(MachineIdentityRequest) returns (MachineIdentityResponse);
// Get, set, or delete per-org identity configuration (issuer, audiences, TTL, signing key)
rpc GetIdentityConfiguration(GetIdentityConfigRequest) returns (IdentityConfigResponse);
rpc SetIdentityConfiguration(IdentityConfigRequest) returns (IdentityConfigResponse);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd model this similar to other APIs, it would probably a an UpdateTenantConfig request, which stores the new identify configuration as part of the tenant object.

I don't mind the separate API modifying it. But I think it should still increment the tenant object version number so that we are able to track changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants