Skip to content

fix: append custom CA to system cert pool instead of replacing it#227

Merged
Shivs11 merged 1 commit intomainfrom
fix/system-cert-pool
Mar 10, 2026
Merged

fix: append custom CA to system cert pool instead of replacing it#227
Shivs11 merged 1 commit intomainfrom
fix/system-cert-pool

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented Mar 10, 2026

Summary

  • PR fix: use CA certificate from mTLS secret for server verification #212 introduced ca.crt support for server certificate verification but used x509.NewCertPool(), which creates an empty CA pool — replacing the system CA bundle entirely
  • This breaks connections to Temporal Cloud (public CA) when the mTLS secret contains a ca.crt key from cert-manager (the CA that signed the client cert, not the server cert)
  • This fix uses x509.SystemCertPool() instead, so the custom CA is appended to the system bundle rather than replacing it

Why this broke

cert-manager always includes ca.crt in TLS secrets (the issuing CA). When connecting to Temporal Cloud:

  1. The controller sees ca.crt in the secret (the self-signed client CA)
  2. NewCertPool() creates an empty pool with only that CA
  3. Temporal Cloud's server cert is signed by a public CA (e.g., DigiCert)
  4. The public CA is no longer trusted → x509: certificate signed by unknown authority

What this fixes

  • SystemCertPool() loads the system CA bundle first, then appends the custom CA
  • Both public CAs (Temporal Cloud) and private CAs (self-hosted) are trusted simultaneously
  • Falls back to NewCertPool() with a warning log if the system pool can't be loaded

Affected versions

Test plan

  • Deploy against Temporal Cloud with cert-manager mTLS secret (has ca.crt) — verify connection succeeds
  • Deploy against self-hosted Temporal with private CA — verify connection succeeds
  • Deploy with mTLS secret without ca.crt — verify fallback to system bundle works

🤖 Generated with Claude Code

PR #212 introduced ca.crt support for server verification but used
x509.NewCertPool(), which replaces the system CA bundle entirely.
This breaks connections to Temporal Cloud (public CA) when the mTLS
secret contains a ca.crt from cert-manager (the client CA).

Use x509.SystemCertPool() instead so the custom CA is appended to
the system bundle, allowing both private and public CAs to be trusted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Shivs11 Shivs11 requested review from a team and jlegrone as code owners March 10, 2026 20:08
Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Let's try this. To test without having to cut another potentially bad v1.2.x patch release, can we test it in our dogfooding test env by just using the main image?

@Shivs11 Shivs11 enabled auto-merge (squash) March 10, 2026 20:15
@Shivs11 Shivs11 merged commit eae7c39 into main Mar 10, 2026
17 of 18 checks passed
@Shivs11 Shivs11 deleted the fix/system-cert-pool branch March 10, 2026 20:59
Shivs11 added a commit that referenced this pull request Mar 19, 2026
## Summary
- PR #212 introduced `ca.crt` support for server certificate
verification but used `x509.NewCertPool()`, which creates an **empty**
CA pool — replacing the system CA bundle entirely
- This breaks connections to Temporal Cloud (public CA) when the mTLS
secret contains a `ca.crt` key from cert-manager (the CA that signed the
**client** cert, not the server cert)
- This fix uses `x509.SystemCertPool()` instead, so the custom CA is
**appended** to the system bundle rather than replacing it

## Why this broke
cert-manager always includes `ca.crt` in TLS secrets (the issuing CA).
When connecting to Temporal Cloud:
1. The controller sees `ca.crt` in the secret (the self-signed client
CA)
2. `NewCertPool()` creates an empty pool with **only** that CA
3. Temporal Cloud's server cert is signed by a public CA (e.g.,
DigiCert)
4. The public CA is no longer trusted → `x509: certificate signed by
unknown authority`

## What this fixes
- `SystemCertPool()` loads the system CA bundle first, then appends the
custom CA
- Both public CAs (Temporal Cloud) and private CAs (self-hosted) are
trusted simultaneously
- Falls back to `NewCertPool()` with a warning log if the system pool
can't be loaded

## Affected versions
- v1.2.1, v1.2.2, v1.2.3 — all contain the regression from PR #212
- Closes #223

## Test plan
- [ ] Deploy against Temporal Cloud with cert-manager mTLS secret (has
`ca.crt`) — verify connection succeeds
- [ ] Deploy against self-hosted Temporal with private CA — verify
connection succeeds
- [ ] Deploy with mTLS secret without `ca.crt` — verify fallback to
system bundle works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Shivs11 added a commit that referenced this pull request Mar 20, 2026
## Summary
- PR #212 introduced `ca.crt` support for server certificate
verification but used `x509.NewCertPool()`, which creates an **empty**
CA pool — replacing the system CA bundle entirely
- This breaks connections to Temporal Cloud (public CA) when the mTLS
secret contains a `ca.crt` key from cert-manager (the CA that signed the
**client** cert, not the server cert)
- This fix uses `x509.SystemCertPool()` instead, so the custom CA is
**appended** to the system bundle rather than replacing it

## Why this broke
cert-manager always includes `ca.crt` in TLS secrets (the issuing CA).
When connecting to Temporal Cloud:
1. The controller sees `ca.crt` in the secret (the self-signed client
CA)
2. `NewCertPool()` creates an empty pool with **only** that CA
3. Temporal Cloud's server cert is signed by a public CA (e.g.,
DigiCert)
4. The public CA is no longer trusted → `x509: certificate signed by
unknown authority`

## What this fixes
- `SystemCertPool()` loads the system CA bundle first, then appends the
custom CA
- Both public CAs (Temporal Cloud) and private CAs (self-hosted) are
trusted simultaneously
- Falls back to `NewCertPool()` with a warning log if the system pool
can't be loaded

## Affected versions
- v1.2.1, v1.2.2, v1.2.3 — all contain the regression from PR #212
- Closes #223

## Test plan
- [ ] Deploy against Temporal Cloud with cert-manager mTLS secret (has
`ca.crt`) — verify connection succeeds
- [ ] Deploy against self-hosted Temporal with private CA — verify
connection succeeds
- [ ] Deploy with mTLS secret without `ca.crt` — verify fallback to
system bundle works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
carlydf added a commit that referenced this pull request Mar 21, 2026
Introduces clientpool_test.go with 8 tests that directly cover the two
regression patterns that slipped through CI:

- TestFetchMTLS_CACertAppendsToSystemPool: regression test for PR #227
  (empty cert pool replacing system CAs). Injects a fake system pool and
  verifies both injected system CAs and the custom ca.crt are trusted.

- TestDialAndUpsert_APIKeySkipsCheckHealth: regression test for PR #232
  (CheckHealth called unconditionally). Verifies CheckHealth is not
  invoked when AuthModeAPIKey is used.

To make both code paths injectable in tests, ClientPool gains two new
function fields (dialFn, systemCertPoolFn) wired to sdkclient.Dial and
x509.SystemCertPool by default — no behavior change in production.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
carlydf added a commit that referenced this pull request Mar 21, 2026
Introduces clientpool_test.go with 8 tests that directly cover the two
regression patterns that slipped through CI:

- TestFetchMTLS_CACertAppendsToSystemPool: regression test for PR #227
  (empty cert pool replacing system CAs). Injects a fake system pool and
  verifies both injected system CAs and the custom ca.crt are trusted.

- TestDialAndUpsert_APIKeySkipsCheckHealth: regression test for PR #232
  (CheckHealth called unconditionally). Verifies CheckHealth is not
  invoked when AuthModeAPIKey is used.

To make both code paths injectable in tests, ClientPool gains two new
function fields (dialFn, systemCertPoolFn) wired to sdkclient.Dial and
x509.SystemCertPool by default — no behavior change in production.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
carlydf added a commit that referenced this pull request Mar 23, 2026
## Summary

- Adds `clientpool_test.go` with 8 unit tests covering the auth code
paths that had no test coverage
- Two tests are explicit regression guards for the bugs fixed in #227
and #232
- Makes `dialFn` and `systemCertPoolFn` injectable on `ClientPool` (no
behavior change in production) to enable testing without network I/O or
OS trust store dependencies

## Regression tests

**`TestFetchMTLS_CACertAppendsToSystemPool`** — guards against the PR
#212 bug (fixed in #227): `fetchClientUsingMTLSSecret` used
`x509.NewCertPool()` (empty) instead of `x509.SystemCertPool()`,
silently dropping system root CAs and breaking Temporal Cloud
connections. The test injects a fake system pool and verifies both the
injected system CAs and the custom `ca.crt` are present in the returned
pool. This test fails if the fix is reverted.

**`TestDialAndUpsert_APIKeySkipsCheckHealth`** — guards against the PR
#203 bug (fixed in #232): `DialAndUpsertClient` called `CheckHealth`
unconditionally, which fails on Temporal Cloud with namespace-scoped API
keys. The test uses an injected mock client and asserts `CheckHealth` is
never called for `AuthModeAPIKey`. This test fails if the fix is
reverted.

## Test plan

- [x] `go test ./internal/controller/clientpool/... -v` — all 8 tests
pass
- [x] `go build ./...` — no compilation errors
- [x] Manually revert the PR #227 fix →
`TestFetchMTLS_CACertAppendsToSystemPool` fails
- [x] Manually revert the PR #232 fix →
`TestDialAndUpsert_APIKeySkipsCheckHealth` fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
shashwatsuri pushed a commit to shashwatsuri/temporal-worker-controller that referenced this pull request Apr 28, 2026
…mporalio#227)

## Summary
- PR temporalio#212 introduced `ca.crt` support for server certificate
verification but used `x509.NewCertPool()`, which creates an **empty**
CA pool — replacing the system CA bundle entirely
- This breaks connections to Temporal Cloud (public CA) when the mTLS
secret contains a `ca.crt` key from cert-manager (the CA that signed the
**client** cert, not the server cert)
- This fix uses `x509.SystemCertPool()` instead, so the custom CA is
**appended** to the system bundle rather than replacing it

## Why this broke
cert-manager always includes `ca.crt` in TLS secrets (the issuing CA).
When connecting to Temporal Cloud:
1. The controller sees `ca.crt` in the secret (the self-signed client
CA)
2. `NewCertPool()` creates an empty pool with **only** that CA
3. Temporal Cloud's server cert is signed by a public CA (e.g.,
DigiCert)
4. The public CA is no longer trusted → `x509: certificate signed by
unknown authority`

## What this fixes
- `SystemCertPool()` loads the system CA bundle first, then appends the
custom CA
- Both public CAs (Temporal Cloud) and private CAs (self-hosted) are
trusted simultaneously
- Falls back to `NewCertPool()` with a warning log if the system pool
can't be loaded

## Affected versions
- v1.2.1, v1.2.2, v1.2.3 — all contain the regression from PR temporalio#212
- Closes temporalio#223

## Test plan
- [ ] Deploy against Temporal Cloud with cert-manager mTLS secret (has
`ca.crt`) — verify connection succeeds
- [ ] Deploy against self-hosted Temporal with private CA — verify
connection succeeds
- [ ] Deploy with mTLS secret without `ca.crt` — verify fallback to
system bundle works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
shashwatsuri pushed a commit to shashwatsuri/temporal-worker-controller that referenced this pull request Apr 28, 2026
## Summary

- Adds `clientpool_test.go` with 8 unit tests covering the auth code
paths that had no test coverage
- Two tests are explicit regression guards for the bugs fixed in temporalio#227
and temporalio#232
- Makes `dialFn` and `systemCertPoolFn` injectable on `ClientPool` (no
behavior change in production) to enable testing without network I/O or
OS trust store dependencies

## Regression tests

**`TestFetchMTLS_CACertAppendsToSystemPool`** — guards against the PR
temporalio#212 bug (fixed in temporalio#227): `fetchClientUsingMTLSSecret` used
`x509.NewCertPool()` (empty) instead of `x509.SystemCertPool()`,
silently dropping system root CAs and breaking Temporal Cloud
connections. The test injects a fake system pool and verifies both the
injected system CAs and the custom `ca.crt` are present in the returned
pool. This test fails if the fix is reverted.

**`TestDialAndUpsert_APIKeySkipsCheckHealth`** — guards against the PR
temporalio#203 bug (fixed in temporalio#232): `DialAndUpsertClient` called `CheckHealth`
unconditionally, which fails on Temporal Cloud with namespace-scoped API
keys. The test uses an injected mock client and asserts `CheckHealth` is
never called for `AuthModeAPIKey`. This test fails if the fix is
reverted.

## Test plan

- [x] `go test ./internal/controller/clientpool/... -v` — all 8 tests
pass
- [x] `go build ./...` — no compilation errors
- [x] Manually revert the PR temporalio#227 fix →
`TestFetchMTLS_CACertAppendsToSystemPool` fails
- [x] Manually revert the PR temporalio#232 fix →
`TestDialAndUpsert_APIKeySkipsCheckHealth` fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v1.2.1 release image still has non-numeric USER despite PR #195 fix

2 participants