Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
db09e87
manage external cert
kangclzjc Oct 24, 2025
e0d5652
customize ca
kangclzjc Nov 25, 2025
55a91fa
refactor
kangclzjc Nov 27, 2025
74f3316
add pcs webhook
kangclzjc Nov 27, 2025
2749596
add e2e
kangclzjc Dec 4, 2025
1e4e410
fix conflict
kangclzjc Dec 10, 2025
5d1c702
check
kangclzjc Dec 11, 2025
376e81c
fix default value
kangclzjc Dec 11, 2025
f6fbda1
use poll
kangclzjc Dec 17, 2025
d7c78a7
Cert-manager to Autoprovision e2e test flow
briefgaming Dec 28, 2025
54cf6c9
Refactor cert-manager & autoprovision tests into separate functions
briefgaming Dec 29, 2025
35ed8b7
Add cert-manager images to dependencies.yaml for pre-pulling
briefgaming Dec 29, 2025
8c5aa3a
Poll ClusterIssuer resource
briefgaming Dec 30, 2025
6a87670
Update cert-manager-ctl image version
briefgaming Dec 30, 2025
825d05d
refactor
kangclzjc Dec 31, 2025
edd4800
refactor
kangclzjc Jan 4, 2026
1b01663
unify log format
kangclzjc Jan 11, 2026
2d7355d
reuse image
kangclzjc Jan 11, 2026
dffd572
fix version
kangclzjc Jan 12, 2026
26afe4b
actually use secretName
gflarity Jan 16, 2026
cdb1efb
remove unused clusterTopologyValidationWebhook from cert-manager test
gflarity Jan 16, 2026
3fa4c9f
webhooks are cluster scoped, remove unused namespaces
gflarity Jan 16, 2026
a75184e
add generic UpgradeGrove helper and refactor cert-manager test to use it
gflarity Jan 16, 2026
1cc2fa5
add a comment explaining if statement
gflarity Jan 16, 2026
2c07e42
rename test to make it less cert manager specific
gflarity Jan 16, 2026
b89431e
update CM tests to check certificates, and condensed it into a single…
gflarity Jan 16, 2026
c45b072
add cert management tests to e2e matrix
gflarity Jan 16, 2026
211adc1
add documentation for external certificate management
gflarity Jan 17, 2026
7fd443e
make E2E tests more robust for CI
gflarity Jan 18, 2026
df66b3b
PR feedback
gflarity Jan 22, 2026
611165f
remove trouble shooting guide
gflarity Jan 22, 2026
5a584f1
add some more documentation to the grove.go file in the setup package
gflarity Jan 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/e2e-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ jobs:
test_pattern: "^Test_SO"
- test_name: Topology_Aware_Scheduling
test_pattern: "^Test_TAS"
- test_name: cert_management
test_pattern: "^Test_CM"
name: E2E - ${{ matrix.test_name }}
steps:
# print runner specs so we have a record incase of failures
Expand Down
20 changes: 20 additions & 0 deletions docs/api-reference/operator-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -671,6 +671,24 @@ _Appears in:_
| `exemptServiceAccountUserNames` _string array_ | ExemptServiceAccountUserNames is a list of service account usernames that are exempt from authorizer checks.<br />Each service account username name in ExemptServiceAccountUserNames should be of the following format:<br />system:serviceaccount:<namespace>:<service-account-name>. ServiceAccounts are represented in this<br />format when checking the username in authenticationv1.UserInfo.Name. | | |


#### CertProvisionMode

_Underlying type:_ _string_

CertProvisionMode defines how webhook certificates are provisioned.

_Validation:_
- Enum: [auto manual]

_Appears in:_
- [WebhookServer](#webhookserver)

| Field | Description |
| --- | --- |
| `auto` | CertProvisionModeAuto enables automatic certificate generation and management via cert-controller.<br />cert-controller automatically generates self-signed certificates and stores them in the Secret.<br /> |
| `manual` | CertProvisionModeManual expects certificates to be provided externally (e.g., by cert-manager, cluster admin).<br /> |


#### ClientConnectionConfiguration


Expand Down Expand Up @@ -916,5 +934,7 @@ _Appears in:_
| `bindAddress` _string_ | BindAddress is the IP address on which to listen for the specified port. | | |
| `port` _integer_ | Port is the port on which to serve requests. | | |
| `serverCertDir` _string_ | ServerCertDir is the directory containing the server certificate and key. | | |
| `secretName` _string_ | SecretName is the name of the Kubernetes Secret containing webhook certificates.<br />The Secret must contain tls.crt, tls.key, and ca.crt. | grove-webhook-server-cert | |
| `certProvisionMode` _[CertProvisionMode](#certprovisionmode)_ | CertProvisionMode controls how webhook certificates are provisioned. | auto | Enum: [auto manual] <br /> |


6 changes: 6 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,12 @@ This make target leverages Grove [Helm](https://helm.sh/) charts and [Skaffold](
- Grove Scheduler CRDs - `podgangs.scheduler.grove.io`.
- All Grove operator resources defined as a part of [Grove Helm chart templates](../operator/charts/templates).

## Certificate Management

By default, Grove automatically generates and manages TLS certificates for its webhook server. For production environments, you may want to use certificates from your organization's PKI or a certificate manager like cert-manager.

See the [Certificate Management Guide](user-guide/certificate-management.md) for detailed configuration options.

## Verify Installation

Follow the instructions in the [quickstart guide](quickstart.md) to deploy a PodCliqueSet and validate your installation.
Expand Down
298 changes: 298 additions & 0 deletions docs/user-guide/certificate-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
# Certificate Management

Grove's webhook server requires TLS certificates to secure communication with the Kubernetes API server. This guide explains how to configure certificate management for your Grove deployment.

## Overview

Grove supports two certificate management modes:

| Mode | Description | Best For |
|------|-------------|----------|
| **Auto-provisioned** (default) | Grove automatically generates and manages self-signed certificates | Development, testing, quick setup |
| **External certificates** | You provide certificates from an external source (e.g., cert-manager, manual) | Production, enterprise PKI integration |

## Auto-Provisioned Certificates (Default)

By default, Grove's built-in cert-controller automatically:
- Generates self-signed CA and server certificates
- Stores them in a Kubernetes Secret
- Injects the CA bundle into webhook configurations
- Rotates certificates before expiry

This mode requires no additional configuration. Simply deploy Grove:

```bash
helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag>
```

### How It Works

1. On startup, the cert-controller checks if the webhook certificate Secret exists
2. If missing or expired, it generates new certificates
3. The CA bundle is automatically injected into `ValidatingWebhookConfiguration` and `MutatingWebhookConfiguration` resources
4. Certificates are stored in the Secret specified by `config.server.webhooks.secretName` (default: `grove-webhook-server-cert`)

## External Certificate Management

For production environments, you may want to use certificates from your organization's PKI or a certificate manager like [cert-manager](https://cert-manager.io/).

### Prerequisites

Before deploying Grove with external certificates, ensure:
1. Your certificate Secret exists in the target namespace
2. The Secret contains the required keys: `tls.crt`, `tls.key`, and `ca.crt`
3. The certificate's Subject Alternative Names (SANs) include the webhook service DNS name

### Required Certificate SANs

The webhook server certificate must include these SANs:
```
grove-operator.<namespace>.svc
grove-operator.<namespace>.svc.cluster.local
```

Replace `<namespace>` with your deployment namespace (default: `default`).

### Configuration

To use external certificates, configure the following Helm values:

```yaml
config:
server:
webhooks:
# Use manual certificate provisioning (external)
certProvisionMode: manual
# Name of the Secret containing your certificates
secretName: "grove-webhook-server-cert"

webhooks:
# Base64-encoded CA certificate that signed the webhook server certificate
caBundle: "<base64-encoded-ca-certificate>"
```

### Using cert-manager

[cert-manager](https://cert-manager.io/) is a popular choice for managing certificates in Kubernetes. Here's how to configure Grove with cert-manager.

#### Step 1: Install cert-manager

If not already installed:

```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
```

Wait for cert-manager to be ready:

```bash
kubectl wait --for=condition=Available deployment/cert-manager -n cert-manager --timeout=120s
kubectl wait --for=condition=Available deployment/cert-manager-webhook -n cert-manager --timeout=120s
```

#### Step 2: Create a Certificate Issuer

Create a self-signed issuer (or use your organization's issuer):

```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: grove-selfsigned-issuer
namespace: <namespace> # Your Grove deployment namespace
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: grove-ca
namespace: <namespace>
spec:
isCA: true
commonName: grove-ca
secretName: grove-ca-secret
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: grove-selfsigned-issuer
kind: Issuer
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: grove-ca-issuer
namespace: <namespace>
spec:
ca:
secretName: grove-ca-secret
```

#### Step 3: Create a Certificate for the Webhook

```yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: grove-webhook-cert
namespace: <namespace>
spec:
secretName: grove-webhook-server-cert
duration: 8760h # 1 year
renewBefore: 720h # 30 days
subject:
organizations:
- grove
isCA: false
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
usages:
- server auth
- digital signature
- key encipherment
dnsNames:
- grove-operator
- grove-operator.<namespace>
- grove-operator.<namespace>.svc
- grove-operator.<namespace>.svc.cluster.local
issuerRef:
name: grove-ca-issuer
kind: Issuer
```

#### Step 4: Configure Webhook Annotations

cert-manager can automatically inject the CA bundle into webhook configurations. Add these annotations to your Helm values:

```yaml
webhooks:
podCliqueSetValidationWebhook:
annotations:
cert-manager.io/inject-ca-from: <namespace>/grove-webhook-cert
podCliqueSetDefaultingWebhook:
annotations:
cert-manager.io/inject-ca-from: <namespace>/grove-webhook-cert
authorizerWebhook:
annotations:
cert-manager.io/inject-ca-from: <namespace>/grove-webhook-cert
```

#### Step 5: Deploy Grove

```bash
helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag> \
--set config.server.webhooks.certProvisionMode=manual \
--set webhooks.podCliqueSetValidationWebhook.annotations."cert-manager\.io/inject-ca-from"="<namespace>/grove-webhook-cert" \
--set webhooks.podCliqueSetDefaultingWebhook.annotations."cert-manager\.io/inject-ca-from"="<namespace>/grove-webhook-cert" \
--set webhooks.authorizerWebhook.annotations."cert-manager\.io/inject-ca-from"="<namespace>/grove-webhook-cert"
```

Or create a `values.yaml` file with the configuration shown above and deploy:

```bash
helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag> -f values.yaml
```

### Manual Certificate Management

If you prefer to manage certificates manually:

#### Step 1: Generate Certificates

```bash
# Create a CA
openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -sha256 -days 365 \
-out ca.crt -subj "/CN=grove-ca"

# Create server certificate
openssl genrsa -out tls.key 2048

# Create CSR config
cat > csr.conf <<EOF
[req]
default_bits = 2048
prompt = no
distinguished_name = dn
req_extensions = req_ext

[dn]
CN = grove-operator

[req_ext]
subjectAltName = @alt_names

[alt_names]
DNS.1 = grove-operator
DNS.2 = grove-operator.<namespace>
DNS.3 = grove-operator.<namespace>.svc
DNS.4 = grove-operator.<namespace>.svc.cluster.local
EOF

# Generate CSR and certificate
openssl req -new -key tls.key -out tls.csr -config csr.conf
openssl x509 -req -in tls.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
-out tls.crt -days 365 -extensions req_ext -extfile csr.conf
```

#### Step 2: Create the Secret

```bash
kubectl create secret generic grove-webhook-server-cert \
--from-file=tls.crt=tls.crt \
--from-file=tls.key=tls.key \
--from-file=ca.crt=ca.crt \
-n <namespace>
```

#### Step 3: Deploy Grove with the CA Bundle

```bash
CA_BUNDLE=$(cat ca.crt | base64 | tr -d '\n')

helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag> \
--set config.server.webhooks.certProvisionMode=manual \
--set webhooks.caBundle="${CA_BUNDLE}"
```

## Switching Between Modes

### From Auto-Provision to External

1. Set up your external certificate infrastructure (cert-manager or manual)
2. Update your Helm values to disable auto-provisioning
3. Upgrade the Grove deployment

```bash
helm upgrade grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag> \
--set config.server.webhooks.certProvisionMode=manual \
--set webhooks.caBundle="${CA_BUNDLE}"
```

### From External to Auto-Provision

1. Update your Helm values to enable auto-provisioning
2. Remove the `webhooks.caBundle` value (or set it to empty)
3. Upgrade the Grove deployment

```bash
helm upgrade grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag> \
--set config.server.webhooks.certProvisionMode=auto \
--set webhooks.caBundle=""
```

The cert-controller will generate new certificates on the next operator restart.

## Configuration Reference

| Helm Value | Type | Default | Description |
|------------|------|---------|-------------|
| `config.server.webhooks.certProvisionMode` | string | `auto` | Certificate provisioning mode: `auto` (cert-controller generates certs) or `manual` (external certs) |
| `config.server.webhooks.secretName` | string | `grove-webhook-server-cert` | Name of the Secret containing certificates |
| `webhooks.caBundle` | string | `""` | Base64-encoded CA certificate for webhook configurations |
| `webhooks.podCliqueSetValidationWebhook.annotations` | map | `{}` | Annotations for the validation webhook (e.g., cert-manager injection) |
| `webhooks.podCliqueSetDefaultingWebhook.annotations` | map | `{}` | Annotations for the defaulting webhook |
| `webhooks.authorizerWebhook.annotations` | map | `{}` | Annotations for the authorizer webhook |
8 changes: 8 additions & 0 deletions operator/api/config/v1alpha1/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,14 @@ func SetDefaults_ServerConfiguration(serverConfig *ServerConfiguration) {
serverConfig.Webhooks.ServerCertDir = defaultWebhookServerTLSServerCertDir
}

if serverConfig.Webhooks.SecretName == "" {
serverConfig.Webhooks.SecretName = "grove-webhook-server-cert"
}

if serverConfig.Webhooks.CertProvisionMode == "" {
serverConfig.Webhooks.CertProvisionMode = CertProvisionModeAuto
}

if serverConfig.HealthProbes == nil {
serverConfig.HealthProbes = &Server{}
}
Expand Down
21 changes: 21 additions & 0 deletions operator/api/config/v1alpha1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,29 @@ type WebhookServer struct {
Server `json:",inline"`
// ServerCertDir is the directory containing the server certificate and key.
ServerCertDir string `json:"serverCertDir"`
// SecretName is the name of the Kubernetes Secret containing webhook certificates.
// The Secret must contain tls.crt, tls.key, and ca.crt.
// +optional
// +kubebuilder:default="grove-webhook-server-cert"
SecretName string `json:"secretName,omitempty"`
// CertProvisionMode controls how webhook certificates are provisioned.
// +optional
// +kubebuilder:default="auto"
CertProvisionMode CertProvisionMode `json:"certProvisionMode,omitempty"`
}

// CertProvisionMode defines how webhook certificates are provisioned.
// +kubebuilder:validation:Enum=auto;manual
type CertProvisionMode string

const (
// CertProvisionModeAuto enables automatic certificate generation and management via cert-controller.
// cert-controller automatically generates self-signed certificates and stores them in the Secret.
CertProvisionModeAuto CertProvisionMode = "auto"
// CertProvisionModeManual expects certificates to be provided externally (e.g., by cert-manager, cluster admin).
CertProvisionModeManual CertProvisionMode = "manual"
)

// Server contains information for HTTP(S) server configuration.
type Server struct {
// BindAddress is the IP address on which to listen for the specified port.
Expand Down
Loading
Loading