Skip to content

Commit e86f478

Browse files
authored
Merge pull request #131 from ckavili/secret-management
πŸ” secret management blog post added πŸ”
2 parents 93e5406 + a1cda29 commit e86f478

3 files changed

Lines changed: 153 additions & 0 deletions

File tree

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
2+
# Managing Secrets in an AI Platform
3+
4+
## Introduction
5+
6+
At Red Hat, we provide an internal OpenShift AI cluster that serves as a unified platform for experimentation, prototyping, and scalable deployment of AI solutions. Designed to support a potential user base of over 19,000 associates, the platform offers a range of capabilitiesβ€”including granular role-based access control (RBAC), GPU auto-scaling for efficient resource management, hosting models from the Granite, Llama, Mistral, and DeepSeek families, as well as specialized models for vision, embeddings, and safety filtering. It also supports demo products like [Models as a Service](https://ai-on-openshift.io/generative-ai/ai-for-everyone/) and the [Chat with Your Documentation](https://ai-on-openshift.io/demos/llm-chat-doc/llm-chat-doc/) RAG implementation. This variety enables teams across Red Hat to explore diverse AI workloads and build solutions tailored to their specific use cases by using OpenShift AI.
7+
8+
Like any production-grade platform, supporting these capabilities at scale requires more than just compute resources. It also requires solid engineering practices, including the secure management of sensitive configuration dataβ€”such as cloud credentials and authorization tokens used in platform setup.
9+
10+
Since day one, we’ve managed the cluster lifecycle using [GitOps](https://ai-on-openshift.io/odh-rhoai/gitops/). However, the presence of these sensitive values meant that we initially had to keep our configuration repository private. While this worked operationally, it limited our ability to share the implementation more broadly with others.
11+
12+
To address this, we adopted a secret management solution. This enabled us to decouple secrets from the GitOps-managed resources and paved the way for securely opening up parts of the platform’s configuration. You can now explore the repository [here](https://github.com/rh-aiservices-bu/rhoaibu-cluster) to see how we run this AI platform in a secure and scalable way.
13+
14+
In this post, we’ll walk through the high level structure of the repository we’ve opened up, what you can find inside, and how it reflects the way we run and scale OpenShift AI internally. We’ll also take a closer look at the secret management approach we adopted; why we chose External Secrets Operator (ESO), how it fits into our GitOps workflow, and the lessons we learned along the way.
15+
16+
## What's In the Box?
17+
18+
The [GitHub repository](https://github.com/rh-aiservices-bu/rhoaibu-cluster) captures how we run production-grade AI infrastructure at scaleβ€”offering reusable patterns, modular design, and automation strategies. Here's a quick overview of what you’ll find inside:
19+
20+
- Declarative GitOps configurations for managing AI/ML infrastructure
21+
22+
- Customizations for OpenShift AI, including workbenches and model serving
23+
24+
- GPU sharing, time-slicing, and autoscaling across different types of GPUs
25+
26+
- Model deployments that power Model-as-a-Service platform
27+
28+
- Environment overlays and promotion workflows for dev/prod separation
29+
30+
- Security, observability, and cost-optimization practices built in
31+
32+
Sharing this repo publicly meant revisiting how we manage sensitive values. Because while transparency is great, credentials shouldn’t live in version control as plain text. They should be managed as code, but in a secure and auditable way. To achieve this, we adopted a secret management tool that integrates seamlessly with our GitOps flow.
33+
34+
## What is External Secrets Operator?
35+
36+
[External Secrets Operator](https://external-secrets.io/latest/) is a Kubernetes operator that integrates external secret management systems like AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, and Azure Key Vault with Kubernetes. It allows you to securely inject secrets from these external systems into your Kubernetes clusters. We use it to pull secrets from AWS Secrets Manager into our OpenShift AI clusters securely and automatically.
37+
38+
39+
## Why We Chose External Secrets Operator?
40+
41+
Several factors influenced our decision to use External Secrets Operator:
42+
43+
1. **Native Kubernetes Integration**: ESO follows the Kubernetes operator pattern, making it easy to integrate with our OpenShift clusters.
44+
45+
2. **AWS Integration**: With our infrastructure running on AWS, ESO's native support for AWS Secrets Manager was a good fit.
46+
47+
3. **GitOps Friendly**: ESO works well with our GitOps workflow using Argo CD, allowing us to manage secret references declaratively in Git without exposing sensitive data.
48+
49+
## Our Implementation
50+
51+
First, let's look at a concrete example of how we use External Secrets in our cluster:
52+
53+
**Database Credentials for Model Registry**
54+
[This example](https://github.com/rh-aiservices-bu/rhoaibu-cluster/blob/dev/components/instances/rhoai-instance/components/model-registry/mysql-secret.yaml) shows how we fetch the database credentials for our Model Registry that is stored in AWS Secrets Manager, demonstrating how one ExternalSecret can map multiple properties from a secret store.
55+
56+
```yaml
57+
apiVersion: external-secrets.io/v1beta1
58+
kind: ExternalSecret
59+
metadata:
60+
name: registry-db-secrets
61+
namespace: rhoai-model-registries
62+
spec:
63+
refreshInterval: 1h
64+
secretStoreRef:
65+
name: rhoaibu-external-store
66+
kind: ClusterSecretStore
67+
target:
68+
name: registry-db-secrets
69+
creationPolicy: Owner
70+
data:
71+
- secretKey: MYSQL_ROOT_PASSWORD
72+
remoteRef:
73+
key: model-registries-db-credentials
74+
property: database-password
75+
- secretKey: MYSQL_USER_NAME
76+
remoteRef:
77+
key: model-registries-db-credentials
78+
property: database-user
79+
```
80+
81+
### Architecture
82+
83+
We implemented External Secrets Operator using a GitOps approach with the following components:
84+
85+
86+
1. **Operator Installation**: We use the Red Hat Community of Practice (CoP)'s GitOps Catalog to deploy the operator:
87+
88+
```yaml
89+
apiVersion: kustomize.config.k8s.io/v1beta1
90+
kind: Kustomization
91+
resources:
92+
- https://github.com/redhat-cop/gitops-catalog/external-secrets-operator/operator/overlays/stable
93+
```
94+
95+
2. **AWS Secrets Manager Integration**: We configured a `ClusterSecretStore` that connects to AWS Secrets Manager:
96+
97+
```yaml
98+
apiVersion: external-secrets.io/v1beta1
99+
kind: ClusterSecretStore
100+
metadata:
101+
name: rhoaibu-external-store
102+
spec:
103+
provider:
104+
aws:
105+
service: SecretsManager
106+
region: us-west-2
107+
```
108+
109+
110+
3. **Namespace Scoping**: For security, we explicitly define which namespaces can access the external secrets:
111+
112+
```yaml
113+
conditions:
114+
- namespaces:
115+
- "cert-manager"
116+
- "openshift-config"
117+
- "rhoai-model-registries"
118+
```
119+
120+
121+
## Benefits We've Seen
122+
123+
1. **Improved Security**: Secrets are stored securely in AWS Secrets Manager, separate from our application code.
124+
2. **Simplified Management**: One central place (AWS Secrets Manager) to manage all our secrets.
125+
3. **GitOps Compatible**: WWe can declaratively manage secret references in Git while keeping actual secrets securely stored in AWS.
126+
4. **Automated Syncing**: Secrets are automatically synchronized between AWS and our clusters.
127+
128+
129+
## Alternatives We Considered
130+
131+
1. **Sealed Secrets**: While powerful, it didn't offer the same level of integration with external secret managers.
132+
2. **Vault Operator**: HashiCorp Vault was more complex to set up and maintain compared to using AWS Secrets Manager.
133+
3. **Native Kubernetes Secrets**: Since they are only base64 encoding, they lack the security and management features we needed.
134+
135+
136+
## Challenges and Solutions
137+
138+
1. **Initial Setup**: Required detailed IAM role configurations to ensure secure access to secrets.
139+
2. **Multi-Region Support**: Solved by using environment-specific patches for different AWS regions.
140+
141+
## Good Practices We Follow
142+
143+
1. **Namespace Isolation**: Strictly control which namespaces can access secrets
144+
2. **Minimal Access**: Use specific IAM roles with least privilege
145+
3. **Version Control**: Maintain secret configurations in Git while keeping sensitive data in AWS
146+
4. **Environment Separation**: Different configurations for dev and prod environments
147+
148+
## Conclusion
149+
150+
External Secrets Operator has proven to be a robust solution for our secret management needs. It provides the right balance of security, ease of use, and integration capabilities. Most importantly, it allowed us to open source our entire cluster setup, from installation to Day 2 operations, while keeping sensitive data secure in AWS Secrets Manager. This separation of configuration enables us to share our implementation publicly, allowing others to learn from and build upon our work while maintaining the security of our credentials and sensitive data.

β€Ždocs/whats-new/whats-new.mdβ€Ž

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# What's new?
22

3+
**2025-07-28**: Add [Managing Secrets in an AI Platform](../odh-rhoai/secret-management.md)
4+
35
**2025-07-15**: Add [Deploying a Red Hat Validated Model in a Disconnected OpenShift AI Environment](../odh-rhoai/deploy-validated-models-on-disconnected.md)
46

57
**2025-04-03**: Add [AI for Everyone: What We Learned](../generative-ai/ai-for-everyone.md)

β€Žmkdocs.ymlβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ nav:
114114
- Custom Serving Runtime (Triton): odh-rhoai/custom-runtime-triton.md
115115
- Dashboard configuration: odh-rhoai/configuration.md
116116
- GitOps (CRs, objects,...): odh-rhoai/gitops.md
117+
- Managing Secrets in an AI Platform: odh-rhoai/managing-secrets.md
117118
- Kueue preemption: odh-rhoai/kueue-preemption/readme.md
118119
- Model serving type modification: odh-rhoai/model-serving-type-modification.md
119120
- NVIDIA GPUs: odh-rhoai/nvidia-gpus.md

0 commit comments

Comments
Β (0)