Skip to content

ADR: Architecture document #1

@viniciusdc

Description

@viniciusdc

Important

This is a Work in Progress repository and task; thus, expect descriptions, deliverables, and overall decisions to change before, during, and after the development of its MVP.

Nebari Infrastructure Core (NIC) is a standalone Go CLI tool that manages cloud infrastructure for Nebari, using HCL and declarative configuration semantics to easily deploy all required components for a stable Kubernetes cluster, supporting multi-cloud architectures. NIC will later host an internal ecosystem of supported apps, managed internally by ArgoCD (see ArgoCD core concepts), referred to throughout this as "Apps".

Before introducing nic-operator, it is important to clarify what a Kubernetes Operator is and why it fits this problem space. In short, an operator is an application that runs inside a Kubernetes cluster and encodes operational knowledge for managing a specific domain. It typically introduces a Custom Resource Definition (CRD) that represents the desired state for some capability (e.g., “this app should be exposed at this hostname with TLS and SSO”), and a controller that continuously reconciles the actual cluster resources until they match that desired state. This “control loop” model allows platform behavior to be expressed declaratively and made repeatable, auditable, and self-healing (see: Kubernetes Operators — what are they?).

We are building nic-operator to standardize and automate “platform wiring” for applications in the NIC Kubernetes ecosystem. The goal is to standardize, in an automated fashion, all necessary hand edges of an app deployment alongside NIC’s "foundational software" layer (discussed in more detail at nebari-dev/nebari-infrastructure-core#5). Right now, we can assume the following software pieces represent the foundational layer of NIC:

  • Envoy Gateway (Gateway API) for ingress routing
  • cert-manager for TLS certificates
  • Keycloak for SSO/OIDC authentication and user management
  • Istio for mesh capabilities (optional integration, not required for MVP)
  • Argo CD for GitOps-based application deployment (Helm)

There is, of course, an additional internal layer dedicated to metric collection and telemetry, but that will be considered a stretch goal for this initial operator implementation. But it should be kept in mind for future talks.

This operator’s primary objective is, in principle, self-service onboarding of NIC's platform "apps".

When a new app is deployed via Helm/Argo CD, the app declares minimal intent, and the platform automatically configures routing, TLS, and SSO — consistently and securely — without manual per-app ingress/cert/auth configuration.

To implement nic-operator using a conventional, maintainable Kubernetes controller layout, we will use the Kubebuilder framework. Kubebuilder provides scaffolding and patterns for defining CRDs (OpenAPI schemas), generating manifests, and implementing reconciliation logic, all aligned with Kubernetes API conventions and controller-runtime best practices.

Why Kubebuilder as a framework?

I am choosing Kubebuilder (Go + controller-runtime) as the primary framework for nic-operator because it matches NIC’s Go-first direction and keeps the operator as “just Kubernetes”: CRDs + controllers + reconciliation loops, following upstream conventions. For reference, see Kubebuilder’s Quick Start and the broader operator pattern overview from CNCF Kubernetes Operators: what are they?.

  • Kubernetes-native API design for our onboarding contract

    • Our core deliverable is a stable CRD contract (e.g., NicApp) with strong schema validation, versioning conventions, and a first-class status subresource (conditions, observed generation). Kubebuilder is built precisely for designing and implementing Kubernetes APIs.
    • nic-operator will reconcile across multiple API groups (Gateway API resources like HTTPRoute, cert-manager resources, Envoy Gateway policy CRDs, plus Secrets/status). Go/controller-runtime makes it easier to keep this logic strongly typed and testable at scale.
    • Because Kubebuilder uses the standard RBAC/codegen pipeline, it’s straightforward to evolve RBAC with the controller as we split reconcilers by domain (routing/tls/auth/keycloak) and/or introduce admission validation later.

We also considered Operator SDK (Go), Kopf (python) and Juju (charms). Both are valid ecosystems, but they introduce additional scope and/or supporting dependencies that are larger than what we need for an internal, Argo CD–driven operator. For that reason, they were not selected for the MVP path. Specifically, the SDK path was drafted out because it relies on the OLM (Operator Lifecycle Manager) ecosystem, which, for this single-operator structure, does not make sense.

Note

We should reconsider this depending on any later decisions about relying on other operators to handle the creation of Grafana (or the LGTM stack), cert-manager, Istio, etc.

TLDR

This issue delivers the primary architecture Architectural Decision Record (ADR) that defines the onboarding contract (CRD-first vs annotations), domain boundaries (routing/tls/auth/keycloak/core), ownership/GC rules to avoid Argo CD conflicts, security boundaries (RBAC + secrets), status/readiness expectations, and a phased roadmap.

Additional considerations

Additionally, for early development/testing, we intend to mirror the local cluster bootstrap workflow/structure used by the vor-infra repository as a template idea to stand up a cluster with the required platform services.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions