This file tells AI coding agents (GitHub Copilot, Codex, Claude, etc.) how to work safely and effectively in this repository.
Terraform module that deploys a production-ready k3s cluster on OCI Always Free resources. All compute, networking, and storage must fit within the Always Free budget — do not introduce resources that incur cost.
| Layer | Technology |
|---|---|
| IaC | Terraform ≥ 1.9 / OpenTofu ≥ 1.9 |
| Cloud | Oracle Cloud Infrastructure (OCI) |
| OS | Ubuntu 24.04 LTS (aarch64) only |
| Kubernetes | k3s (latest resolved at plan time) |
| Ingress | Envoy Gateway (Gateway API) |
| Observability | kube-prometheus-stack (via GitOps) |
| Logging | OCI Unified Logging (optional) |
| Storage | Longhorn |
| GitOps | ArgoCD + Image Updater |
| TLS | cert-manager (Let's Encrypt) |
| Reboots | kured + unattended-upgrades |
| Resource | Free allowance | This module |
|---|---|---|
| A1.Flex compute | 4 OCPUs / 24 GB / 4 instances | 3 servers + 1 worker |
| Block storage | 200 GB | 4 × 50 GB boot volumes = 200 GB; bastion is OCI Bastion Service (managed, no VM, no storage) |
| NLB | 1 | 1 public NLB |
| Flex LB | 2 × 10 Mbps | 1 internal LB |
| E2.1.Micro | 2 | 0 (bastion uses OCI Bastion Service, not a VM) |
| NAT Gateway | 1 per VCN | 1 |
| Object Storage | 20 GB | 2 versioned buckets — Terraform state (enable_object_storage_state) + Longhorn PVC backups (enable_longhorn_backup) |
| Vault (shared) | Software keys + 150 secrets | 3 secrets — k3s_token, longhorn_ui_password, grafana_admin_password (enable_vault = true) |
| Volume backups | 5 total | 4 — one per node, weekly, 1-week retention (enable_backup = true) |
| Notifications | 1M HTTPS + 3K email/month | 1 topic wired to Alertmanager (enable_notifications = false, opt-in) |
| MySQL HeatWave | 1 standalone, 50 GB | 1 DB system in private subnet (enable_mysql = false, opt-in) |
Never add resources that exceed this budget. If a change requires more OCPUs, storage, or additional paid resources, flag it explicitly instead of implementing it.
vars.tf — all input variables (add new vars here)
locals.tf — derived locals (ssh_public_key, k3s_version, common_tags, agent_plugins)
data.tf — cloud-init assembly (join of vars tpl + lib files), random_password resources
terraform.tf — required_providers and version constraints
network.tf — VCN, subnets, IGW, NAT GW, route tables
security.tf — Security Lists
nsg.tf — Network Security Groups
iam.tf — Dynamic Group and Policy (scoped to cluster_name tag, includes log-content and secret-family)
logging.tf — OCI Log Group, Log, Unified Agent Configuration (enabled via enable_oci_logging)
compute.tf — Instance pool (servers), pool (workers), standalone extra worker
lb.tf — Internal Flexible LB (kubeapi HA)
nlb.tf — Public Network LB (HTTP/HTTPS ingress)
backup.tf — Custom weekly backup policy + assignments for all node boot volumes (enable_backup)
vault.tf — OCI Vault (DEFAULT type, SOFTWARE key), three cluster secrets (enable_vault)
objectstorage.tf — Versioned Object Storage bucket for Terraform state (enable_object_storage_state)
notifications.tf — OCI Notifications topic + optional email subscription (enable_notifications)
mysql.tf — MySQL HeatWave DB system in private subnet (enable_mysql)
output.tf — Outputs (IPs, k3s_token, longhorn_ui_credentials, argocd_initial_password_hint, oci_log_group_id, terraform_state_backend, notification_topic_endpoint, mysql_endpoint, vault_id)
files/server-vars.sh.tpl — cloud-init header for servers: ONLY file with Terraform ${var} syntax
files/agent-vars.sh.tpl — cloud-init header for agents: ONLY file with Terraform ${var} syntax
files/lib/common.sh — pure bash: OS bootstrap, unattended-upgrades, OCI CLI, Helm, resolve_flannel_params()
files/lib/k3s-server.sh — pure bash: first-server election, k3s install, main entry point
files/lib/k3s-bootstrap.sh — pure bash: secrets pre-creation, Gateway API CRDs, cert-manager, ArgoCD
files/lib/k3s-agent.sh — pure bash: k3s agent install, main entry point
gitops/apps/ — ArgoCD Application manifests (App of Apps pattern)
gitops/network-policies/ — Default-deny NetworkPolicies (managed by network-policies.yaml App)
gitops/longhorn/ — Longhorn ingress with BasicAuth (Envoy Gateway SecurityPolicy + HTTPRoute)
gitops/cert-manager/ — ClusterIssuer templates + ArgoCD Application template (see adoption notes)
gitops/gateway/ — Envoy Gateway config: EnvoyProxy (DaemonSet/NodePort), GatewayClass, Gateway, redirect HTTPRoute, TLS ClientTrafficPolicy
gitops/external-secrets/ — ClusterSecretStore template + example ExternalSecret CRs (enable_external_secrets)
example/ — Example module usage
.github/workflows/terraform.yml — CI: fmt, validate, tflint, ShellCheck, terraform-docs
.terraform-docs.yml — terraform-docs config (inject mode; CI auto-commits README updates)
renovate.json — Automated dependency updates
- All resources get
freeform_tags = local.common_tags. - Versions in
vars.tfuse# renovate:inline comments so Renovate opens PRs automatically:# renovate: datasource=github-releases depName=cert-manager/cert-manager default = "v1.16.3"
- Run
tofu fmt -recursive(orterraform fmt -recursive) before committing — CI enforces it. terraform validateruns against both the root module andexample/— keep both valid.- The
lifecycle { prevent_destroy = true }on both load balancers is intentional; do not remove it. - When renaming a resource, always add a
moved {}block so existing states don't requireterraform state mv:Removemoved { from = oci_core_instance.old_name to = oci_core_instance.new_name }
moved {}blocks after one release cycle — leave a comment inmoved.tfexplaining that it's intentionally empty. The file itself must remain so its purpose is clear.
files/server-vars.sh.tplandfiles/agent-vars.sh.tplare the ONLY Terraform templatefiles. They export all Terraform-resolved values as bashexport KEY="value".${var}is Terraform interpolation; these files render to a plain bash variable header.files/lib/*.share pure bash — no Terraform syntax, no$${var}escaping. ShellCheck runs on these files without workarounds (# shellcheck disable=SC2154is the only suppression, covering vars exported by the prepended template header).data.tfassembles the final script withjoin("\n", [templatefile(...), file(...), ...]).- Ubuntu 24.04 only. No Oracle Linux, no multi-distro branches.
- Always use
set -euo pipefailat the top of each file.
If the component must be bootstrapped before ArgoCD starts (e.g. it provides a CRD that ArgoCD apps depend on):
- Add a version variable to
vars.tfwith a# renovate:comment. - Export the version in
files/server-vars.sh.tplasexport MY_VERSION="${my_version}". - Write an
install_<component>()function infiles/lib/k3s-bootstrap.sh. - Call it from
run_bootstrap()ink3s-bootstrap.sh. - Add the version variable to the
templatefile()vars map indata.tf.
If the component is fully managed by ArgoCD (Helm chart from gitops/apps/):
- Add an ArgoCD
Applicationmanifest togitops/apps/with the chart version pinned and a# renovate:comment for automated updates. - No changes to cloud-init or vars.tf are needed.
New Kubernetes manifests belong in gitops/. Add an ArgoCD Application CR in
gitops/apps/ to have ArgoCD manage them automatically.
Users who want to add their own apps on top of the built-in stack must fork this repo. The workflow is:
- Fork the repo on GitHub.
- Run
bash gitops/update-repo-url.sh https://github.com/their-org/their-fork.gitto replace allrepoURL: https://github.com/mbologna/k3s-oci.gitoccurrences ingitops/apps/with their fork URL. Commit and push. - Set
gitops_repo_url = "https://github.com/their-org/their-fork.git"interraform.tfvarsso cloud-init writes the correct URL intoapp-of-apps.yaml. - Update
txtOwnerIdingitops/apps/external-dns.yamlto matchvar.cluster_name(important whenenable_external_dns = trueand sharing a Cloudflare zone). - Add their own ArgoCD
Applicationmanifests togitops/apps/— each can point at any Helm registry or any Git repo; only the App of Apps manifest itself must live in the fork.
When helping users add apps, always remind them to run update-repo-url.sh and set
gitops_repo_url if they haven't already.
| Check | Command |
|---|---|
| Terraform format | terraform fmt -check -recursive |
| Terraform validate (root) | terraform init -backend=false && terraform validate |
| Terraform validate (example) | same, in example/ |
| OpenTofu validate (root + example) | same as above but with tofu |
| tflint | tflint --init && tflint --recursive (pinned version, Renovate-managed; auto-discovers .tflint.hcl) |
| ShellCheck | shellcheck --severity=warning files/*.sh |
| YAML lint (gitops/ + .github/workflows/) | yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}' gitops/ .github/workflows/ |
| actionlint | actionlint (GitHub Actions workflow syntax) |
| Trivy IaC scan | trivy config . --severity HIGH,CRITICAL (Terraform + gitops) |
| terraform-docs | fails on diff in PRs; auto-committed on push to main |
Run all checks locally before pushing:
tofu fmt -recursive
tofu init -backend=false && tofu validate
(cd example && tofu init -backend=false && tofu validate)
tflint --init && tflint --recursive
shellcheck --severity=warning files/lib/common.sh files/lib/k3s-server.sh files/lib/k3s-bootstrap.sh files/lib/k3s-agent.sh
yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}' gitops/ .github/workflows/
actionlint
trivy config . --severity HIGH,CRITICAL --skip-dirs .terraform,example/.terraform
terraform-docs .- Do not add paid OCI resources (compute shapes other than A1.Flex, extra NLBs, etc.)
- Do not add Oracle Linux support — Ubuntu 24.04 LTS only
- Do not remove
lifecycle { prevent_destroy = true }from load balancers - Do not hardcode secrets, OCIDs, or credentials anywhere
- Do not remove the
# renovate:comments on version variables - Do not commit
example/terraform.tfvars(it is gitignored;.tfvars.exampleis the template) - Do not break the
terraform validatestep —server-vars.sh.tpl/agent-vars.sh.tplvars must match whatdata.tfpasses - Do not suggest terminating TLS at the OCI load balancer — the public-facing LB is the OCI NLB (
nlb.tf), which operates at L4 TCP only (protocol = "TCP") and cannot inspect or terminate TLS. The one free OCI Flexible LB allocation (L7, TLS-capable) is consumed by the internal kubeapi HA LB (lb.tf). TLS must be terminated at Envoy Gateway. cert-manager + Let's Encrypt handles certificate issuance and renewal automatically. - Do not add nginx or other ingress controllers — Envoy Gateway (Gateway API) is the ingress implementation. All HTTP/HTTPS routing uses standard
HTTPRoute,Gateway, andGatewayClassresources. - Do not re-add
control-plane:NoScheduletaints — cloud-init removes these taints after cluster init so user workloads schedule across all 4 nodes. With only 1 worker, keeping the taints makes the worker a single point of failure for all workloads. All nodes are identically sized; etcd and user workloads coexist safely. - Do not add UFW or any iptables-front-end to nodes. k3s manages iptables directly via flannel;
adding ufw would flush k3s's rules on
ufw enableand break pod networking. OCI NSGs provide the security boundary at the hypervisor level, independent of the OS firewall. - Vault uses
DEFAULTtype andSOFTWAREprotection only —VIRTUAL_PRIVATEvault type andHSMprotection mode are NOT Always Free.vault_type = "DEFAULT"(shared vault) +protection_mode = "SOFTWARE"are entirely free. The 150-secret limit covers the three cluster secrets many times over. Never change the vault type or protection mode without verifying cost. - Vault and key have
prevent_destroy = true— OCI DEFAULT vaults have a low per-tenancy limit and take a minimum of 7 days to fully delete (thePENDING_DELETIONstate counts against quota).prevent_destroykeeps the vault alive acrosstofu destroy/tofu applycycles. If you genuinely need to delete the vault, remove thelifecycleblock or runtofu state rmfirst. - Do not add an nginx stream proxy back. The OCI NLB routes directly to Envoy Gateway NodePorts
(
is_preserve_source = truepreserves real client IPs transparently). An extra nginx hop adds latency and complexity with no benefit. - Do not reduce
boot_volume_size_in_gbsbelow 50 GB — OCI requires ≥ 50 GB for boot volumes on all shapes (A1.Flex and E2.1.Micro alike). 4 × 50 GB = 200 GB exactly fills the Always Free block storage limit. Do not suggest 47 GB as an optimisation — it is not valid.
expose_ssh = trueadds TCP:22 listener + backends to the public NLB and NSG rules allowingmy_public_ip_cidrto SSH directly to nodes via the NLB IP (seessh_commandoutput).expose_kubeapi = trueadds TCP:6443 to the NLB for direct kubeapi access without a bastion.- When
expose_ssh = true, OCI Bastion Service (enable_bastion) is redundant. Setenable_bastion = falseto avoid the lingering-VNIC delay when destroying (OCI Bastion VNICs take 15-30 min to clean up internally after deletion, blocking subnet deletion). - NSG rules for NLB SSH/kubeapi traffic MUST use
source_type = "CIDR_BLOCK"withsource = var.my_public_ip_cidr, NOTsource_type = "NETWORK_SECURITY_GROUP". The NLB usesis_preserve_source = trueso real client IPs arrive at node VNICs directly — NLB NSG rules only match health-check traffic.
- Deployed as a DaemonSet (one Envoy proxy pod per node) via the
EnvoyProxyresource — every NLB backend serves ingress locally, no cross-node forwarding, no single-pod SPOF. priorityClassName: system-cluster-criticalensures Envoy proxy pods preempt user workloads under memory pressure and are never evicted before system daemons.resources.requests: 100m CPU / 128Mi RAMprevents scheduling on nodes that cannot sustain ingress load.PodDisruptionBudget maxUnavailable: 1for the Envoy DaemonSet is NOT used — Kubernetes PDB does not support DaemonSet-controlled pods (DaemonSets do not implement the scale subresource). kured uses--ignore-daemonsetsduring drain so the one-node-at-a-time guarantee comes from kured's own distributed lock, not a PDB. Do not add a PDB for the Envoy DaemonSet pods.- All HTTP/HTTPS routing uses standard
HTTPRouteresources (Gateway API v1). ProprietaryIngressRouteCRDs are not used. - HTTP-01 ACME challenges use
gatewayHTTPRoutesolver (cert-manager Gateway API integration). cert-manager is installed with--feature-gates=ExperimentalGatewayAPISupport=true. - TLS certificates live in the
envoy-gateway-systemnamespace (same as the Gateway) so noReferenceGrantis needed. - BasicAuth for Longhorn UI uses Envoy Gateway
SecurityPolicywith.htpasswdSecret — same security, standard API. - Do not change
envoyDaemonSetback toenvoyDeployment— this would reintroduce a single-pod SPOF for all HTTP/HTTPS traffic.
- Replica count is explicitly pinned to 3 in
gitops/apps/longhorn.yamlviadefaultSettings.defaultReplicaCount=3andpersistence.defaultClassReplicaCount=3. Do not rely on the upstream chart default. - Longhorn is managed entirely by ArgoCD (
gitops/apps/longhorn.yaml). Cloud-init does NOT install Longhorn. - With 4 nodes and 3 replicas, any single node can be lost without PVC data loss.
- The etcd HA ceiling applies independently: losing 2 control-plane nodes loses etcd quorum regardless of Longhorn replica count.
- Password is generated by
random_password.longhorn_ui_passwordindata.tfand exported byfiles/server-vars.sh.tplasLONGHORN_UI_PASSWORD_PLAIN(or fetched from Vault whenenable_vault = true). files/lib/k3s-bootstrap.shgenerates the APR1 hash viaopenssl passwd -apr1and createsSecret/longhorn-basic-auth-secretinlonghorn-systemat bootstrap time. The hash requires runtime password resolution so it cannot be a static gitops file.gitops/longhorn/ingress.yamlis a template — users configure theHTTPRoute,SecurityPolicy, andCertificateresources there pointing to the pre-created Secret.- Credentials are available via the
longhorn_ui_credentialssensitive output.
- Cloud-init bootstraps ClusterIssuers with the correct email from
var.certmanager_email_address. This must happen at bootstrap time — the email cannot be in git without manual editing. gitops/cert-manager/contains template ClusterIssuers and an ArgoCD Application template.- To enable ArgoCD management of ClusterIssuers: update the email in
cluster-issuers.yaml, then copyapplication-template.yamltogitops/apps/cert-manager.yaml. - Do NOT place the template in
gitops/apps/as-is — it containschangeme@example.com.
- Separation of concerns:
server-vars.sh.tplandagent-vars.sh.tplare the ONLY files with Terraform${var}interpolation. Allfiles/lib/*.share pure bash. - Assembly:
data.tfusesjoin("\n", [templatefile(vars.tpl), file(lib/common.sh), ...])to produce a single cloud-init script. The rendered vars header is prepended, making allexport KEY="value"statements available to the lib scripts at runtime. - GitOps-first: cloud-init only bootstraps what ArgoCD cannot self-manage:
- Gateway API CRDs (must exist before ArgoCD syncs
gateway-configapp) - cert-manager Helm + ClusterIssuers (email is a runtime Terraform var, not static git)
- ArgoCD Helm + App of Apps bootstrap
- External Secrets Operator Helm + ClusterSecretStore (conditional, vault_ocid is runtime)
- Pre-create Kubernetes Secrets with runtime values (passwords, endpoints)
- Hostname-specific HTTPS Gateway listener + TLS Certificate + HTTPRoute (NLB IP is runtime; see
configure_grafana_ingress()and the "Hostname-specific HTTPS resources" section in Deploying web apps)
- Gateway API CRDs (must exist before ArgoCD syncs
- Managed by ArgoCD, NOT cloud-init: Envoy Gateway, Longhorn, kured,
system-upgrade-controller, external-dns Helm — all in
gitops/apps/*.yaml. - Removed vars:
kured_start_time,kured_end_time,kured_reboot_days,kured_chart_version,longhorn_chart_version,envoy_gateway_chart_version,external_dns_chart_versionwere removed fromvars.tf. Configure kured viagitops/apps/kured.yamldirectly. - Shared cloud-init vars:
local.k3s_common_cloud_init_varsinlocals.tfholds the five vars shared by both server and agent (k3s_version,k3s_subnet,k3s_token,k3s_url,vault_secret_id_k3s_token). The server templatefile call usesmerge(local.k3s_common_cloud_init_vars, {...server-only...}); the agent call passes the local directly. - Flannel interface resolution:
resolve_flannel_params()incommon.shsetsLOCAL_IPandFLANNEL_IFACE(exported) whenK3S_SUBNETis notdefault_route_table. Called by bothinstall_k3s_server()andinstall_k3s_agent(); server adds--advertise-addresstoo. - ShellCheck:
# shellcheck disable=SC2154in lib/ files covers exported vars from the prepended template header. No other suppressions needed (was 5+ in the old monolith).
- Controlled by
enable_oci_loggingvariable (default:true). - Creates:
oci_logging_log_group,oci_logging_log,oci_logging_unified_agent_configuration. - The dynamic group from
iam.tfis referenced for the agent config. - The
Custom Logs Monitoringplugin is enabled inlocals.tfagent_plugins. - Ships
/var/log/k3s-cloud-init.logto OCI Logging (10 GB/month free). oci_log_group_idoutput provides the OCID for use withoci loggingCLI.
- README Variables and Outputs sections are auto-generated between
<!-- BEGIN_TF_DOCS -->and<!-- END_TF_DOCS -->markers. - CI (
terraform-docsjob) auto-commits README if the content drifts (git-push mode). - Run
terraform-docs .locally before pushing to avoid an extra CI commit. - Config is in
.terraform-docs.yml(inject mode, sort by name).
- Controlled by
enable_vaultvariable (default:true). - Uses
vault_type = "DEFAULT"(shared vault, free).VIRTUAL_PRIVATEvaults cost money — never use that type. - Key uses
protection_mode = "SOFTWARE"(free). HSM-protected keys are NOT free. - Stores three secrets:
k3s_token,longhorn_ui_password,grafana_admin_password. - Cloud-init fetches secrets at boot via
oci secrets secret-bundle get-secret-bundlewithOCI_CLI_AUTH=instance_principal. - When
enable_vault = false, the plaintext values are exported byserver-vars.sh.tpl/agent-vars.sh.tplasK3S_TOKEN_PLAIN,LONGHORN_UI_PASSWORD_PLAIN,GRAFANA_ADMIN_PASSWORD_PLAIN; the lib scripts use them as fallback. - The IAM policy uses
concat()to addread secret-familyonly whenenable_vault = true. - Agent script (
files/lib/k3s-agent.sh) installs OCI CLI and fetches k3s_token from Vault whenVAULT_SECRET_ID_K3S_TOKENis non-empty.
- Controlled by
enable_backupvariable (default:true). - Creates a custom
oci_core_volume_backup_policywith weekly full backups, 1-week retention. - Assigns the policy to all server boot volumes (
data.oci_core_instance.k3s_servers[*].boot_volume_id) and the standalone worker boot volume. - With 4 nodes and 1-week retention there are at most 4 active backups — within the 5-backup Always Free limit.
- Do NOT increase retention or frequency beyond 1-week/weekly without exceeding the free limit.
data.oci_objectstorage_namespace.k3sis created when eitherenable_object_storage_stateorenable_longhorn_backupis true — both buckets share it.- Terraform state bucket (
enable_object_storage_state = true): versioned,NoPublicAccess, name${cluster_name}-terraform-state. S3-compatible endpoint and bucket name interraform_state_backendoutput. - Longhorn backup bucket (
enable_longhorn_backup = true): versioned,NoPublicAccess, name${cluster_name}-longhorn-backup. Thelonghorn_backup_setupoutput prints the three steps to connect Longhorn (Customer Secret Key → kubectl secret → uncommentgitops/longhorn/backup-target.yaml). - Both buckets share the 20 GB Always Free Object Storage allowance. Longhorn backup bucket uses no versioning for actual backup blobs (Longhorn manages its own retention), but the bucket resource has versioning enabled for accidental-delete protection.
- Users need OCI Customer Secret Keys (S3 credentials) to use either bucket — these are user-created in the Console and not managed by Terraform.
- Controlled by
enable_notificationsvariable (default:false). - Creates
oci_ons_notification_topic.k3s_alerts+ optional email subscription (alertmanager_email). - Cloud-init always creates the
alertmanager-oci-configSecret in themonitoringnamespace — with a null receiver when disabled, OCI webhook receiver when enabled. gitops/apps/kube-prometheus-stack.yamlreferences this secret viaalertmanager.alertmanagerSpec.configSecret. Do NOT removeconfigSecret: alertmanager-oci-configfrom that file — the secret always exists.notification_topic_endpointoutput provides the HTTPS endpoint for the Alertmanager webhook.
- Controlled by
enable_mysqlvariable (default:false). - Uses
shape_name = var.mysql_shape(default"MySQL.Free"— the Always Free shape). - Placed in the private subnet, reachable by all k3s nodes on port 3306.
- Admin password generated by
random_password.mysql_admin_password(inmysql.tf). - Cloud-init pre-creates a
mysql-credentialsKubernetes Secret in thedefaultnamespace. mysql_endpointandmysql_admin_credentials(sensitive) outputs are available after apply.is_highly_available = false— HA MySQL is NOT Always Free.
- Controlled by
enable_external_dnsvariable (default:false). - Installs External DNS (chart version tracked by Renovate) configured for the Cloudflare provider.
- Syncs
HTTPRoutehostnames to Cloudflare DNS automatically — annotate resources withexternal-dns.alpha.kubernetes.io/hostname: your.host.example.com. - Requires
cloudflare_api_tokenandcloudflare_zone_id. external_dns_domain_filterlimits which zones External DNS manages (prevents accidental changes to unrelated zones when the API token covers multiple zones).txtOwnerIdis hardcoded tok3s-clusteringitops/apps/external-dns.yaml; update it to matchvar.cluster_namein your fork so multiple clusters can share a Cloudflare zone without conflicts.
- Controlled by
enable_external_secretsvariable (default:false). Requiresenable_vault = true. - Installs External Secrets Operator and creates a
ClusterSecretStorebacked by OCI Vault using instance_principal auth — no credentials to rotate. - The existing IAM
read secret-familypolicy (added whenenable_vault = true) already covers it. - See
gitops/external-secrets/for the ClusterSecretStore template and example ExternalSecret CRs. - Users create
ExternalSecretresources referencing Vault secret OCIDs; the operator syncs them into Kubernetes Secrets automatically and rotates on the configured refresh interval.
The following issues were discovered while deploying the first HTTPS application. Document them here so agents do not repeat the investigation.
NLB is_preserve_source = true and NSG rules
The public NLB uses is_preserve_source = true on all backend sets. This means packets arrive at node VNICs with the real client IP as source, not the NLB's own IP. NSG rules that use source_type = NETWORK_SECURITY_GROUP pointing at the NLB NSG will only match health-check traffic (which originates from the NLB's VNIC) — real user traffic is silently dropped. NodePort rules for HTTP (:30080) and HTTPS (:30443) on both the workers NSG and the servers NSG (servers are also NLB backends) must use source = "0.0.0.0/0" with source_type = "CIDR_BLOCK". Nodes are in a private subnet with no public IPs, so this is safe.
cert-manager HTTP-01 self-check blocked by NetworkPolicy
gitops/network-policies/cert-manager.yaml deploys egress NetworkPolicies that kube-router enforces strictly. The original allow-https-egress policy only permitted TCP 443/6443/8443 — it did NOT allow TCP 80. cert-manager's HTTP-01 solver performs a self-check GET request to http://<hostname>/.well-known/acme-challenge/... before submitting to Let's Encrypt. With port 80 egress blocked, kube-router REJECTs the packet with ICMP port-unreachable, which Go's net/http reports as "connection refused". The allow-http-egress NetworkPolicy was added to fix this — do not remove it.
CHACHA20_POLY1305 ciphers crash Envoy TLS on aarch64
gitops/gateway/tls-policy.yaml (ClientTrafficPolicy) must NOT include TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 or TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305. Envoy/BoringSSL on aarch64 rejects these TLS 1.2 cipher names with error code 13, which causes the entire xDS TLS snapshot to be rejected. The result is that no TLS certificate is ever loaded via SDS and all HTTPS connections are dropped with TCP RST. The AES-GCM ciphers are sufficient; TLS 1.3 ChaCha20 (TLS_CHACHA20_POLY1305_SHA256) still negotiates automatically and is unaffected.
HTTPRoute hostnames: [] matches ALL requests
An empty hostnames list in a Gateway API HTTPRoute is identical to omitting the field — it matches every hostname. An HTTP-to-HTTPS redirect route with no (or empty) hostnames redirects ALL HTTP traffic. cert-manager's ACME HTTP-01 challenge HTTPRoute (created automatically by cert-manager) has a more specific hostname+path match and takes precedence — so the match-all redirect is safe. gitops/gateway/redirect.yaml intentionally omits hostnames for exactly this reason. Do NOT add explicit hostnames to the redirect route — the route would break for any hostname not listed.
Hostname-specific HTTPS resources are managed by cloud-init, not gitops/
NLB IP changes on every redeploy. Hardcoding sslip.io addresses in gitops/ breaks GitOps: every redeploy requires manual file edits. The design:
local.grafana_hostnameinlocals.tfauto-computesgrafana.<nlb-ip>.sslip.io(or usesvar.grafana_hostnameif set).files/server-vars.sh.tplexportsGRAFANA_HOSTNAME.files/lib/k3s-bootstrap.sh:configure_grafana_ingress()runs after ArgoCD is installed and creates:- The
https-grafanaGateway listener (SSA, field-manager=cloud-init-bootstrap) - The
grafana-tlsCertificate inenvoy-gateway-system - The
grafanaHTTPRoute inmonitoring(SSA, field-manager=cloud-init-bootstrap)
- The
gitops/gateway/gateway.yamlhas ONLY thehttplistener (ArgoCD owns it via SSA).gitops/monitoring/grafana-ingress.yamlhas the Grafana HTTPRoute WITHOUThostnames(ArgoCD owns all fields exceptspec.hostnames, which cloud-init-bootstrap owns).
SSA field-manager ownership prevents ArgoCD from clearing cloud-init patches
Gateway API's spec.listeners is a x-kubernetes-list-map-keys: [name] list — SSA treats it as a named map and merges by the name key. Each SSA manager owns the entries it applied:
argocd-controllerownsspec.listeners[name=http](applied from gateway.yaml)cloud-init-bootstrapownsspec.listeners[name=https-grafana](applied by configure_grafana_ingress) When ArgoCD syncs gateway.yaml (withouthttps-grafana), it only ownshttpand never toucheshttps-grafana. TheignoreDifferences: /spec/listenersin gateway-config ArgoCD Application suppresses OutOfSync warnings.
Similarly, spec.hostnames in the Grafana HTTPRoute is owned by cloud-init-bootstrap (via kubectl apply --server-side --field-manager=cloud-init-bootstrap --force-conflicts). ArgoCD's SSA apply (without hostnames in the manifest) doesn't claim or clear the field.
Do NOT use CSA (kubectl apply without --server-side) to patch ArgoCD-managed resources. CSA sets the kubectl.kubernetes.io/last-applied-configuration annotation, which confuses ArgoCD's 3-way merge on the next sync. Always use SSA with a custom field-manager for cloud-init patches to ArgoCD-managed resources.
gateway-config MUST use ServerSideApply=true to avoid resourceVersion: 0 errors. Without SSA, ArgoCD's CSA apply with RespectIgnoreDifferences strips spec.listeners from the patch payload, causing a malformed UPDATE request to fail validation.
- Controlled by
enable_dns01_challengevariable (default:false). Requirescloudflare_api_token. - When enabled, cloud-init creates a
cloudflare-api-tokenSecret incert-managerand switches ClusterIssuers to use DNS-01 (Cloudflare) instead of HTTP-01. - Benefits: supports wildcard certs (
*.example.com), no inbound port 80 required. - See
gitops/cert-manager/cluster-issuers.yamlfor the commented DNS-01 ClusterIssuer variants to use when adopting cert-manager into ArgoCD.