Multi-Cloud Kubernetes Platform (GKE + AKS)

A production-shaped platform running the same application stack on GKE and AKS, joined by an Istio multi-primary mesh, with unified observability through Thanos and Grafana and DNS-level failover through Azure Traffic Manager.

Motivation

Cloud platforms fail. Regions go down, vendors have outages, and enterprise customers often mandate which cloud their data lives on. A platform that bets everything on one provider is one incident away from a full outage with no fallback.

I wanted to build something that treated multi-cloud not as a stretch goal but as a first-class architectural constraint — where the networking, the service mesh, the observability, and the CI/CD pipeline are all designed from the start with the assumption that workloads span two different clouds with different APIs, different IAM models, and non-overlapping network spaces. This platform is the result of that.

Why It Matters

Vendor independence — no single cloud provider failure takes the platform offline. Traffic Manager shifts load in minutes; the mesh reroutes within seconds once traffic is flowing.
Regulated workloads — multi-cloud is often a contractual or compliance requirement, not a choice. This platform demonstrates the patterns (separate VPCs, explicit trust boundaries, no cross-cloud shared credentials) that those environments demand.
Cost optionality — running identical workloads on two clouds means you can shift capacity toward whichever is cheaper at renewal time without re-architecting the application.
Operational realism — the runbooks, alert rules, and failover scripts here are the kind of artefacts a platform team actually maintains, not just YAML that provisions and never gets touched again.

Real-World Uses

This architecture pattern directly maps to production scenarios across several industries:

SaaS platforms needing 99.99% SLAs without betting on one cloud region
Financial services with cloud-specific data residency requirements per jurisdiction
Healthcare and government workloads where a customer's procurement mandates a specific cloud provider per environment
Global applications routing users to the geographically nearest healthy cluster
Disaster recovery setups where one cloud acts as active and the other as warm standby

Architecture

flowchart TD
    DNS["Cloud DNS\nAzure Traffic Manager\n(weighted routing + failover)"]

    DNS --> GKE
    DNS --> AKS

    subgraph GKE ["GKE / us-central1"]
        GIG["Istio ingress gateway"]
        GAGW["api-gateway"]
        GORD["order-service"]
        GINV["inventory-service"]
        GPROM["Prometheus"]
        GIG --> GAGW --> GORD --> GINV
        GPROM --> THANOS
    end

    subgraph AKS ["AKS / East US"]
        AIG["Istio ingress gateway"]
        AAGW["api-gateway"]
        AORD["order-service"]
        AINV["inventory-service"]
        APROM["Prometheus"]
        AIG --> AAGW --> AORD --> AINV
    end

    GORD <-->|"east-west mTLS :15443"| AORD
    APROM -->|"remote_write"| THANOS["Thanos Receive → Query"]
    THANOS --> GRAFANA["Grafana\n(single pane)"]

ASCII version

                                   +----------------------+
                                   | Cloud DNS            |
                                   | Azure TrafficManager |
                                   +----------+-----------+
                                              |
                                     public traffic + health
                                              |
                                +-----------------------------+-----------------------------+
                                |                                                           |
                                v                                                           v
                      +---------------------+                                     +---------------------+
                      | GKE / us-central1   |                                     | AKS / East US       |
                      | network1 / cluster1 |                                     | network2 / cluster2 |
                      +----------+----------+                                     +----------+----------+
                                 |                                                           |
                        +-----------+-----------+                                   +-----------+-----------+
                        | Istio ingress gateway |                                   | Istio ingress gateway |
                        +-----------+-----------+                                   +-----------+-----------+
                                 |                                                           |
                         +-------+-------+                                           +-------+-------+
                         | api-gateway   |                                           | api-gateway   |
                         +-------+-------+                                           +-------+-------+
                                 |                                                           |
                         +-------+-------+                                           +-------+-------+
                         | order-service | <==== east-west mTLS over 15443 =====>    | order-service |
                         +-------+-------+                                           +-------+-------+
                                 |                                                           |
                         +-------+-------+                                           +-------+-------+
                         | inventory-svc |                                           | inventory-svc |
                         +---------------+                                           +---------------+

                         Prometheus                                               Prometheus
                              |                                                        |
                              |                              remote_write               |
                              +-----------------------+<-------------------------------+
                                                      |
                                               Thanos Receive
                                                      |
                                                 Thanos Query
                                                      |
                                                   Grafana

What Works

Terraform provisions GKE and AKS with separate cloud-specific modules
Helm deploys the application stack to both clusters with per-cluster value overrides
Istio runs multi-primary multi-network with east-west gateways for cross-cluster mTLS
Cross-cluster service calls route through the mesh without manual endpoint management
Azure Traffic Manager handles public weighted routing and health-based failover
Grafana on GKE shows metrics from both clusters via Thanos remote_write from AKS
GitHub Actions builds, scans, and deploys with OIDC — no static cloud keys in GitHub

Why I Built It This Way

I wanted one repo that forced me to deal with tradeoffs instead of hiding them. GKE and AKS do not look the same once you get past the cluster marketing page, so I kept them in separate Terraform modules rather than abstracting them into a generic interface that would smooth over the differences.

The mesh is multi-network because pod IPs stop being meaningful the second traffic crosses clouds. Monitoring is push-based from AKS to GKE because that plays nicer with NAT and public edges than a pull-only setup.

Failover is intentionally layered. Public traffic moves at DNS, which is slower but simple and requires no mesh awareness. Inside the mesh, outlier detection reacts faster once requests are already moving. Keeping those two mechanisms separate made the whole platform easier to reason about and debug independently.

Stack

Area	Version / shape	Notes
Terraform	1.14.8	Root module with separate `gke/`, `aks/`, and `dns/` modules
GCP provider	`hashicorp/google` 7.25.0	GKE, VPC, firewall, Cloud DNS
Azure provider	`hashicorp/azurerm` 4.66.0	AKS, VNet, NSG, Traffic Manager
Kubernetes	GKE Regular channel, AKS current GA
Istio	1.29.1	Multi-primary, multi-network with east-west gateways
Helm	4.1.3	App chart, monitoring installs, external-dns
Go	1.22.x	`api-gateway` and `order-service`
Python	3.12.x / Flask 3.1.x	`inventory-service`
Monitoring	kube-prometheus-stack 82.10.3 · Thanos Receive + Query · Grafana 10.7.0	One Grafana view over both clusters
CI	GitHub Actions + OIDC	No static cloud credentials in GitHub

Quick Start

Copy terraform/terraform.tfvars.example to terraform/terraform.tfvars and fill in cloud, registry, and DNS values.
make init TF_BACKEND_BUCKET=<bucket> TF_BACKEND_PREFIX=multi-cloud-k8s/dev
make plan-gke && make plan-aks
make apply-all
Build and push images, then deploy the Helm chart to both clusters.
./mesh/scripts/setup-mesh.sh --gke-context <ctx> --aks-context <ctx>
./monitoring/scripts/install-monitoring.sh --gke-context <ctx> --aks-context <ctx>
./routing/scripts/dns-verify.sh && ./routing/scripts/test-failover.sh && ./scripts/quick-test.sh

Repository Layout

.
├── terraform/          # Cloud infra — separate gke/, aks/, dns/ modules
├── app/                # Three services (api-gateway Go, order-service Go, inventory-service Python)
│   └── docker-compose.yml
├── helm/               # Umbrella chart with per-cluster value overrides
├── mesh/               # Istio multi-cluster setup, east-west gateways, traffic policy
├── monitoring/         # Prometheus, Thanos, Grafana, alerts
├── routing/            # external-dns, Traffic Manager Terraform, failover scripts
├── tests/              # Smoke, integration, load, OPA policy checks
├── docs/               # Architecture docs, ADRs, runbooks
├── costs/              # Cost estimate and optimization notes
├── Makefile            # All workflow targets
└── Justfile            # Alternative task runner

Costs

The always-on lab shape runs around $675/month — on-demand node pools, public edges, central monitoring, both clusters up continuously. Most of the bill is node compute. Detail in costs/estimate.md.

Known Gaps

Tracing headers are propagated but no span store is wired up — Jaeger or Tempo would be the next addition.
Grafana dashboard JSON has some rough panel layout that I'd clean up if I kept this running longer.
No night-time scale-down — the platform stays always-on right now, which trades cost for simplicity.

What I Learned

Multi-network Istio is not the same as multi-cluster Istio. When pod CIDRs overlap across clouds you cannot use flat networking. The east-west gateway pattern — where each cluster exposes services on a dedicated gateway and the mesh routes to remote endpoints by hostname — is the only option, and debugging it when it breaks requires reading Envoy xDS state directly with istioctl proxy-config.
GKE and AKS Terraform modules cannot share an abstraction cleanly. Firewall rules, node pool shapes, workload identity setup, and managed add-on APIs are different enough that a shared module would have been a leaky abstraction. Separate modules with explicit outputs were more honest.
OIDC for GitHub Actions is non-trivial on AKS. GCP has first-class Workload Identity Federation support in the provider. Azure requires a federated credential on an App Registration, RBAC assignment on the subscription, and the azure/login action wired correctly. Getting both working in the same pipeline without static keys took a few iterations.
Push-based remote_write from AKS to GKE is more reliable than federation across clouds. Prometheus federation pulls — if NAT or public IPs shift, scrapes fail silently. With Thanos Receive, the AKS cluster pushes metrics over a stable public endpoint and the failure mode is visible immediately in Thanos Query.
Layering failover at DNS and mesh separately is the right call. DNS failover is slow (TTL-bound) but requires zero mesh awareness — it works even if Istio is broken. Mesh-level outlier detection is fast but only kicks in once traffic is already flowing. Keeping them independent meant I could test and reason about each layer without the other interfering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Cloud Kubernetes Platform (GKE + AKS)

Motivation

Why It Matters

Real-World Uses

Architecture

What Works

Why I Built It This Way

Stack

Quick Start

Repository Layout

Costs

Known Gaps

What I Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
.vscode		.vscode
app		app
costs		costs
helm		helm
mesh		mesh
monitoring		monitoring
routing		routing
scripts		scripts
terraform		terraform
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Justfile		Justfile
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Multi-Cloud Kubernetes Platform (GKE + AKS)

Motivation

Why It Matters

Real-World Uses

Architecture

What Works

Why I Built It This Way

Stack

Quick Start

Repository Layout

Costs

Known Gaps

What I Learned

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages