alphagov · jasonBirchall · Jan 27, 2025 · Jan 28, 2025
diff --git a/source/images/govuk-monitoring-alerting.jpg b/source/images/govuk-monitoring-alerting.jpg
diff --git a/source/images/govuk-platform-1000ft.jpg b/source/images/govuk-platform-1000ft.jpg
diff --git a/source/kubernetes/how-platform-works/index.html.md b/source/kubernetes/how-platform-works/index.html.md
@@ -1,23 +1,96 @@
 ---
+owner_slack: "#govuk-platform-engineering"
 title: How the platform works
 weight: 60
+section: Kubernetes
+type: learn
 layout: multipage_layout
+parent: "/manual.html"
 ---
-
 # How the platform works
 
-The GOV.UK Kubernetes platform is an AWS-hosted [Kubernetes](https://kubernetes.io) cluster, using Amazon's [Elastic Kubernetes Service](https://aws.amazon.com/eks/) (EKS).
+The GOV.UK Kubernetes platform is an AWS-hosted [Kubernetes](https://kubernetes.io) cluster built using Amazon's [Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). It provides a standardised environment for running containerised applications that power GOV.UK services.
+
+The platform is designed to:
+
+- Enable consistent application deployments across staging, integration, and production environments.
+
+- Automate tasks like scaling, load balancing, and secret management.
+
+- Centralise key infrastructure components, such as monitoring and alerting.
+
+
+To meet GOV.UK’s operational requirements, the platform integrates Kubernetes with AWS tools and additional Kubernetes add-ons. These include features for:
+- Managing secrets securely.
+- Automatically scaling the cluster based on workloads.
+- Routing traffic through managed load balancers.
+
+This document outlines the core components of the platform, including the add-ons, authentication methods, and other services that support its operation.
+
+## High-Level Architecture
+
+![High level overview of internal platform](../../images/govuk-platform-1000ft.jpg)
+
+> Note: Source for the above diagram can be found [here](https://drive.google.com/file/d/1iYblqBbGXlkOScOBRlg8hqtFvVaA9CXp/view?usp=drive_link).
+
+The platform is designed to manage and deploy applications reliably and efficiently across multiple environments, including staging, integration, and production. The architecture integrates several components to handle deployment, monitoring, alerting, and logging:
+
+- **Infrastructure Deployment**: Terraform Cloud and the AWS API are used to manage and provision infrastructure, including Kubernetes clusters hosted on AWS Elastic Kubernetes Service (EKS).
+
+- **Application Deployment**: Developers push code to GitHub, which triggers GitHub Actions to build and deploy applications using Helm charts via Argo workflows.  
+
+- **Monitoring and Alerting**: Prometheus scrapes metrics from applications and infrastructure, with alerts routed through AlertManager to PagerDuty, Slack, and other channels. Grafana provides observability dashboards for performance monitoring.  
+
+- **Secret Management**: External-secrets fetches secrets from AWS Secrets Manager, making them available as Kubernetes Secrets for applications.  
+
+- **Logging**: Application and cluster logs are centralised in ELK/LogIt for analysis and debugging.  
+
+- **Operational Support**: Tools like Sentry capture exceptions, while monitoring systems like CloudWatch and Pingdom enhance reliability and alerting for both applications and infrastructure.
+
+This setup enables automated, scalable, and secure management of the platform while providing observability and operational insights for application teams.
+
+### Application Deployment
+
+You can read more about how applications are deployed [here](kubernetes/manage-app/access-ci-cd/index.html.md).
+
+### Monitoring and Alerting
+
+![Model of the monitoring and alerting setup](../../images/govuk-monitoring-alerting.jpg)
+
+> Note: Source for the above diagram can be found [here](https://drive.google.com/file/d/1pB3acw7CFtqTJe1nSyV391B8iPLrKYgC/view?usp=sharing).
+
+Monitoring and alerting in the GOV.UK Kubernetes platform ensure the stability, performance, and reliability of applications and infrastructure. The following diagram provides a high-level overview of the monitoring and alerting workflow:
+
+#### Key Components
+
+1. **Prometheus**  
+   Prometheus collects metrics from applications (e.g., App1, App2) and infrastructure components. It serves as the central hub for scraping and storing performance and operational data.
+
+2. **AlertManager**  
+   AlertManager processes alerts generated by Prometheus based on predefined rules. It routes alerts to various notification channels, including:
+   - **Slack**: Sends alerts to team channels for real-time awareness.
+   - **PagerDuty**: Escalates critical alerts to on-call operators.
+   - **Email**: Distributes alerts for broader notification.
+
+3. **Grafana**  
+   Grafana integrates with Prometheus to provide visualisations and dashboards for observing performance metrics and trends. It allows teams to monitor the health of applications and systems effectively.
+
+4. **Sentry**  
+   Sentry captures exceptions and error logs from applications, feeding them into the monitoring pipeline for analysis and resolution.
 
-For information on how Kubernetes works in general, see the:
+5. **Pingdom and AWS CloudWatch**  
+   - **Pingdom** monitors external service uptime and latency, providing an additional layer of visibility into application availability.
+   - **AWS CloudWatch** provides monitoring for AWS-managed resources, complementing Prometheus's metrics collection.
 
-- [Kubernetes documentation](https://kubernetes.io/docs/home/)
-- [Amazon EKS documentation](https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html)
-- Linux foundation’s free [introduction to Kubernetes training certification](https://training.linuxfoundation.org/training/introduction-to-kubernetes/)
+6. **Seal and Dependabot**  
+   - **Dependabot** ensures dependency updates are monitored.
+   - **Seal** provides alert consolidation and actionable insights for smoother incident response.
 
-Specific to its implementation of Kubernetes, the GOV.UK Kubernetes platform cluster:
+#### Workflow Summary
 
-- uses add-ons to manage storage and secrets
-- authenticates platform cluster users using an `aws-auth` ConfigMap
+- Applications emit metrics and logs to Prometheus and Sentry. Prometheus evaluates metrics against alerting rules and forwards alerts to AlertManager. AlertManager routes alerts to appropriate channels based on their severity and urgency.
+- Teams use Grafana dashboards to analyse metrics and trends for ongoing operational insights.
+- External monitoring tools (Pingdom, CloudWatch) augment observability, while Seal and Dependabot contribute to a streamlined response to issues.
 
 ## Platform cluster add-ons