Skip to content

miguelmartens/cpna-study-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Certified Cloud Native Platform Engineering Associate (CPNA) Advanced Study Guide

This advanced study guide is organized by the CPNA exam domains and their key competencies. It provides a comprehensive overview of essential concepts in platform engineering, with an emphasis on modern Kubernetes-based platforms and DevOps practices. Each section aligns with the official CPNA domains, integrating recent topics such as HEART metrics, SBOM attestations, Helm commands, Kubernetes core components, and OpenTofu (Terraform's open-source fork). The guide is structured for clarity, with headings for each domain and subheadings for specific competencies, ensuring an easy-to-follow learning path.

Table of Contents

  1. Platform Engineering Core Fundamentals (36%)
  2. Platform Observability, Security and Conformance (20%)
  3. Continuous Delivery and Platform Engineering (16%)
  4. Platform APIs and Provisioning Infrastructure (12%)
  5. IDPs and Developer Experience (8%)
  6. Measuring Your Platform (8%)

Platform Engineering Core Fundamentals (36%)

Overview: This domain covers foundational concepts of cloud-native platform engineering, including microservices and container fundamentals, declarative configuration, Kubernetes basics, and core DevOps principles as they apply to internal platforms. A firm grasp of these basics is critical before diving into more specialized topics.

Declarative Resource Management and Kubernetes Fundamentals

Platform engineering emphasizes declarative configuration – specifying what the desired state of infrastructure or services should be, rather than how to achieve it. In practice, this is seen in tools like Kubernetes manifests (YAML files declaring desired cluster state) and Infrastructure as Code (IaC) languages. Declarative resource management relies on a control loop that continuously reconciles actual state to match the declared state. Kubernetes exemplifies this model: developers submit desired resource definitions to the kube-apiserver (the control plane's API front-end), which stores state in etcd; then controllers and the kube-scheduler act to make reality match the spec (e.g. scheduler assigns pods to nodes, controllers create or update resources). This reconciliation loop is core to Kubernetes' design – control plane components (API server, controller-manager, scheduler) constantly detect differences between the desired state and current cluster state and take action to correct them. For example, if a Deployment declares 3 replicas but only 2 pods are running, the Deployment controller creates a new pod to satisfy the declared replica count. This loop underpins GitOps and other declarative models across platform tools.

Kubernetes Core Components: The Kubernetes control plane comprises several key components that work together in this declarative model. The API server is the front-end that exposes the Kubernetes API (e.g. kubectl and controllers communicate with it). etcd is the reliable key-value store where all cluster state is persisted. The scheduler monitors for unassigned Pods and binds them to appropriate nodes based on resource needs and policies. The controller manager runs various controllers which implement the desired state for different resource types (Deployments, ReplicaSets, etc.). Together, these ensure any change to the declared state (like deploying a new microservice or changing a config) results in automated actions to achieve that state – a powerful pattern for platform automation. Platform engineers should understand how this control loop works and how custom controllers (operators) can extend it for new resource types (discussed later).

DevOps Practices in Platform Engineering

DevOps principles – such as automation, continuous improvement, and close collaboration between development and operations – are foundational to platform engineering. A platform engineering team often serves as an enabler, applying DevOps practices to build a self-service Internal Developer Platform (IDP). Key DevOps concepts applied in platform engineering include Infrastructure as Code (using declarative templates and version control for infrastructure), continuous integration/continuous delivery (CI/CD) pipelines, automated testing, monitoring and feedback loops, and blameless incident postmortems.

It's also important to distinguish platform engineering from general DevOps: while DevOps is a broad culture, platform engineering focuses on building a product (the internal platform) that developers use. Thus, platform teams take a product mindset toward developer experience. They implement DevOps techniques behind the scenes to provide golden paths, but also act on user feedback, track platform adoption, and treat developers as customers of the platform. Common DevOps tools (CI servers, container registries, config management, etc.) are leveraged to create a cohesive platform where developers can deploy and operate apps with minimal toil.

Application Environments and Infrastructure Concepts

Modern applications are typically built as microservices packaged in containers. A strong understanding of containers (and how they encapsulate application code with dependencies) and container orchestration is critical. Platform engineers should grasp how applications progress from development to production through different environments (dev, test, staging, prod), and how to provide these environments on demand. Key concepts include:

  • Immutable infrastructure: Rather than manual changes on servers, new deployments involve provisioning new instances (containers, VMs) with updated code/config. This yields consistency across environments.
  • Configuration and secrets management: Separating config from code (e.g. using Kubernetes ConfigMaps and Secrets or similar mechanisms) to adjust environment-specific settings without modifying application images.
  • Infrastructure concepts: Virtualization, containerization, and clustering. For example, knowing how Kubernetes clusters abstract underlying compute (nodes) and provide a uniform API to deploy workloads. Also understanding cloud infrastructure primitives (VMs, networking, storage) since the platform often must manage cloud resources alongside cluster resources.

Platform engineers also handle multi-tenancy and isolation concerns. This can involve organizing resources by namespaces or projects to isolate teams, and ensuring one team's workloads can't impact another's (through resource quotas, network policies, etc., as covered in Security). Understanding how to efficiently utilize cloud infrastructure (right-sizing clusters, using autoscaling, etc.) is part of these fundamentals as well.

Platform Architecture and Capabilities

An internal developer platform typically includes multiple integrated capabilities: container orchestration (most often Kubernetes) for scheduling and running microservices; service networking (for service discovery, routing, perhaps a service mesh for traffic policy); CI/CD systems for building and deploying code; logging and monitoring systems for observability; and database/storage services or integrations. Platform architecture refers to how these components are assembled and exposed to developers.

Key architectural principles include modularity (using loosely coupled components via APIs), extensibility (e.g. allowing new services or tooling to be plugged in via operators or webhooks), and self-service interfaces. The platform may expose a UI or CLI (developer portal) that abstracts away the underlying tools. For instance, rather than requiring developers to write raw Kubernetes YAML, a platform might provide a user-friendly interface or templated "golden path" to deploy a service.

In designing platform architecture, engineers should consider capabilities such as: on-demand environment provisioning, automated scaling, zero-downtime deployment strategies, secrets management, and policy enforcement. A well-architected platform will hide complexity (like Kubernetes resource details) behind simpler abstractions or automation. For example, instead of manually configuring CI pipelines, developers might select a predefined pipeline template. Goals and approaches in platform engineering (discussed next) drive these architectural decisions.

Platform Engineering Goals, Objectives, and Approaches

The ultimate goal of platform engineering is to improve developer productivity and happiness by reducing cognitive load and operational toil. Key objectives include: self-service (developers can get what they need without filing tickets), standardization (consistent tooling and best practices across teams), scalability (both of infrastructure and of the number of developers the platform can support), and security/compliance by design (embedding security controls into the platform).

Approaches to achieve these goals involve treating the platform as a product. This means engaging with developer users, collecting feedback, and measuring success (e.g. tracking adoption and satisfaction metrics – see the Developer Experience and Measurement sections for HEART and other metrics). Platform teams often adopt the concept of "Golden Paths" – opinionated defaults or templates that make doing the right thing easy. For example, providing a service template that sets up CI, CD, monitoring, and infrastructure boilerplate automatically. By curating a paved road, the platform ensures reliability and security while freeing developers to focus on writing code.

Another approach is automation everywhere: automating environment setups, test execution, deployments, and recovery actions. Manual processes are minimized. The platform should also align with DevOps research (DORA) practices – enabling frequent deployments, fast recovery, and low change failure rates (these DORA metrics are discussed under Measuring Your Platform). In summary, core fundamentals of platform engineering revolve around building a system that encapsulates DevOps best practices, uses declarative and automated mechanisms (like Kubernetes and CI/CD pipelines), and always keeps developer needs in mind.

Continuous Integration Fundamentals

Continuous Integration (CI) is the practice of frequently merging code changes into a shared repository and automatically building and testing those changes. Platform engineers need to ensure robust CI pipelines are in place as part of the platform's software factory. CI fundamentals include setting up automated builds (compiling code, creating container images), running unit and integration tests on each change, and providing rapid feedback to developers when something fails.

Key points:

  • Source control integration: Every code commit triggers an automated pipeline (using tools like Jenkins, GitHub Actions, GitLab CI, etc.). This pipeline should lint the code, run tests, and build artifacts (e.g., container images or binaries).
  • Artifact management: The platform should include or integrate with artifact repositories (container registries like Harbor or ECR, package registries, etc.) to store built images with proper versioning (often tagging images with git commit SHAs or version numbers).
  • Team collaboration: CI encourages smaller, more frequent merges to avoid integration hell. It often pairs with trunk-based development and feature flagging so that even incomplete features can live in the main branch safely toggled off.

From a platform perspective, CI as a service might be provided to teams – for example, the platform team could maintain shared CI runners and standard pipeline definitions that teams use. Ensuring that pipelines are fast and reliable (perhaps by parallelizing tests or using caching) improves developer experience. Additionally, static code analysis and security scanning can be integrated into CI to catch issues early (this overlaps with security in pipelines, covered later).

Continuous Delivery and GitOps (Introduction)

While detailed continuous delivery (CD) and GitOps practices are covered under a later domain, a fundamental understanding is introduced here. Continuous Delivery means code changes automatically flow through staging environments and can be deployed to production at any time (after passing tests), often via an automated pipeline with manual approval gates if needed. Continuous Deployment is an even further step where changes that pass tests deploy to production automatically with no manual gate. The platform should support these practices by providing the tooling and infrastructure to promote builds between environments, run automated integration tests, and orchestrate deployments.

GitOps is a modern approach to CD, especially for Kubernetes, where the desired state of the system is stored in Git and a reconciler tool continuously applies this state to the cluster. In a GitOps model, developers don't manually kubectl apply changes; instead, they commit a change (for example, update a Kubernetes manifest or Helm chart in a git repo), and a tool like Argo CD or Flux will detect the change and apply it to the cluster, ensuring the cluster state matches the repo. Git serves as the single source of truth for both infrastructure and application config, and rollbacks are as simple as reverting a git commit. This approach inherently provides versioning and audit trails for all changes.

Helm is often used in tandem with GitOps or CD pipelines as a packaging tool for Kubernetes manifests. A Helm chart packages all the YAML needed for an application, templatizing it for different environments. In a CD context (whether push-based or GitOps pull-based), one might use helm commands to manage releases:

  • helm upgrade --install: This is an idempotent command to install a chart if it's not already released, or upgrade it to a new version if it exists. It allows using the same command for both initial deployment and subsequent upgrades, ensuring repeatable results. Many CI/CD pipelines use this to push out changes, optionally with flags like --wait (to wait for the deployment to succeed) and --atomic (to automatically rollback if the upgrade fails).
  • helm rollback: This command rolls back a release to a previous revision. Helm maintains a history of revisions for each release; for example, helm rollback 1 would revert to revision 1. If no revision number is specified (or 0 is used), it rolls back to the last release version. This provides a quick path to recover if a new deployment causes issues – a critical capability for incident response in continuous delivery.

In summary, continuous delivery fundamentals involve automating the path to production so that software can be released rapidly, safely, and preferably with minimal manual intervention. Platform engineers ensure that the CI flow transitions smoothly into CD by connecting pipeline stages to deployment tools or GitOps processes. They define deployment strategies (rolling updates, blue-green, canary releases) and may utilize progressive delivery techniques. GitOps, with its emphasis on declarative state and automation, has become a gold standard for managing Kubernetes cluster configurations and app deployments in a reliable way.


Platform Observability, Security and Conformance (20%)

Overview: This domain spans three critical aspects: Observability (monitoring the health and performance of the platform and applications), Security (securing the platform and software supply chain), and Conformance/Governance (ensuring the platform and workloads comply with policies and standards). A platform engineer must build a platform that is visible (so issues can be detected), secure (to protect code, deployments, and runtime), and governed (to enforce best practices and compliance automatically).

Observability Fundamentals: Traces, Metrics, Logs, and Events

Observability is the ability to understand the internal state of systems from their external outputs. In cloud-native environments, observability is often discussed in terms of four key data types (sometimes called the "pillars" of observability):

  • Metrics: Numeric measurements collected over time, representing the performance or load of a system (e.g., CPU usage, request latency, error rate). Metrics are typically time-series data that enable monitoring and alerting. They answer "what is the quantitative state?" and are great for automated thresholds (for example, triggering an alert if error rate > 5%). Tools: Prometheus (for scraping metrics), Grafana (for dashboards), etc.

  • Logs: Immutable, timestamped records of discrete events that happened within the system (e.g., an application log entry for a user login or an error stack trace). Logs provide detailed context and are useful for forensic analysis and debugging. They answer "what happened and when?" A logging stack might include FluentD/FluentBit (collecting logs) and ElasticSearch/Kibana or Loki for storage and search.

  • Traces: Distributed traces follow a user request or transaction as it propagates through multiple services, recording the path and time spent in each component. Tracing is essential for understanding how a single transaction flows through a microservices architecture and where bottlenecks or errors occur. Traces provide the contextual linkage that metrics and logs alone might miss, answering "how did this request execute across the system?" Tools: Jaeger, Zipkin, or OpenTelemetry collectors feeding data to a tracing backend.

  • Events: In this context, events can mean discrete incidents or state changes (e.g., Kubernetes events like a pod killing, or alerts firing). They are often used to denote noteworthy occurrences that might not be captured as logs/metrics, or to feed into incident management. Events might also include synthetic checks or external triggers. Essentially, they add color to the timeline of system behavior.

These telemetry data sources are complementary. For example, metrics might show a spike in latency, logs might show errors at that time, and traces could pinpoint which service or call in the request path was slow or failing. An effective platform has an observability stack that correlates these signals. Modern observability practices (including OpenTelemetry) even treat traces, metrics, and logs as part of one correlated telemetry stream. As a platform engineer, ensure that your platform provides: application instrumentation (perhaps via OpenTelemetry SDKs), a metrics aggregator and store (Prometheus is CNCF standard), centralized log collection, and possibly an APM solution for traces. Also, Service Meshes like Istio can generate metrics and traces for all service communication automatically.

Finally, observability isn't only about tooling but also about establishing Service Level Objectives (SLOs) and alerts. Platform engineers often work with SREs to define SLOs for the platform itself (e.g., API gateway latency, CI pipeline success rate) and enable product teams to set SLOs for their services. Conformance to reliability targets is assessed via the collected metrics and events (bridging into the "conformance" aspect in terms of meeting defined standards for performance and availability).

Secure Service Communication (Zero Trust Networking)

In cloud-native platforms, services often need to communicate over the network (e.g., microservice A calls microservice B). Securing inter-service communication means ensuring that these calls are authenticated and encrypted, and that unauthorized access is prevented. Two primary techniques are used:

  • Mutual TLS (mTLS): Each service is issued a cryptographic identity (certificate). When service A calls service B, both sides exchange certificates to mutually authenticate each other, and the connection is encrypted via TLS. This ensures both confidentiality (encrypted traffic) and identity verification (only known services with valid certs can talk). Implementing mTLS manually can be complex, so platforms often leverage a service mesh (like Istio, Linkerd, or Kuma) which injects sidecar proxies to handle mTLS for all service-to-service traffic automatically. One big reason organizations adopt service meshes is to get secure communications via mTLS by default. With Istio for example, you can enable a mesh-wide policy that all traffic is encrypted and only workloads within the mesh with proper certificates can communicate, effectively establishing a zero-trust network where every call is verified.

  • Network Policies and Segmentation: Kubernetes NetworkPolicies can restrict which pods/services can talk to which others at the network level. For instance, you can declare that the frontend deployment is allowed to call backend on port 80, but nothing else can. This is like a firewall for pod traffic. Proper network segmentation prevents an attacker who compromises one service from freely reaching all others. In addition, isolating applications by namespaces and using separate network policies per environment/team provides a defense-in-depth.

Other aspects of secure service communication include using service accounts and IAM (especially in cloud provider contexts) so that when service A calls an external service or cloud API, it uses least-privilege credentials. Within Kubernetes, ensure that the default service account tokens aren't overly permissive and consider using Kubernetes RBAC for any API access between services (if they interact with the Kubernetes API).

Additionally, API gateways or ingress controllers often act as a first line of defense for incoming traffic, handling TLS termination from outside clients. These should be configured with strong ciphers, certificate rotations, and possibly client certificate validation if needed. Platforms might integrate with cloud identity systems or OIDC for service-to-service auth on HTTP level (JWT tokens etc.), but mTLS at transport layer remains a gold standard for internal traffic encryption.

In summary, platform engineers should implement a zero trust networking model: assume no request is safe by default, and require authentication/encryption for all service communications. Using a service mesh simplifies this, as it can manage certificate issuance (via an internal CA), automatic key rotation, and encryption of all traffic with minimal application changes. The result is a platform where data in transit is protected and services trust is established via verified identities.

Policy Engines for Platform Governance (Conformance)

Platform conformance and governance ensures that the platform and workloads adhere to certain rules and standards. Rather than relying on humans to catch misconfigurations or policy violations, Kubernetes offers policy as code solutions. A prominent example is Open Policy Agent (OPA) Gatekeeper, which is an admission controller that checks incoming Kubernetes resource requests against a set of policies before they are persisted. With Gatekeeper, platform engineers can define custom policies (in OPA's declarative Rego language) to enforce things like: all images must come from approved registries, pods cannot run as root, certain labels must be present on resources, or specific ingress domains are disallowed, etc. The Open Policy Agent Gatekeeper project helps enforce such policies and strengthen governance in a Kubernetes environment. Essentially, it acts as a gate: any kubectl apply or GitOps sync will be vetted, and if it violates a rule, the change is rejected.

Examples of governance policies:

  • Security conformance: e.g., disallow privileged containers or require that every container has a read-only root filesystem. (These can encode organizational security best practices.)
  • Resource standards: e.g., ensure every Deployment has resource requests/limits set (to prevent runaway resources), or enforce label schemas for all resources (for cost tracking or ownership).
  • Multi-tenancy isolation: e.g., prevent certain namespaces from using hostNetworking or hostPath volumes, which could break isolation.

OPA/Gatekeeper is one approach. Another CNCF project, Kyverno, offers a Kubernetes-native policy engine where policies are themselves Kubernetes resources (written in a more JSON/YAML pattern style). Kyverno can mutate resources to auto-correct issues (like injecting labels) or validate and block changes similar to Gatekeeper.

Beyond admission control, "conformance" might also refer to ensuring the platform itself is conformant with industry standards. For Kubernetes, CNCF conformance tests ensure your cluster adheres to required APIs; platform engineers managing custom Kubernetes builds or multi-cloud might need to ensure they use conformant K8s versions. Also, using CIS Benchmarks for Kubernetes (via tools like kube-bench) can validate that the cluster setup follows security best practices.

Finally, governance includes audit and compliance: making sure all changes are auditable (which GitOps inherently aids, since git history is an audit log), and periodically reviewing platform configurations against compliance checklists (like SOC 2, PCI, etc., which often have requirements for encryption, least privilege, etc.). Many organizations incorporate automated policy checks in CI/CD (for example, using OPA in CI to lint Terraform manifests or Kubernetes YAML for compliance before deployment).

In short, policy engines and governance tools are the immune system of your platform – they keep bad or non-compliant configurations from ever being deployed, thus maintaining a consistent and secure baseline across the board.

Kubernetes Security Essentials

Building a secure platform requires addressing security at multiple layers: the supply chain, the cluster, and the runtime. Here are essential Kubernetes security considerations and best practices:

  • Cluster Access Control: Restrict access to the Kubernetes API. Use robust authentication (certificates, OIDC tokens tied to SSO, etc.) and Role-Based Access Control (RBAC) to ensure users and service accounts have only the permissions they need. Avoid using cluster-admin privileges for day-to-day work. Additionally, control network access to the API server (e.g., via firewall rules or API server whitelists) so that only trusted networks or VPN users can reach it.

  • Isolation via Namespaces: Leverage Kubernetes Namespaces to separate different teams or environments. Namespaces provide a scoping mechanism for names and also work with RBAC and ResourceQuotas. For example, ensure each team's namespace has its own ServiceAccounts and Roles such that they can't interfere with other teams' resources. Also consider node isolation if needed (taints/tolerations or separate node pools for different trust zones).

  • Network Policies: As mentioned, use NetworkPolicies to limit pod-to-pod communication. By default, Kubernetes allows all pods to talk to each other. A default deny policy plus specific allow rules can implement the principle of least privilege for network access.

  • Secure Pod Configuration: Enforce Pod Security Standards (the successor of deprecated PodSecurityPolicies). This includes practices like: do not run containers as root user (you can specify a non-zero runAsUser or use PSP replacements to prevent root); drop unnecessary Linux capabilities; use read-only root filesystem if possible; restrict hostPath volumes and hostNetworking (they should rarely be needed). Many of these can be checked via policy engines (as above) or via admission controllers. Kubernetes now has Pod Security Admission with levels (privileged, baseline, restricted) – aim for "restricted" policy for most workloads for strong defaults.

  • Images and Supply Chain: Only deploy trusted container images. Use image scanning (for vulnerabilities) as part of CI, and again enforce image provenance in the cluster – e.g., require images to be pulled from your private registry (not random public images) and consider running an image admission controller that refuses images failing certain criteria. Keep base images minimal and updated.

  • Secrets Management: Never hardcode secrets in images or configs. Use Kubernetes Secrets (which base64-encode data by default – not strong encryption, but you can combine with tools like Sealed Secrets or vault integrations for encryption at rest). Ensure RBAC limits who can read secrets. Kubernetes can also encrypt secrets at rest in etcd (enable etcd encryption provider with a key).

  • Node Security: If you manage the nodes, keep the OS updated, minimize host attack surface (e.g., remove unused packages), and ensure only necessary ports are open. Use container runtime security features like seccomp and AppArmor profiles to limit what containers can do. Also ensure Kubernetes components themselves (kubelet, kube-proxy) are configured securely (auth, no unauthorized access to kubelet API, etc.).

  • Monitoring and Auditing: Enable audit logging on the API server – this logs all API requests and is invaluable for forensic analysis (e.g., detecting if someone tried to escalate privileges). Also monitor cluster components' metrics and events for signs of issues (like excessive restarts, OOMKills which could indicate a DoS attempt, etc.).

  • Incident Response Prep: Have a plan for compromised credentials or pods. Network segmentation and minimal privileges help contain breaches. You can also use tools like Falco (a runtime security tool that watches syscalls for abnormal behavior) to detect signs of container breakout or crypto mining, etc., in real time.

By following these essentials, you make the platform resilient against common threats. It's worth noting that Kubernetes security is a broad topic – from the control plane to the network to the workload. A good mental model is "layered defense": assume any single protection can fail, so have multiple layers (for example, even if an attacker gets into a pod, network policy might prevent them from moving laterally, and that pod not running as root prevents host escalation, and read-only FS prevents tampering with binaries, etc.).

Security in CI/CD Pipelines (Supply Chain Security and SBOM Attestation)

The security of the software supply chain – from code commit, through CI build, to artifact storage and deployment – is crucial. High-profile attacks have leveraged CI/CD pipelines to insert malicious code (SolarWinds) or compromised dependencies. Platform engineers must embed security checks and safeguards into the CI/CD process:

  • Source Control Protections: Use branch protections, code review requirements, and signed commits to ensure only verified code enters the pipeline. Consider adoption of Git signing (GPG or X.509 signatures on commits/tags) to verify authorship.

  • Dependency Management: Integrate dependency scanning (Software Composition Analysis) in CI to catch known vulnerable libraries. Lock down dependency versions (use checksum verification or lockfiles) to prevent unexpected changes.

  • Build Environment Hardening: Ensure your CI runners or build agents are secure. If using shared runners, be cautious of secrets exposure. Rotate credentials that CI uses (like cloud API keys) and follow the principle of least privilege for those creds (e.g., if CI needs to push to a container registry, that token shouldn't also be able to read from prod database). Some organizations use isolated, ephemeral build environments (spin up a fresh container/VM per build) to reduce persistence of any compromise.

  • Image Signing and Verification: After building container images, sign them cryptographically. The Sigstore project (CNCF) provides Cosign for signing container images. By signing images and configuring clusters to verify signatures (e.g., using an admission controller like Cosign's policy controller or Kubernetes image policy webhooks), you ensure only images produced by your trusted CI and signed by your key are run. This mitigates the risk of an attacker pushing an unverified image.

  • SBOM Generation: A Software Bill of Materials (SBOM) is a manifest listing all components, libraries, and dependencies in your software. Automated tools like Syft or CycloneDX generators can create SBOMs during CI. Having an SBOM for each build improves transparency – you know exactly what went into that container image (down to OS package versions, library versions, etc.). This is increasingly important for compliance and for quickly assessing exposure to new vulnerabilities.

  • SBOM Attestation: An SBOM itself is valuable, but one must also ensure the SBOM is authentic and corresponds to the exact artifact built. This is where attestation comes in. An attestation is a signed statement (metadata) about an artifact. In our context, a signed SBOM attestation is a way to say "this SBOM was generated for this specific image, by our build pipeline, at this time" in a verifiable manner. Using a tool like Cosign, the CI pipeline can attach an SBOM to the container image and sign it (Cosign supports storing attestations in OCI registries alongside the image). Consumers or deployment systems can then verify that the SBOM hasn't been tampered with and indeed was produced by the trusted builder. Attestation provides provenance – you can prove who produced the artifact and how. Without attestation, an SBOM is just a text file that could be altered; with attestation, it's bound to the image via a cryptographic signature, assuring it hasn't changed since build time and is trustworthy.

In practice: your pipeline might run: build image → generate SBOM (e.g., syft image:tag -o spdx-json > sbom.spdx or similar) → use cosign to sign and attach the SBOM (cosign attest --predicate sbom.spdx --key ). Later, a deployment admission controller can require that an image has a valid SBOM attestation before allowing it to run (policy enforcement).

  • Continuous Vulnerability Scanning: Even after build, continuously scan images in the registry for newly disclosed vulnerabilities (using tools like Trivy or Clare). Feed this information back to the backlog for patching.

  • Secret Detection: Ensure no secrets are leaking in code repositories or container images. CI can run secret scanning (e.g., truffleHog or GitHub's secret scan) to catch accidental commits of keys.

  • Integrity of CI/CD Systems: Treat your Jenkins or GitHub Actions or GitLab CI configuration as critical infrastructure. Lock down who can modify pipeline definitions. If using container-based builds, be aware of supply chain there (e.g., a compromised base image could inject code during build). Prefer official base images and verify checksums.

In summary, CI/CD security pipelines aim to implement DevSecOps – baking security checks into the pipeline so that insecure code is caught early and only vetted artifacts make it to production. By generating and attesting SBOMs, signing images, and enforcing these at deploy time, the platform greatly reduces the risk of unknown or malicious code running in the cluster. This builds trust in the artifacts moving through the pipeline and ultimately running on the platform.


Continuous Delivery and Platform Engineering (16%)

Overview: This domain focuses on how code gets deployed and managed in a platform engineering context. It covers continuous delivery practices, incident response processes, the relationship between CI and CD, and particularly GitOps as a modern approach to managing infrastructure and deployments. It also touches on how platform engineers facilitate rapid, reliable releases and handle failures when they occur.

Continuous Integration vs. Continuous Delivery – The CI/CD Relationship

CI (Continuous Integration) and CD (Continuous Delivery/Deployment) are closely related but distinct stages in the software delivery lifecycle. The CI phase produces artifacts (builds, test results, packages), and CD takes those artifacts and delivers them to runtime environments. Platform engineers should ensure CI outputs (like container images and Helm charts) feed smoothly into CD processes.

Continuous Delivery means that every change is proven deployable at any time. This typically involves automatically promoting builds to a staging environment and running integration tests. If those pass, the build is considered "release-candidate". Whether deployment to production is automatic (continuous deployment) or requires a human promotion, the pipeline mechanics are often similar – the difference is just a manual gate.

Key aspects of connecting CI and CD:

  • Artifact versioning: Tag builds in CI in a way that CD can reference (e.g., image tags correspond to git commit or build ID). Use immutable references for deployments.
  • Configuration management: Often, CD involves deploying the same artifact to multiple environments (dev, stage, prod) with different configs. Platforms might use Helm values or Kustomize overlays or separate manifest repos per environment. Keeping config in version control (and sometimes generating it via templates) is crucial.
  • Orchestration: CD pipelines can be time-sequenced (Jenkins pipeline, Tekton, etc.) or event-driven (GitOps controllers reacting to git commits). Platform engineers need to decide what orchestrates environment promotions. For example, a Jenkins pipeline might after a successful test automatically update a GitOps config repo's prod folder with the new image tag, which triggers ArgoCD to deploy it – thus marrying traditional CI pipeline with GitOps for the last mile.
  • Feedback loops: Ensure CD pipeline results (deployment success, test outcomes, any issues) are reported back and visible to dev teams. Also implement automated rollback if health checks fail (many CD tools and GitOps controllers can do this, or one can rely on Kubernetes features like Deployments with rollback on failure using --atomic Helm upgrades as mentioned).

CI and CD are often visualized as a unified pipeline, but conceptually CI stops at "artifact ready" and CD continues with "artifact deployed in all target environments". A platform engineer should enable both parts: robust CI pipelines and reliable CD automation. This may involve different tooling, but a seamless integration between them is the goal.

Incident Response in Platform Engineering

No system is perfect – deployments will fail, outages will happen. Platform engineers are responsible not just for enabling delivery but also for providing mechanisms to quickly respond to incidents and mitigate impact. Key practices include:

  • Monitoring and Alerting: As part of observability, ensure there are alerts on critical symptoms (high error rates, saturation of resources, etc.). These alerts might feed into an on-call rotation for the platform team and/or the app teams, depending on ownership models. Clear runbooks should exist for common alerts.

  • Automated Rollbacks: The platform should support quickly undoing a bad deployment. As discussed, Helm and Kubernetes Deployment objects support rollbacks. For instance, if a new version causes widespread errors, having an automated rollback (either triggered by a health check failing or manually by an on-call engineer issuing helm rollback or reverting a Git commit) is crucial. Fast rollback capability reduces MTTR (Mean Time to Recovery).

  • Canary and Blue-Green Deployments: To reduce blast radius of issues, platform might enable canary releases (deploy to a small percentage of users or subset of servers first) and automated analysis (using tools like Argo Rollouts or Flagger). If metrics look bad, the canary is aborted. Blue-green deployment keeps the old version live until the new one is confirmed healthy, making switchback easy if needed.

  • Feature Flags: Integrating feature flags into the platform allows turning off a problematic new feature without a full rollback of deployment. Platform engineering might provide a feature flag service that developers can use to decouple feature rollout from code deploy.

  • Incident Management Process: Platform teams often coordinate incident calls and work with dev teams during outages. The platform should provide diagnostic tools: e.g., the ability to snapshot logs from all pods of a service, or quickly exec into containers if needed (with proper authorization), or increase logging levels on the fly. The more self-service and fast these troubleshooting actions are, the quicker the incident can be resolved.

  • Post-Incident Learnings: Ensuring that each incident triggers a post-mortem analysis (without blame) to identify how to prevent or detect earlier the issue in the future. For example, if a bad config caused an outage, platform engineers might implement a new policy to catch that config error next time (a governance improvement), or add a new alert that would have caught the symptom sooner. Over time, this leads to improved resilience.

Platform engineers should also educate and equip application teams with the ability to debug their apps on the platform. This might mean providing them access to certain read-only metrics dashboards or on-demand log retrieval, etc. During a live incident, good communication channels (like a dedicated Slack or incident command center) matter – often the platform team is in the middle facilitating, since they have knowledge of both the infrastructure and (to some extent) the apps.

In summary, a robust platform is not just about delivering changes quickly, but also about safely – minimizing the impact of bad changes and recovering swiftly. Technologies like Helm (with versioned releases), GitOps (with quick Git revert to previous known-good state), and Kubernetes' self-healing (e.g., if a node fails, pods reschedule elsewhere automatically) all contribute to strong incident response when leveraged fully.

GitOps Basics and Workflows

GitOps is a paradigm that has gained popularity for both infrastructure and application management. Its principles are: declarative configuration and automatic reconciliation of state, with Git as the single source of truth and an audit log of changes. Here's how GitOps typically works in a platform engineering context:

  • Configuration in Git: All desired state (Kubernetes manifests, Helm chart values, etc.) is stored in Git repositories. Often there is a repo (or multiple) dedicated to environment configurations. For example, you might have an "infrastructure" repo for cluster baseline manifests (CRDs, RBAC, etc.) and an "applications" repo that has directories for each environment (dev, staging, prod) containing app deployment manifests or Kustomize overlays.

  • Automated Sync Operators: A GitOps tool (like Argo CD or Flux) runs in the cluster. It periodically (or via webhook) checks the Git repo for changes. If it finds that Git's desired state differs from the cluster's current state, it will apply the changes to the cluster (create/update resources) to reconcile them. Argo CD, for example, can track multiple "Applications" (each mapping to a folder or Helm chart in Git) and will continuously ensure the cluster matches what's in Git. This is the pull-based deployment model, as opposed to traditional push (where a CI pipeline would kubectl apply changes). The advantage is that the cluster is self-healing to the Git state – if someone manually deletes a resource, Argo will notice it diverged from Git and recreate it, thus also discouraging manual drifting changes.

  • Separation of Code and Config Pipelines: Developers might still use CI to build new app versions and then update the Git config (e.g., bump the image tag in a Kubernetes Deployment manifest or Helm values file). This commit is what triggers the deployment (through the GitOps operator). This separates concerns: CI handles building artifacts, GitOps handles applying them. It also naturally provides rollback (as mentioned, git revert to previous commit will trigger the operator to revert the deployment).

  • Operational Models: In GitOps, there are different structures like mono-repo vs multi-repo. Some orgs keep each environment in a branch of one repo (e.g., main for prod, dev branch for dev env). Others use separate repos per env or per team. What matters is clarity and access control – e.g., developers might only be allowed to merge to the dev config, whereas ops team handles prod merges (if a separation of duties is needed). Tools like Argo CD support managing multiple target clusters from one repo as well.

Workflow example: A developer wants to deploy version 2.0 of service X to staging. They update the staging/apps/service-x.yaml in the GitOps repo, changing the image tag from 1.5 to 2.0. They create a pull request, it gets approved and merged. Argo CD (which is pointed at the staging folder) detects the commit, and applies the new Deployment spec to the cluster, pulling the 2.0 image. If something goes wrong, the developer can revert the commit (or adjust config) and Argo will sync that change.

One must ensure atomic, consistent changes – when multiple manifests need updating together, they should be committed together so that, for example, a new Deployment and its new ConfigMap arrive together. GitOps operators usually apply entire commits in one sync operation to maintain consistency.

GitOps offers many benefits aligned with platform engineering goals: improved auditability (every change is a git commit linked to an author), easier compliance (you can point auditors to repos rather than collecting ad-hoc change records), and alignment with the declarative approach (it's essentially the same idea as Kubernetes – desired state in Git, actual state in cluster). It also can simplify disaster recovery: in theory, if you lose a cluster, you can rebuild a new one and point ArgoCD/Flux to the repo and it will reconstruct all resources.

Platform engineers should know GitOps inside out, as it's a key method for Continuous Delivery on Kubernetes and for infrastructure management (some even manage cloud resources outside K8s via GitOps using tools like Crossplane or Terraform Cloud integrations – storing desired infra state in Git and relying on controllers or pipelines to apply it).

GitOps for Application Environments

Applying GitOps specifically to application delivery means structuring your environment configs and repository in a way that supports promotion and multi-environment deployments. Some considerations for application GitOps:

  • Environment-specific Overlays: Applications often differ slightly between dev/staging/prod (like using a different database endpoint, or lower replica counts in dev). Using tools like Kustomize overlays or Helm with separate values.yaml per environment can manage these variations while reusing the base manifests. For instance, a base Deployment might set replicas as a variable, and each environment's overlay sets its number (dev=1, prod=5). The Git repo could have a base directory and then overlays for each env. Argo CD can be configured with Applications that reference the appropriate path and Kustomize overlay for each environment.

  • Promotion via Git: One model is to promote by git merging. For example, after testing in staging, a PR is opened to merge the staging config into prod config (or simply to replicate the version bump in the prod folder). This merge is essentially the promotion approval. Some organizations script this or even use GitOps pipelines where human approval triggers a bot to merge to the prod branch, etc. The advantage is the audit trail of exactly when and who promoted.

  • Helm and GitOps: Helm charts can be integrated with GitOps as well – instead of raw YAML, the Git repo might contain a HelmRelease custom resource (if using Flux Helm Operator) or Argo CD Application that pulls a Helm chart from a registry. In this case, the git holds values files and the desired chart version. The GitOps operator ensures the correct helm upgrade --install happens in cluster. Helm can also be used client-side by the platform team to templatize config that is checked into git, but often it's either GitOps with raw manifests (straight YAML or Kustomize) or a combination with Helm Operator. In any case, understanding Helm commands helps when debugging or doing manual interventions. For instance, if Argo CD reports a sync failure for a Helm chart app, an engineer might run helm template locally to see rendered output or use helm upgrade --install manually in a pinch.

  • Secrets and GitOps: One tricky part is storing secrets – obviously not in plain text in Git. Solutions include using GitOps with Sealed Secrets (where the encrypted secret is in Git and only the cluster can decrypt it) or using external secret management and having only references in Git. Platform engineers should provide a strategy so that GitOps doesn't become a security risk (e.g., accidentally committing a secret).

  • Drift detection: GitOps not only deploys, but also notices drift. If someone manually scaled a deployment or changed an image via kubectl, the GitOps tool will typically detect that the cluster is "out of sync" with git (since the running replica count doesn't match the desired in git) and it can be configured either to alert or to auto-revert the change to match git. This is powerful for enforcing consistency. Some choose to allow manual changes in dev clusters by not auto-syncing, but in prod auto-sync (with a strict fast revert of any out-of-band change) can ensure compliance with auditable config.

In platform engineering, GitOps for apps means developers largely don't directly interact with Kubernetes; they work with Git. This can massively simplify developer experience if done right, because the config manifests become the interface, and they can be code-reviewed, reused, and diffed. The platform team, on the other hand, needs to maintain the GitOps operator and config repo structure, ensuring it's reliable and scales (e.g., Argo CD handling hundreds of apps). They also need to handle bootstrapping: bringing up a new cluster with GitOps, and managing cluster-specific configs (sometimes one repo per cluster vs one central repo that can target many clusters).

Helm in Continuous Delivery (Upgrades and Rollbacks)

As noted earlier, Helm often plays a role in continuous delivery for packaging and versioning Kubernetes applications. In a GitOps scenario, helm charts might be applied by the operator, but in many CI/CD setups (especially those not fully GitOps), CI pipelines use Helm to push out releases. Let's dig a bit deeper into Helm commands in the context of CD:

  • helm upgrade --install: This command is fundamental for CD pipelines. It allows a script to be idempotent – you don't have to check "does this release exist?". If it doesn't, --install makes it do an install; if it does, it performs an upgrade. For example, a pipeline might run: helm upgrade --install myapp ./charts/myapp -n production --set image.tag=abc123. This will either install the app (on first deploy) or update it to use the new image tag. As a best practice, pipelines include the --wait flag to wait for Kubernetes resources to finish deploying, and often --atomic so that if any part of the release fails (maybe a Deploy didn't become ready in time), Helm will automatically rollback to the previous release. Using helm upgrade --install ensures repeatable deployments that can be triggered regardless of state, simplifying automation.

  • helm rollback [release] [revision]: In an incident or failure scenario, this command is a lifesaver. Each successful Helm upgrade bumps a revision number (you can see them with helm history releaseName). If revision 5 is bad, helm rollback releaseName 4 will instruct Helm to apply the manifests from revision 4 (which correspond to the previous known-good state). This is faster and less error-prone than redeploying manually. In a continuous delivery context, you might automate this – e.g., a simple script or Argo CD rollbacks via UI. But knowing the command-line is useful for manual intervention. Note that rollback will create a new revision (6, perhaps, which matches 4's spec).

  • Helm release management: Continuous delivery with Helm also means managing your charts (version controlling them, maybe storing them in a chart repository, etc.). Some platform teams prefer using Helm just as a templating engine (render templates and apply via GitOps), others use full Helm releases in clusters (where Helm itself stores release info in the cluster). If using the latter, be sure to regularly backup or at least be aware that Helm's history is stored in Secrets in the cluster's namespace. Losing it might mean you can't easily rollback (though you could always redeploy from git).

  • Automating Helm via GitOps: If using Argo CD, note it can directly pull Helm charts and do upgrades (it acts somewhat like an automated helm upgrade when it detects a new chart version). In such cases, you might not manually run helm at all for CD, but understanding how Helm works is still important to debug issues (like if a value is not applied as expected, or if a helm hook is failing).

In essence, Helm bridges the gap between raw manifests and full applications by packaging all needed Kubernetes objects and allowing parameterization. This makes it easier to promote the same app to different environments by just changing values. Platform engineers often maintain a set of Helm charts for common components (e.g., a standardized chart for microservices at the company). Developers then reuse these charts, only overriding dev-specific values. The platform team ensures the chart implements best practices (ingress, service, deployment, config all in one). In CD, upgrading the chart version can propagate improvements to all apps using it.

Finally, always document your platform's deployment process (whether it's "commit to this repo and Argo deploys" or "run this helm command or pipeline"). Developers deploying on the platform should have clear instructions, with the platform providing as much automation as possible (e.g., a CLI command or bot that abstracts even the helm commands, etc., depending on the UX you aim for).


Platform APIs and Provisioning Infrastructure (12%)

Overview: This domain addresses how the platform exposes APIs for developers or automated systems to consume, and how infrastructure is provisioned and managed (often via automation and IaC). It includes understanding Kubernetes' extension mechanisms (like the operator pattern and custom resources) and how to provide infrastructure (clusters, cloud resources, services) as on-demand to developers. We also discuss OpenTofu (the open-source Terraform fork) here as it relates to Infrastructure as Code.

Kubernetes Reconciliation Loop (Controllers & Operators)

We introduced Kubernetes' reconciliation concept earlier; here we dive deeper, especially into custom controllers (Operators) which are key to creating Platform APIs. Kubernetes is often called a platform for building platforms – because you can extend its API with your own types and logic.

Controllers and the reconciliation loop: A controller is essentially a loop that watches for changes to certain resources and takes action to drive the current state toward the desired state. For built-in objects: e.g., Deployment controller sees a new Deployment with replicas=3, it creates ReplicaSet and Pods until there are 3 pods running. This reconcile action happens continuously. Controllers typically use exponential backoff and only act when needed, so it's efficient.

Custom Resource Definitions (CRDs): These allow you to define new resource types (kind) in the Kubernetes API (for example, a kind PostgresDatabase or Team or FeatureFlag). Once a CRD is applied, users can create custom resources of that type, which the Kubernetes API will store, but by itself a CRD does nothing – you need a custom controller to give it meaning. Together, CRD + Controller = Operator (Operator is just a term, originating from CoreOS, for a controller that encodes operational knowledge for a specific domain, often managing an application like a database).

Operators as Platform APIs: Platform engineering can use CRDs to create higher-level abstractions for developers. For instance, instead of exposing raw Kubernetes concepts, you might let developers submit a Microservice custom resource with fields like name, repo, version, scale, etc. Then a custom operator will translate that into the necessary Deployment, Service, Ingress, etc., behind the scenes. This becomes a Platform API – developers express intent in a simple, company-specific resource, and the platform's controllers handle the heavy lifting. Another example: a Database CRD where a developer requests kind: Database, spec: { type: postgres, version: 13 } and an operator on the platform will provision a cloud database or spin up a container one, plus perhaps create a Kubernetes Secret with credentials for the app to use. This is powerful: it offers self-service through Kubernetes API but abstracted.

Kubernetes' design encourages such extensions – the control plane doesn't care if a resource is custom, controllers can run external to the main API server (usually as deployments in the cluster). The reconciliation pattern remains: the operator sees the desired state (user wants a DB), checks current state (no DB exists or DB is outdated), then performs steps to converge (create a DB instance via cloud API or in cluster, update status in the resource, etc.).

Understanding how to write or at least configure operators is a key skill for platform engineers. Many open-source operators exist (for monitoring systems, databases, etc.) which you can leverage instead of writing from scratch. Tools like Operator Framework (Operator SDK) or Kubebuilder help in writing custom operators if needed.

Example: The Kubernetes Horizontal Pod Autoscaler (HPA) is a built-in controller: it's essentially an operator for scaling. It watches Deployments (through the metrics they produce) and adjusts the replicas count to match a target metric (like CPU usage). Similarly, a CertManager operator manages certificate CRDs by requesting certs from Let's Encrypt, etc. Platform engineers might install such operators to provide functionality (e.g., developers create a Certificate CRD object and the operator gets them a TLS cert).

In summary, the reconciliation loop concept underlies not just how Kubernetes works out-of-the-box, but how you extend it. The cluster itself can become a platform that offers a catalog of APIs (Custom Resources) that developers use to accomplish tasks declaratively. This ties into the next competency on self-service APIs.

APIs for Self-Service Platforms (CRDs and Service Catalogs)

Platform APIs refer to the interfaces (often programmatic, like REST/HTTP or Kubernetes APIs or CLI/SDK) that the internal platform exposes for consumers (developers, QA, maybe other automated systems). A good platform provides clear APIs or tools for common needs: provisioning environments, requesting databases, setting up CI pipelines, etc.

In Kubernetes-centric platforms, as described above, Custom Resources themselves become APIs. Developers use kubectl (or higher-level CLI wrappers) to create these objects. For example, instead of filing a ticket for a database, a dev might run: kubectl apply -f mydb.yaml with a CRD that the platform team supports. This is a declarative API and has the benefit of integration with Kubernetes RBAC, audit logs, etc.

Another approach is through a Service Catalog/Portal. For instance, CNCF's Backstage (originally by Spotify) is a popular developer portal. Backstage provides a UI for developers to discover and provision resources (like create a new microservice via a template, or request access to a SaaS service). Backstage can integrate with underlying APIs. It's not itself provisioning things, but it acts as a convenient front-end. Backstage features a Software Catalog that tracks all the organization's services and their metadata (owners, URLs, etc.), and Software Templates which can automate creation of new components.

Many platform teams implement Backstage and extend it with plugins – for example, a plugin that triggers a pipeline or calls a Kubernetes operator when a dev clicks "Provision Kafka Topic".

Other platforms might offer direct REST APIs or CLI tools. For example, an internal CLI (companyctl create environment --team X) could behind the scenes call Terraform or Kubernetes to spin up resources. The key is: self-service means devs aren't dependent on manual ops tasks; they interact with an API or interface that triggers automated provisioning.

Service Catalog (Kubernetes Service Catalog): There was an initiative to integrate service brokers via Kubernetes Service Catalog (using OSB – Open Service Broker API). It allowed CRDs like ServiceInstance that a broker (like cloud provider service broker) would act on to provision something (like a cloud DB instance). This concept is similar to operators, but via a broker pattern. It's less talked about now (CNCF Service Catalog project exists but not widely adopted as operators and direct APIs have taken spotlight).

Examples of platform APIs:

  • Provisioning infrastructure: e.g., an API to request a new Kubernetes namespace or a whole new Kubernetes cluster for a team (some platforms allow self-service cluster creation, perhaps in a multi-tenant control plane setup).
  • CI/CD pipeline as a service: e.g., a config file in git repository (like a Jenkinsfile or GitHub Actions workflow) is one form, but some internal platforms let you request a pipeline via an API and then handle setting up webhooks, etc., automatically. Or a UI where you toggle which steps you need, etc.
  • Application configuration: e.g., a "feature flag service" with APIs to set flags, or an "A/B testing API".
  • Cost and quota APIs: devs might query how much resources their services are using via an API the platform provides.

A critical aspect is discoverability – which is where internal developer portals shine. They list what capabilities are available and how to use them (documentation). So, platform APIs don't necessarily mean a raw HTTP API (though it could); it encompasses any interface that developers use to interact with the platform programmatically.

From the perspective of the CPNA exam, understanding that CRDs and Kubernetes extension points are one powerful way to implement platform APIs (the platform itself becomes extensible) is likely key. It ties back to IDP concepts (discussed later in Developer Experience). In summary, platform APIs allow internal platform capabilities to be consumed in a self-serve, programmatic way, reducing friction and accelerating development.

Infrastructure Provisioning with Kubernetes (and Beyond)

This competency is about how infrastructure (compute instances, clusters, networks, etc.) is provisioned. There are two major paradigms: 1. Using Kubernetes to provision infrastructure (a.k.a Kubernetes control-plane driven provisioning). 2. Using Infrastructure-as-Code (IaC) tools externally (like Terraform, Pulumi, etc.).

Kubernetes-driven provisioning: This involves operators or controllers that manage not just Kubernetes resources, but external infrastructure. A prime example is Crossplane (CNCF project), which extends Kubernetes with CRDs representing cloud resources (databases, VPCs, etc.). With Crossplane, a developer could create a CRD instance like kind: RDSInstance in the cluster, and the Crossplane provider will call AWS to actually provision an RDS database, then reflect the status back to the CRD. This effectively turns Kubernetes into a universal control plane – you can manage both cluster-internal and cloud-external resources with one API and a unified GitOps flow. It's a powerful model for platform engineering because you can coordinate application deployment and infrastructure provisioning via the same pipeline. For example, create an AWSBucket CRD and a Deployment that uses it, and an operator ensures the bucket exists before the app starts. Crossplane and similar operator-based IaC often use the Kubernetes reconciliation loop for continuous drift detection on infra (if someone changes the AWS resource outside of Crossplane, Crossplane can reconcile it back, depending on configuration).

Infrastructure as Code (IaC) tools: The traditional approach to provisioning infra (outside of K8s) is using tools like Terraform, which is declarative and cloud-agnostic. Terraform configurations (HCL files) describe desired state of infrastructure. The Terraform engine calculates a plan and applies changes via providers (AWS, Azure, etc.). Platform engineers often maintain Terraform modules or scripts to provision foundational infrastructure (networks, Kubernetes clusters themselves, CI servers). Terraform can be wrapped in pipelines or service catalogs for dev teams when they need infra outside what Kubernetes can easily do. However, Terraform historically is not continuous – you run it on demand (though you could run in a pipeline regularly). That's where something like Crossplane aims to provide continuous reconciliation on infra, bringing K8s-style operations to it.

OpenTofu (Terraform's Open-Source Fork): HashiCorp's Terraform moved to a business source license in 2023, which spurred the community to create a truly open fork called OpenTofu. OpenTofu is basically Terraform with an open governance under the Linux Foundation, maintaining compatibility with Terraform's language and providers. For the purposes of CPNA, one should recognize that OpenTofu is an Infrastructure-as-Code tool allowing you to define and provision cloud resources using declarative configuration and workflows akin to Terraform. It uses the HashiCorp Configuration Language (HCL) and supports all the same providers to manage multi-cloud or on-prem infra in a unified way. The main difference is licensing and community governance – OpenTofu remains fully open source (MPL 2.0), community-driven, while Terraform (upstream) has usage restrictions under the BSL. From a platform engineering perspective, OpenTofu/Terraform knowledge is important for provisioning base infrastructure (like networks, clusters, DBaaS instances) that your platform runs on or integrates with. It can also be offered to developers as a service (with sane defaults and modules) for certain tasks.

Selecting the right tool: There's often a question – do we manage infra via K8s (operators) or via IaC pipelines (Terraform)? Each has pros/cons:

  • Kubernetes operators (like Crossplane) enable a GitOps, unified approach, and can react in real-time to drift or changes. They also expose infra as CRDs which can be easier to compose with apps. However, not all resources might have stable operators available, and giving developers CRDs to create, say, an AWS VPC might be too low-level or too much responsibility.
  • Terraform/OpenTofu is very mature with a huge ecosystem of modules. Many platform teams use Terraform to provision Kubernetes clusters themselves, network plumbing, and other foundational pieces. They might then use Crossplane inside clusters for developer-facing infra (like databases) so that devs can self-service those via K8s while keeping more global infra in Terraform managed by the platform team.

In any case, familiarity with IaC concepts is expected: state management (Terraform keeps state of deployed resources, whereas K8s operators treat the actual cloud resource as state), planning vs. applying changes, and handling of dependencies. Also, API-driven provisioning can involve using cloud provider APIs/CLIs directly or through IaC. For instance, an internal tool might directly call AWS SDK to create resources for speed, but that sacrifices the declarative approach unless it's carefully tracked.

As platform engineering evolves, some organizations implement a layer on top of IaC to expose simplified APIs. For example, a self-service portal where a dev requests "Postgres DB" and behind scenes the platform triggers a Terraform module to create an RDS instance and returns the credentials.

Concluding on provisioning: Platform Engineers should be comfortable automating infrastructure lifecycle. The exam likely expects knowing that OpenTofu/Terraform provides declarative infra provisioning with code (you write desired state files and these tools apply them), whereas Kubernetes CRDs + controllers can do similar by extending cluster capabilities. Increasingly, platforms treat infrastructure as just another layer of the stack to be orchestrated – some call this concept "Infrastructure as Software", implying that your control plane (be it K8s-based or other) continuously drives the infra toward what the platform needs, rather than humans running CLI commands ad-hoc.

Kubernetes Operator Pattern for Integration

We've touched on operators already, but here the focus is on using the operator pattern to integrate and manage complex systems on the platform. The Operator Pattern essentially means encoding the operational knowledge of managing a service (like deploying, scaling, backups, self-healing) into a controller so that Kubernetes users can declare a desired state and the operator handles the rest.

Why Operators: Certain applications, especially stateful ones, require domain-specific logic to operate. For instance, a MySQL cluster needs to initialize replication, handle failover, etc. Instead of manually running those steps or writing ad-hoc scripts, an operator automates them. It watches a CRD like MySQLCluster and when a new cluster is declared, it creates StatefulSets, configures replication, and maybe even schedules backups by creating CronJobs, etc., all as coded logic. If a node running a DB pod fails, the operator might detect the failed primary and promote a replica. Without an operator, an SRE would have to do these steps at 3am.

Integration via Operators: Platform teams use operators to provide "automated SRE" for certain components. Examples:

  • Database operators (e.g., for PostgreSQL – Crunchy Data's Postgres Operator, or MongoDB's Enterprise Operator) to let teams run databases on the platform with less worry.
  • CI/CD operators (e.g., Tekton is an operator-based CI system where you declare Pipeline CRDs).
  • Integrating external systems: Operators can even represent external SaaS – e.g., create a PagerDutyAlert CRD and an operator calls PagerDuty API to create a service and route, etc. This way you manage external config in Kubernetes.
  • Custom business logic: If your platform needs to integrate with internal tools (like an internal auditing system whenever a new app is created), an operator could watch Microservice CRDs and call external APIs to register that service in various systems.

The operator pattern is essentially an extension mechanism, so its use in platform engineering is broad. A crucial part is designing CRDs that are user-friendly and capture the right level of abstraction. Also, because operators run with quite a bit of power (often cluster-admin to manage resources, or cloud creds to manage cloud resources), security and reliability of operators is important (they are part of your control plane).

Building vs Buying: There are many off-the-shelf operators (cert-manager, ArgoCD itself has an Application controller, etc.), and many cloud vendors provide operators to manage their services from K8s. Platform engineers should know how to evaluate these and how to install and configure them. In cases where no suitable operator exists, platform teams might write their own (using Operator SDK, etc.). Writing an operator involves writing code (usually Go) to handle events for the CRD – not trivial, but powerful.

Lifecycle: Operators themselves often need upgrades; a platform should have a strategy to update operators with minimal impact (maybe using ArgoCD to sync them or Helm charts to install them, etc.).

This competency likely expects that you understand what the operator pattern is and why it's used – "to extend Kubernetes with custom controllers that automate management of complex applications or external integrations." It ties together much of what's discussed: CRDs (the API), controllers (the automation), reconciliation (the mechanism), and self-service (the outcome for users).

To summarize: The operator pattern enables the platform to integrate complex systems and provide them as a service. Instead of manual processes or snowflake scripts, everything becomes a declarative object and a controller loop – aligning with the overall cloud-native philosophy of desired state and continuous reconciliation. Mastering this pattern allows platform engineers to turn common operational tasks into reliable automated services on their platform.


IDPs and Developer Experience (8%)

Overview: This domain centers on Internal Developer Platforms (IDPs) and the developer experience (DevEx) they provide. An IDP is the sum of tooling, workflows, and interfaces that make it easy for developers to build, deploy, and maintain applications. The focus here is on how to simplify access to platform capabilities, utilize developer portals and service catalogs, and even emerging trends like AI/ML assistance in platform automation. Critically, it also covers measuring developer experience (like how to gauge success of an IDP), including the HEART framework.

Simplified Access to Platform Capabilities

The hallmark of a good IDP is that it abstracts complexity and presents capabilities in an accessible way. Platform capabilities can include things like: creating a new microservice repository with CI/CD set up, deploying to a staging environment, viewing logs and metrics, requesting resources (databases, caches), configuring alerts, etc. Simplifying access means developers do not need to know the low-level details of Kubernetes, AWS, Jenkins, etc., to use these capabilities.

Approaches to simplification:

  • Self-Service UI/Portal: A web interface (like Backstage) where devs can click through common tasks. For example, a "Create Service" wizard that asks a few questions (service name, language, maybe pick from templates) and then automatically scaffolds the service (perhaps using a cookiecutter template or Terraform to set up repo and pipelines). Or a "Deploy App" page where they can select a version and environment rather than dealing with kubectl directly.
  • Command-Line Interface (CLI): Some companies provide an internal CLI tool that wraps around multiple operations. E.g., a platformctl or devctl command that under the hood might call kubectl, or GitOps, or other APIs, but for the dev it's one tool with subcommands like platformctl logs my-service to fetch logs, or platformctl deploy my-service --version 2.0. The CLI can incorporate multiple steps (build, push, update manifest) behind one command.
  • Templates & Generators: Provide template repositories or Helm charts for common patterns. This way a developer doesn't start from scratch – they use npm init @company/service or a Yeoman generator or Backstage software template to generate a new service with company standards. This ensures best practices are built-in (observability, Dockerfile, security scans, etc.).
  • Documentation and Training integrated: A key part of access is knowledge. A platform should have concise docs or even in-line help. Some portals integrate documentation so that when viewing a service, you see how to, say, add a new environment variable or how to access certain resources.

Essentially, an IDP acts as a facade to underlying infrastructure – exposing only what is necessary to developers in a cohesive way. By simplifying access, platform engineering aims to reduce the cognitive load on developers (they don't need to be Kubernetes experts to deploy an app, for instance).

API-Driven Service Catalogs

A service catalog is a curated list of services and components available to developers, often with the ability to provision or subscribe to them. Backstage is a prime example of an open-source framework for building such a catalog. It maintains an inventory of all software components (microservices, libraries, data pipelines, etc.), their owners, links to docs, runbooks, etc., making it easier for developers to discover existing capabilities and avoid reinventing wheels.

API-driven implies that the catalog isn't just a static list, but interactive via APIs. For example, Backstage has plugins that allow e.g. clicking "create new service" which invokes the scaffolding API, or "request access to X database" which might call an underlying API to start the process. In another sense, API-driven could mean the catalog itself exposes APIs (GraphQL/REST) so that other tools or CLIs can query it (like to list all services of team A, or to get endpoints for service B).

Benefits of a service catalog:

  • Discoverability: New team members or cross-team collaborations can easily find what services exist, who owns them, what they do.
  • Standard Metadata: The catalog can enforce that each service has certain info (owner team, documentation link, compliance tags, etc.), which helps with governance and tracking.
  • One-stop-shop: Ideally, a dev can go to the catalog not just to read about services but to perform actions related to them (deploy, deprecate, update dependencies, etc., depending on integration).
  • Integration of tools: Many catalogs like Backstage support plugins to integrate CI status, monitoring dashboards, incident management (so a dev can see recent build status and current alerts for their service on one page).

From a platform engineering perspective, building a service catalog often involves aggregating data from various sources (git repositories, Kubernetes clusters, monitoring tools, CMDBs). The platform likely provides an API or configuration file for each service that the catalog ingests (Backstage, for instance, uses YAML config files in repos to register components).

The API-driven nature also means that everything in the catalog can potentially be manipulated via API calls – which could enable automation, like automatically registering a new service in the catalog when its repo is created (by calling the catalog's API).

In summary, service catalogs (and portals) are key to driving adoption of the platform – they present the platform's offerings in a developer-centric way and act as the central registry and interface for developer-platform interactions.

Developer Portals for Platform Adoption

Developer portals (like Backstage, or an internal web UI built in-house) are essentially the front-end of the platform. Their goal is to increase platform adoption by offering a great user experience. A portal can be thought of as the "app store" of internal services and tools.

Important aspects of a developer portal:

  • Onboarding: It should help new developers quickly get started – e.g., a tutorial or a guided flow to deploy their first "Hello World" using the platform. Many portals will have an onboarding checklist, maybe automating setup of access, sandbox environments, etc.
  • Centralized Access: The portal might integrate authentication (SSO) and based on user identity show the relevant info (like the services you own, the pipelines you have access to, etc.). It becomes the single pane of glass for developers to interact with various subsystems – CI, CD, monitoring, support tickets even.
  • Encouraging Best Practices: By funneling activities through the portal, the platform team can subtly enforce best practices. For example, if creating a new service through the portal's wizard, you ensure it has logging enabled, proper config, etc. If doing manually, devs might skip those. Similarly, the portal can have checks or scorecards (some companies do "service maturity" evaluations that show on the portal dashboard).
  • Community and Collaboration: Some portals include internal blog wikis, engineering docs, or discussion forums. While not directly platform, having those resources linked improves knowledge sharing on using the platform effectively.
  • AI/ML in Portals: An emerging trend is incorporating AI assistants in developer portals. For example, a chatbot that can answer "How do I increase memory for my service in staging?" by looking at docs or config. Or more advanced, analyzing patterns to suggest improvements (e.g., "Team X's service is consistently under-utilizing its pods, consider reducing replicas – click here to apply"). The portal could leverage ML to recommend actions or detect anomalies in usage.

A portal greatly helps platform adoption because it reduces the learning curve. If a platform were only exposed via raw CLI and configs, some devs might struggle or stick to old ways. A polished portal invites them in. It's also a marketing tool internally – showcasing features and reliability of the platform. Some organizations treat their internal portal like a product with release notes, feature announcements, etc., which fosters engagement.

From a CPNA perspective, knowing Backstage is a prime example is useful, but the key is understanding why a portal matters: it improves developer experience (DX).

We should mention that Backstage's mission is to be the "UX layer" for the infrastructure engineers, to give the best developer experience without needing to be experts in every tool (they compare it to Kubernetes for DX – providing a standard toolbox for the user experience layer of infrastructure). This encapsulates the rationale for portals.

AI/ML in Platform Automation

AI and Machine Learning are increasingly being leveraged to enhance platform engineering:

  • Intelligent Recommendations: Use AI to analyze metrics and logs to suggest optimizations. E.g., an ML model could predict that a service's load requires scaling up before it actually hits an alert, or suggest removing unused resources to cut cost.
  • ChatOps and AI Assistants: An AI-powered chatbot integrated with the platform (maybe in Slack or in the portal) that can answer questions ("What's the error rate of service X?" pulling metrics) or even execute commands ("deploy version Y to staging") if given proper rights. This can simplify interactions for developers – they might literally converse with the platform.
  • Automated Triage: For incident response, ML can help filter noise or cluster related alerts, helping on-call engineers focus on root cause faster. It might also correlate events (like code changes that correlate with a performance regression).
  • Infrastructure optimization: AI could analyze usage patterns to automatically adjust cluster node counts or pod resource requests (horizontal or vertical auto-scaling beyond simple threshold rules).
  • Security anomaly detection: ML systems monitoring for unusual patterns (like a suddenly huge data egress or a process spawning a shell) can complement static security policies.

While these are advanced topics, platform engineers should be aware that AI/ML can be part of their toolset to manage complexity at scale. For instance, running experiments on ML-driven autoscaling vs HPA, or integrating a GPT-based internal assistant trained on your platform's documentation to answer developer queries quickly.

However, an IDP must incorporate AI carefully – it should not overwhelm or mislead developers. The goal remains improving DX, so AI features are usually assistive and optional.

The CPNA including this likely expects an awareness of these modern trends rather than deep implementation details. It underscores that platform engineering is evolving with technology.

Measuring Developer Experience and Platform Success (HEART framework)

A core part of Developer Experience is actually knowing whether your efforts are paying off – are developers happier and more productive using the platform? The HEART framework is a methodology (originating from Google UX research) for measuring user experience across five dimensions: Happiness, Engagement, Adoption, Retention, and Task Success. Platform teams can apply HEART (originally for product UX) to internal platforms to quantify and track DevEx improvements.

Let's break down HEART in the platform context:

  • Happiness: How satisfied are developers with the platform? This is typically measured via subjective feedback – e.g., periodic developer surveys, Net Promoter Score (NPS) asking "How likely are you to recommend our internal platform to your colleagues?", or CSAT surveys on specific features. Also, qualitative feedback from interviews or focus groups falls here. Metrics could include NPS score, or number of compliments vs complaints in feedback channels. High happiness indicates developers find the platform valuable and not frustrating. Watch out for the paradox: sometimes devs say they are happy (maybe they like the idea of the platform), but other metrics might show low usage – so you need to correlate happiness with actual use.

  • Engagement: The breadth and depth of usage of platform features. This answers: are developers actively using the platform day-to-day? Metrics include daily/weekly active users of the platform's interfaces (portal, CLI), time spent or number of interactions, and usage of various features. For example, how many unique devs ran a deployment or accessed logs this week? How many support tickets or Slack questions indicate people engaging (or struggling)? Engagement also looks at how they engage – e.g., do senior devs contribute improvements (signal of deep engagement). A highly engaged user base suggests the platform is integral to workflows. Low engagement might mean devs circumvent the platform or only use it sparingly (maybe only when they have to).

  • Adoption: How many developers or teams have adopted the platform and how fully. This could be measured by the percentage of projects onboarded to the platform vs legacy paths, growth rate of new users, and the ramp-up time for new users. For example, number of new services created via the platform in the last quarter, or fraction of deploys going through the platform vs manual. Adoption is often tracked after rollouts of new features too (e.g., did 80% of teams start using the new CI pipeline?). High adoption means the platform provides something teams feel is worth investing in. If adoption stalls, platform team might need to do more advocacy or improve features.

  • Retention: Do developers stick with the platform over time or do they abandon it? This is analogous to user churn in a product. Metrics: number of "dormant" users (devs who used to use platform but haven't in X months), or count of teams that onboarded but later built their own parallel solutions. You can track login/activity over long periods: e.g., 90% of users active in Q1 are still active in Q3 (90% retention). If retention is low, find out why – maybe some teams tried it and encountered limitations (leading them to give up). Exit surveys can help: if someone stops using it, ask why.

  • Task Success: Can developers effectively complete their goals on the platform? This is about the efficiency and error rate of performing tasks (which ties to productivity). Metrics could include: deployment success rate (how often do deployments via the platform succeed without manual intervention), average time to deploy a new service (from code to prod) which should decrease with platform use, or how many support tickets are raised (a lot of tickets implies tasks are hard to complete self-service). You could also measure specific funnel steps: e.g., out of 10 new service creation attempts, how many succeeded without contacting the platform team – that would be a task success indicator. In surveys, asking devs if any repetitive tasks are still painful can highlight where task success is lacking. High task success implies the platform is truly making things easier (e.g., a deployment that used to take 1 day now takes 1 hour). Also consider toil reduction – if platform automations remove manual work, that's success. For instance, if no one manually configures CI pipelines anymore because it's auto, that task is 100% successful behind the scenes.

By tracking HEART metrics over time, platform teams get a holistic view: Happiness might be high but if Engagement is low, it's a red flag (maybe people like the concept but don't use it). Or Adoption might be high (everyone forced onto it) but Happiness low (they're unhappy being forced) – which could foretell future backlash or shadow IT solutions.

An example of applying HEART: Say the platform launched a new Developer Portal.

  • Adoption: track what % of devs log in to the portal weekly.
  • Engagement: track average number of actions per session, or how many use advanced features like template generation.
  • Task success: perhaps measure the completion rate of "create new service" flows (do users drop off mid-way? That could indicate confusion).
  • Happiness: survey those who created a service: "How was your experience? Rate 1-5 and feedback".
  • Retention: see if those who created a service come back to create another or manage existing ones later (one-off use vs ongoing use).

According to Google's research, not every project needs to measure all HEART categories – pick ones relevant to your goals. For a new platform, Adoption and Task success might be initial focus (can people onboard and get things done?). For a mature platform, Retention and Engagement become critical (are they sticking or finding workarounds?).

Beyond HEART, another widely used set of metrics for platform success especially on the productivity side are the DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, MTTR), which we discuss in the next section on Measuring Your Platform. DORA metrics gauge software delivery performance, while HEART is more user-experience oriented – but both are complementary. For instance, if your platform's goal is to improve deployment frequency and reduce lead time (DORA metrics), a negative developer experience (HEART metrics) could impede that (developers avoid using the platform or make mistakes, affecting those outcomes).

In summary, measuring developer experience is as important as measuring system performance. It provides the feedback needed to iterate on the platform. Using frameworks like HEART ensures you cover multiple facets: are developers pleased, are they using it fully, are new people adopting, do they keep using, and can they accomplish their jobs effectively? A platform engineering team should define specific KPIs under each of those and report on them to demonstrate the platform's value to the organization and to identify where to focus improvements.

(We'll delve more into metrics like DORA in the next section, which focuses on overall platform measurement including team productivity and efficiency.)


Measuring Your Platform (8%)

Overview: This final domain deals explicitly with how to measure the success and impact of the internal platform. It covers metrics related to platform efficiency, team productivity, and continuous improvement of delivery performance. Two major categories of metrics come to play: Technical throughput/stability metrics (like DORA's four key metrics) and Developer experience metrics (like HEART, discussed above). Together, these help quantify the ROI of platform engineering efforts.

Platform Efficiency and Team Productivity Metrics

To assess platform efficiency, you want to measure how the platform is affecting software delivery and operations in quantifiable terms. Key indicators include the well-known DORA metrics – often considered the gold standard for DevOps performance measurement:

  • Deployment Frequency: How often the organization successfully deploys to production (or to end-users). A high deployment frequency (e.g., on-demand, multiple times per day) suggests an efficient CI/CD pipeline and a culture of continuous delivery. If your platform streamlines CI/CD, this metric should improve (increase). You might measure this per team and overall. For example, before platform adoption, maybe deployments were weekly; after, they're daily.

  • Lead Time for Changes: The time from code committed to code running in production. Shorter lead times mean faster value delivery and ability to iterate quickly. This metric can be measured as median or average time between a commit (or merge) and that commit being deployed. A good internal platform with automated pipelines can drastically cut lead times (e.g., from days to hours or minutes). If using a VCS and CD pipeline, this can be instrumented (e.g., timestamp of commit vs deployment event).

  • Change Failure Rate: The percentage of deployments that result in a failure (incident, rollback, hotfix). Lower is better – it indicates stability and quality of releases. A robust platform with proper testing, monitoring, and safe deployment practices (like canaries) should reduce failures. You measure, e.g., out of 100 deployments, 5 caused issues -> 5% change failure rate. Combining with frequency, elite performers manage high frequency and low failure rate.

  • Mean Time to Recovery (MTTR): How long it takes to restore service when a failure occurs. Short MTTR means effective incident response and resilient systems. The platform can impact this by providing easy rollback mechanisms, good diagnostics, and possibly automated self-healing. You'd measure MTTR as the average time from incident start to resolution. Aiming for hours or minutes rather than days is the goal (DORA sometimes calls this "Time to Restore Service" or "Failed Deployment Recovery Time").

These four metrics are highly regarded because research (by DORA/Google) shows they correlate with high-performing IT organizations both in delivering value and in stability. Platform engineering should enable improvement in all four: more frequent deploys, faster pipeline, fewer failures (through quality gates and best practices baked in), and faster recovery (through tooling and automation).

In practice, you might track these metrics before and after platform initiatives. For example, after introducing a new CI workflow, did deployment frequency go up? After adding better monitoring and canary deployments, did MTTR go down? A positive shift in these metrics demonstrates the platform's value to leadership.

Beyond DORA, other efficiency metrics:

  • Resource Efficiency: How well utilized are resources (compute, etc.) on the platform? If the platform can bin-pack workloads better or auto-scale, perhaps you reduce infrastructure cost for the same throughput. You could measure something like cost per service or per developer productivity. But be careful to align with team goals (cost saving shouldn't compromise delivery).
  • Pipeline Efficiency: E.g., average build duration, average deployment time (if pipeline too slow, devs lose time). The platform team might set a goal "80% of builds complete under 5 minutes" etc. Short build times keep developers in flow.
  • Support Tickets/Interventions: How many manual interventions are needed? A decrease indicates the platform is more self-service and reliable. If every release used to require ops person, but now 95% of releases go through without human touch, that's improved efficiency.
  • Uptime and Reliability of Platform Components: If your IDP includes shared services (CI runners, artifact registries, etc.), measure their uptime/error rates. The platform itself must be reliable to be valuable. It's somewhat meta, but tracking "platform downtime" (if, say, pipeline system is down, how long? how often?) can be part of its efficiency.

DORA Metrics for Platform Initiatives

As noted, DORA's four key metrics are central to evaluating continuous delivery performance. When the exam mentions "DORA Metrics for Platform Initiatives," it implies that platform engineering projects should be aligned with improving these metrics. For example:

  • If you implement a new automated testing platform or parallelize tests, you expect lead time to improve (commits get to prod faster).
  • If you introduce infrastructure as code and consistent environments, maybe change failure rate drops (fewer environment-specific bugs).
  • If you implement blue-green deployments and robust monitoring, MTTR might drop (easy to switch traffic back, detect issues quickly).
  • If you create a self-service deployment portal, deployment frequency might rise (teams deploy more often because it's easier and doesn't require ops coordination).

When planning platform features, tying them to one of these outcomes is a good practice. It provides a clear success criterion.

It's also worth noting that these metrics should be stratified properly: e.g., measure per application or per team as needed and then aggregate. Outliers can skew data – median is often used for lead time to avoid a single slow change distorting average.

Some organizations add a fifth metric like Availability or Reliability (though MTTR covers a part of that). Others track Change Volume (but that's essentially deployment frequency). The four are widely accepted, so CPNA likely sticks to them.

Example scenario: Before platform, a team deploys monthly (freq ~0.03 per day), lead time is 1-2 weeks (dev complete to prod), change failure maybe 30% (because deployments are big bang, lots can go wrong), and MTTR is a day (because manual fixes). After platform adoption (with CI/CD, smaller batches, canaries), perhaps that team now deploys daily (freq ~1.0 per day), lead time 1 day or a few hours, failure rate 5%, and MTTR 1 hour. Those improvements directly translate to more business agility and less downtime.

Thus, measuring and reporting DORA metrics can justify platform investments. It's common to present them in internal reviews (like an "engineering productivity" report).

Finally, platform engineers should not chase metrics blindly; context matters (Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure"). They should ensure improvements in metrics reflect real improvements in developer productivity and user satisfaction, not just gaming the numbers. For instance, deploying a trivial config change many times a day could raise frequency but not value – so combine metrics with qualitative insight. The HEART metrics can complement DORA by ensuring the human side (developer sentiment) is positive as well.

Combining Metrics to Tell the Whole Story

By integrating the technical delivery metrics (like DORA) with developer experience metrics (like HEART), platform teams get a comprehensive dashboard:

  • DORA metrics show how the engineering system is performing (speed and stability of deliveries).
  • HEART metrics show how the developers are experiencing the platform.

If both are good (fast deliveries, happy devs), your platform is on the right track. If they diverge (fast deliveries but unhappy devs, or happy devs but slow delivery), that indicates where to investigate:

  • Fast pipeline but devs unhappy could mean maybe the platform forces them to do things in a way they don't like or they feel unsupported even if metrics are okay (maybe toil in other areas not captured by DORA).
  • Happy devs but slow delivery could mean devs like the platform features but maybe there are organizational bottlenecks outside platform (e.g., manual approvals or too much process still).

Remember also to measure business impact where possible: e.g., did faster deployment correlate with more features delivered or higher customer satisfaction for external users? That might be beyond CPNA scope, but in real world connecting platform metrics to business KPIs (like revenue from new features, or uptime SLA adherence) elevates the platform's value proposition.

In conclusion, Measuring Your Platform is about creating feedback loops. It's not just about proving value, but also identifying areas for improvement. Continuous improvement is a DevOps principle and platform teams should adopt it: use metrics to find friction points, iterate on the platform, and watch metrics improve, then repeat. A data-driven approach ensures the platform is solving the right problems and evolving based on evidence, not just assumptions.

By mastering both the human-centric metrics (HEART categories like engagement, satisfaction) and the process-centric metrics (DORA's keys for velocity and stability), a platform engineer can ensure their internal developer platform truly empowers developers and accelerates delivery in a measurable, demonstrable way.

Sources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published