Google Managed Prometheus for HCP Monitoring #4

jimdaga · 2026-02-09T18:33:40Z

Summary

Adds observability design decision document for Google Managed Prometheus
Includes cost analysis, cost control strategy, and example configurations
Adds experiment files with PodMonitoring and Prometheus cluster-wide configs

Context

This PR recreates the content from the original PR openshift-online/gcp-hcp#72, which was lost when the repository was converted from private to public. The original branch and file content have been preserved from the archived repo.

Jira: GCP-343

Test plan

Verify all 6 files render correctly
Confirm file content matches the original PR

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci · 2026-02-09T18:33:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jimdaga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jimdaga]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-02-09T18:34:19Z

Walkthrough

Introduces comprehensive documentation and example manifests for implementing Google Managed Prometheus (GMP) in a hybrid architecture for HyperShift control plane monitoring on GCP. Includes design decisions, cost analysis, implementation strategies, and operational guidance with both PodMonitoring and cluster-wide Prometheus deployment examples.

Changes

Cohort / File(s)	Summary
Design & Architectural Decision `design-decisions/observability-google-managed-prometheus.md`	Formal design decision documenting hybrid GMP architecture: self-managed Prometheus collecting metrics with short retention, filtering, and exporting to GMP for long-term storage. Defines scope, constraints (regional isolation, global querying, cost management, golden signals), evaluates three alternatives, provides justification, consequences, and operationalization details including GitOps deployment patterns and cost-control mechanisms.
Cost Analysis & Control Strategy `experiments/google-managed-prometheus/COST-ANALYSIS.md`, `experiments/google-managed-prometheus/GMP-COST-CONTROL-STRATEGY.md`	Comprehensive cost analysis covering GMP pricing model, billing scenarios (Baseline, Optimized, Aggressive) at scales of 100-1000 HCPs with per-MC projections. Detailed cost-control strategy document outlining two-tier collection approach with ConfigMap-based metric allowlisting, recording rules, filtering rationale, testing plan, and cost-monitoring/alerting mechanisms.
Implementation Guidance `experiments/google-managed-prometheus/README.md`	Comprehensive guide comparing GMP PodMonitoring vs. cluster-wide Prometheus approaches for HCP monitoring, covering migration effort, consistency, RBAC, network policies, resource usage, uptime responsibilities, and cost considerations. Includes detailed steps, architecture diagrams, and testing results for both deployment patterns.
Deployment Manifests `experiments/google-managed-prometheus/gmp-podmonitoring-example.yaml`, `experiments/google-managed-prometheus/prometheus-cluster-wide.yaml`	Example Kubernetes manifests: PodMonitoring-based approach with cluster-scoped RBAC, per-HCP namespace resources for kube-apiserver/etcd, Rules CRD recording rules, and ClusterPodMonitoring; cluster-wide Prometheus deployment with ServiceAccount, ClusterRole/Binding, and Prometheus CR configured with GCP integrations and cost-optimization hooks.

Sequence Diagram(s)

sequenceDiagram
    participant HCP as HCP Components
    participant LocalProm as Local Prometheus
    participant Filter as Metric Filter
    participant GMP as Google Managed<br/>Prometheus
    participant GCM as Google Cloud<br/>Monitoring
    participant Storage as Long-term<br/>Storage

    HCP->>LocalProm: Scrape metrics<br/>(all metrics)
    LocalProm->>LocalProm: Store locally<br/>(short retention)
    LocalProm->>Filter: Export all metrics
    Filter->>Filter: Apply allowlist<br/>(cost control)
    Filter->>GMP: Send filtered metrics
    GMP->>GCM: Ingest & process
    GCM->>Storage: Archive for<br/>long-term retention
    GCM-->>LocalProm: Query results<br/>(cross-region)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Google Managed Prometheus for HCP Monitoring' directly and clearly summarizes the main change—adding Google Managed Prometheus documentation and configurations for HCP monitoring.
Description check	✅ Passed	The description is fully related to the changeset, providing context about the PR purpose (adding GMP design decision, cost analysis, configurations), its origin, and what was added.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@experiments/google-managed-prometheus/prometheus-cluster-wide.yaml`:
- Around line 74-76: Update the Prometheus image and version fields to the newer
stable tag: replace the current image value
"gke.gcr.io/prometheus-engine/prometheus:v2.53.5-gmp.1-gke.2" and the version
value "v2.53.5-gmp.1-gke.2" with the newer tag "v2.53.5-gmp.0-gke.13" so both
the image reference (image) and the version field (version) consistently point
to v2.53.5-gmp.0-gke.13.

🧹 Nitpick comments (7)

experiments/google-managed-prometheus/README.md (1)
708-712: Update the Files section to include all related documentation.

The Files section lists only 3 files but omits the related cost analysis and strategy documents that are part of this PR:

COST-ANALYSIS.md - Referenced throughout the document for cost projections

GMP-COST-CONTROL-STRATEGY.md - Referenced for filtering implementation details
📝 Suggested fix
 ## Files
 
 - `README.md` - This comparison document
+- `COST-ANALYSIS.md` - Detailed cost projections and pricing analysis
+- `GMP-COST-CONTROL-STRATEGY.md` - Two-tier collection strategy and filtering implementation
 - `gmp-podmonitoring-example.yaml` - Example resources for Option 1 (GMP PodMonitoring)
 - `prometheus-cluster-wide.yaml` - Deployment manifest for Option 2 (Cluster-Wide Prometheus)
experiments/google-managed-prometheus/COST-ANALYSIS.md (2)
77-90: Add language specifier to fenced code block.

The code block describing billing account consolidation should have a language specifier for better rendering and syntax highlighting.
📝 Suggested fix
-```
+```text
 Billing Account: gcp-hcp-production
 ├── Project: gcp-hcp-mc-1 (50B samples/month)
586-588: Remove duplicate horizontal rule separator.

There are two consecutive horizontal rules (---) which appears to be a formatting error.
📝 Suggested fix
 ---
 
----
-
 **Document Version**: 2.0
experiments/google-managed-prometheus/GMP-COST-CONTROL-STRATEGY.md (1)
36-56: Add language specifier to architecture diagram.

The ASCII architecture diagram should have a language specifier for consistent formatting.
📝 Suggested fix
-```
+```text
 ┌─────────────────────────────────────────────┐
 │  HCP Namespaces (ServiceMonitors/PodMonitors) │
experiments/google-managed-prometheus/prometheus-cluster-wide.yaml (2)
11-13: Parameterize environment-specific service account.

The Workload Identity annotation contains a hardcoded GCP service account email specific to an int-mgt-us-c1-yjiv project. This should be documented as a placeholder or parameterized for different environments.
📝 Suggested fix
   annotations:
     # Workload Identity binding
-    iam.gke.io/gcp-service-account: prometheus-agent-hcp@int-mgt-us-c1-yjiv.iam.gserviceaccount.com
+    # TODO: Replace with your GCP service account
+    iam.gke.io/gcp-service-account: prometheus-agent-hcp@YOUR-PROJECT-ID.iam.gserviceaccount.com
104-109: Parameterize environment-specific external labels.

The externalLabels contain hardcoded values specific to the int-mgt-us-c1-yjiv environment. These should be documented as placeholders requiring replacement for each deployment.
📝 Suggested fix
   # External labels (will be added to all metrics)
+  # TODO: Replace these values with your environment-specific configuration
   externalLabels:
-    project_id: "int-mgt-us-c1-yjiv"
-    location: "us-central1"
-    cluster: "int-mgt-us-c1-yjiv-gke"
+    project_id: "YOUR-PROJECT-ID"
+    location: "YOUR-REGION"
+    cluster: "YOUR-CLUSTER-NAME"
     prometheus: "cluster-wide"
experiments/google-managed-prometheus/gmp-podmonitoring-example.yaml (1)
16-46: Cluster-wide secret access is an intentional security trade-off.

The Checkov warning (CKV2_K8S_5) about granting cluster-wide secret read access is valid from a security perspective. However, this is a documented and intentional requirement for the GMP PodMonitoring approach:

HCP metrics endpoints use mTLS with certificates stored in secrets

GMP collectors in gke-gmp-system need access to these secrets for authentication

The README.md explicitly lists this as a challenge: "Custom RBAC: Maintain ClusterRoleBinding (may conflict with GKE updates)"

Consider adding a comment in the manifest to acknowledge this security consideration:
📝 Suggested documentation addition
 ---
 # ClusterRole: Grant secret access to GMP collectors
 # This is required because HCP metrics endpoints use TLS certificates
 # stored in secrets within each HCP namespace.
+#
+# SECURITY NOTE: This grants cluster-wide secret read access to the GMP collector.
+# This is a documented trade-off of the GMP PodMonitoring approach.
+# Consider the Cluster-Wide Prometheus alternative if this is unacceptable.
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole

experiments/google-managed-prometheus/prometheus-cluster-wide.yaml

Add observability design document for Google Managed Prometheus

a4bfb39

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci bot requested review from cristianoveiga and patjlm February 9, 2026 18:33

openshift-ci bot added the approved label Feb 9, 2026

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

experiments/google-managed-prometheus/prometheus-cluster-wide.yaml Show resolved Hide resolved

jimdaga marked this pull request as draft February 9, 2026 18:40

openshift-ci bot added the do-not-merge/work-in-progress label Feb 9, 2026

jimdaga marked this pull request as ready for review February 9, 2026 18:50

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 9, 2026

openshift-ci bot requested a review from cblecker February 9, 2026 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Managed Prometheus for HCP Monitoring #4

Google Managed Prometheus for HCP Monitoring #4

Uh oh!

jimdaga commented Feb 9, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Google Managed Prometheus for HCP Monitoring #4

Are you sure you want to change the base?

Google Managed Prometheus for HCP Monitoring #4

Uh oh!

Conversation

jimdaga commented Feb 9, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jimdaga commented Feb 9, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading