Skip to content

Conversation

@jimdaga
Copy link
Contributor

@jimdaga jimdaga commented Feb 9, 2026

Summary

  • Adds observability design decision document for Google Managed Prometheus
  • Includes cost analysis, cost control strategy, and example configurations
  • Adds experiment files with PodMonitoring and Prometheus cluster-wide configs

Context

This PR recreates the content from the original PR openshift-online/gcp-hcp#72, which was lost when the repository was converted from private to public. The original branch and file content have been preserved from the archived repo.

Jira: GCP-343

Test plan

  • Verify all 6 files render correctly
  • Confirm file content matches the original PR

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jimdaga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 9, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Walkthrough

Introduces comprehensive documentation and example manifests for implementing Google Managed Prometheus (GMP) in a hybrid architecture for HyperShift control plane monitoring on GCP. Includes design decisions, cost analysis, implementation strategies, and operational guidance with both PodMonitoring and cluster-wide Prometheus deployment examples.

Changes

Cohort / File(s) Summary
Design & Architectural Decision
design-decisions/observability-google-managed-prometheus.md
Formal design decision documenting hybrid GMP architecture: self-managed Prometheus collecting metrics with short retention, filtering, and exporting to GMP for long-term storage. Defines scope, constraints (regional isolation, global querying, cost management, golden signals), evaluates three alternatives, provides justification, consequences, and operationalization details including GitOps deployment patterns and cost-control mechanisms.
Cost Analysis & Control Strategy
experiments/google-managed-prometheus/COST-ANALYSIS.md, experiments/google-managed-prometheus/GMP-COST-CONTROL-STRATEGY.md
Comprehensive cost analysis covering GMP pricing model, billing scenarios (Baseline, Optimized, Aggressive) at scales of 100-1000 HCPs with per-MC projections. Detailed cost-control strategy document outlining two-tier collection approach with ConfigMap-based metric allowlisting, recording rules, filtering rationale, testing plan, and cost-monitoring/alerting mechanisms.
Implementation Guidance
experiments/google-managed-prometheus/README.md
Comprehensive guide comparing GMP PodMonitoring vs. cluster-wide Prometheus approaches for HCP monitoring, covering migration effort, consistency, RBAC, network policies, resource usage, uptime responsibilities, and cost considerations. Includes detailed steps, architecture diagrams, and testing results for both deployment patterns.
Deployment Manifests
experiments/google-managed-prometheus/gmp-podmonitoring-example.yaml, experiments/google-managed-prometheus/prometheus-cluster-wide.yaml
Example Kubernetes manifests: PodMonitoring-based approach with cluster-scoped RBAC, per-HCP namespace resources for kube-apiserver/etcd, Rules CRD recording rules, and ClusterPodMonitoring; cluster-wide Prometheus deployment with ServiceAccount, ClusterRole/Binding, and Prometheus CR configured with GCP integrations and cost-optimization hooks.

Sequence Diagram(s)

sequenceDiagram
    participant HCP as HCP Components
    participant LocalProm as Local Prometheus
    participant Filter as Metric Filter
    participant GMP as Google Managed<br/>Prometheus
    participant GCM as Google Cloud<br/>Monitoring
    participant Storage as Long-term<br/>Storage

    HCP->>LocalProm: Scrape metrics<br/>(all metrics)
    LocalProm->>LocalProm: Store locally<br/>(short retention)
    LocalProm->>Filter: Export all metrics
    Filter->>Filter: Apply allowlist<br/>(cost control)
    Filter->>GMP: Send filtered metrics
    GMP->>GCM: Ingest & process
    GCM->>Storage: Archive for<br/>long-term retention
    GCM-->>LocalProm: Query results<br/>(cross-region)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Google Managed Prometheus for HCP Monitoring' directly and clearly summarizes the main change—adding Google Managed Prometheus documentation and configurations for HCP monitoring.
Description check ✅ Passed The description is fully related to the changeset, providing context about the PR purpose (adding GMP design decision, cost analysis, configurations), its origin, and what was added.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@experiments/google-managed-prometheus/prometheus-cluster-wide.yaml`:
- Around line 74-76: Update the Prometheus image and version fields to the newer
stable tag: replace the current image value
"gke.gcr.io/prometheus-engine/prometheus:v2.53.5-gmp.1-gke.2" and the version
value "v2.53.5-gmp.1-gke.2" with the newer tag "v2.53.5-gmp.0-gke.13" so both
the image reference (image) and the version field (version) consistently point
to v2.53.5-gmp.0-gke.13.
🧹 Nitpick comments (7)
experiments/google-managed-prometheus/README.md (1)

708-712: Update the Files section to include all related documentation.

The Files section lists only 3 files but omits the related cost analysis and strategy documents that are part of this PR:

  • COST-ANALYSIS.md - Referenced throughout the document for cost projections
  • GMP-COST-CONTROL-STRATEGY.md - Referenced for filtering implementation details
📝 Suggested fix
 ## Files
 
 - `README.md` - This comparison document
+- `COST-ANALYSIS.md` - Detailed cost projections and pricing analysis
+- `GMP-COST-CONTROL-STRATEGY.md` - Two-tier collection strategy and filtering implementation
 - `gmp-podmonitoring-example.yaml` - Example resources for Option 1 (GMP PodMonitoring)
 - `prometheus-cluster-wide.yaml` - Deployment manifest for Option 2 (Cluster-Wide Prometheus)
experiments/google-managed-prometheus/COST-ANALYSIS.md (2)

77-90: Add language specifier to fenced code block.

The code block describing billing account consolidation should have a language specifier for better rendering and syntax highlighting.

📝 Suggested fix
-```
+```text
 Billing Account: gcp-hcp-production
 ├── Project: gcp-hcp-mc-1 (50B samples/month)

586-588: Remove duplicate horizontal rule separator.

There are two consecutive horizontal rules (---) which appears to be a formatting error.

📝 Suggested fix
 ---
 
----
-
 **Document Version**: 2.0
experiments/google-managed-prometheus/GMP-COST-CONTROL-STRATEGY.md (1)

36-56: Add language specifier to architecture diagram.

The ASCII architecture diagram should have a language specifier for consistent formatting.

📝 Suggested fix
-```
+```text
 ┌─────────────────────────────────────────────┐
 │  HCP Namespaces (ServiceMonitors/PodMonitors) │
experiments/google-managed-prometheus/prometheus-cluster-wide.yaml (2)

11-13: Parameterize environment-specific service account.

The Workload Identity annotation contains a hardcoded GCP service account email specific to an int-mgt-us-c1-yjiv project. This should be documented as a placeholder or parameterized for different environments.

📝 Suggested fix
   annotations:
     # Workload Identity binding
-    iam.gke.io/gcp-service-account: prometheus-agent-hcp@int-mgt-us-c1-yjiv.iam.gserviceaccount.com
+    # TODO: Replace with your GCP service account
+    iam.gke.io/gcp-service-account: prometheus-agent-hcp@YOUR-PROJECT-ID.iam.gserviceaccount.com

104-109: Parameterize environment-specific external labels.

The externalLabels contain hardcoded values specific to the int-mgt-us-c1-yjiv environment. These should be documented as placeholders requiring replacement for each deployment.

📝 Suggested fix
   # External labels (will be added to all metrics)
+  # TODO: Replace these values with your environment-specific configuration
   externalLabels:
-    project_id: "int-mgt-us-c1-yjiv"
-    location: "us-central1"
-    cluster: "int-mgt-us-c1-yjiv-gke"
+    project_id: "YOUR-PROJECT-ID"
+    location: "YOUR-REGION"
+    cluster: "YOUR-CLUSTER-NAME"
     prometheus: "cluster-wide"
experiments/google-managed-prometheus/gmp-podmonitoring-example.yaml (1)

16-46: Cluster-wide secret access is an intentional security trade-off.

The Checkov warning (CKV2_K8S_5) about granting cluster-wide secret read access is valid from a security perspective. However, this is a documented and intentional requirement for the GMP PodMonitoring approach:

  1. HCP metrics endpoints use mTLS with certificates stored in secrets
  2. GMP collectors in gke-gmp-system need access to these secrets for authentication
  3. The README.md explicitly lists this as a challenge: "Custom RBAC: Maintain ClusterRoleBinding (may conflict with GKE updates)"

Consider adding a comment in the manifest to acknowledge this security consideration:

📝 Suggested documentation addition
 ---
 # ClusterRole: Grant secret access to GMP collectors
 # This is required because HCP metrics endpoints use TLS certificates
 # stored in secrets within each HCP namespace.
+#
+# SECURITY NOTE: This grants cluster-wide secret read access to the GMP collector.
+# This is a documented trade-off of the GMP PodMonitoring approach.
+# Consider the Cluster-Wide Prometheus alternative if this is unacceptable.
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole

@jimdaga jimdaga marked this pull request as draft February 9, 2026 18:40
@jimdaga jimdaga marked this pull request as ready for review February 9, 2026 18:50
@openshift-ci openshift-ci bot requested a review from cblecker February 9, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant