-
Notifications
You must be signed in to change notification settings - Fork 6
Google Managed Prometheus for HCP Monitoring #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jimdaga The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughIntroduces comprehensive documentation and example manifests for implementing Google Managed Prometheus (GMP) in a hybrid architecture for HyperShift control plane monitoring on GCP. Includes design decisions, cost analysis, implementation strategies, and operational guidance with both PodMonitoring and cluster-wide Prometheus deployment examples. Changes
Sequence Diagram(s)sequenceDiagram
participant HCP as HCP Components
participant LocalProm as Local Prometheus
participant Filter as Metric Filter
participant GMP as Google Managed<br/>Prometheus
participant GCM as Google Cloud<br/>Monitoring
participant Storage as Long-term<br/>Storage
HCP->>LocalProm: Scrape metrics<br/>(all metrics)
LocalProm->>LocalProm: Store locally<br/>(short retention)
LocalProm->>Filter: Export all metrics
Filter->>Filter: Apply allowlist<br/>(cost control)
Filter->>GMP: Send filtered metrics
GMP->>GCM: Ingest & process
GCM->>Storage: Archive for<br/>long-term retention
GCM-->>LocalProm: Query results<br/>(cross-region)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@experiments/google-managed-prometheus/prometheus-cluster-wide.yaml`:
- Around line 74-76: Update the Prometheus image and version fields to the newer
stable tag: replace the current image value
"gke.gcr.io/prometheus-engine/prometheus:v2.53.5-gmp.1-gke.2" and the version
value "v2.53.5-gmp.1-gke.2" with the newer tag "v2.53.5-gmp.0-gke.13" so both
the image reference (image) and the version field (version) consistently point
to v2.53.5-gmp.0-gke.13.
🧹 Nitpick comments (7)
experiments/google-managed-prometheus/README.md (1)
708-712: Update the Files section to include all related documentation.The Files section lists only 3 files but omits the related cost analysis and strategy documents that are part of this PR:
COST-ANALYSIS.md- Referenced throughout the document for cost projectionsGMP-COST-CONTROL-STRATEGY.md- Referenced for filtering implementation details📝 Suggested fix
## Files - `README.md` - This comparison document +- `COST-ANALYSIS.md` - Detailed cost projections and pricing analysis +- `GMP-COST-CONTROL-STRATEGY.md` - Two-tier collection strategy and filtering implementation - `gmp-podmonitoring-example.yaml` - Example resources for Option 1 (GMP PodMonitoring) - `prometheus-cluster-wide.yaml` - Deployment manifest for Option 2 (Cluster-Wide Prometheus)experiments/google-managed-prometheus/COST-ANALYSIS.md (2)
77-90: Add language specifier to fenced code block.The code block describing billing account consolidation should have a language specifier for better rendering and syntax highlighting.
📝 Suggested fix
-``` +```text Billing Account: gcp-hcp-production ├── Project: gcp-hcp-mc-1 (50B samples/month)
586-588: Remove duplicate horizontal rule separator.There are two consecutive horizontal rules (
---) which appears to be a formatting error.📝 Suggested fix
--- ---- - **Document Version**: 2.0experiments/google-managed-prometheus/GMP-COST-CONTROL-STRATEGY.md (1)
36-56: Add language specifier to architecture diagram.The ASCII architecture diagram should have a language specifier for consistent formatting.
📝 Suggested fix
-``` +```text ┌─────────────────────────────────────────────┐ │ HCP Namespaces (ServiceMonitors/PodMonitors) │experiments/google-managed-prometheus/prometheus-cluster-wide.yaml (2)
11-13: Parameterize environment-specific service account.The Workload Identity annotation contains a hardcoded GCP service account email specific to an
int-mgt-us-c1-yjivproject. This should be documented as a placeholder or parameterized for different environments.📝 Suggested fix
annotations: # Workload Identity binding - iam.gke.io/gcp-service-account: prometheus-agent-hcp@int-mgt-us-c1-yjiv.iam.gserviceaccount.com + # TODO: Replace with your GCP service account + iam.gke.io/gcp-service-account: prometheus-agent-hcp@YOUR-PROJECT-ID.iam.gserviceaccount.com
104-109: Parameterize environment-specific external labels.The
externalLabelscontain hardcoded values specific to theint-mgt-us-c1-yjivenvironment. These should be documented as placeholders requiring replacement for each deployment.📝 Suggested fix
# External labels (will be added to all metrics) + # TODO: Replace these values with your environment-specific configuration externalLabels: - project_id: "int-mgt-us-c1-yjiv" - location: "us-central1" - cluster: "int-mgt-us-c1-yjiv-gke" + project_id: "YOUR-PROJECT-ID" + location: "YOUR-REGION" + cluster: "YOUR-CLUSTER-NAME" prometheus: "cluster-wide"experiments/google-managed-prometheus/gmp-podmonitoring-example.yaml (1)
16-46: Cluster-wide secret access is an intentional security trade-off.The Checkov warning (CKV2_K8S_5) about granting cluster-wide secret read access is valid from a security perspective. However, this is a documented and intentional requirement for the GMP PodMonitoring approach:
- HCP metrics endpoints use mTLS with certificates stored in secrets
- GMP collectors in
gke-gmp-systemneed access to these secrets for authentication- The README.md explicitly lists this as a challenge: "Custom RBAC: Maintain ClusterRoleBinding (may conflict with GKE updates)"
Consider adding a comment in the manifest to acknowledge this security consideration:
📝 Suggested documentation addition
--- # ClusterRole: Grant secret access to GMP collectors # This is required because HCP metrics endpoints use TLS certificates # stored in secrets within each HCP namespace. +# +# SECURITY NOTE: This grants cluster-wide secret read access to the GMP collector. +# This is a documented trade-off of the GMP PodMonitoring approach. +# Consider the Cluster-Wide Prometheus alternative if this is unacceptable. apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole
Summary
Context
This PR recreates the content from the original PR openshift-online/gcp-hcp#72, which was lost when the repository was converted from private to public. The original branch and file content have been preserved from the archived repo.
Jira: GCP-343
Test plan
🤖 Generated with Claude Code