The Cost Metrics Aggregator is a Go-based application for collecting and aggregating cost-related metrics from Kubernetes clusters, focusing on node vCPU utilization and pod CPU usage for subscription purposes. It stores data in a PostgreSQL database with partitioned tables for efficient time-series management. The application is deployed on OpenShift with automated image builds via Quay.io and supports local development with Podman.
- Collects node metrics (e.g., core count) and pod metrics (e.g., CPU usage and request seconds) from clusters.
- Stores data in PostgreSQL with UUID-based identifiers and range-partitioned tables for time-series data.
- Aggregates daily node and pod metrics for efficient querying (e.g., total hours and effective core seconds).
- Manages database partitions with automated creation and deletion via OpenShift CronJobs.
- Provides RESTful API endpoints to upload metrics and query node and pod data.
- Deploys on OpenShift with a dedicated PostgreSQL instance and secrets.
- Supports local development with Podman and
podman-composefor testing and debugging. - Provides scripts for offline setup & installation
- OpenShift Deployment:
- OpenShift cluster (v4.x) with admin access.
- Quay.io account with permissions to push to
quay.io/almacdon/cost-metrics-aggregator. - GitHub repository (
aptmac/cost-metrics-aggregator) with push access. kubectlinstalled locally.
- Local Development:
- Go 1.20 or higher.
- Podman and
podman-composeinstalled. makefor using theMakefile.- A storage class (e.g.,
standard) available in OpenShift for PostgreSQL persistence (if deploying locally with OpenShift).
.
├── Containerfile # Container build configuration
├── Makefile # Build, test, and deployment tasks
├── podman-compose.yaml # Local development services (app, database)
├── go.mod # Go module dependencies
├── install.sh # Online installation script
├── api/
│ ├── handlers/ # API request handlers
│ │ ├── query.go
│ │ ├── sources.go
│ │ └── upload.go
│ ├── router.go # API router
│ └── router_test.go
├── cmd/
│ └── server/
│ └── main.go # Application entry point
├── internal/
│ ├── config/ # Server configuration
│ │ ├── config.go
│ │ └── config_test.go
│ ├── db/ # Database layer
│ │ ├── repository.go
│ │ ├── repository_test.go
│ │ ├── testutils/
│ │ │ └── setup.go
│ │ └── migrations/ # SQL migrations
│ │ ├── 0001_init.up.sql
│ │ └── 0001_init.down.sql
│ └── processor/ # CSV processing logic
│ ├── csv_processor.go
│ ├── csv_processor_test.go
│ ├── tar_processor.go
│ ├── tar_processor_test.go
│ └── testutils/
│ └── setup.go
├── scripts/ # Utility scripts
│ ├── generate-ssl-certs.sh # SSL certificate generation
│ ├── reset_db.sh # Database reset utility
│ ├── create/ # Partition creation script
│ │ └── main.go
│ ├── drop/ # Partition deletion script
│ │ └── main.go
│ └── generate_test_upload/ # Test data generation
│ └── main.go
├── grafana/ # Grafana Helm deployment
│ ├── grafana-values.yml # Helm values for Grafana
│ ├── install-grafana.sh # Grafana installation script
│ └── README.md
├── observability/ # Long-term metrics storage stack
│ ├── dashboard.json # Grafana dashboard for cost metrics
│ ├── install.sh # Installation script
│ ├── install-seaweedfs.sh # SeaweedFS-specific installation
│ ├── README.md # Observability stack documentation
│ ├── RESOURCES.md # Resource requirements
│ ├── RESOURCES-SEAWEEDFS.md # SeaweedFS resource details
│ └── manifests/
│ ├── base/ # Base resources (namespace, storage)
│ │ ├── namespace.yml
│ │ ├── serviceaccount.yml
│ │ ├── storage.yml
│ │ └── storage-seaweedfs.yml
│ ├── prometheus/ # Prometheus with federation
│ │ ├── configmap.yml
│ │ ├── service.yml
│ │ └── statefulset.yml
│ ├── thanos/ # Thanos components
│ │ ├── compactor-statefulset.yml
│ │ ├── query-deployment.yml
│ │ └── store-statefulset.yml
│ ├── seaweedfs/ # SeaweedFS object storage
│ │ └── deployment.yml
│ └── grafana/ # Grafana for visualization
│ ├── configmap.yml
│ └── deployment.yml
├── deploy/ # Kubernetes deployment manifests
│ ├── namespace.yml # CMA namespace
│ ├── deployment.yml # CMA application deployment
│ ├── service.yml # CMA service
│ ├── route.yml # CMA route
│ ├── operator/ # Koku Metrics Operator
│ │ ├── operator-serviceaccount.yml
│ │ ├── operator-clusterrole.yml
│ │ ├── operator-clusterrolebinding.yml
│ │ ├── operator-prometheus-rolebinding.yml
│ │ ├── operator-crd.yml
│ │ ├── operator-deployment.yml
│ │ └── CostManagementMetricsConfig.yml
│ ├── postgres/ # PostgreSQL database
│ │ ├── postgres-deployment.yml
│ │ ├── postgres-ssl-config.yml
│ │ ├── cost-metrics-db-secret.yml
│ │ ├── cronjob-create-partitions.yml
│ │ └── cronjob-drop-partitions.yml
│ └── offline/ # Offline variants (registry placeholders)
│ ├── deployment.yml
│ ├── postgres-deployment.yml
│ ├── operator-deployment.yml
│ └── grafana-openshift-values.yaml
└── offline/ # Offline/air-gapped deployment
├── prepare-offline-bundle.sh
├── README.md
├── demo-apps/ # Demo applications bundle
│ ├── prepare-offline-demo-bundle.sh
│ ├── README.md
│ ├── config/ # Helm values
│ │ ├── cryostat.yaml
│ │ └── eap74.yaml
│ └── installation-scripts/
│ ├── install-cryostat-offline.sh
│ ├── install-eap74-offline.sh
│ └── load-images-offline.sh
└── installation-scripts/
├── install-offline.sh
├── install-grafana-offline.sh
└── load-images-offline.sh
The database schema (internal/db/migrations/0001_init.up.sql) defines:
clusters: Stores cluster metadata with UUIDidandname.nodes: Stores node metadata with UUIDid,cluster_id,name,identifier, andtype.node_metrics: Stores time-series node metrics with UUIDid,node_id,timestamp,core_count, andcluster_id, partitioned monthly bytimestamp.node_daily_summary: Aggregates daily node metrics bynode_id,date, andcore_count, storingtotal_hours.pods: Stores pod metadata with UUIDid,cluster_id,node_id,name,namespace, andcomponent.pod_metrics: Stores time-series pod metrics with UUIDid,pod_id,timestamp,pod_usage_cpu_core_seconds,pod_request_cpu_core_seconds,node_capacity_cpu_core_seconds, andnode_capacity_cpu_cores, partitioned monthly bytimestamp.pod_daily_summary: Aggregates daily pod metrics bypod_idanddate, storingmax_cores_used,total_pod_effective_core_seconds, andtotal_hours.
All id columns use UUIDs (via gen_random_uuid()). The node_metrics and pod_metrics tables are partitioned for performance.
git clone https://github.com/aptmac/cost-metrics-aggregator.git
cd cost-metrics-aggregatorCreate a ./db.env file for the application:
echo "DATABASE_URL=postgres://costmetrics:costmetrics@db:5432/costmetrics?sslmode=disable" > ./db.env
echo "POD_LABEL_KEYS=label_rht_comp" >> ./db.envDATABASE_URL: Matches the PostgreSQL service inpodman-compose.yaml. Usessslmode=disablefor local development since the local PostgreSQL container doesn't have SSL configured.POD_LABEL_KEYS: Defines pod labels for filtering (e.g.,label_rht_comp).
Note: For OpenShift/production deployments, SSL is enabled by default. The deployment uses sslmode=require in the secret configuration.
Use the Makefile to start the application and PostgreSQL database:
make compose-upThis:
- Builds the application image using the
Containerfile. - Starts the
app(aggregator) anddb(PostgreSQL) services. - Applies migrations from
internal/db/migrationsto initialize the database schema.
Verify services are running:
podman psExpected output includes containers aggregator and aggregator-db.
Execute unit tests to verify the application logic:
make testThis runs tests in all packages, including CSV processing for node and pod metrics.
Generate a test tar.gz file containing a manifest.json and sample CSV files for the previous 24 hours:
make generate-test-uploadUpload the generated test file to the application:
make upload-testThe generate-test-upload target creates a test_upload.tar.gz file with a manifest and two CSV files, each containing hourly metrics data compatible with the application's ingestion endpoint. The upload-test target sends this file to http://localhost:8080/api/ingress/v1/upload. Ensure the application is running before uploading.
💡 Tip: Substitute
start_dateandend_datewith the current date (inYYYY-MM-DDformat) to ensure you query data from current month partition.
Query node metrics:
curl "http://localhost:8080/api/metrics/v1/nodes?start_date=2025-05-17&end_date=2027-05-17"Query pod metrics:
curl "http://localhost:8080/api/metrics/v1/pods?start_date=2025-05-17&end_date=2027-05-17&namespace=test"Connect to the PostgreSQL database to inspect data:
podman exec -it aggregator-db psql -U costmetrics -d costmetricsList tables and partitions:
\dt+ node_metrics*
\dt+ pod_metrics*Query summaries:
SELECT * FROM node_daily_summary WHERE date = '2025-05-17';
SELECT * FROM pod_daily_summary WHERE date = '2025-05-17';Shut down and remove containers:
make compose-downFor a streamlined online deployment using public registries:
./install.shThis script will:
- Create namespaces for the aggregator and operator
- Deploy PostgreSQL with SSL configuration
- Deploy the Cost Metrics Aggregator
- Install the Koku Metrics Operator
- Apply the CostManagementMetricsConfig
make build
podman build -t quay.io/almacdon/cost-metrics-aggregator:latest .
podman push quay.io/almacdon/cost-metrics-aggregator:latest-
Create the
cost-metricsnamespace:kubectl apply -f deploy/namespace.yml
-
Update
deploy/postgres/cost-metrics-db-secret.ymlwith base64-encoded values:postgres-password: Your PostgreSQL password (e.g.,echo -n "costmetrics" | base64)database-url: Connection string with SSL enabled- Format:
postgres://<username>:<password>@postgres:5432/costmetrics?sslmode=require - Example:
echo -n "postgres://costmetrics:costmetrics@postgres:5432/costmetrics?sslmode=require" | base64 - Result:
cG9zdGdyZXM6Ly9jb3N0bWV0cmljczpjb3N0bWV0cmljc0Bwb3N0Z3Jlczo1NDMyL2Nvc3RtZXRyaWNzP3NzbG1vZGU9cmVxdWlyZQ==
- Format:
Note: The PostgreSQL deployment is configured with
POSTGRESQL_ENABLE_TLS=trueto support SSL connections. -
Deploy PostgreSQL and secret:
kubectl apply -f deploy/postgres/cost-metrics-db-secret.yml -n cost-metrics kubectl apply -f deploy/postgres/postgres-deployment.yml -n cost-metrics
-
Deploy the application:
kubectl apply -f deploy/deployment.yml -n cost-metrics kubectl apply -f deploy/service.yml -n cost-metrics kubectl apply -f deploy/route.yml -n cost-metrics
-
Deploy CronJobs for partition management:
kubectl apply -f deploy/postgres/cronjob-create-partitions.yml -n cost-metrics kubectl apply -f deploy/postgres/cronjob-drop-partitions.yml -n cost-metrics
If you need the Koku Metrics Operator for cost management:
kubectl apply -f deploy/operator/operator-serviceaccount.yml
kubectl apply -f deploy/operator/operator-clusterrole.yml
kubectl apply -f deploy/operator/operator-clusterrolebinding.yml
kubectl apply -f deploy/operator/operator-prometheus-rolebinding.yml
kubectl apply -f deploy/operator/operator-crd.yml
kubectl apply -f deploy/operator/operator-deployment.yml
kubectl apply -f deploy/operator/CostManagementMetricsConfig.yml -n koku-metrics-operatorFor air-gapped or offline environments, see the offline deployment guide.
-
Check pod status:
kubectl get pods -n cost-metrics -l app=postgres kubectl get pods -n cost-metrics -l app=cost-metrics-aggregator
-
Verify database schema:
kubectl exec -it <postgres-pod-name> -n cost-metrics -- psql -U costmetrics -d costmetrics -c "\dt+ node_metrics*" kubectl exec -it <postgres-pod-name> -n cost-metrics -- psql -U costmetrics -d costmetrics -c "\dt+ pod_metrics*"
-
Check application logs:
kubectl logs -l app=cost-metrics-aggregator -n cost-metrics
-
Verify CronJob execution:
kubectl get jobs -n cost-metrics kubectl logs <job-pod-name> -n cost-metrics
You can use kubectl to query the database directly:
Template:
kubectl exec -n cost-metrics \
$(kubectl get pod -n cost-metrics -l app=postgres -o jsonpath='{.items[0].metadata.name}') -- \
psql -U costmetrics -d costmetrics -c "YOUR SQL QUERY HERE"
Example (count all records):
kubectl exec -n cost-metrics \
$(kubectl get pod -n cost-metrics -l app=postgres -o jsonpath='{.items[0].metadata.name}') -- \
psql -U costmetrics -d costmetrics -c \
"SELECT COUNT(*) FROM node_metrics; SELECT COUNT(*) FROM pod_metrics;"- Creation: The
create_partitions.goscript (run by an initContainer andcronjob-create-partitions) createsnode_metricsandpod_metricspartitions for the previous and next 90 days. - Deletion: The
drop_partitions.goscript (run bycronjob-drop-partitions) drops partitions older than 90 days. - Schedule: Both CronJobs run monthly on the 1st at midnight (
0 0 1 * *).
- POST /api/ingress/v1/upload: Uploads a tar.gz file containing
manifest.jsonand CSV files (e.g.,node.csv) for metric ingestion. - GET /api/metrics/v1/nodes: Queries node metrics (e.g., core count, total hours) with optional filters (
start_date,end_date,cluster_id,cluster_name,node_type). - GET /api/metrics/v1/pods: Queries pod metrics (e.g., max cores used, effective core seconds, total hours) with optional filters (
start_date,end_date,cluster_id,namespace,component).
- Local Development:
- Container Failures: Check
podman logs aggregatororpodman logs aggregator-dbfor errors. - Database Connectivity: Ensure
vulnerability/db.envhas the correctDATABASE_URLand thedbservice is running. - CSV Processing Errors: Verify CSV format and
interval_starttimestamps (2006-01-02 15:04:05 +0000 MST).
- Container Failures: Check
- OpenShift Deployment:
- Build Failures: Check Quay.io build logs for missing dependencies or network issues.
- Migration Errors: Verify
DATABASE_URLincost-metrics-db-secret.ymland PostgreSQL pod logs. - CronJob Failures: Check job logs for script errors or database permissions.
- Metrics Issues:
- Query
node_daily_summaryorpod_daily_summaryto verifytotal_hours:SELECT * FROM node_daily_summary WHERE date = '2025-05-17'; SELECT * FROM pod_daily_summary WHERE date = '2025-05-17';
- Query
- Submit pull requests to
almacdon/cost-metrics-aggregator. - Update
internal/db/migrations/for schema changes andinternal/processor/for metric processing logic. - Add tests in relevant packages (e.g.,
internal/processor) for node and pod metric aggregation. - Test locally with
make compose-upandmake testbefore pushing to Quay.io.