Skip to content

Commit 3ed4b69

Browse files
committed
profile and monitoring
1 parent 83c1a31 commit 3ed4b69

File tree

17 files changed

+89
-4109
lines changed

17 files changed

+89
-4109
lines changed

dev-tools/monitoring/README.md

Lines changed: 50 additions & 309 deletions
Large diffs are not rendered by default.

dev-tools/monitoring/data/.gitignore

Lines changed: 0 additions & 5 deletions
This file was deleted.

dev-tools/monitoring/docs/METRICS.md

Lines changed: 0 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -119,136 +119,6 @@ Exposed on port **8080** via Prometheus endpoint in cardano-db-sync.
119119
- `dbsync_ledger_snapshot_duration_seconds` - Time to save ledger snapshot
120120
- `dbsync_ledger_events_processed_total` - Ledger events processed
121121

122-
## Useful Queries
123-
124-
### PostgreSQL Performance
125-
126-
**Buffer cache hit ratio** (should be >99%):
127-
```promql
128-
rate(pg_stat_database_blks_hit[5m]) /
129-
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))
130-
```
131-
132-
**Transaction rate**:
133-
```promql
134-
rate(pg_stat_database_xact_commit[5m])
135-
```
136-
137-
**Database growth rate** (bytes per second):
138-
```promql
139-
rate(pg_database_size_bytes[5m])
140-
```
141-
142-
**Dead tuple percentage** (high values indicate need for VACUUM):
143-
```promql
144-
pg_stat_user_tables_n_dead_tup /
145-
(pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)
146-
```
147-
148-
### System Performance
149-
150-
**CPU usage percentage**:
151-
```promql
152-
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
153-
```
154-
155-
**Memory usage percentage**:
156-
```promql
157-
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
158-
```
159-
160-
**Disk I/O rate** (MB/s):
161-
```promql
162-
(rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])) / 1024 / 1024
163-
```
164-
165-
**Filesystem usage percentage**:
166-
```promql
167-
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
168-
```
169-
170-
### Cardano DB Sync (once implemented)
171-
172-
**Sync lag** (difference between chain tip and synced block):
173-
```promql
174-
chain_tip_block_height - dbsync_block_height
175-
```
176-
177-
**Block processing rate** (blocks per second):
178-
```promql
179-
rate(dbsync_block_height[5m])
180-
```
181-
182-
**Cache hit ratio**:
183-
```promql
184-
rate(dbsync_cache_hits_total[5m]) /
185-
(rate(dbsync_cache_hits_total[5m]) + rate(dbsync_cache_misses_total[5m]))
186-
```
187-
188-
**Memory growth rate**:
189-
```promql
190-
rate(dbsync_memory_heap_size_bytes[5m])
191-
```
192-
193-
**GC overhead**:
194-
```promql
195-
rate(dbsync_memory_gc_cpu_seconds[5m]) /
196-
(rate(dbsync_memory_gc_cpu_seconds[5m]) + rate(process_cpu_seconds_total[5m]))
197-
```
198-
199-
## Alerting Rules (Examples)
200-
201-
These can be added to Prometheus alerting rules:
202-
203-
```yaml
204-
groups:
205-
- name: cardano-db-sync
206-
rules:
207-
# High memory usage
208-
- alert: HighMemoryUsage
209-
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
210-
for: 5m
211-
labels:
212-
severity: warning
213-
annotations:
214-
summary: "High memory usage on {{ $labels.instance }}"
215-
description: "Available memory is below 10%"
216-
217-
# Database size growing too fast
218-
- alert: RapidDatabaseGrowth
219-
expr: rate(pg_database_size_bytes[1h]) > 1073741824 # 1GB/hour
220-
for: 1h
221-
labels:
222-
severity: warning
223-
annotations:
224-
summary: "Database growing rapidly"
225-
description: "Database growing at {{ $value | humanize }} bytes/sec"
226-
227-
# Too many dead tuples
228-
- alert: HighDeadTuples
229-
expr: |
230-
(pg_stat_user_tables_n_dead_tup /
231-
(pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) > 0.2
232-
for: 30m
233-
labels:
234-
severity: warning
235-
annotations:
236-
summary: "High dead tuple ratio on table {{ $labels.relname }}"
237-
description: "Dead tuples exceed 20%, consider VACUUM"
238-
239-
# Low buffer cache hit ratio
240-
- alert: LowBufferCacheHitRatio
241-
expr: |
242-
(rate(pg_stat_database_blks_hit[5m]) /
243-
(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))) < 0.95
244-
for: 10m
245-
labels:
246-
severity: warning
247-
annotations:
248-
summary: "Low PostgreSQL buffer cache hit ratio"
249-
description: "Cache hit ratio is {{ $value | humanizePercentage }}"
250-
```
251-
252122
## See Also
253123

254124
- [Prometheus Query Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)

0 commit comments

Comments
 (0)