-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Is there an existing issue for this?
- I have searched the existing issues
Is your feature request related to a problem? Please describe.
The Snapshot feature (introduced in #44358) currently only uses generic Proxy-level metrics (milvus_proxy_function_call_total and milvus_proxy_req_latency). These metrics only track API call counts and latency, but lack snapshot-specific operational insights.
Current limitations:
- Cannot monitor the total number of snapshots across collections
- Cannot track snapshot storage consumption
- Cannot observe restore operation progress in real-time via metrics
- Cannot measure snapshot creation/restore duration at the DataCoord level
- No visibility into snapshot-referenced data that cannot be garbage collected
This makes it difficult to:
- Set up alerting for snapshot storage growth
- Monitor restore job progress via Grafana dashboards
- Understand the storage impact of snapshots on object storage costs
- Troubleshoot snapshot-related performance issues
Describe the solution you'd like
Add dedicated Prometheus metrics for the Snapshot feature:
Snapshot Inventory Metrics
| Metric Name | Type | Labels | Description |
|---|---|---|---|
milvus_snapshot_total |
Gauge | collection_id, db_name |
Total number of snapshots |
milvus_snapshot_storage_bytes |
Gauge | collection_id, snapshot_name |
Storage size of snapshot data |
milvus_snapshot_referenced_storage_bytes |
Gauge | collection_id |
Storage size of data referenced by snapshots (cannot be GC'd) |
Snapshot Operation Metrics
| Metric Name | Type | Labels | Description |
|---|---|---|---|
milvus_snapshot_create_duration_seconds |
Histogram | collection_id, status |
Time to create a snapshot |
milvus_snapshot_restore_duration_seconds |
Histogram | collection_id, status |
Time to restore a snapshot |
milvus_snapshot_restore_progress_ratio |
Gauge | job_id, snapshot_name |
Restore progress (0.0 - 1.0) |
milvus_snapshot_restore_jobs_total |
Gauge | state |
Number of restore jobs by state (pending/in_progress/completed/failed) |
Snapshot Error Metrics
| Metric Name | Type | Labels | Description |
|---|---|---|---|
milvus_snapshot_operation_errors_total |
Counter | operation, error_type |
Total snapshot operation errors |
Describe an alternate solution
-
Extend existing metrics: Add
operation_type=snapshot_*labels to existing DataCoord metrics instead of creating new metric families. -
Expose via API only: Keep metrics lightweight and expose detailed snapshot statistics only via
DescribeSnapshotAPI responses, letting users build custom exporters.
Anything else? (Additional Context)
- Related Feature Issue: [Feature]: Add Snapshot Functionality for Collections #44358
- Current metrics location:
pkg/metrics/proxy_metrics.go - Snapshot implementation:
internal/proxy/snapshot_impl.go,internal/datacoord/snapshot*.go
User Guide Note: The snapshot user guide mentions "Monitoring: Track snapshot creation times and storage usage" as a best practice, but currently there's no built-in way to do this via metrics.