Skip to content

[Feature]: Add dedicated metrics for Snapshot feature #47097

@zhuwenxing

Description

@zhuwenxing

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

The Snapshot feature (introduced in #44358) currently only uses generic Proxy-level metrics (milvus_proxy_function_call_total and milvus_proxy_req_latency). These metrics only track API call counts and latency, but lack snapshot-specific operational insights.

Current limitations:

  1. Cannot monitor the total number of snapshots across collections
  2. Cannot track snapshot storage consumption
  3. Cannot observe restore operation progress in real-time via metrics
  4. Cannot measure snapshot creation/restore duration at the DataCoord level
  5. No visibility into snapshot-referenced data that cannot be garbage collected

This makes it difficult to:

  • Set up alerting for snapshot storage growth
  • Monitor restore job progress via Grafana dashboards
  • Understand the storage impact of snapshots on object storage costs
  • Troubleshoot snapshot-related performance issues

Describe the solution you'd like

Add dedicated Prometheus metrics for the Snapshot feature:

Snapshot Inventory Metrics

Metric Name Type Labels Description
milvus_snapshot_total Gauge collection_id, db_name Total number of snapshots
milvus_snapshot_storage_bytes Gauge collection_id, snapshot_name Storage size of snapshot data
milvus_snapshot_referenced_storage_bytes Gauge collection_id Storage size of data referenced by snapshots (cannot be GC'd)

Snapshot Operation Metrics

Metric Name Type Labels Description
milvus_snapshot_create_duration_seconds Histogram collection_id, status Time to create a snapshot
milvus_snapshot_restore_duration_seconds Histogram collection_id, status Time to restore a snapshot
milvus_snapshot_restore_progress_ratio Gauge job_id, snapshot_name Restore progress (0.0 - 1.0)
milvus_snapshot_restore_jobs_total Gauge state Number of restore jobs by state (pending/in_progress/completed/failed)

Snapshot Error Metrics

Metric Name Type Labels Description
milvus_snapshot_operation_errors_total Counter operation, error_type Total snapshot operation errors

Describe an alternate solution

  1. Extend existing metrics: Add operation_type=snapshot_* labels to existing DataCoord metrics instead of creating new metric families.

  2. Expose via API only: Keep metrics lightweight and expose detailed snapshot statistics only via DescribeSnapshot API responses, letting users build custom exporters.

Anything else? (Additional Context)

User Guide Note: The snapshot user guide mentions "Monitoring: Track snapshot creation times and storage usage" as a best practice, but currently there's no built-in way to do this via metrics.

Metadata

Metadata

Assignees

Labels

kind/featureIssues related to feature request from users

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions