You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: victoriametrics/vs-prometheus.md
+37-46Lines changed: 37 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,44 +61,47 @@ While both systems share the same goal (ingesting and querying metrics), their i
61
61
62
62
The most critical difference lies in how they map a Label (e.g., `pod="A"`) to the data on disk.
63
63
64
-
### Prometheus: The Classic Inverted Index
64
+
### 1.1. The inverted index & block archiecture
65
65
66
-
Prometheus uses a search-engine style index designed for **immutable blocks**.
66
+
Prometheus uses a TSDB (Time Series Database) model where data is partitioned into non-overlapping blocks of time (initially 2h, compacting into larger ranges).
67
67
68
-
-**Structure:**
69
-
-**Posting Lists:** A sorted list of Series IDs for every label value.
70
-
-**Offset Table:** A lookup table pointing to where the Posting List begins on disk.
71
-
-**The Workflow:**
72
-
-**Write:** New series live in the **Head Block** (RAM). When this block flushes (every 2h), Prometheus must "stop the world" to rewrite the entire index sequentially.
73
-
-**High Churn Penalty:** If you add 100k ephemeral pods, the Head Block explodes in size because it must track every mapping in memory. Flushing requires rewriting the entire Posting List structure, leading to massive I/O and CPU spikes.
68
+
**Index Structure**
74
69
75
-
### VictoriaMetrics: The Key-Value LSM Tree
70
+
- Inverted Index: The core structure is an inverted index mapping `Label pairs → List of Series IDs (Postings)`.
71
+
- Postings Lists: These are sorted lists of Series IDs. To find data for `app="frontend" AND status="500"`, Prometheus fetches the postings list for both labels and performs a linear intersection (O(N) complexity relative to the number of matching series).
72
+
- Symbol Table: All string pairs (Label Name + Value) are interned in a symbol table to save space, but this table grows globally within a block.
76
73
77
-
VictoriaMetrics does **not** use a separate index file format. It treats index entries as simple Key-Value pairs in an LSM tree (the `MergeSet`).
-**Storage:** These keys are appended to a log structure and sorted in the background.
82
-
-**The Workflow:**
83
-
-**Write:** Adding a new series is just an **append-only** operation. VM writes a new key to the end of the LSM tree.
84
-
-**High Churn Advantage:** There is no "read-modify-write" penalty. Creating 100k new pods just means writing 100k small keys. The heavy lifting (sorting/merging) happens lazily in the background.
76
+
In Kubernetes environments with high churn (e.g., generic jobs or HPA-driven scaling), every new pod ID creates a new Time Series.
85
77
86
-
<!-- end list -->
78
+
- Index Bloat: Even if a pod lives for 5 minutes, its series ID and labels remain in the index for the entire duration of the block (and subsequent compacted blocks until retention expires).
79
+
- Memory Pressure: Prometheus keeps the "Head" (active) block in memory. High churn inflates the Head block's index, leading to OOM kills.
80
+
- Compaction Spikes: When merging blocks (e.g., 2h -> 8h), the index must be rewritten. Merging massive posting lists requires significant CPU and memory, often causing "compaction storms."
X[New Series: pod='A'] -->|Append Key| Y[LSM Tree Log]
98
-
Y -->|Key: pod=A+MetricID_1| Z[Background Merge]
99
-
Z -->|No Read penalty on Write| W[High Churn = Easy]
100
-
end
101
-
```
84
+
VictoriaMetrics decouples the index from the data blocks more aggressively than Prometheus. It uses a component called `IndexDB` which is essentially an embedded database optimized for key-value lookups with LSM-like properties.
85
+
86
+
**Index Structure: The "MergeSet"**
87
+
88
+
VM stores index entries in a `MergeSet` (LSM-tree variant). This allows it to handle high write rates (inserting new series) more efficiently than Prometheus's Head block structures.
89
+
90
+
- TSID (Time Series ID): VM assigns a specialized internal TSID (containing MetricGroupID, JobID, etc.) to every series.
91
+
- Rotation (The "Churn Killer"): Unlike Prometheus, which maintains one monolithic index per block, VM rotates its `IndexDB`:
92
+
- Daily Indexes: VM maintains prev, current, and next index structures.
93
+
- Per-Day Inverted Index: It stores separate mappings for Date + Label -> MetricID.
94
+
-Impact: If you have high churn on Tuesday, those millions of short-lived series are isolated to Tuesday's index. Queries for Wednesday do not need to scan or load the "polluted" index from Tuesday. This provides O(1) complexity for churn regarding retention time, whereas Prometheus is O(T) (churn accumulates over time).
95
+
96
+
**Lookup optimization**
97
+
98
+
VM uses a multi-level lookup to reduce disk I/O:
99
+
100
+
- Global Index: Label -> MetricID (for low-cardinality, stable metrics).
- MetricID -> TSID: Final mapping to the physical data location.
103
+
104
+
This "Per-Day" optimization is why VM outperforms Prometheus in high-cardinality lookups over long time ranges. It can skip entire days of index data if the query window doesn't overlap, whereas Prometheus often has to touch index structures that contain irrelevant series.
102
105
103
106
## 2. The I/O Path: WAL vs. Compressed Parts
104
107
@@ -108,17 +111,17 @@ This explains why VictoriaMetrics has "smoother" disk usage despite flushing mor
108
111
109
112
-**Strategy:** Immediate durability via a WAL file.
110
113
-**Write Pattern:** Every sample is appended to the WAL file on disk.
111
-
-**Compression:** None/Low (for speed).
112
-
-**Payload:** Large (Raw bytes).
114
+
-**Compression:** None/Low (for speed).
115
+
-**Payload:** Large (Raw bytes).
113
116
-**Fsync:** Infrequent (usually on segment rotation or checkpoint). Relying on OS page cache.
114
117
-**Consequence:** High "Write Amplification" during compaction. The disk is constantly busy writing raw data, and then gets hammered every 2 hours when the Head Block flushes.
115
118
116
119
### VictoriaMetrics: The Buffered Flush
117
120
118
121
-**Strategy:** Periodic durability via compressed micro-parts.
119
122
-**Write Pattern:** Data is buffered in RAM (`inmemoryPart`) and flushed every \~1-5 seconds.
120
-
-**Compression:** High (ZSTD-like + Gorilla).
121
-
-**Payload:** Tiny (Data is compressed _before_ writing).
123
+
-**Compression:** High (ZSTD-like + Gorilla).
124
+
-**Payload:** Tiny (Data is compressed _before_ writing).
122
125
-**Fsync:****Frequent** (Every flush).
123
126
-**Consequence:** Even though VM calls `fsync` every few seconds, the **payload is so small** (50KB vs 2MB) that modern SSDs handle it effortlessly. This avoids the "Stop the World" I/O spikes seen in Prometheus.
124
127
@@ -132,15 +135,3 @@ This explains why VictoriaMetrics has "smoother" disk usage despite flushing mor
132
135
|**CPU Usage**|**High.** Uses 1 Goroutine per target. Garbage Collector struggles with millions of small objects in Head. |**Optimized.** Uses a fixed-size worker pool. Optimized code reduces memory allocations, lightening the load on the GC. |
133
136
|**Disk Space**|**Standard.**\~1.5 bytes per sample. |**Ultra-Low.**\~0.4 bytes per sample. Precision reduction + better compression algorithms. |
134
137
|**Operation**|**Spiky.** Periodic heavy loads (Compaction/GC). |**Smooth.** Continuous small background merges. |
135
-
136
-
## Summary Recommendation
137
-
138
-
-**Stick with Prometheus if:** You have a small-to-medium static environment, you need 100% standard adherence, and you cannot tolerate even 1 second of data loss on a crash.
139
-
-**Switch to VictoriaMetrics if:**
140
-
1.**High Churn:** You run Kubernetes with frequent deployments or auto-scaling.
141
-
2.**Long Retention:** You need to store months/years of data cheaply.
142
-
3.**Performance Issues:** Your Prometheus is OOMing or using too much CPU.
143
-
144
-
### Final "Under the Hood" Visualization
145
-
146
-
This graph (referenced from our earlier discussion) summarizes the reality: Prometheus shows a "Sawtooth" pattern of resource usage (building up to a flush), whereas VictoriaMetrics shows a "Flat" line, making it far easier to capacity plan for production clusters.
0 commit comments