update

ntk148v · ntk148v · commit 4b81d158742f · 2025-12-17T09:37:35.000+07:00
diff --git a/victoriametrics/vs-prometheus.md b/victoriametrics/vs-prometheus.md
@@ -61,44 +61,47 @@ While both systems share the same goal (ingesting and querying metrics), their i
 
 The most critical difference lies in how they map a Label (e.g., `pod="A"`) to the data on disk.
 
-### Prometheus: The Classic Inverted Index
+### 1.1. The inverted index & block archiecture
 
-Prometheus uses a search-engine style index designed for **immutable blocks**.
+Prometheus uses a TSDB (Time Series Database) model where data is partitioned into non-overlapping blocks of time (initially 2h, compacting into larger ranges).
 
-- **Structure:**
-  - **Posting Lists:** A sorted list of Series IDs for every label value.
-  - **Offset Table:** A lookup table pointing to where the Posting List begins on disk.
-- **The Workflow:**
-  - **Write:** New series live in the **Head Block** (RAM). When this block flushes (every 2h), Prometheus must "stop the world" to rewrite the entire index sequentially.
-  - **High Churn Penalty:** If you add 100k ephemeral pods, the Head Block explodes in size because it must track every mapping in memory. Flushing requires rewriting the entire Posting List structure, leading to massive I/O and CPU spikes.
+**Index Structure**
 
-### VictoriaMetrics: The Key-Value LSM Tree
+- Inverted Index: The core structure is an inverted index mapping `Label pairs → List of Series IDs (Postings)`.
+- Postings Lists: These are sorted lists of Series IDs. To find data for `app="frontend" AND status="500"`, Prometheus fetches the postings list for both labels and performs a linear intersection (O(N) complexity relative to the number of matching series).
+- Symbol Table: All string pairs (Label Name + Value) are interned in a symbol table to save space, but this table grows globally within a block.
 
-VictoriaMetrics does **not** use a separate index file format. It treats index entries as simple Key-Value pairs in an LSM tree (the `MergeSet`).
+**The "High Churn" Weakness**
 
-- **Structure:**
-  - **Keys:** `Prefix + LabelName + LabelValue + MetricID`.
-  - **Storage:** These keys are appended to a log structure and sorted in the background.
-- **The Workflow:**
-  - **Write:** Adding a new series is just an **append-only** operation. VM writes a new key to the end of the LSM tree.
-  - **High Churn Advantage:** There is no "read-modify-write" penalty. Creating 100k new pods just means writing 100k small keys. The heavy lifting (sorting/merging) happens lazily in the background.
+In Kubernetes environments with high churn (e.g., generic jobs or HPA-driven scaling), every new pod ID creates a new Time Series.
 
-<!-- end list -->
+- Index Bloat: Even if a pod lives for 5 minutes, its series ID and labels remain in the index for the entire duration of the block (and subsequent compacted blocks until retention expires).
+- Memory Pressure: Prometheus keeps the "Head" (active) block in memory. High churn inflates the Head block's index, leading to OOM kills.
+- Compaction Spikes: When merging blocks (e.g., 2h -> 8h), the index must be rewritten. Merging massive posting lists requires significant CPU and memory, often causing "compaction storms."
 
-```mermaid
-flowchart TD
-    subgraph "Prometheus: Rigid Structure"
-        A[Label: pod='A'] -->|Lookup Offset| B[Posting List]
-        B -->|Contains| C[SeriesID_1, SeriesID_5, ...]
-        C -->|Must rewrite entire list on update| D[High Churn = Pain]
-    end
+### 1.2. VictoriaMetrics: IndexDB & rotation strategy
 
-    subgraph "VictoriaMetrics: LSM Stream"
-        X[New Series: pod='A'] -->|Append Key| Y[LSM Tree Log]
-        Y -->|Key: pod=A+MetricID_1| Z[Background Merge]
-        Z -->|No Read penalty on Write| W[High Churn = Easy]
-    end
-```
+VictoriaMetrics decouples the index from the data blocks more aggressively than Prometheus. It uses a component called `IndexDB` which is essentially an embedded database optimized for key-value lookups with LSM-like properties.
+
+**Index Structure: The "MergeSet"**
+
+VM stores index entries in a `MergeSet` (LSM-tree variant). This allows it to handle high write rates (inserting new series) more efficiently than Prometheus's Head block structures.
+
+- TSID (Time Series ID): VM assigns a specialized internal TSID (containing MetricGroupID, JobID, etc.) to every series.
+- Rotation (The "Churn Killer"): Unlike Prometheus, which maintains one monolithic index per block, VM rotates its `IndexDB`:
+    - Daily Indexes: VM maintains prev, current, and next index structures.
+    - Per-Day Inverted Index: It stores separate mappings for Date + Label -> MetricID.
+      -Impact: If you have high churn on Tuesday, those millions of short-lived series are isolated to Tuesday's index. Queries for Wednesday do not need to scan or load the "polluted" index from Tuesday. This provides O(1) complexity for churn regarding retention time, whereas Prometheus is O(T) (churn accumulates over time).
+
+**Lookup optimization**
+
+VM uses a multi-level lookup to reduce disk I/O:
+
+- Global Index: Label -> MetricID (for low-cardinality, stable metrics).
+- Per-Day Index: Date + Label -> MetricID (for high-churn/ephemeral metrics).
+- MetricID -> TSID: Final mapping to the physical data location.
+
+This "Per-Day" optimization is why VM outperforms Prometheus in high-cardinality lookups over long time ranges. It can skip entire days of index data if the query window doesn't overlap, whereas Prometheus often has to touch index structures that contain irrelevant series.
 
 ## 2. The I/O Path: WAL vs. Compressed Parts
 
@@ -108,17 +111,17 @@ This explains why VictoriaMetrics has "smoother" disk usage despite flushing mor
 
 - **Strategy:** Immediate durability via a WAL file.
 - **Write Pattern:** Every sample is appended to the WAL file on disk.
-  - **Compression:** None/Low (for speed).
-  - **Payload:** Large (Raw bytes).
+    - **Compression:** None/Low (for speed).
+    - **Payload:** Large (Raw bytes).
 - **Fsync:** Infrequent (usually on segment rotation or checkpoint). Relying on OS page cache.
 - **Consequence:** High "Write Amplification" during compaction. The disk is constantly busy writing raw data, and then gets hammered every 2 hours when the Head Block flushes.
 
 ### VictoriaMetrics: The Buffered Flush
 
 - **Strategy:** Periodic durability via compressed micro-parts.
 - **Write Pattern:** Data is buffered in RAM (`inmemoryPart`) and flushed every \~1-5 seconds.
-  - **Compression:** High (ZSTD-like + Gorilla).
-  - **Payload:** Tiny (Data is compressed _before_ writing).
+    - **Compression:** High (ZSTD-like + Gorilla).
+    - **Payload:** Tiny (Data is compressed _before_ writing).
 - **Fsync:** **Frequent** (Every flush).
 - **Consequence:** Even though VM calls `fsync` every few seconds, the **payload is so small** (50KB vs 2MB) that modern SSDs handle it effortlessly. This avoids the "Stop the World" I/O spikes seen in Prometheus.
 
@@ -132,15 +135,3 @@ This explains why VictoriaMetrics has "smoother" disk usage despite flushing mor
 | **CPU Usage**  | **High.** Uses 1 Goroutine per target. Garbage Collector struggles with millions of small objects in Head. | **Optimized.** Uses a fixed-size worker pool. Optimized code reduces memory allocations, lightening the load on the GC. |
 | **Disk Space** | **Standard.** \~1.5 bytes per sample.                                                                      | **Ultra-Low.** \~0.4 bytes per sample. Precision reduction + better compression algorithms.                             |
 | **Operation**  | **Spiky.** Periodic heavy loads (Compaction/GC).                                                           | **Smooth.** Continuous small background merges.                                                                         |
-
-## Summary Recommendation
-
-- **Stick with Prometheus if:** You have a small-to-medium static environment, you need 100% standard adherence, and you cannot tolerate even 1 second of data loss on a crash.
-- **Switch to VictoriaMetrics if:**
-  1.  **High Churn:** You run Kubernetes with frequent deployments or auto-scaling.
-  2.  **Long Retention:** You need to store months/years of data cheaply.
-  3.  **Performance Issues:** Your Prometheus is OOMing or using too much CPU.
-
-### Final "Under the Hood" Visualization
-
-This graph (referenced from our earlier discussion) summarizes the reality: Prometheus shows a "Sawtooth" pattern of resource usage (building up to a flush), whereas VictoriaMetrics shows a "Flat" line, making it far easier to capacity plan for production clusters.