victoriametrics vs. prometheus

ntk148v · ntk148v · commit 632bac1e4c0b · 2025-12-16T11:20:23.000+07:00
diff --git a/victoriametrics/vs-prometheus.md b/victoriametrics/vs-prometheus.md
@@ -1,7 +1,5 @@
 Below are **8 focused, high-quality references** (official docs, deep dives, and case studies) that back the key points I used earlier: VictoriaMetrics’ column-oriented parts (timestamps/values), global inverted index (TSID/MetricID), efficient merges/deduplication, and comparative case studies showing resource savings versus Prometheus. I list each reference with a one-sentence summary of which claim it supports so you can jump straight to the most relevant material.
 
----
-
 # Primary (official) documentation & engineering blogs
 
 1. **VictoriaMetrics — official docs (home)** — central documentation for VM (architecture, components, deployment). Supports general architectural claims and links to component docs. ([docs.victoriametrics.com][1])
@@ -18,9 +16,7 @@ Below are **8 focused, high-quality references** (official docs, deep dives, and
 
 5. **VictoriaMetrics articles & performance tips (docs/blogs)** — collection of articles including string interning, performance techniques and part layout (timestamps/values per part). Supports columnar/time/value separation and memory optimizations. ([docs.victoriametrics.com][5])
    _Use for:_ concrete optimizations (string interning, selective loading).
-
----
-
+s
 # Comparative / independent analyses and case studies
 
 6. **Prometheus — Storage (official docs)** — Prometheus TSDB block model, 2-hour blocks, per-block index; supports the discussion of Prometheus’ per-block index and compaction behavior. ([prometheus.io][6])
@@ -32,8 +28,6 @@ Below are **8 focused, high-quality references** (official docs, deep dives, and
 8. **Independent write-ups / explainers (blogs, medium, community posts)** — several third-party analyses summarizing VM internals, columnar parts, and comparative experiences (e.g., Medium posts, dbdb.io, Groundcover explainer). These provide independent corroboration and useful diagrams/interpretations. ([Medium][8])
    _Use for:_ alternative explanations, diagrams, and community benchmarks.
 
----
-
 # Quick mapping: which references support which major claims
 
 - **VM stores timestamps & values in separate per-part files (column-like layout):** vmstorage blog + articles. ([VictoriaMetrics][3])
@@ -42,18 +36,6 @@ Below are **8 focused, high-quality references** (official docs, deep dives, and
 - **Prometheus uses 2-hour blocks with per-block indexes (leads to duplicated index data):** Prometheus storage docs. ([prometheus.io][6])
 - **Real-world resource improvements after migrating from Prometheus to VM:** PingCAP TiDB migration case study. ([TiDB][7])
 
----
-
-# If you want next steps
-
-I can do any of the following (pick one and I’ll proceed):
-
-- Compile a **downloadable pack** (PDF or Markdown) with excerpts and relevant snippets from the sources above (with full links and short quotes).
-- Extract and compare **specific on-disk file formats** (exact file names, byte layout) from VM docs vs Prometheus TSDB format.
-- Produce a **brief annotated benchmark** plan you can run (prometheus + remote_write → VM) with metrics to capture (disk, RAM, CPU, query latency) and expected outcomes.
-
-Which of those would you like me to prepare now?
-
 [1]: https://docs.victoriametrics.com/?utm_source=chatgpt.com "Welcome to VictoriaMetrics Docs"
 [2]: https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/?utm_source=chatgpt.com "VictoriaMetrics: Single-node version"
 [3]: https://victoriametrics.com/blog/vmstorage-retention-merging-deduplication/?utm_source=chatgpt.com "How vmstorage Processes Data: Retention, Merging, ..."
@@ -62,3 +44,102 @@ Which of those would you like me to prepare now?
 [6]: https://prometheus.io/docs/prometheus/latest/storage/?utm_source=chatgpt.com "Storage"
 [7]: https://www.pingcap.com/blog/tidb-observability-migrating-prometheus-victoriametrics/?utm_source=chatgpt.com "Moving from Prometheus to VictoriaMetrics"
 [8]: https://alexmarket.medium.com/victoriametrics-a-look-inside-its-innards-b00eaa2a1e32?utm_source=chatgpt.com "VictoriaMetrics: a look inside its innards. | by Alex - Medium"
+
+---
+
+# Architecture Deep Dive: VictoriaMetrics vs. Prometheus
+
+## Executive Summary
+
+While both systems share the same goal (ingesting and querying metrics), their internal architectures optimize for completely different patterns.
+
+  * **Prometheus** is optimized for **fast recent reads**. It sacrifices memory and write-flexibility to keep the "Head" of data instantly accessible in RAM.
+  * **VictoriaMetrics** is optimized for **high-volume writes and storage efficiency**. It uses a Log-Structured Merge (LSM) tree approach (similar to ClickHouse or LevelDB) to decouple ingestion from storage, making it resilient to high churn and massive scale.
+
+## 1. The Indexing Engine: Inverted Index vs. LSM Tree
+
+The most critical difference lies in how they map a Label (e.g., `pod="A"`) to the data on disk.
+
+### Prometheus: The Classic Inverted Index
+
+Prometheus uses a search-engine style index designed for **immutable blocks**.
+
+  * **Structure:**
+      * **Posting Lists:** A sorted list of Series IDs for every label value.
+      * **Offset Table:** A lookup table pointing to where the Posting List begins on disk.
+  * **The Workflow:**
+      * **Write:** New series live in the **Head Block** (RAM). When this block flushes (every 2h), Prometheus must "stop the world" to rewrite the entire index sequentially.
+      * **High Churn Penalty:** If you add 100k ephemeral pods, the Head Block explodes in size because it must track every mapping in memory. Flushing requires rewriting the entire Posting List structure, leading to massive I/O and CPU spikes.
+
+### VictoriaMetrics: The Key-Value LSM Tree
+
+VictoriaMetrics does **not** use a separate index file format. It treats index entries as simple Key-Value pairs in an LSM tree (the `MergeSet`).
+
+  * **Structure:**
+      * **Keys:** `Prefix + LabelName + LabelValue + MetricID`.
+      * **Storage:** These keys are appended to a log structure and sorted in the background.
+  * **The Workflow:**
+      * **Write:** Adding a new series is just an **append-only** operation. VM writes a new key to the end of the LSM tree.
+      * **High Churn Advantage:** There is no "read-modify-write" penalty. Creating 100k new pods just means writing 100k small keys. The heavy lifting (sorting/merging) happens lazily in the background.
+
+<!-- end list -->
+
+```mermaid
+flowchart TD
+    subgraph "Prometheus: Rigid Structure"
+        A[Label: pod='A'] -->|Lookup Offset| B[Posting List]
+        B -->|Contains| C[SeriesID_1, SeriesID_5, ...]
+        C -->|Must rewrite entire list on update| D[High Churn = Pain]
+    end
+
+    subgraph "VictoriaMetrics: LSM Stream"
+        X[New Series: pod='A'] -->|Append Key| Y[LSM Tree Log]
+        Y -->|Key: pod=A+MetricID_1| Z[Background Merge]
+        Z -->|No Read penalty on Write| W[High Churn = Easy]
+    end
+```
+
+## 2. The I/O Path: WAL vs. Compressed Parts
+
+This explains why VictoriaMetrics has "smoother" disk usage despite flushing more frequently.
+
+### Prometheus: The WAL (Write Ahead Log)
+
+  * **Strategy:** Immediate durability via a WAL file.
+  * **Write Pattern:** Every sample is appended to the WAL file on disk.
+      * **Compression:** None/Low (for speed).
+      * **Payload:** Large (Raw bytes).
+  * **Fsync:** Infrequent (usually on segment rotation or checkpoint). Relying on OS page cache.
+  * **Consequence:** High "Write Amplification" during compaction. The disk is constantly busy writing raw data, and then gets hammered every 2 hours when the Head Block flushes.
+
+### VictoriaMetrics: The Buffered Flush
+
+  * **Strategy:** Periodic durability via compressed micro-parts.
+  * **Write Pattern:** Data is buffered in RAM (`inmemoryPart`) and flushed every \~1-5 seconds.
+      * **Compression:** High (ZSTD-like + Gorilla).
+      * **Payload:** Tiny (Data is compressed *before* writing).
+  * **Fsync:** **Frequent** (Every flush).
+  * **Consequence:** Even though VM calls `fsync` every few seconds, the **payload is so small** (50KB vs 2MB) that modern SSDs handle it effortlessly. This avoids the "Stop the World" I/O spikes seen in Prometheus.
+
+**Critical Trade-off:** VictoriaMetrics sacrifices the last \~5 seconds of data (held in RAM buffer) in the event of a hard crash (`kill -9`) to achieve this massive I/O gain. Prometheus recovers everything via WAL replay.
+
+## 3. Resource Efficiency: Why VM is "Lighter"
+
+| Feature        | Prometheus (Single Node)                                                                                   | VictoriaMetrics (Single Node)                                                                                           |
+| :------------- | :--------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------- |
+| **RAM Usage**  | **High.** Scales with Active Series + Ingestion Rate (Head Block bloat).                                   | **Low.** Scales with Cache Size. Writes don't require massive RAM buffers.                                              |
+| **CPU Usage**  | **High.** Uses 1 Goroutine per target. Garbage Collector struggles with millions of small objects in Head. | **Optimized.** Uses a fixed-size worker pool. Optimized code reduces memory allocations, lightening the load on the GC. |
+| **Disk Space** | **Standard.** \~1.5 bytes per sample.                                                                      | **Ultra-Low.** \~0.4 bytes per sample. Precision reduction + better compression algorithms.                             |
+| **Operation**  | **Spiky.** Periodic heavy loads (Compaction/GC).                                                           | **Smooth.** Continuous small background merges.                                                                         |
+
+## Summary Recommendation
+
+  * **Stick with Prometheus if:** You have a small-to-medium static environment, you need 100% standard adherence, and you cannot tolerate even 1 second of data loss on a crash.
+  * **Switch to VictoriaMetrics if:**
+    1.  **High Churn:** You run Kubernetes with frequent deployments or auto-scaling.
+    2.  **Long Retention:** You need to store months/years of data cheaply.
+    3.  **Performance Issues:** Your Prometheus is OOMing or using too much CPU.
+
+### Final "Under the Hood" Visualization
+
+This graph (referenced from our earlier discussion) summarizes the reality: Prometheus shows a "Sawtooth" pattern of resource usage (building up to a flush), whereas VictoriaMetrics shows a "Flat" line, making it far easier to capacity plan for production clusters.