chore: docs

killme2008 · killme2008 · commit f512ad2ac54f · 2026-04-21T18:52:09.000+08:00
Signed-off-by: Dennis Zhuang &lt;killme2008@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -140,7 +140,7 @@ Writing to GreptimeDB from Node.js? The bulk path is the fastest option by a wid
 | `@opentelemetry/exporter-logs-otlp-proto` | 622k r/s     | 621k r/s     | 0.77×        |
 | `@influxdata/influxdb-client`             | 496k r/s     | 500k r/s     | 0.62×        |
 
-Same pre-generated data, same server, same Node.js runtime; each client is driven with its own default configuration. Arrow Flight ships the batch already-columnar so the server skips text/proto parsing and per-attribute column mapping.
+Same schema, same data generator, same server, same Node.js runtime; each client is driven with its own default configuration. Arrow Flight ships the batch already-columnar so the server skips text/proto parsing and per-attribute column mapping.
 
 On the 22-column log schema the bulk path reaches **~137k rows/s** (2M rows, batch=5000). Unary and streaming numbers, the exact SDK-usage decisions behind each bench, and reproduction commands: [docs/benchmarking.md](./docs/benchmarking.md).
 
diff --git a/bench/index.ts b/bench/index.ts
@@ -11,11 +11,17 @@ const forward = process.argv.slice(3);
 
 if (arg === undefined) {
   console.error(
-    'Usage: pnpm bench <name> [--rows=N --batch-size=N --parallelism=N --endpoint=host:port]',
+    'Usage: pnpm bench <name> [--rows=N --batch-size=N --parallelism=N --num-hosts=N ...]',
   );
   console.error(
     'Available: regular-api, stream-api, bulk-api, cpu-bulk-api, cpu-influxdb, cpu-otel',
   );
+  console.error(
+    'Network flags vary by bench: gRPC benches take --endpoint=host:port; cpu-influxdb /',
+  );
+  console.error(
+    'cpu-otel take --http-endpoint=URL plus --database / --user / --password. See docs/benchmarking.md.',
+  );
   process.exit(2);
 }
 
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
@@ -62,7 +62,7 @@ Today's gap is Arrow JS single-thread encoding (`rowsToArrowTable` = 99% of clie
 
 ## Apples-to-apples: vs InfluxDB JS SDK & OpenTelemetry JS SDK
 
-Three benches share the CPU schema above and write the same pre-generated data through three JS clients, letting us isolate protocol/client overhead from schema effects. Ports are the GreptimeDB defaults: gRPC Bulk on `4001`, InfluxDB v2 and OTLP over HTTP on `4000`.
+Three benches share the CPU schema above and pre-generate datasets with the same shape and cardinality (series layout + ms-stepped timestamps; Float64 values are re-rolled per run via `Math.random()`) through three JS clients, letting us isolate protocol/client overhead from schema effects. Ports are the GreptimeDB defaults: gRPC Bulk on `4001`, InfluxDB v2 and OTLP over HTTP on `4000`.
 
 - `cpu-bulk-api` — our own `@greptime/ingester`, Arrow Flight bulk path. Writes a proper time-series table: 4-tag composite PK + 5 Float64 fields + ms timestamp.
 - `cpu-influxdb` — `@influxdata/influxdb-client` v1.35, line protocol to `/v1/influxdb/api/v2/write`. GreptimeDB serves the InfluxDB v2 API natively; token is `"<user>:<password>"`. Writes the same tag/field shape; server parses LP and maps to the columnar path.
@@ -82,7 +82,7 @@ Takeaways:
 
 - Arrow Flight bulk wins by a comfortable margin: ~1.3× over OTLP and ~1.6× over InfluxDB LP. The advantage is on the server side: rows arrive as a ready-made Arrow columnar batch, no parsing or per-attribute promotion required.
 - OTLP with `greptime_identity` pays for OTLP proto decode + per-attribute column mapping on the server, plus HTTP/1.1 framing. Still beats InfluxDB LP, which pays for text parsing on top of the same column mapping.
-- Row count is verified after each run via `SELECT COUNT(*)` against the per-protocol table.
+- Row counts were spot-checked out-of-band with `SELECT COUNT(*)` on each per-protocol table; the bench scripts themselves do not run the verification query.
 - Even with `greptime_identity`, the OTel and bulk tables aren't strictly identical — the OTel table still carries log-model columns (`ScopeName`, `TraceId`, etc.) and has no `TAG`-marked primary key, so per-series semantics differ. The numbers here measure ingestion throughput only, not query-path parity.
 
 ### SDK usage notes
@@ -120,7 +120,7 @@ pnpm bench bulk-api --rows=2000000 --batch-size=5000 --endpoint=localhost:4001
 
 Available benchmark names: `regular-api`, `stream-api`, `bulk-api`, `cpu-bulk-api`, `cpu-influxdb`, `cpu-otel`. Shared flags:
 
-- `--rows=N` — total rows to push
+- `--rows=N` — target row count; rounded down to a multiple of `--batch-size` (benches send whole batches only)
 - `--batch-size=N` — per-`write()` batch
 - `--parallelism=N` — concurrent in-flight RPCs (bulk / cpu-\* benches; default 8)
 - `--num-hosts=N` — `cpu-*` benches only; cardinality = `N × 5 × 10 × 20` series (default 100 → 100k series; use 1000 for the blog's 1M-series config)