Skip to content

MongoDB Tuning + Global Parallelism + Local Run without minio#264

Open
krinart wants to merge 4 commits into
trunkfrom
viktor/tuning
Open

MongoDB Tuning + Global Parallelism + Local Run without minio#264
krinart wants to merge 4 commits into
trunkfrom
viktor/tuning

Conversation

@krinart

@krinart krinart commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

ETL pipeline parallelism

Reworked how the ETL writes into the target so a single run can drive many concurrent writes instead of serializing per table. Each op-segment is split into fixed-size row chunks and dispatched against a shared concurrency budget, controlled by three env knobs:

Env var Default Meaning
SPICEBENCH_SINK_CHUNK_ROWS 5000 Rows per sink.write() call. 0 disables splitting (one write per op-segment).
SPICEBENCH_SINK_PARALLELISM_PER_TABLE 1 Max in-flight writes within a single table's batch.
SPICEBENCH_SINK_PARALLELISM unbounded Global ceiling on concurrent writes across all tables — protects the target from overload.

The two limits compose: per-table concurrency lets a hot table (e.g. lineitem) parallelize, while the global cap bounds the aggregate.

Values tuned per target in CI (a write's cost dictates the strategy):

Target CHUNK_ROWS PARALLELISM_PER_TABLE PARALLELISM (global)
spice_cloud (cayenne) 0 (no chunking) 1 unbounded
databricks-sql / lakebase 0 (no chunking) 1 unbounded
mongodb-streams 5000 64 64

Generic CDC metrics contract

A vendor-neutral way for any system adapter to report CDC replication performance back to the benchmark, instead of hard-coding MongoDB/cayenne specifics. It models the universal CDC pipeline as a small per-table record carried as an optional field on the metrics response:

pub struct CdcReplicationMetrics {
    /// Per-table cumulative counters at scrape time.
    pub per_table: Vec<CdcTableMetrics>,
}

pub struct CdcTableMetrics {
    pub table: String,            // "__all__" if the adapter has no per-table split
    pub source_wait_ms: Option<f64>,  // time blocked waiting for the source to deliver changes
    pub apply_ms: Option<f64>,        // time applying coalesced changes to the accelerator
    pub apply_count: Option<u64>,     // number of apply bursts
    pub linger_ms: Option<f64>,       // deliberate coalesce/linger time (batching cost)
    pub rows_applied: Option<u64>,    // row-level records applied (throughput denominator)
    pub bytes_applied: Option<u64>,   // bytes applied
}

The benchmark consumes this uniformly and emits it as per-table SUT gauges (sut_cdc_source_wait_ms, sut_cdc_apply_ms, sut_cdc_rows_applied, sut_cdc_bytes_applied, sut_cdc_linger_ms, and a derived sut_cdc_source_wait_pct), so the same dashboards (throughput, source-wait %) work whether the source is a MongoDB change stream, Postgres WAL, or anything added later. Adapters without per-table granularity report a single __all__ row.

Example:
Screenshot 2026-06-14 at 5 18 56 PM

Offline / local ETL + checkpoints (no S3 needed)

Lets the whole benchmark run end-to-end against local files with no S3/MinIO dependency. New flags:

  • spicebench run --etl-source-archive <file.tar.zst> — use a locally generated data archive instead of downloading from S3 (--etl-bucket/--etl-endpoint no longer required).
  • spicebench run --checkpoint-local-dir <dir> — validate against a local directory of pre-computed checkpoints instead of pulling them from a bucket; combine with --validate-results for fully offline validation.
  • spicebench checkpoint --etl-source-archive <file.tar.zst> — generate checkpoints from a local archive and write the manifest to disk instead of uploading; --bucket becomes optional in this mode.
# generate data + checkpoints once, then run fully offline
spicebench generate --scale-factor 1 --output-archive data/sf1.tar.zst
spicebench checkpoint --scenario tpch --version 1.0 --etl-source-archive data/sf1.tar.zst --checkpoint-dir data/ckpt-sf1
spicebench run --scenario tpch --scale-factor 1 \
  --etl-source-archive data/sf1.tar.zst \
  --validate-results --checkpoint-local-dir data/ckpt-sf1 ...

MongoDB sink

Use insert_many(ordered=false) for batch ingestion

@krinart krinart changed the title MongoDB Turning MongoDB Tuning + Global Parallelism + Local Run without minio Jun 15, 2026
@krinart krinart marked this pull request as ready for review June 15, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant