MongoDB Tuning + Global Parallelism + Local Run without minio by krinart · Pull Request #264 · spiceai/spicebench

krinart · 2026-06-15T00:21:09Z

ETL pipeline parallelism

Reworked how the ETL writes into the target so a single run can drive many concurrent writes instead of serializing per table. Each op-segment is split into fixed-size row chunks and dispatched against a shared concurrency budget, controlled by three env knobs:

Env var	Default	Meaning
`SPICEBENCH_SINK_CHUNK_ROWS`	`5000`	Rows per `sink.write()` call. `0` disables splitting (one write per op-segment).
`SPICEBENCH_SINK_PARALLELISM_PER_TABLE`	`1`	Max in-flight writes within a single table's batch.
`SPICEBENCH_SINK_PARALLELISM`	unbounded	Global ceiling on concurrent writes across all tables — protects the target from overload.

The two limits compose: per-table concurrency lets a hot table (e.g. lineitem) parallelize, while the global cap bounds the aggregate.

Values tuned per target in CI (a write's cost dictates the strategy):

Target	`CHUNK_ROWS`	`PARALLELISM_PER_TABLE`	`PARALLELISM` (global)
spice_cloud (cayenne)	`0` (no chunking)	`1`	unbounded
databricks-sql / lakebase	`0` (no chunking)	`1`	unbounded
mongodb-streams	`5000`	`64`	`64`

Generic CDC metrics contract

A vendor-neutral way for any system adapter to report CDC replication performance back to the benchmark, instead of hard-coding MongoDB/cayenne specifics. It models the universal CDC pipeline as a small per-table record carried as an optional field on the metrics response:

pub struct CdcReplicationMetrics {
    /// Per-table cumulative counters at scrape time.
    pub per_table: Vec<CdcTableMetrics>,
}

pub struct CdcTableMetrics {
    pub table: String,            // "__all__" if the adapter has no per-table split
    pub source_wait_ms: Option<f64>,  // time blocked waiting for the source to deliver changes
    pub apply_ms: Option<f64>,        // time applying coalesced changes to the accelerator
    pub apply_count: Option<u64>,     // number of apply bursts
    pub linger_ms: Option<f64>,       // deliberate coalesce/linger time (batching cost)
    pub rows_applied: Option<u64>,    // row-level records applied (throughput denominator)
    pub bytes_applied: Option<u64>,   // bytes applied
}

The benchmark consumes this uniformly and emits it as per-table SUT gauges (sut_cdc_source_wait_ms, sut_cdc_apply_ms, sut_cdc_rows_applied, sut_cdc_bytes_applied, sut_cdc_linger_ms, and a derived sut_cdc_source_wait_pct), so the same dashboards (throughput, source-wait %) work whether the source is a MongoDB change stream, Postgres WAL, or anything added later. Adapters without per-table granularity report a single __all__ row.

Example:

Offline / local ETL + checkpoints (no S3 needed)

Lets the whole benchmark run end-to-end against local files with no S3/MinIO dependency. New flags:

spicebench run --etl-source-archive <file.tar.zst> — use a locally generated data archive instead of downloading from S3 (--etl-bucket/--etl-endpoint no longer required).
spicebench run --checkpoint-local-dir <dir> — validate against a local directory of pre-computed checkpoints instead of pulling them from a bucket; combine with --validate-results for fully offline validation.
spicebench checkpoint --etl-source-archive <file.tar.zst> — generate checkpoints from a local archive and write the manifest to disk instead of uploading; --bucket becomes optional in this mode.

# generate data + checkpoints once, then run fully offline
spicebench generate --scale-factor 1 --output-archive data/sf1.tar.zst
spicebench checkpoint --scenario tpch --version 1.0 --etl-source-archive data/sf1.tar.zst --checkpoint-dir data/ckpt-sf1
spicebench run --scenario tpch --scale-factor 1 \
  --etl-source-archive data/sf1.tar.zst \
  --validate-results --checkpoint-local-dir data/ckpt-sf1 ...

MongoDB sink

Use insert_many(ordered=false) for batch ingestion

krinart added 2 commits June 14, 2026 16:57

MongoDB Turning

cb949b3

Lint

00c0c1b

krinart changed the title ~~MongoDB Turning~~ MongoDB Tuning + Global Parallelism + Local Run without minio Jun 15, 2026

krinart added 2 commits June 14, 2026 17:26

Lint

d672fd6

Fix databricks

0afb8af

krinart marked this pull request as ready for review June 15, 2026 00:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB Tuning + Global Parallelism + Local Run without minio#264

MongoDB Tuning + Global Parallelism + Local Run without minio#264
krinart wants to merge 4 commits into
trunkfrom
viktor/tuning

krinart commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

krinart commented Jun 15, 2026

ETL pipeline parallelism

Generic CDC metrics contract

Offline / local ETL + checkpoints (no S3 needed)

MongoDB sink

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant