MongoDB Tuning + Global Parallelism + Local Run without minio#264
Open
krinart wants to merge 4 commits into
Open
MongoDB Tuning + Global Parallelism + Local Run without minio#264krinart wants to merge 4 commits into
krinart wants to merge 4 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ETL pipeline parallelism
Reworked how the ETL writes into the target so a single run can drive many concurrent writes instead of serializing per table. Each op-segment is split into fixed-size row chunks and dispatched against a shared concurrency budget, controlled by three env knobs:
SPICEBENCH_SINK_CHUNK_ROWS5000sink.write()call.0disables splitting (one write per op-segment).SPICEBENCH_SINK_PARALLELISM_PER_TABLE1SPICEBENCH_SINK_PARALLELISMThe two limits compose: per-table concurrency lets a hot table (e.g.
lineitem) parallelize, while the global cap bounds the aggregate.Values tuned per target in CI (a write's cost dictates the strategy):
CHUNK_ROWSPARALLELISM_PER_TABLEPARALLELISM(global)0(no chunking)10(no chunking)150006464Generic CDC metrics contract
A vendor-neutral way for any system adapter to report CDC replication performance back to the benchmark, instead of hard-coding MongoDB/cayenne specifics. It models the universal CDC pipeline as a small per-table record carried as an optional field on the metrics response:
The benchmark consumes this uniformly and emits it as per-table SUT gauges (
sut_cdc_source_wait_ms,sut_cdc_apply_ms,sut_cdc_rows_applied,sut_cdc_bytes_applied,sut_cdc_linger_ms, and a derivedsut_cdc_source_wait_pct), so the same dashboards (throughput, source-wait %) work whether the source is a MongoDB change stream, Postgres WAL, or anything added later. Adapters without per-table granularity report a single__all__row.Example:

Offline / local ETL + checkpoints (no S3 needed)
Lets the whole benchmark run end-to-end against local files with no S3/MinIO dependency. New flags:
spicebench run --etl-source-archive <file.tar.zst>— use a locally generated data archive instead of downloading from S3 (--etl-bucket/--etl-endpointno longer required).spicebench run --checkpoint-local-dir <dir>— validate against a local directory of pre-computed checkpoints instead of pulling them from a bucket; combine with--validate-resultsfor fully offline validation.spicebench checkpoint --etl-source-archive <file.tar.zst>— generate checkpoints from a local archive and write the manifest to disk instead of uploading;--bucketbecomes optional in this mode.# generate data + checkpoints once, then run fully offline spicebench generate --scale-factor 1 --output-archive data/sf1.tar.zst spicebench checkpoint --scenario tpch --version 1.0 --etl-source-archive data/sf1.tar.zst --checkpoint-dir data/ckpt-sf1 spicebench run --scenario tpch --scale-factor 1 \ --etl-source-archive data/sf1.tar.zst \ --validate-results --checkpoint-local-dir data/ckpt-sf1 ...MongoDB sink
Use
insert_many(ordered=false)for batch ingestion