Skip to content

Commit 7616805

Browse files
Merge pull request #860 from ClickHouse/refactor/per-system-script-interface
Refactor: standard install/start/check/stop/load/query interface per system
2 parents 422f6af + f26184d commit 7616805

2,836 files changed

Lines changed: 76578 additions & 14010 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,24 @@
55
*.parquet
66
hits.csv
77
hits.tsv
8+
9+
# Per-system runtime artifacts produced by benchmark.sh
10+
result.csv
11+
log.txt
12+
load_out.txt
13+
server.log
14+
server.pid
15+
arc_token.txt
16+
data-size.txt
17+
.doris_home
18+
.sirius_env
19+
20+
# Per-system data files
21+
hits.db
22+
mydb
23+
hits.hyper
24+
hits.vortex
25+
*.vortex
26+
27+
# Python venvs created by install scripts
28+
myenv/

CHANGELOG.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,19 @@
22

33
Changes in the benchmark methodology or presentation, as well as major news.
44

5+
### 2026-05-11
6+
Unified benchmark scripts for different systems by providing a common interface in a set of scripts: `install`, `start`, `check`, `stop`, `load`, `query`, and `data-size`. Make the dataset download scripts common as well. Use a general benchmark runner in `lib/` to ensure different systems get equal treatment. This makes it easier to add more ways of testing, different datasets and scenarios to the benchmark, and simplifies support of all 88 systems presented. Note: for embedded systems, such as sqlite and Python duckdb module, wrap them into a Python HTTP server, so that the benchmark can run each query separately.
7+
8+
Restart databases before measuring cold run of each query as requested in [#667](https://github.com/ClickHouse/ClickBench/issues/667) and [#793](https://github.com/ClickHouse/ClickBench/issues/793). This prevents unfair measurements and removes the way for cheating on benchmark for systems that do excessive in-process caching without flushing it before the cold run. Unify flushing the OS page cache before cold run, so that all benchmark entries follow the same rules. Notes: for stateless systems (such as query engines on top of Parquet), the restart is no-op; for systems without durability and in-memory systems, the restart before each query also requires repeated data loading, which time is included in the cold query measurement.
9+
10+
Introduced a new measurement - QPS and error rate on concurrent workload (10 connections for 10 minutes) to prove the advantage of the refactoring. Currently, the metric is not exposed in the benchmark.
11+
12+
Re-run 88 systems on every machine. Fixed queries with regexps for MariaDB and SQLite. Added ARM64 versions for some systems: databend, octosql, octeryx. Use the faster data loader for MariaDB. An attempt to rerun CedarDB showed a bug. Added new systems: Trino, Presto, Quickwit. Generic runner for pandas and polars. Fixed issues with Spark variants. Clean up some tags. Some systems are found dead: vertica, kinetica, singlestore, heavyai.
13+
14+
Improve the website: move important selectors (open-source, hardware, tuned) on top and show them horizontally, they also filter out visible options in other selectors. When the mouse pointer is on top of a system, highlight their tags. Add a button on the diagram to remove a system from the report. Add measurement date to the diagram (as requested in [#639](https://github.com/ClickHouse/ClickBench/issues/639)). Make some cloud machine names shorter to remove clutter. The report methodology (aggregation of the measurements) and the default selection remains unchanged.
15+
16+
(Alexey Milovidov)
17+
518
### 2026-05-08
619
Refactored directory structure to keep every historical result - they are organized in directories `system/results/YYYYMMDD/*.json` for each date. Compared to using git history, this unifies the format and structure of the results, making them ready for analysis. You can analyze it with clickhouse-local: `ch "SELECT * FROM '*/results/*/*.json'"` or export the data: `ch "SELECT * FROM '*/results/*/*.json' ORDER BY _path INTO OUTFILE 'results.parquet'"` (Alexey Milovidov).
720

@@ -56,7 +69,7 @@ The systems on the main chart are distinguished by color (systems from the same
5669

5770
Added the "open-source" and "proprietary" tags, so that you can list only open-source databases. For the reference, Umbra, Hyper, and CedarDB are proprietary.
5871

59-
Removed pointless tags, that some systems attribute to themself. One system misattributed itself as "mysql-compatible", two others added tags with their names, another reported two programming languages, a few systems reported an "analytical" tag, which is pointless, and one system didn't report itself as "ClickHouse-derivative" while being based on the ClickHouse interfaces and architecture.
72+
Removed pointless tags, that some systems attribute to themselves. One system misattributed itself as "mysql-compatible", two others added tags with their names, another reported two programming languages, a few systems reported an "analytical" tag, which is pointless, and one system didn't report itself as "ClickHouse-derivative" while being based on the ClickHouse interfaces and architecture.
6073

6174
Some systems provided bogus results on the loading time or data size. For example, one system reported data size 1000 times less, and we didn't notice that. This was corrected. The comparison on the loading time will not include stateless systems that don't require data loading.
6275

arc/benchmark.sh

Lines changed: 4 additions & 203 deletions
Original file line numberDiff line numberDiff line change
@@ -1,204 +1,5 @@
11
#!/bin/bash
2-
# Arc ClickBench Complete Benchmark Script (Go Binary Version)
3-
set -e
4-
5-
# ============================================================
6-
# 1. INSTALL ARC FROM .DEB PACKAGE
7-
# ============================================================
8-
echo "Installing Arc from .deb package..."
9-
10-
# Fetch latest Arc version from GitHub releases
11-
echo "Fetching latest Arc version..."
12-
ARC_VERSION=$(curl -s https://api.github.com/repos/Basekick-Labs/arc/releases/latest | grep -oP '"tag_name": "v\K[^"]+')
13-
if [ -z "$ARC_VERSION" ]; then
14-
echo "Error: Could not fetch latest Arc version from GitHub"
15-
exit 1
16-
fi
17-
echo "Latest Arc version: $ARC_VERSION"
18-
19-
ARCH=$(uname -m)
20-
if [ "$ARCH" = "aarch64" ] || [ "$ARCH" = "arm64" ]; then
21-
DEB_URL="https://github.com/Basekick-Labs/arc/releases/download/v${ARC_VERSION}/arc_${ARC_VERSION}_arm64.deb"
22-
DEB_FILE="arc_${ARC_VERSION}_arm64.deb"
23-
else
24-
DEB_URL="https://github.com/Basekick-Labs/arc/releases/download/v${ARC_VERSION}/arc_${ARC_VERSION}_amd64.deb"
25-
DEB_FILE="arc_${ARC_VERSION}_amd64.deb"
26-
fi
27-
28-
echo "Detected architecture: $ARCH -> $DEB_FILE"
29-
30-
if [ ! -f "$DEB_FILE" ]; then
31-
wget -q "$DEB_URL" -O "$DEB_FILE"
32-
fi
33-
34-
sudo dpkg -i "$DEB_FILE" || sudo apt-get install -f -y
35-
echo "[OK] Arc installed"
36-
37-
# ============================================================
38-
# 2. PRINT SYSTEM INFO (Arc defaults)
39-
# ============================================================
40-
CORES=$(nproc)
41-
TOTAL_MEM_KB=$(grep MemTotal /proc/meminfo | awk '{print $2}')
42-
TOTAL_MEM_GB=$((TOTAL_MEM_KB / 1024 / 1024))
43-
MEM_LIMIT_GB=$((TOTAL_MEM_GB * 80 / 100)) # 80% of system RAM
44-
45-
echo ""
46-
echo "System Configuration:"
47-
echo " CPU cores: $CORES"
48-
echo " Connections: $((CORES * 2)) (cores × 2)"
49-
echo " Threads: $CORES (same as cores)"
50-
echo " Memory limit: ${MEM_LIMIT_GB}GB (80% of ${TOTAL_MEM_GB}GB total)"
51-
echo ""
52-
53-
# ============================================================
54-
# 3. START ARC AND CAPTURE TOKEN FROM LOGS
55-
# ============================================================
56-
echo "Starting Arc service..."
57-
58-
# Check if we already have a valid token from a previous run
59-
if [ -f "arc_token.txt" ]; then
60-
EXISTING_TOKEN=$(cat arc_token.txt)
61-
echo "Found existing token file, will verify after Arc starts..."
62-
fi
63-
64-
sudo systemctl start arc
65-
66-
# Wait for Arc to be ready
67-
echo "Waiting for Arc to be ready..."
68-
for i in {1..30}; do
69-
if curl -sf http://localhost:8000/health > /dev/null 2>&1; then
70-
echo "[OK] Arc is ready!"
71-
break
72-
fi
73-
if [ $i -eq 30 ]; then
74-
echo "Error: Arc failed to start within 30 seconds"
75-
sudo journalctl -u arc --no-pager | tail -50
76-
exit 1
77-
fi
78-
sleep 1
79-
done
80-
81-
# Try to get token - either from existing file or from logs (first run)
82-
ARC_TOKEN=""
83-
84-
# First, check if existing token works
85-
if [ -n "$EXISTING_TOKEN" ]; then
86-
if curl -sf http://localhost:8000/health -H "x-api-key: $EXISTING_TOKEN" > /dev/null 2>&1; then
87-
ARC_TOKEN="$EXISTING_TOKEN"
88-
echo "[OK] Using existing token from arc_token.txt"
89-
else
90-
echo "Existing token invalid, looking for new token in logs..."
91-
fi
92-
fi
93-
94-
# If no valid token yet, try to extract from logs (first run scenario)
95-
if [ -z "$ARC_TOKEN" ]; then
96-
ARC_TOKEN=$(sudo journalctl -u arc --no-pager | grep -oP '(?:Initial admin API token|Admin API token): \K[^\s]+' | head -1)
97-
if [ -n "$ARC_TOKEN" ]; then
98-
echo "[OK] Captured new token from logs"
99-
echo "$ARC_TOKEN" > arc_token.txt
100-
else
101-
echo "Error: Could not find or validate API token"
102-
echo "If this is not the first run, Arc's database may need to be reset:"
103-
echo " sudo rm -rf /var/lib/arc/data/arc.db"
104-
exit 1
105-
fi
106-
fi
107-
108-
echo "Token: ${ARC_TOKEN:0:20}..."
109-
110-
# ============================================================
111-
# 4. DOWNLOAD DATASET
112-
# ============================================================
113-
DATASET_FILE="hits.parquet"
114-
DATASET_URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet"
115-
EXPECTED_SIZE=14779976446
116-
117-
if [ -f "$DATASET_FILE" ]; then
118-
CURRENT_SIZE=$(stat -c%s "$DATASET_FILE" 2>/dev/null || stat -f%z "$DATASET_FILE" 2>/dev/null)
119-
if [ "$CURRENT_SIZE" -eq "$EXPECTED_SIZE" ]; then
120-
echo "[OK] Dataset already downloaded (14GB)"
121-
else
122-
echo "Re-downloading dataset (size mismatch)..."
123-
rm -f "$DATASET_FILE"
124-
wget --continue --progress=dot:giga "$DATASET_URL"
125-
fi
126-
else
127-
echo "Downloading ClickBench dataset (14GB)..."
128-
wget --continue --progress=dot:giga "$DATASET_URL"
129-
fi
130-
131-
# ============================================================
132-
# 5. LOAD DATA INTO ARC
133-
# ============================================================
134-
echo "Loading data into Arc..."
135-
136-
# Determine Arc's data directory (default: /var/lib/arc/data)
137-
ARC_DATA_DIR="/var/lib/arc/data"
138-
TARGET_DIR="$ARC_DATA_DIR/clickbench/hits"
139-
TARGET_FILE="$TARGET_DIR/hits.parquet"
140-
141-
sudo mkdir -p "$TARGET_DIR"
142-
143-
if [ -f "$TARGET_FILE" ]; then
144-
SOURCE_SIZE=$(stat -c%s "$DATASET_FILE" 2>/dev/null || stat -f%z "$DATASET_FILE" 2>/dev/null)
145-
TARGET_SIZE=$(stat -c%s "$TARGET_FILE" 2>/dev/null || stat -f%z "$TARGET_FILE" 2>/dev/null)
146-
if [ "$SOURCE_SIZE" -eq "$TARGET_SIZE" ]; then
147-
echo "[OK] Data already loaded"
148-
else
149-
echo "Reloading data (size mismatch)..."
150-
sudo cp "$DATASET_FILE" "$TARGET_FILE"
151-
fi
152-
else
153-
sudo cp "$DATASET_FILE" "$TARGET_FILE"
154-
echo "[OK] Data loaded to $TARGET_FILE"
155-
fi
156-
157-
# ============================================================
158-
# 6. SET ENVIRONMENT AND RUN BENCHMARK
159-
# ============================================================
160-
export ARC_URL="http://localhost:8000"
161-
export ARC_API_KEY="$ARC_TOKEN"
162-
export DATABASE="clickbench"
163-
export TABLE="hits"
164-
165-
echo ""
166-
echo "Running ClickBench queries (true cold runs)..."
167-
echo "================================================"
168-
./run.sh 2>&1 | tee log.txt
169-
170-
# ============================================================
171-
# 7. STOP ARC AND FORMAT RESULTS
172-
# ============================================================
173-
echo "Stopping Arc..."
174-
sudo systemctl stop arc
175-
176-
# Format results as proper JSON array
177-
cat log.txt | grep -oE '^[0-9]+\.[0-9]+|^null' | \
178-
awk '{
179-
if (NR % 3 == 1) printf "[";
180-
printf "%s", $1;
181-
if (NR % 3 == 0) print "],";
182-
else printf ", ";
183-
}' > results.txt
184-
185-
echo ""
186-
echo "[OK] Benchmark complete!"
187-
echo "================================================"
188-
echo "Load time: 0"
189-
echo "Data size: $EXPECTED_SIZE"
190-
cat results.txt
191-
echo "================================================"
192-
193-
# ============================================================
194-
# 8. CLEANUP
195-
# ============================================================
196-
echo "Cleaning up..."
197-
198-
# Uninstall Arc package
199-
sudo dpkg -r arc || true
200-
201-
# Remove Arc data directory
202-
sudo rm -rf /var/lib/arc
203-
204-
echo "[OK] Cleanup complete"
2+
# Thin shim — actual flow is in lib/benchmark-common.sh.
3+
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-single"
4+
export BENCH_DURABLE=yes
5+
exec ../lib/benchmark-common.sh

arc/check

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
set -e
3+
4+
ARC_URL="${ARC_URL:-http://localhost:8000}"
5+
TOKEN=$(cat arc_token.txt 2>/dev/null || true)
6+
7+
if [ -n "$TOKEN" ]; then
8+
curl -sf "$ARC_URL/health" -H "x-api-key: $TOKEN" >/dev/null
9+
else
10+
curl -sf "$ARC_URL/health" >/dev/null
11+
fi

arc/data-size

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Source parquet file size (loaded into Arc's data directory).
5+
F="/var/lib/arc/data/clickbench/hits/hits.parquet"
6+
if [ -f "$F" ]; then
7+
sudo stat -c%s "$F"
8+
else
9+
echo 14779976446
10+
fi

arc/install

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Install Arc from a .deb release. Idempotent.
5+
if dpkg -l arc 2>/dev/null | grep -q '^ii '; then
6+
exit 0
7+
fi
8+
9+
ARC_VERSION=$(curl -s https://api.github.com/repos/Basekick-Labs/arc/releases/latest \
10+
| grep -oP '"tag_name": "v\K[^"]+')
11+
if [ -z "$ARC_VERSION" ]; then
12+
echo "Error: Could not fetch latest Arc version from GitHub" >&2
13+
exit 1
14+
fi
15+
16+
ARCH=$(uname -m)
17+
if [ "$ARCH" = "aarch64" ] || [ "$ARCH" = "arm64" ]; then
18+
DEB_FILE="arc_${ARC_VERSION}_arm64.deb"
19+
else
20+
DEB_FILE="arc_${ARC_VERSION}_amd64.deb"
21+
fi
22+
DEB_URL="https://github.com/Basekick-Labs/arc/releases/download/v${ARC_VERSION}/${DEB_FILE}"
23+
24+
if [ ! -f "$DEB_FILE" ]; then
25+
wget -q "$DEB_URL" -O "$DEB_FILE"
26+
fi
27+
28+
sudo dpkg -i "$DEB_FILE" || sudo apt-get install -f -y

arc/load

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Arc loads the parquet file into its data directory and indexes it on startup.
5+
ARC_DATA_DIR="/var/lib/arc/data"
6+
TARGET_DIR="$ARC_DATA_DIR/clickbench/hits"
7+
TARGET_FILE="$TARGET_DIR/hits.parquet"
8+
9+
sudo mkdir -p "$TARGET_DIR"
10+
11+
if [ -f "$TARGET_FILE" ] && \
12+
[ "$(stat -c%s hits.parquet)" -eq "$(stat -c%s "$TARGET_FILE")" ]; then
13+
: # already loaded
14+
else
15+
sudo cp hits.parquet "$TARGET_FILE"
16+
fi
17+
18+
# Free up local space.
19+
rm -f hits.parquet
20+
sync

arc/query

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
# Reads a SQL query from stdin, POSTs it to Arc's HTTP API.
3+
# Stdout: query response body (JSON).
4+
# Stderr: query runtime in fractional seconds on the last line (extracted
5+
# from Arc's journal log line `execution_time_ms=N`).
6+
# Exit non-zero on error.
7+
set -e
8+
9+
ARC_URL="${ARC_URL:-http://localhost:8000}"
10+
ARC_API_KEY="${ARC_API_KEY:-$(cat arc_token.txt 2>/dev/null)}"
11+
12+
query=$(cat)
13+
14+
# Build JSON payload with proper escaping.
15+
JSON_PAYLOAD=$(jq -Rs '{sql: .}' <<<"$query")
16+
17+
# Mark journal position so we can locate the matching execution_time_ms entry.
18+
LOG_MARKER=$(date -u +"%Y-%m-%dT%H:%M:%S")
19+
20+
RESPONSE=$(curl -s -w "\n%{http_code}" \
21+
-X POST "$ARC_URL/api/v1/query" \
22+
-H "x-api-key: $ARC_API_KEY" \
23+
-H "Content-Type: application/json" \
24+
-d "$JSON_PAYLOAD" \
25+
--max-time 300)
26+
27+
HTTP_CODE=$(printf '%s\n' "$RESPONSE" | tail -1)
28+
BODY=$(printf '%s\n' "$RESPONSE" | head -n -1)
29+
30+
if [ "$HTTP_CODE" != "200" ]; then
31+
printf 'arc query failed: HTTP %s\n%s\n' "$HTTP_CODE" "$BODY" >&2
32+
exit 1
33+
fi
34+
35+
# Result body to stdout.
36+
printf '%s\n' "$BODY"
37+
38+
# Extract execution_time_ms from Arc's journal — give it a moment to flush.
39+
sleep 0.1
40+
EXEC_MS=$(sudo journalctl -u arc --since="$LOG_MARKER" --no-pager 2>/dev/null \
41+
| grep -oP 'execution_time_ms=\K[0-9]+' | tail -1)
42+
43+
if [ -z "$EXEC_MS" ]; then
44+
echo "Could not extract execution_time_ms from arc journal" >&2
45+
exit 1
46+
fi
47+
48+
# Convert ms -> seconds and emit on stderr.
49+
awk -v ms="$EXEC_MS" 'BEGIN { printf "%.4f\n", ms / 1000 }' >&2

0 commit comments

Comments
 (0)