-
Notifications
You must be signed in to change notification settings - Fork 230
feat(arc): add ClickBench results for Arc on c6a.4xlarge #634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xe-nvdk
wants to merge
27
commits into
ClickHouse:main
Choose a base branch
from
Basekick-Labs:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
29e86a5
adding arc values
xe-nvdk a433b23
we missed one query, now is complete
xe-nvdk 01b692e
fixing run.sh and re run, just in case both benchmark in pro m3 max a…
xe-nvdk b663892
disabling query caching and re ran the benchmarks
xe-nvdk 7a40588
updating repo to match the current for arc
xe-nvdk b934055
Merge branch 'main' into main
xe-nvdk fa56ed3
Merge branch 'ClickHouse:main' into main
xe-nvdk 7c0ccab
Merge branch 'ClickHouse:main' into main
xe-nvdk 1db5924
adding updated values for m3 max
xe-nvdk 08fe758
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk bde45ce
updating results and scripts for arc
xe-nvdk 7135fff
Merge branch 'ClickHouse:main' into main
xe-nvdk 3a00ca3
fixing benchmark to load the data
xe-nvdk 757d7fa
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk 6e70633
fixing token creation
xe-nvdk 32c62ba
fixing api env passing
xe-nvdk 56702bc
fixing db specification for api creation
xe-nvdk 82abc81
making sure that we don't have enabled query cache
xe-nvdk d6904f8
adding results for arc in clickbench
xe-nvdk 48a8fc9
Merge branch 'main' into main
xe-nvdk 8333f83
Merge branch 'ClickHouse:main' into main
xe-nvdk 799b4a7
refining format of the results
xe-nvdk ecd0414
refining format of the results
xe-nvdk b905b50
Merge branch 'ClickHouse:main' into main
xe-nvdk ad86bf5
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk 97da2bd
deleting comments in the results
xe-nvdk 716b715
adding time-series tag
xe-nvdk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Arc ClickBench - Changelog | ||
|
||
## 2025-10-07 - Fixed for ClickBench Submission | ||
|
||
### Issues Reported by ClickBench Maintainers | ||
|
||
1. **`--break-system-packages` Required** | ||
- Problem: Script used `pip3 install` globally, requiring `--break-system-packages` on modern Python | ||
- Fix: Created Python virtual environment (`python3 -m venv arc-venv`) | ||
- Result: All dependencies installed in isolated venv, no system modification | ||
|
||
2. **`ImportError: cannot import name 'Permission'`** | ||
- Problem: Script tried to import `Permission` from `api.auth`, which doesn't exist | ||
- Fix: Removed `Permission` import, use simple `auth.create_token(name, description)` | ||
- Result: Token creation works with Arc's actual auth API | ||
|
||
### Changes Made | ||
|
||
#### `benchmark.sh` | ||
- ✅ Added Python venv creation and activation | ||
- ✅ Fixed auth token creation (removed `Permission` import) | ||
- ✅ Auto-detect CPU cores for optimal worker count | ||
- ✅ Better error handling (30s timeout with logs on failure) | ||
- ✅ Proper cleanup (stop Arc, deactivate venv) | ||
- ✅ Following chdb/benchmark.sh pattern | ||
|
||
#### `README.md` | ||
- ✅ Added complete setup instructions | ||
- ✅ Documented virtual environment approach | ||
- ✅ Manual steps for debugging | ||
- ✅ Architecture and performance notes | ||
|
||
#### `run.sh` | ||
- ✅ Already working correctly | ||
- ✅ Uses environment variables for configuration | ||
- ✅ Proper error handling | ||
|
||
### Testing Checklist | ||
|
||
- [ ] Clean Ubuntu/Debian environment | ||
- [ ] Virtual environment creation | ||
- [ ] Arc installation from GitHub | ||
- [ ] Token creation without `Permission` import | ||
- [ ] Server startup with auto-detected workers | ||
- [ ] Dataset download (14GB) | ||
- [ ] Query execution (43 queries × 3 runs) | ||
- [ ] Results formatting | ||
- [ ] Cleanup (venv deactivation, Arc shutdown) | ||
|
||
### Expected Behavior | ||
|
||
```bash | ||
$ ./benchmark.sh | ||
|
||
Installing system dependencies... | ||
Creating Python virtual environment... | ||
Cloning Arc repository... | ||
Installing Arc dependencies... | ||
Creating API token... | ||
Created API token: xvN6zwR4oSd... | ||
Token created successfully | ||
Starting Arc with 28 workers (14 cores detected)... | ||
Arc started with PID: 12345 | ||
✓ Arc is ready! | ||
Dataset size: 14G hits.parquet | ||
Dataset contains 99,997,497 rows | ||
Running ClickBench queries via Arc HTTP API... | ||
================================================ | ||
Benchmark complete! | ||
✓ Benchmark complete! | ||
|
||
Results saved to: results.json | ||
``` | ||
|
||
### Performance | ||
|
||
Tested on M3 Max (14 cores, 36GB RAM): | ||
- **Total time:** ~22 seconds (43 queries) | ||
- **Workers:** 28 (2x cores, optimal for analytical queries) | ||
- **Query cache:** Disabled (per ClickBench rules) | ||
|
||
### Notes for ClickBench Maintainers | ||
|
||
1. **No system modification:** All dependencies in venv | ||
2. **Simple auth:** No complex permission system, just token creation | ||
3. **Auto-scaling:** Detects CPU cores and sets optimal workers | ||
4. **Error handling:** Clear error messages with logs | ||
5. **Standard format:** Follows chdb pattern (venv, wget, etc.) | ||
|
||
### Future Improvements | ||
|
||
- [ ] Add MinIO for object storage benchmark variant | ||
- [ ] Test on different CPU architectures (ARM, x86) | ||
- [ ] Add memory usage monitoring | ||
- [ ] Optimize for larger datasets (100M+ rows) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
# Arc - ClickBench Benchmark | ||
|
||
Arc is a high-performance time-series data warehouse built on DuckDB, Parquet, and object storage. | ||
|
||
## System Information | ||
|
||
- **System:** Arc | ||
- **Date:** 2025-10-07 | ||
- **Machine:** m3_max (14 cores, 36GB RAM) | ||
- **Tags:** Python, time-series, DuckDB, Parquet, columnar, HTTP API | ||
- **License:** AGPL-3.0 | ||
- **Repository:** https://github.com/Basekick-Labs/arc | ||
|
||
## Performance | ||
|
||
Arc achieves: | ||
- **Write throughput:** 1.89M records/sec (MessagePack binary protocol) | ||
- **ClickBench:** ~22 seconds total (43 analytical queries) | ||
- **Storage:** DuckDB + Parquet with MinIO/S3/GCS backends | ||
|
||
## Prerequisites | ||
|
||
- Ubuntu/Debian Linux (or compatible) | ||
- Python 3.11+ | ||
- 8GB+ RAM recommended | ||
- Internet connection for dataset download | ||
|
||
## Quick Start | ||
|
||
The benchmark script handles everything automatically: | ||
|
||
```bash | ||
./benchmark.sh | ||
``` | ||
|
||
This will: | ||
1. Create Python virtual environment (no system packages modified) | ||
2. Clone Arc repository | ||
3. Install dependencies in venv | ||
4. Start Arc server with optimal worker count (2x CPU cores) | ||
5. Download ClickBench dataset (14GB parquet file) | ||
6. Run 43 queries × 3 iterations | ||
7. Output results in ClickBench JSON format | ||
|
||
## Manual Steps | ||
|
||
### 1. Install Dependencies | ||
|
||
```bash | ||
sudo apt-get update -y | ||
sudo apt-get install -y python3-pip python3-venv wget curl | ||
``` | ||
|
||
### 2. Create Virtual Environment | ||
|
||
```bash | ||
python3 -m venv arc-venv | ||
source arc-venv/bin/activate | ||
``` | ||
|
||
### 3. Clone and Setup Arc | ||
|
||
```bash | ||
git clone https://github.com/Basekick-Labs/arc.git | ||
cd arc | ||
pip install -r requirements.txt | ||
mkdir -p data logs | ||
``` | ||
|
||
### 4. Create API Token | ||
|
||
```bash | ||
python3 << 'EOF' | ||
from api.auth import AuthManager | ||
|
||
auth = AuthManager(db_path='./data/historian.db') | ||
token = auth.create_token(name='clickbench', description='ClickBench benchmark') | ||
print(f"Token: {token}") | ||
EOF | ||
``` | ||
|
||
### 5. Start Arc Server | ||
|
||
```bash | ||
# Auto-detect cores | ||
CORES=$(nproc) | ||
WORKERS=$((CORES * 2)) | ||
|
||
# Start server | ||
gunicorn -w $WORKERS -b 0.0.0.0:8000 \ | ||
-k uvicorn.workers.UvicornWorker \ | ||
--timeout 300 \ | ||
api.main:app | ||
``` | ||
|
||
### 6. Download Dataset | ||
|
||
```bash | ||
wget https://datasets.clickhouse.com/hits_compatible/hits.parquet | ||
``` | ||
|
||
### 7. Run Benchmark | ||
|
||
```bash | ||
export ARC_URL="http://localhost:8000" | ||
export ARC_API_KEY="your-token-from-step-4" | ||
export PARQUET_FILE="/path/to/hits.parquet" | ||
|
||
./run.sh | ||
``` | ||
|
||
## Configuration | ||
|
||
Arc uses optimal settings for ClickBench: | ||
|
||
- **Workers:** 2x CPU cores (balanced for analytical queries) | ||
- **Query cache:** Disabled (per ClickBench rules) | ||
- **Storage:** Local filesystem (fastest for single-node) | ||
- **Timeout:** 300 seconds per query | ||
|
||
## Results Format | ||
|
||
Results are output in ClickBench JSON format: | ||
|
||
```json | ||
[ | ||
[0.0226, 0.0233, 0.0284], | ||
[0.0324, 0.0334, 0.0392], | ||
... | ||
] | ||
``` | ||
|
||
Each array contains 3 execution times (in seconds) for the same query. | ||
|
||
## Notes | ||
|
||
- **Virtual Environment:** All dependencies installed in isolated venv (no `--break-system-packages` needed) | ||
- **Authentication:** Uses Arc's built-in token auth (simpler than Permission-based auth) | ||
- **Query Cache:** Disabled to ensure fair benchmark (no cache hits) | ||
- **Worker Count:** Auto-detected based on CPU cores, optimized for analytical workloads | ||
- **Timeout:** Generous 300s timeout for complex queries | ||
|
||
## Architecture | ||
|
||
``` | ||
ClickBench Query → Arc HTTP API → DuckDB → Parquet File → Results | ||
``` | ||
|
||
Arc queries the Parquet file directly via DuckDB's `read_parquet()` function, providing excellent analytical performance without data import. | ||
|
||
## Performance Characteristics | ||
|
||
Arc is optimized for: | ||
- **High-throughput writes** (1.89M RPS with MessagePack) | ||
- **Analytical queries** (DuckDB's columnar engine) | ||
- **Object storage** (S3, GCS, MinIO compatibility) | ||
- **Time-series workloads** (built-in time-based indexing) | ||
|
||
## Support | ||
|
||
- GitHub: https://github.com/Basekick-Labs/arc | ||
- Issues: https://github.com/Basekick-Labs/arc/issues | ||
- Docs: https://docs.arc.basekick.com (coming soon) | ||
|
||
## License | ||
|
||
Arc Core is licensed under AGPL-3.0. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no prerequisites - the benchmark runs automatically on an empty AWS machine with Ubuntu AMI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. We’ll revisit the submission later this year. For now, we’re happy to have the benchmark numbers internally and will use them for our own reference. Once we release official binaries, we’ll try again to get included in ClickBench.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a problem, let's push this PR to ClickBench. The more systems included, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alexey-milovidov we just updated, we were able to run the benchmark.sh according to clickbench guidelines. Let me know if you have issues running, but shouldn't have any. Thank you.