Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
29e86a5
adding arc values
xe-nvdk Oct 6, 2025
a433b23
we missed one query, now is complete
xe-nvdk Oct 6, 2025
01b692e
fixing run.sh and re run, just in case both benchmark in pro m3 max a…
xe-nvdk Oct 6, 2025
b663892
disabling query caching and re ran the benchmarks
xe-nvdk Oct 6, 2025
7a40588
updating repo to match the current for arc
xe-nvdk Oct 7, 2025
b934055
Merge branch 'main' into main
xe-nvdk Oct 7, 2025
fa56ed3
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 9, 2025
7c0ccab
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 9, 2025
1db5924
adding updated values for m3 max
xe-nvdk Oct 11, 2025
08fe758
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 11, 2025
bde45ce
updating results and scripts for arc
xe-nvdk Oct 12, 2025
7135fff
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 12, 2025
3a00ca3
fixing benchmark to load the data
xe-nvdk Oct 12, 2025
757d7fa
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 12, 2025
6e70633
fixing token creation
xe-nvdk Oct 12, 2025
32c62ba
fixing api env passing
xe-nvdk Oct 12, 2025
56702bc
fixing db specification for api creation
xe-nvdk Oct 12, 2025
82abc81
making sure that we don't have enabled query cache
xe-nvdk Oct 12, 2025
d6904f8
adding results for arc in clickbench
xe-nvdk Oct 12, 2025
48a8fc9
Merge branch 'main' into main
xe-nvdk Oct 13, 2025
8333f83
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 13, 2025
799b4a7
refining format of the results
xe-nvdk Oct 13, 2025
ecd0414
refining format of the results
xe-nvdk Oct 13, 2025
b905b50
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 13, 2025
ad86bf5
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 13, 2025
97da2bd
deleting comments in the results
xe-nvdk Oct 13, 2025
716b715
adding time-series tag
xe-nvdk Oct 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions arc/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Arc ClickBench - Changelog

## 2025-10-07 - Fixed for ClickBench Submission

### Issues Reported by ClickBench Maintainers

1. **`--break-system-packages` Required**
- Problem: Script used `pip3 install` globally, requiring `--break-system-packages` on modern Python
- Fix: Created Python virtual environment (`python3 -m venv arc-venv`)
- Result: All dependencies installed in isolated venv, no system modification

2. **`ImportError: cannot import name 'Permission'`**
- Problem: Script tried to import `Permission` from `api.auth`, which doesn't exist
- Fix: Removed `Permission` import, use simple `auth.create_token(name, description)`
- Result: Token creation works with Arc's actual auth API

### Changes Made

#### `benchmark.sh`
- ✅ Added Python venv creation and activation
- ✅ Fixed auth token creation (removed `Permission` import)
- ✅ Auto-detect CPU cores for optimal worker count
- ✅ Better error handling (30s timeout with logs on failure)
- ✅ Proper cleanup (stop Arc, deactivate venv)
- ✅ Following chdb/benchmark.sh pattern

#### `README.md`
- ✅ Added complete setup instructions
- ✅ Documented virtual environment approach
- ✅ Manual steps for debugging
- ✅ Architecture and performance notes

#### `run.sh`
- ✅ Already working correctly
- ✅ Uses environment variables for configuration
- ✅ Proper error handling

### Testing Checklist

- [ ] Clean Ubuntu/Debian environment
- [ ] Virtual environment creation
- [ ] Arc installation from GitHub
- [ ] Token creation without `Permission` import
- [ ] Server startup with auto-detected workers
- [ ] Dataset download (14GB)
- [ ] Query execution (43 queries × 3 runs)
- [ ] Results formatting
- [ ] Cleanup (venv deactivation, Arc shutdown)

### Expected Behavior

```bash
$ ./benchmark.sh

Installing system dependencies...
Creating Python virtual environment...
Cloning Arc repository...
Installing Arc dependencies...
Creating API token...
Created API token: xvN6zwR4oSd...
Token created successfully
Starting Arc with 28 workers (14 cores detected)...
Arc started with PID: 12345
✓ Arc is ready!
Dataset size: 14G hits.parquet
Dataset contains 99,997,497 rows
Running ClickBench queries via Arc HTTP API...
================================================
Benchmark complete!
✓ Benchmark complete!

Results saved to: results.json
```

### Performance

Tested on M3 Max (14 cores, 36GB RAM):
- **Total time:** ~22 seconds (43 queries)
- **Workers:** 28 (2x cores, optimal for analytical queries)
- **Query cache:** Disabled (per ClickBench rules)

### Notes for ClickBench Maintainers

1. **No system modification:** All dependencies in venv
2. **Simple auth:** No complex permission system, just token creation
3. **Auto-scaling:** Detects CPU cores and sets optimal workers
4. **Error handling:** Clear error messages with logs
5. **Standard format:** Follows chdb pattern (venv, wget, etc.)

### Future Improvements

- [ ] Add MinIO for object storage benchmark variant
- [ ] Test on different CPU architectures (ARM, x86)
- [ ] Add memory usage monitoring
- [ ] Optimize for larger datasets (100M+ rows)
167 changes: 167 additions & 0 deletions arc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Arc - ClickBench Benchmark

Arc is a high-performance time-series data warehouse built on DuckDB, Parquet, and object storage.

## System Information

- **System:** Arc
- **Date:** 2025-10-07
- **Machine:** m3_max (14 cores, 36GB RAM)
- **Tags:** Python, time-series, DuckDB, Parquet, columnar, HTTP API
- **License:** AGPL-3.0
- **Repository:** https://github.com/Basekick-Labs/arc

## Performance

Arc achieves:
- **Write throughput:** 1.89M records/sec (MessagePack binary protocol)
- **ClickBench:** ~22 seconds total (43 analytical queries)
- **Storage:** DuckDB + Parquet with MinIO/S3/GCS backends

## Prerequisites

- Ubuntu/Debian Linux (or compatible)
- Python 3.11+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no prerequisites - the benchmark runs automatically on an empty AWS machine with Ubuntu AMI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. We’ll revisit the submission later this year. For now, we’re happy to have the benchmark numbers internally and will use them for our own reference. Once we release official binaries, we’ll try again to get included in ClickBench.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a problem, let's push this PR to ClickBench. The more systems included, the better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alexey-milovidov we just updated, we were able to run the benchmark.sh according to clickbench guidelines. Let me know if you have issues running, but shouldn't have any. Thank you.

- 8GB+ RAM recommended
- Internet connection for dataset download

## Quick Start

The benchmark script handles everything automatically:

```bash
./benchmark.sh
```

This will:
1. Create Python virtual environment (no system packages modified)
2. Clone Arc repository
3. Install dependencies in venv
4. Start Arc server with optimal worker count (2x CPU cores)
5. Download ClickBench dataset (14GB parquet file)
6. Run 43 queries × 3 iterations
7. Output results in ClickBench JSON format

## Manual Steps

### 1. Install Dependencies

```bash
sudo apt-get update -y
sudo apt-get install -y python3-pip python3-venv wget curl
```

### 2. Create Virtual Environment

```bash
python3 -m venv arc-venv
source arc-venv/bin/activate
```

### 3. Clone and Setup Arc

```bash
git clone https://github.com/Basekick-Labs/arc.git
cd arc
pip install -r requirements.txt
mkdir -p data logs
```

### 4. Create API Token

```bash
python3 << 'EOF'
from api.auth import AuthManager

auth = AuthManager(db_path='./data/historian.db')
token = auth.create_token(name='clickbench', description='ClickBench benchmark')
print(f"Token: {token}")
EOF
```

### 5. Start Arc Server

```bash
# Auto-detect cores
CORES=$(nproc)
WORKERS=$((CORES * 2))

# Start server
gunicorn -w $WORKERS -b 0.0.0.0:8000 \
-k uvicorn.workers.UvicornWorker \
--timeout 300 \
api.main:app
```

### 6. Download Dataset

```bash
wget https://datasets.clickhouse.com/hits_compatible/hits.parquet
```

### 7. Run Benchmark

```bash
export ARC_URL="http://localhost:8000"
export ARC_API_KEY="your-token-from-step-4"
export PARQUET_FILE="/path/to/hits.parquet"

./run.sh
```

## Configuration

Arc uses optimal settings for ClickBench:

- **Workers:** 2x CPU cores (balanced for analytical queries)
- **Query cache:** Disabled (per ClickBench rules)
- **Storage:** Local filesystem (fastest for single-node)
- **Timeout:** 300 seconds per query

## Results Format

Results are output in ClickBench JSON format:

```json
[
[0.0226, 0.0233, 0.0284],
[0.0324, 0.0334, 0.0392],
...
]
```

Each array contains 3 execution times (in seconds) for the same query.

## Notes

- **Virtual Environment:** All dependencies installed in isolated venv (no `--break-system-packages` needed)
- **Authentication:** Uses Arc's built-in token auth (simpler than Permission-based auth)
- **Query Cache:** Disabled to ensure fair benchmark (no cache hits)
- **Worker Count:** Auto-detected based on CPU cores, optimized for analytical workloads
- **Timeout:** Generous 300s timeout for complex queries

## Architecture

```
ClickBench Query → Arc HTTP API → DuckDB → Parquet File → Results
```

Arc queries the Parquet file directly via DuckDB's `read_parquet()` function, providing excellent analytical performance without data import.

## Performance Characteristics

Arc is optimized for:
- **High-throughput writes** (1.89M RPS with MessagePack)
- **Analytical queries** (DuckDB's columnar engine)
- **Object storage** (S3, GCS, MinIO compatibility)
- **Time-series workloads** (built-in time-based indexing)

## Support

- GitHub: https://github.com/Basekick-Labs/arc
- Issues: https://github.com/Basekick-Labs/arc/issues
- Docs: https://docs.arc.basekick.com (coming soon)

## License

Arc Core is licensed under AGPL-3.0.
Loading