A comprehensive and automated tool for running TPC-DS benchmarks on Cloudberry Database, HashData, Greenplum, and PostgreSQL. Originally derived from Pivotal TPC-DS.
This toolkit provides:
- End-to-end automated TPC-DS benchmark execution
- Support for both local and cloud deployment models
- Configurable data generation (1GB to 100TB scale factors)
- Flexible query execution with customizable parameters
- Detailed performance reporting and scoring
- Support for multiple database products and versions
- Optimized for MPP architectures
- Easy-to-use configuration system
- TPC-DS Toolkit for Cloudberry Database
# 1. Clone the repository
git clone https://github.com/cloudberry-contrib/TPC-DS-Toolkit.git
cd TPC-DS-Toolkit
# 2. Configure your environment
vim tpcds_variables.sh
# 3. Run the benchmark
./run.shBest for: Cloudberry Database, Greenplum, HashData Lightning
This mode leverages the MPP architecture by generating data directly on segment nodes and loading it using the gpfdist protocol, maximizing resource utilization for faster benchmark execution.
Key Configuration:
export RUN_MODEL="local"
export TABLE_ACCESS_METHOD="USING ao_column"
export TABLE_USE_PARTITION="true"Requirements:
- Running Cloudberry Database or Greenplum cluster
gpadminaccess to the coordinator node- Password-less SSH between coordinator and segment nodes
Please refer to the QuickStartLocal.md for detailed instructions.
Best for: HashData Cloud, Cloudberry Database, Greenplum, PostgreSQL
This mode generates data on the client machine and imports it into the database using the COPY command, making it suitable for cloud deployments and remote clients.
Key Configuration:
export RUN_MODEL="cloud"
export PSQL_OPTIONS="-h <host> -p <port> -U <user>"
export RANDOM_DISTRIBUTION="true"
export TABLE_USE_PARTITION="true"Requirements:
psqlclient installed on the client machine- Passwordless database access (via
.pgpassfile) - Sufficient disk space for data generation
Please refer to the QuickStartCloud.md for detailed instructions.
| Version | Date | Specification |
|---|---|---|
| 4.0.0 | 2023/12/15 | |
| 3.2.0 | 2021/06/15 | |
| 2.1.0 | 2015/11/12 | |
| 1.3.1 | 2015/02/19 |
This tool uses TPC-DS 4.0.0 as of the latest version.
This tool is built with shell scripts and has been tested primarily on CentOS-based operating systems. To accommodate different products, various options are available to choose storage types, partitions, optimizer settings, and distribution policies. Please review the tpcds_variables.sh for detailed configuration options for different models and products.
- Cloudberry Database 1.x / 2.x
- HashData Enterprise / HashData Lightning
- Greenplum Database 4.x / 5.x / 6.x / 7.x
- PostgreSQL 15.x / 16.x / 17.x
For running tests on the coordinator host:
This mode leverages the MPP architecture to use data directories of segment nodes to generate data and load data using the 'gpfdist' protocol. More resources will be utilized for data generation and loading to accelerate the test process.
- Set
RUN_MODEL="local"intpcds_variables.sh - Ensure you have a running Cloudberry Database with
gpadminaccess - Create a
gpadmindatabase - Configure password-less
sshbetweenmdw(coordinator) and segment nodes (sdw1..n)
For running tests from a remote client:
With this mode, all data will be generated on the client machine, and data will be imported into the database using the copy command. This mode works for HashData Cloud, Cloudberry, Greenplum, HashData Lightning, and should work for other PostgreSQL-compatible products. However, it is recommended to use local mode for non-Cloud MPP products.
- Set
RUN_MODEL="cloud"intpcds_variables.sh - Install
psqlclient with passwordless access (.pgpass) - Create
gpadmindatabase with:ALTER ROLE gpadmin SET warehouse=testforcloud;
- Configure required variables in
tpcds_variables.sh:export RANDOM_DISTRIBUTION="true" export TABLE_STORAGE_OPTIONS="compresstype=zstd, compresslevel=5" export CUSTOM_GEN_PATH="/tmp/dsbenchmark" export GEN_DATA_PARALLEL="2"
The following conventions are used in this document: mdw for the coordinator node, and sdw1..n for segment nodes.
The TPC-DS benchmark execution follows these sequential steps:
-
Compile TPC-DS Tools (
00_compile_tpcds)- Builds the
dsdgen(data generation) anddsqgen(query generation) tools from source - Only needs to be run once unless the toolkit is updated
- Builds the
-
Generate Test Data (
01_gen_data)- Creates datasets using
dsdgenbased on the specified scale factor - Supports parallel data generation for faster execution
- Data is generated either locally on segment nodes or on a remote client depending on the run model
- Creates datasets using
-
Initialize Cluster (
02_init)- Configures database settings and GUCs for optimal benchmark performance
- Records segment configurations for data generation and loading
-
Create Database Objects (
03_ddl)- Creates schemas, tables, and indexes according to TPC-DS specifications
- Supports different storage types, partitioning, and distribution policies
- Can drop and recreate existing objects if needed
-
Load Data (
04_load)- Imports generated datasets into the database
- Uses
gpfdistprotocol for local mode orCOPYcommand for cloud mode - Supports parallel loading for faster data ingestion
-
Analyze Tables (
05_analyze)- Computes table statistics for optimal query planning
- Improves query performance by providing accurate cardinality estimates
-
Generate Queries (
06_sql)- Creates the 99 TPC-DS benchmark queries using
dsqgen - Supports custom query generation seeds for reproducible results
- Creates the 99 TPC-DS benchmark queries using
-
Single User Test (Power Test)
- Executes all 99 queries sequentially to measure single-threaded performance
- Measures the time to complete all queries (power metric)
-
Single User Reports (
07_single_user_reports)- Generates detailed reports for the single user test results
- Stores results in the database for further analysis
-
Multi User Test (Throughput Test) (
08_multi_user)- Executes multiple query streams concurrently to measure system throughput
- Simulates real-world workloads with multiple users accessing the system
-
Multi User Reports (
09_multi_user_reports)- Generates detailed reports for the multi user test results
- Analyzes query performance under concurrent workloads
-
Final Score Calculation (
10_score)- Computes the final QphDS (Queries per Hour TPC-DS) score
- Combines power and throughput metrics according to TPC-DS specifications
Install the dependencies on mdw for compiling the dsdgen (data generation) and dsqgen (query generation) tools:
ssh root@mdw
yum -y install gcc make byacc flex unzipThe original source code is from the TPC website.
Simply clone the repository with Git or download the source code from GitHub:
ssh gpadmin@mdw
git clone https://github.com/cloudberry-contrib/TPC-DS-Toolkit.gitPlace the folder under /home/gpadmin/ and change ownership to gpadmin:
chown -R gpadmin:gpadmin TPC-DS-ToolkitTo run the benchmark, login as gpadmin on the coordinator node (mdw):
ssh gpadmin@mdw
cd ~/TPC-DS-Toolkit
./run.shBy default, this will run a scale 1 (1GB) benchmark with 1 concurrent user, from data generation through to score computation, in the background. Logs will be stored with the name tpcds_<timestamp>.log in the ~/TPC-DS-Toolkit directory.
The benchmark is controlled through the tpcds_variables.sh file, which is organized into the following modules:
These options define the core environment settings for the benchmark:
# Core settings
export ADMIN_USER="gpadmin" # OS user that executes this toolkit
export BENCH_ROLE="dsbench" # Database user for running the benchmark
# Remote cluster connection
export PSQL_OPTIONS="" # Database connection options (host, port, user)These options control the scale and concurrency of the benchmark:
## Set to "local" to run the benchmark on the COORDINATOR host or "cloud" to run from a remote client.
export RUN_MODEL="local"
## Set to true to enable more detailed logging for troubleshooting purposes.
export LOG_DEBUG="false"
## Scale factor for the TPC-DS dataset, default is 1.
export GEN_DATA_SCALE="1" # Scale factor (1 = 1GB, 1000 = 1TB)
export MULTI_USER_COUNT="2" # Number of concurrent users for throughput tests
## Database schema for TPC-DS tables
export DB_SCHEMA_NAME="tpcds"
# For large scale tests, consider:
# - 3TB: GEN_DATA_SCALE="3000" with MULTI_USER_COUNT="5"
# - 10TB: GEN_DATA_SCALE="10000" with MULTI_USER_COUNT="7"
# - 30TB: GEN_DATA_SCALE="30000" with MULTI_USER_COUNT="10"These options control the execution of each benchmark step:
## Step 00_compile_tpcds: Compile TPC-DS tools
export RUN_COMPILE_TPCDS="true" # Compile data/query generators (one-time setup)
## Step 01_gen_data: Generate test data
export RUN_GEN_DATA="true" # Generate test data
# To run another TPC-DS with a different BENCH_ROLE using existing tables and data,
# the queries need to be regenerated with the new role.
# Change BENCH_ROLE and set RUN_GEN_DATA to true and GEN_NEW_DATA to false.
export GEN_NEW_DATA="true" # Generate new data vs reusing existing data
### Default path to store the generated benchmark data, separated by space for multiple paths.
export CUSTOM_GEN_PATH="/tmp/dsbenchmark"
### How many parallel processes to run on each data path to generate data in all modes
export GEN_DATA_PARALLEL="2"
### The following variables only take effect when RUN_MODEL is set to "local".
export USING_CUSTOM_GEN_PATH_IN_LOCAL_MODE="false" # Use custom data generation path in local mode
## Step 02_init: Initialize cluster
export RUN_INIT="true" # Initialize cluster settings and GUCs
## Step 03_ddl: Create database objects
export RUN_DDL="true" # Create database schemas/tables
export DROP_EXISTING_TABLES="true" # Drop existing tables before creating new ones
## Set to true to use random distribution for test tables.
export RANDOM_DISTRIBUTION="false" # Use random distribution for fact tables
## Step 04_load: Load generated data
export RUN_LOAD="true" # Load generated data
export LOAD_PARALLEL="2" # Number of parallel processes to load data (max 24)
export TRUNCATE_TABLES="true" # Truncate existing tables before loading data
## Step 05_analyze: Compute table statistics
export RUN_ANALYZE="true" # Compute table statistics for query optimization
export RUN_ANALYZE_PARALLEL="5" # Number of parallel processes for analyze (max 24)
## Step 06_sql: Generate and run queries
export RUN_SQL="true" # Run power test queries
### Set statement memory limit for each query execution, default is 1GB.
export STATEMENT_MEM="1GB" # Memory per statement for single-user test
export RUN_QGEN="true" # Generate queries for TPC-DS benchmark
export QUERY_INTERVAL="0" # Wait time between each query execution
## Step 07_single_user_reports: Generate single user reports
export RUN_SINGLE_USER_REPORTS="true" # Generate single-user test results
## Step 08_multi_user: Run multi-user test
export RUN_MULTI_USER="false" # Run throughput test queries
### Set statement memory limit for each query execution in multi-user mode, default is 1GB.
export STATEMENT_MEM_MULTI_USER="1GB" # Memory per statement for multi-user test
export RUN_MULTI_USER_QGEN="true" # Generate queries for multi-user test
## Step 09_multi_user_reports: Generate multi-user reports
export RUN_MULTI_USER_REPORTS="false" # Generate multi-user test results
## Step 10_score: Calculate final score
export RUN_SCORE="false" # Compute final benchmark scoreThese options control various miscellaneous settings:
## Set to 1 if you want the progress to stop when error occurs during single and multi user tests.
export ON_ERROR_STOP="0" # Stop on error flag (1 to stop)
## Set to true to generate queries for the TPC-DS benchmark with a specific seed "2016032410" to guarantee the same query generated for all tests.
export UNIFY_QGEN_SEED="true" # Use unified seed for query generation
export SINGLE_USER_ITERATIONS="1" # Number of times to run the power test
export EXPLAIN_ANALYZE="false" # Set to true for query plan analysis
## Set to on/off to enable vectorization
export ENABLE_VECTORIZATION="off" # Enable vectorized computing
## Set gpfdist location where gpfdist will run: p (primary) or m (mirror)
export GPFDIST_LOCATION="p" # gpfdist running location
# System variables (auto-set, no need to modify)
export OSVERSION=$(uname)
export ADMIN_USER=$(whoami)
export ADMIN_HOME=$(eval echo ${HOME}/${ADMIN_USER})
export MASTER_HOST=$(hostname -s)
export DB_SCHEMA_NAME="$(echo "${DB_SCHEMA_NAME}" | tr '[:upper:]' '[:lower:]')"
export DB_EXT_SCHEMA_NAME="ext_${DB_SCHEMA_NAME}"
export GEN_PATH_NAME="dsgendata_${DB_SCHEMA_NAME}"
export BENCH_ROLE="$(echo "${BENCH_ROLE}" | tr '[:upper:]' '[:lower:]')"
export DB_CURRENT_USER=$(psql ${PSQL_OPTIONS} -t -c "SELECT current_user;" 2>/dev/null | tr -d '[:space:]')Key options explained:
ON_ERROR_STOP: Set to 1 to stop the benchmark when an error occurs during query execution.UNIFY_QGEN_SEED: Uses a unified seed (2016032410) for query generation to ensure reproducible results.SINGLE_USER_ITERATIONS: Controls how many times the power test runs. The fastest query time from multiple runs is used for final scoring.EXPLAIN_ANALYZE: When set totrue, executes queries withEXPLAIN ANALYZEto see query plans, costs, and memory usage. For debugging only, as it affects benchmark results.ENABLE_VECTORIZATION: Set toonto enable vectorized computing for better performance (supported in Cloudberry Database 2.0+ and HashData Lightning 1.5.3+). Only works with AO column and PAX table formats.GPFDIST_LOCATION: Specifies wheregpfdistwill run:pfor primary segment nodes ormfor mirror segment nodes.
These options control the storage settings for tables:
## Support TABLE_ACCESS_METHOD as ao_row / ao_column / heap in both GPDB 7 / CBDB
## Support TABLE_ACCESS_METHOD as "PAX" for PAX table format and remove blocksize option in TABLE_STORAGE_OPTIONS for CBDB 2.0 only.
## TABLE_ACCESS_METHOD only works for Cloudberry and Greenplum 7.0 or later.
# export TABLE_ACCESS_METHOD="USING ao_column" # Uncomment to enable
## Set different storage options for each access method
## Set to use partition for the following tables:
## catalog_returns / catalog_sales / inventory / store_returns / store_sales / web_returns / web_sales
export TABLE_USE_PARTITION="true"
## SET TABLE_STORAGE_OPTIONS with different options in GP/CBDB/Cloud
export TABLE_STORAGE_OPTIONS="WITH (appendoptimized=true, orientation=column, compresstype=zstd, compresslevel=5)"Note:
TABLE_ACCESS_METHOD: Default is commented out for compatibility with all products. For Cloudberry Database and Greenplum 7.0+, set toUSING ao_columnfor best performance.USING PAXis available exclusively for Cloudberry Database 2.0.- For earlier Greenplum versions without
TABLE_ACCESS_METHODsupport, use full storage options inTABLE_STORAGE_OPTIONS:export TABLE_STORAGE_OPTIONS="appendoptimized=true, orientation=column, compresstype=zlib, compresslevel=5, blocksize=1048576"
- Distribution policies are defined in
TPC-DS-Toolkit/03_ddl/distribution.txt. For products supportingREPLICATEDdistribution, 14 dimension tables useREPLICATEDdistribution by default. For early Greenplum versions withoutREPLICATEDsupport, seeTPC-DS-Toolkit/03_ddl/distribution_original.txt. - Table partition definitions are in
TPC-DS-Toolkit/03_ddl/*.sql.partition. When using partitioning with column-oriented tables, large block sizes may cause high memory consumption. Reduce block size or partition count if encountering out-of-memory errors.
For optimal performance, consider the following tuning recommendations:
-
Memory Settings
# Recommended for 8GB+ RAM per segment node cluster export STATEMENT_MEM="8GB" # Should be set to STATEMENT_MEM / MULTI_USER_COUNT for balanced performance export STATEMENT_MEM_MULTI_USER="4GB"
-
Storage Optimization
# For best compression and query performance export TABLE_ACCESS_METHOD="USING ao_column" export TABLE_STORAGE_OPTIONS="WITH (appendoptimized=true, orientation=column, compresstype=zstd, compresslevel=9)" # Use partitioned tables for better query performance on large datasets export TABLE_USE_PARTITION="true"
-
Concurrency Tuning
# Adjust based on available CPU cores export GEN_DATA_PARALLEL="$(nproc)" # Set based on available system resources and expected workload export MULTI_USER_COUNT="$(( $(nproc) / 2 ))" # Increase parallel loading for faster data ingestion export LOAD_PARALLEL="4" export RUN_ANALYZE_PARALLEL="8"
-
Enable Vectorization (for supported systems)
export ENABLE_VECTORIZATION="on"
-
Distribution Policy
# Use random distribution for cloud products or when unsure about data distribution export RANDOM_DISTRIBUTION="true"
-
Optimizer Settings
# Adjust optimizer settings in 01_gen_data/optimizer.txt # Configure ORCA vs Planner usage for each query # After changing settings, regenerate queries with RUN_QGEN="true"
-
Query Generation
# Use unified seed for reproducible results export UNIFY_QGEN_SEED="true"
-
Power Test Iterations
# Run power test multiple times and use the fastest result export SINGLE_USER_ITERATIONS="3"
The TPC-DS queries were modified in the following ways to ensure compatibility:
Changed date addition syntax from:
and (cast('2000-02-28' as date) + 30 days)To:
and (cast('2000-02-28' as date) + '30 days'::interval)Affected queries: 5, 12, 16, 20, 21, 32, 37, 40, 77, 80, 82, 92, 94, 95, and 98.
Added subqueries for ORDER BY clauses with column aliases:
-- New version with subquery
select * from (
-- Original query
) AS sub
order by
lochierarchy desc
,case when lochierarchy = 0 then s_state end
,rank_within_parent
limit 100;Affected queries: 36 and 70.
Modified query templates to exclude columns not found in the query, specifically in common table expressions where alias columns were used in dynamic filters. Affected query: 86.
Added table aliases to improve query parser compatibility. Affected queries: 2, 14, and 23.
Added LIMIT 100 to queries that could produce very large result sets.
Affected queries: 64, 34, and 71.
-
Missing or Invalid Environment Variables
Ensure all required environment variables intpcds_variables.share set correctly. If any variable is missing or invalid, the script will abort and display the problematic variable name. Double-check the following key variables:RUN_MODELGEN_DATA_SCALETABLE_ACCESS_METHODPSQL_OPTIONS
-
Permission Errors
- Verify ownership:
chown -R gpadmin:gpadmin /home/gpadmin/TPC-DS-Toolkit - Ensure
gpadminhas proper database access permissions
- Verify ownership:
-
Data Generation Failures
- Confirm successful compilation of
dsdgen - Verify
CUSTOM_GEN_PATHpoints to a valid, writable directory - Check available disk space
- Confirm successful compilation of
-
Query Execution Errors
- Ensure tables and schemas exist (set
RUN_DDL=trueon first run) - Look for syntax errors in modified queries
- Verify database connectivity
- Ensure tables and schemas exist (set
-
Performance Issues
- Adjust memory settings based on system resources
- Enable vectorization if supported
- Use appropriate storage options for your workload
- Consider partitioning for large tables
- Adjust database parameters as referred to: set_gucs
For detailed diagnostics, examine:
- Main log file:
tpcds_<timestamp>.login~/TPC-DS-Toolkit - Database server logs
- System resource utilization during test runs
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.