NOTICE: This repo contains modifications to the official TPC-DS specification so any results from this are not comparable to officially audited results.
The following tables are currently used:
Dimension Tables:
- DATE_DIM
- TIME_DIM
- CUSTOMER
- CUSTOMER_ADDRESS
- CUSTOMER_DEMOGRAPHICS
- HOUSEHOLD_DEMOGRAPHICS
- ITEM
- PROMOTION
- STORE
Fact Tables:
- STORE_SALES
These steps setup your environment to perform a distributed data generation for the given scale factor.
The scripts assume that you have passwordless SSH from the master node (where you will clone the repos to) to every DataNode that is in your cluster.
These scripts also assume that your $HOME directory is the same path on all DataNodes.
sudo yum -y install gcc make flex bison byacc gitcd $HOME(use your$HOMEdirectory as it's hard coded in some scripts for now)git clone https://github.com/gregrahn/tpcds-kit.gitcd tpcds-kitgit checkout --quiet eff5de2cd toolsmake -f Makefile.suite
cd $HOME(use your$HOMEdirectory as it's hard coded in some scripts for now)- clone this repo
git clone https://github.com/cloudera/impala-tpcds-kit cd impala-tpcds-kit- Edit
tpcds-env.shand modify as needed. The defaults assume you have a/user/$USERdirectory in HDFS. If you don't, run these commands:sudo -u hdfs hdfs dfs -mkdir /user/$USERsudo -u hdfs hdfs dfs -chown $USER /user/$USERsudo -u hdfs hdfs dfs -chmod 777 /user/$USER
- Edit
dn.txtand put one DataNode hostname per line - no blank lines. - Run
push-bits.shwhich will scptpcds-kitandimpala-tpcds-kitto each DataNode listed indn.txt. - Run
set-nodenum.sh. This will createimpala-tpcds-kit/nodenum.shon every DataNode and set the value accordingly. This is used to determine what portion of the distributed data generation is done on each node.
Data is landed directly in HDFS so there is no requirement for any local storage.
hdfs-mkdirs.sh- Make HDFS directories for each table.gen-dims.sh- Generate dimension flat files (runs on one DataNode only).run-gen-facts.sh- Runsgen-facts.shon each DataNode via ssh to generate STORE_SALES flat files.
impala-create-external-tables.sh- Creates a Hive database and the external tables pointing to flat files.impala-load-dims.sh- Load dimension tables (no format specified, modify as necessary, but not required).impala-load-store_sales.sh- Load STORE_SALES table which uses dynamic partitioning, one partition per calendar day.impala-compute-stats.sh- Gather table and column statistics on all tables.
impala-tpcds-kit/queries contains queries execute on Impala (v2.3+). Note that the
queries are not qualified with a database name. In order to run them, the impala-shell
needs to be run with the -d paramater. Alternatively, one can also issue a use db_name
before running each individual query.