This repository provides code for Robust Index Benefit Estimation Through Hierarchical and Two-dimensional Feature Representation.
You can use conda to create a virtual environment and install requirements as follow:
$ conda create --name eddie python=3.9
$ conda activate eddie
$ pip install -r requirements.txt
In addition, you need to set up a PostgresSQL server (version 12.13 is preferred) along with the HypoPG extension (version 1.3.1 is preferred).
Cloning the Repository
This project includes several submodules, which are necessary for running experiments. After cloning the repository, make sure to initialize and update the submodules by running:
$ git submodule update --init --recursive
Eddie requires specific datasets for training and evaluation. You have two options to prepare the datasets:
You can download the preprocessed datasets directly from this data repository and extract them to the datasets/ directory in the project’s root directory.
$ unzip '*.zip' -d datasets/
The directory structure after unzipping will look like this:
├── datasets
├── tpcds__base_w_init_idx.pickle
├── tpcds__base_wo_init_idx.pickle
└── ...
- For Benchmark Databases (TPC-DS, TPC-H): Use the provided script to import TPC-DS and TPC-H databases into PostgreSQL:
$ python database_scripts/import_database.py \
--run_id import_db_tpcds \
--workload_name tpcds \
--scale_factor 10 \
--host localhost \
--port 5432 \
--user postgres \
--password <password> \
--connected_database_name postgres \
--explicit_database_name indexselection_tpcds___10Available parameters:
run_id(required): Unique identifier for this run. This will be used for logging and checkpoint naming.workload_name(required): Indicates the type of database to be imported, (i.e., tpcds or tpch).scale_factor(required): The size of the imported database, (e.g., 1 means 1G).host(required): The host address used for the connection.port(required): The port used for the connection.user(required): The username used for the database connection.password(required): The password used for the database connection.connected_database_name(optional): The database name used for the connection.explicit_database_name(optional): The name of the database to be imported, which is created at the time of import. If the database name already exists, it will not be imported again.(Automatically generated by default).
-
For IMDB Database: Refer to the instructions in the join-order-benchmark repository to set up the IMDB database.
-
For the pre-training augmented training set: Following the "Reproduce entire DBGen benchmark" section from zero-shot-cost-estimation. Start by downloading all files from the
zero-shot-datadirectory in the repository and place them in a localzero-shot-data/directory, which will be used in subsequent steps. The directory structure should look like this:
├── zero-shot-data
├── datasets
├── airline
└── ...
└── runs
├── parsed_plans
├── airline
└── ...
└── raw
├── airline
└── ...
└── ...
Then, follow the instructions in the referenced section to run scripts for downloading additional data, and scale and load the datasets into PostgreSQL.
Before collecting statistics, ensure you can run the ANALYZE command on the target database.
Then execute the following scripts to collect database statistics and histograms:
python database_scripts/collect_db_stats.py \
--host localhost \
--port 5432 \
--user postgres \
--password <password> \
--database_name indexselection_tpcds___10Available parameters:
host(required): The host address used for the connection.port(required): The port used for the connection.user(required): The username used for the database connection.password(required): The password used for the database connection.database_name(required): The database name used for the connection.
Execute the following script to generate all workloads used in the experiment, including benchmark workloads, synthetic workloads, and pre-training workloads:
python workload_gen/gen_workload.py \
--run_id workload_gen \
--host localhost \
--port 5432 \
--user postgres \
--password <password>Available parameters:
run_id(required): Unique identifier for this run. This will be used for logging and checkpoint naming.host(required): The host address used for the connection.port(required): The port used for the connection.user(required): The username used for the database connection.password(required): The password used for the database connection.
For synthetic workloads (e.g., TPC-DS+, TPC-H+), the script calls workload_gen/gen_synthetic_query.py to generate synthetic queries, which are saved in the synthetic_queries directory.
All workloads generated by the script are stored in the workload directory and will be available for use in subsequent steps. Each SQL in the workload is annotated by a query number (e.g., Q20) for easy reference.
Generate all datasets required for the experiment with the following script:
python datasets_gen/gen_datasets.py \
--run_id data_gen \
--host localhost \
--port 5432 \
--user postgres \
--password <password>Available parameters:
run_id(required): Unique identifier for this run. This will be used for logging and checkpoint naming.host(required): The host address used for the connection.port(required): The port used for the connection.user(required): The username used for the database connection.password(required): The password used for the database connection.
You can edit the datasets_gen/data_gen_config.json file to configure dataset generation. Note that this script may take a long time to execute. After completion, the resulting datasets will be stored in the datasets/ directory in the project’s root directory, matching the structure from Option 1:
├── datasets
├── tpcds__base_w_init_idx.pickle
├── tpcds__base_wo_init_idx.pickle
└── ...
Eddie can be trained and evaluated using two methods:
- Configuration-based Execution: Define and modify experiments in
exp_configs/exp_config.json, then runmain.py. - Command-line Execution: Use
run_eddie.pywith specified parameters for flexible experiment execution.
This method allows you to run experiments using predefined configuration files. Simply specify the configuration file, and the script will automatically load the necessary parameters.
- Edit Configuration:
The repository includes an example configuration at exp_configs/exp_config.json. Below is a simplified version with comments for reference:
{
"version": "v1", // Config version for tracking changes
"databases": [
{
"db_id": "tpcds_10", // Database identifier: [TPC-DS, SF=10]
"db_name": "indexselection_tpcds___10", // PostgreSQL database name
"db_stat_path": "./db_stats_data/indexselection_tpcds___10_stats.json",
"hist_file_path": "./db_stats_data/indexselection_tpcds___10_hist_file.csv"
},
{
"db_id": "tpcds_5", // [TPC-DS, SF=5]
"db_name": "indexselection_tpcds___5",
"db_stat_path": "./db_stats_data/indexselection_tpcds___5_stats.json",
"hist_file_path": "./db_stats_data/indexselection_tpcds___5_hist_file.csv"
}
// Add more databases as needed (e.g., tpch_10, tpch_5, imdb)
],
"workloads": [
{
"workload_name": "tpcds", // Identify different workloads
"db_id": "tpcds_10", // Corresponding database identifier
"experiments": [
{
"exp_id": "tpcds__base_wo_init_idx", // Experiment identifier
"dataset_path": "./datasets/tpcds__base_wo_init_idx.pickle",
"run_models": ["eddie", "lib", "queryformer", "postgresql"] // Models to run
},
{
"exp_id": "tpcds__base_w_init_idx",
"dataset_path": "./datasets/tpcds__base_w_init_idx.pickle",
"run_models": ["eddie", "lib", "queryformer", "postgresql"]
},
{
"exp_id": "tpcds__vary_query",
"parent_exp_id": "tpcds__base_w_init_idx", // Depends on parent experiment
"vary_dataset_path": "./datasets/tpcds__vary_query.pickle", // Varying dataset for testing
"run_models": ["eddie", "lib", "queryformer", "postgresql"]
},
{
"exp_id": "tpcds__vary_stat",
"parent_exp_id": "tpcds__base_w_init_idx",
"vary_db_id": "tpcds_5", // Database for varying statistics
"vary_dataset_path": "./datasets/tpcds__vary_stat.pickle",
"run_models": ["eddie", "lib", "queryformer", "postgresql"]
},
{
"exp_id": "tpcds__vary_schema",
"parent_exp_id": "tpcds__base_w_init_idx",
"vary_schema": true, // Vary schema in experiment
"run_models": ["eddie", "lib", "queryformer", "postgresql"]
},
{
"exp_id": "tpcds__perturb_idx",
"dataset_path": "./datasets/tpcds__perturb_idx.pickle",
"run_models": ["eddie", "lib", "queryformer", "postgresql"]
}
]
}
// Add more workloads and experiments as needed
]
}
- Run the Experiment:
Run the main script to train and evaluate Eddie with the specified configuration:
$ python main.py --config exp_configs/exp_config.json
After execution, you can view the final metrics in the log file named after the corresponding experiment, located in the ./logs directory.
Alternatively, you can run Eddie by specifying parameters directly via run_eddie.py script. Depending on the experiment type, different parameters or configurations may apply. Below are the details for running general experiments, pre-training experiments, and end-to-end experiments.
Run a standard experiment with K-fold cross-validation using run_eddie.py.
Usage example:
$ python run_eddie.py \
--run_id tpcds__base_w_init_idx__eddie_v1 \
--model_name eddie \
--dataset_path ./datasets/tpcds__base_w_init_idx.pickle \
--db_stat_path ./db_stats_data/indexselection_tpcds___10_stats.json \
--checkpoints_path ./checkpoints/tpcds__base_w_init_idx__eddie_v1Available parameters:
run_id(required): Unique identifier for this run. This will be used for logging and checkpoint naming.model_name(required): Name of the model to be used.dataset_path(required): Path to the dataset used for K-fold cross-validation.db_stat_path(required): Path to the database statistics file.checkpoints_path(required): Directory to save or load model checkpoints. If the specified checkpoint does not exist, the model will be trained during K-fold cross-validation.vary_dataset_path(optional): Path to a dataset for testing drift scenarios (e.g., changes in queries or data statistics).vary_db_stat_path(optional): Path to alternative database statistics, used when data statistics change in drift scenarios.vary_schema(optional): Whether to vary the schema of the dataset, for drift scenarios testing.
After execution, metrics and logs will be stored in the ./logs directory, with filenames prefixed by the specified run_id. Model checkpoints will be saved in the provided checkpoints_path.
Step 1: Dataset enhancements
This step is to enhance the end-to-end training dataset, using ./datasets_enhance/enhanced_datasets_gen.py to train the model for the end-to-end pipeline.
$ python datasets_enhance/enhanced_datasets_gen.py \
--database_name indexselection_tpcds___10 \
--host localhost \
--port 54321 \
--user postgres \
--password your_password
--enhanced_dataset_path ./datasets/tpcds__end2end.pickleAvailable parameters:
- enhanced_dataset_path (required): The path of the dataset that will be enhanced. (e.g., "./datasets/tpcds__end2end.pickle").
- database_name (required): Name of the PostgreSQL database (e.g., "indexselection_tpcds__10").
- host: The host address used for the connection.
- port: The port used for the connection.
- user: The username used for the connection.
- password: The password used for the connection.
The enhanced training data is generated for the next training.
Step 2: Train the Model
This step is identical to the General Experiment, using run_eddie.py to train the model for the end-to-end pipeline.
$ python run_eddie.py \
--run_id tpcds__end2end_eddie_v1 \
--model_name eddie \
--dataset_path ./datasets/tpcds__end2end.pickle \
--db_stat_path ./db_stats_data/indexselection_tpcds___10_stats.json \
--checkpoints_path ./checkpoints/tpcds__end2end__eddie_v1
--clip_label FalseThe generated checkpoints will be used in the next step.
Step 3: Run End-to-End Evaluation
After training, perform end-to-end evaluation with run_end2end.py. This script supports evaluating multiple models by specifying them in --run_models.
$ python end2end/run_end2end.py \
--run_models eddie \
--db_name indexselection_tpcds___10 \
--dataset_path ./datasets/tpcds__end2end.pickle \
--db_stat_path ./db_stats_data/indexselection_tpcds___10_stats.json \
--checkpoints_path ./checkpoints/tpcds__end2end__{model_name}_v1Available parameters:
- run_models (required): Comma-separated list of models to evaluate (e.g., "eddie,lib").
- db_name (required): Name of the PostgreSQL database (e.g., "indexselection_tpcds__10").
- dataset_path (required): Path to the dataset for evaluation.
- db_stat_path (required): Path to the database statistics file.
- checkpoints_path (required): Directory template to load model checkpoints; {model_name} is a placeholder replaced by each model name from --run_models (e.g., ./checkpoints/tpcds__end2end__eddie_v1 for "eddie").
After execution, metrics and logs for each model are saved in the ./logs directory.
Step 1: Customize Training Parameters
Before the experiment, you can customize the training parameters in the file pretrain_finetune/pretrain_finetuneconfig.py, and the default parameters are the ones used in the paper
Step 2: Pre-train the Model
$ python pretrain_finetune/run_pretrain_eddie.py \
--run_id pretrain_tpcds \
--model_name eddie \
--checkpoints_path ./checkpoints/pretrain_tpcds \
--workload_name tpcdsAvailable parameters:
run_id(required): Unique identifier for this run. This will be used for logging and checkpoint naming.model_name(required): Name of the model to be used.checkpoints_path(required): Directory template to load model checkpointsworkload_nameThis parameter is used to search for the specific experimental parameters of the configuration file in the Step 1
Step 3: Fine-tune the Model
$ python pretrain_finetune/run_finetune_eddie.py \
--run_id tpcds__finetune \
--model_name eddie \
--checkpoints_path ./checkpoints/finetune_tpcds \
--workload_name tpcds \
--pretrain_model_path ./checkpoints/pretrain_tpcds/tpcds__eddie.pth \
--finetune_data_rate 0.5Available parameters:
run_id(required): Unique identifier for this run. This will be used for logging and checkpoint naming.model_name(required): Name of the model to be used.checkpoints_path(required): Directory template to load model checkpoints.workload_nameThis parameter is used to search for the specific experimental parameters of the configuration file in the Step 1pretrain_model_path(optional): Used to find pre-trained model that have been saved. Parameters used for initializing fine-tuning modelfinetune_data_rate(optional): Fine-tune the target data ratio of the model, with a value range of 0-1