Sourcecode of our ICE '25 paper "GRACEFUL: A Learned Cost Estimator For UDFs"
Please cite our paper if you find this work useful or use it in your own research:
TODO
- Main Modules
- Prerequisites
- Reproducing the Experiments
- Flat Vector Baseline
- Add custom databases or workloads
Module | Description |
---|---|
cross_db_benchmark | All code required to reproduce the DBGen benchmark (workload generator, executor, parsing) |
models | GRACEFUL model code |
prepare_db | All code required to extracti db statistic, clean the db, ... |
setup | Model hyperparameters and commands to execute DBGen Benchmark |
udf_generator | Code to generate UDFs |
udf_graph | Code to generate and annotate UDF graphs |
utils | Utility functions used throughout the codebase |
For running the commands in this repository, please install the required libraries defined in the requirements.txt
file.
(Please make sure that both graphviz
and graphviz-dev
are available on your system, e.g. install via
sudo apt install graphviz graphviz-dev
)
You can download the following artifacts to fast-forward the setup process all from here. This includes:
- Collected Query traces: Preprocessed and annotated query Traces of the executed workloads (ready to directly train the model)
- UDF benchmark: UDF workload files including the UDF code as well as the surrounding SQL queries for the 20 datasets
- GRACEFUL Artifacts: Artifacts of the GRACEFUL model used for UDF cost estimation
- DuckDB database files (.db) - for the 20 datasets we use
- DeepDB Artifacts: Artifacts of the DeepDB model used for cardinality estimation
After downloading extract the zip files using the script python workload_runs/extract_zips.py
(shipped with the data).
This will use pre-trained models shipped with the data.
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/genome/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/genome/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_genome --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_genome_stratDBs_ep60_maxr30_20240503_115437_074 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/airline/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/airline/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_airline --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_airline_stratDBs_ep60_maxr30_20240503_020205_034 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/financial/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/financial/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_financial --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_financial_stratDBs_ep60_maxr30_20240501_093145_013 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/movielens/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/movielens/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_movielens --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_movielens_stratDBs_ep60_maxr30_20240501_093101_065 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/fhnk/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/fhnk/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_fhnk --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_fhnk_stratDBs_ep60_maxr30_20240501_093224_088 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/employee/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/employee/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_employee --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_employee_stratDBs_ep60_maxr30_20240501_093236_083 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/ssb/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/ssb/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_ssb --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_ssb_stratDBs_ep60_maxr30_20240502_141949_072 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/baseball/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/baseball/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_baseball --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_baseball_stratDBs_ep60_maxr30_20240503_101832_095 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/walmart/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/walmart/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_walmart --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_walmart_stratDBs_ep60_maxr30_20240502_134902_066 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/credit/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/credit/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_credit --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_credit_stratDBs_ep60_maxr30_20240501_093221_046 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/consumer/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/consumer/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_consumer --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_consumer_stratDBs_ep60_maxr30_20240502_160244_008 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/tournament/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/tournament/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_tournament --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_tournament_stratDBs_ep60_maxr30_20240502_141825_069 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/carcinogenesis/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/carcinogenesis/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_carcinogenesis --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_carcinogenesis_stratDBs_ep60_maxr30_20240502_160202_085 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/seznam/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/seznam/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_seznam --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_seznam_stratDBs_ep60_maxr30_20240502_160216_072 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/geneea/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/geneea/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_geneea --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_geneea_stratDBs_ep60_maxr30_20240501_093236_018 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/accidents/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/accidents/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_accidents --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_accidents_stratDBs_ep60_maxr30_20240502_160102_051 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/imdb/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/imdb/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_imdb --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_imdb_stratDBs_ep60_maxr30_20240502_142026_057 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/hepatitis/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/hepatitis/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_hepatitis --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_hepatitis_stratDBs_ep60_maxr30_20240502_141851_092 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/basketball/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/basketball/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_basketball --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_basketball_stratDBs_ep60_maxr30_20240503_015928_050 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
python evaluate_pull_up_predictor.py --pullup_plans_path ../Graceful_data/workload_runs/duckdb_pullup/parsed_plans/tpc_h/workload.json --pushdown_plans_path ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/tpc_h/workload.json --model_dir ../Graceful_data/graceful_artifacts/leave_out_tpc_h --model_name act_minrt50ms_complex_dd_wpullupdata_wnoUDFdata_ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge_tpc_h_stratDBs_ep60_maxr30_20240501_093049_070 --device cuda:0 --statistics_file ../Graceful_data/workload_runs/duckdb_pushdown/parsed_plans/statistics_workload_combined.json --model_config ddactfonudf_liboh_gradnorm_mldupl_loopend_loopedge --dataset genome
For training the model please generate the experiment commands using the gen_train_commands.py
helper file. This
includes the model configurations for the ablation study and the leave-one-out experiments.
python gen_train_commands.py --artifacts_path ../Graceful_data/graceful_artifacts/ --wl_base_path ../Graceful_data/workload_runs/
To log the training process to Weights&Biases, please login to wandb using wandb login
and set the
--register_run_at_wandb
flag in the training command. Please specify username (entity) and project name in the
--wandb_entity
and --wandb_project
arguments.
An example run can be seen here: https://wandb.ai/jwehrstein/GRACEFUL
We provide precreated .duckdb files. These databases come already prepared with the necessary data and are ready to be used. If you want to use custom datasets, please execute the following preprocessing steps for your DBMS:
Extract meta stats from the DBs. This includes information about the tables and columns in the DBs.
python udf_generator/create_db_statistics.py --dbms duckdb --dbms_kwargs dir=../Graceful_data/datasets/ --col_stats_dir ../Graceful_data/datasets/statistics/ --target ../Graceful_data/datasets/DB_metadata.csv
This will apply data cleaning and preprocessing steps making the data fit more to the UDFs that will be generated. This will reduce the data collection effort, since less errors will appear during query execution. However this step is not necessary.
Clean the DB. This is done by running the following command:
python3 clean_DBs.py --work_dir ../Graceful_data/datasets/ --dbms_kwargs dir=../Graceful_data/datasets/
Regenerate the meta stats again (Step 1) to reflect the updates made during the cleaning process.
Extract column and string statistics:
python3 gen_dataset_stats.py --data_dir /path/to/your/csv/files/ --work_dir ../Graceful_data/datasets/ --dbs zs_less_scaled --dbms duckdb --dbms_kwargs dir=/duckdb_dir/
Next, we need to set up an experiment folder. This folder will be used to store the UDFs, the graphs, and the feature
statistics. The following command will create the folder structure:
python3 udf_generator/generate_udf_wl.py --setup True --experiment_dir ~/experiments --exp_name test_exp
In this step, we generate the SQL queries that will be used to create the UDFs.
This is done by running the following command:
python3 dbgensetup.py --generate_workloads --workload_dir ../Graceful_data/workload_runs/my_run/dbs --col_stats_dir ../Graceful_data/datasets/statistics/
Next, we need to generate the UDFs. This is done by running the following command:
python3 generate_udf_wl.py --exp_name test_exp --no_funcs 1000 --exact_tree {True|False}
In the example above, we will create 1000 queries with a different randomly generated UDF for each query
The parameter --exact_tree
indicates whether we want to create UDFs with an exact tree structure consisting of all
components (COMP, LOOP, IF) (True
) or with tree structure that might be a subset of the the set (COMP, LOOP,
IF) (False
)
If we want to create UDFs for already created SPJA queries, run the following command:
python3 generate_udf_wl.py --metafile_name udf_stats.json --exp_name test_exp --no_funcs 1000 --exact_tree {True|False}
Here, we leverage information about the UDFs that we are about to create; this information is stored in udf_stats.json
This information includes, e.g., information about the columns that are used in the UDFs
Once we have obtained/created the UDFs and the corresponding queries, we need to run these workload to gather the training data.
Execute queries:
python3 run_benchmark.py --run_workload --source ../Graceful_data/workload_runs/my_run/dbs/[db_name]/sql_scripts/[source_file_name].sql --target ../Graceful_data/workload_runs/my_run/raw/[db_name]/workload.json --database duckdb --db_name [db_name] --databse_conn dir=/duckdb_dir/ --run_kwargs hardware=my_fancy_machine --min_query_ms 100 --query_timeout 30 --repetitions_per_query 1 --udf_load_path ../Graceful_data/workload_runs/my_run/dbs/[db_name]/sql_scripts/udfs.sql --udf_drop_path ../Graceful_data/workload_runs/my_run/dbs/[db_name]/sql_scripts/dropUDFs.sql --cap_workload 5000
If the SQL queries contain pull-up queries, please set the --pullup_udf
flag. This will enforce that duckdb will not
apply filter-pushdown rules on the UDF.
Parse queries:
python3 parse_plans.py --raw_dir ../Graceful_data/workload_runs/my_run/raw/ --parsed_plan_dir ../Graceful_data/workload_runs/my_run/parsed_plans/ --workloads workload.json --out_names workload.json --databse duckdb --duckdb_dir ../Graceful_data/datasets/ --min_query_ms 100 --udf_code_location ..Graceful_data/worload_runs/my_run/dbs/ --cap_queries 5000 --dataset [db_name] --deepdb_artifacts_dir ../Graceful_data/deepdb_artifacts
This will as well automatically annotate cardinality estimates from DeepDB and Wanderjoin. To deactivate this annotation
step, add --skip_wj
and --skip_deepdb
to the command.
To prepare model training, we need to parse the UDF code and create the graphs. This is done by running the following
command:
python3 create_udf_graph.py --exp_folder ../Graceful_data/workload_runs/duckdb_pushdown --dataset [db_name] --dbms_kwargs dir=../Graceful_data/datasets/ --duckdb_dir ../Graceful_data/datasets/ e --deepdb_dir ../Graceful_data/deepdb_artifacts --add_loop_end_node
This command will create a graph for each UDF that was part of a successfully executed query.
The graphs are stored in the experiment folder that was created in a previous step; when training/evaluating the model,
The graphs of the UDFs are loaded from this location.
The parameter --add_loop_end_node
is optional and adds a node to the graph that represents the end of a loop.
The filenames of the graph mark the UDFs that were used to create the graph and whether the graph was created with or
without the --add_loop_end_node
parameter.
Next, we need to gather feature statistics from the graphs that were just created. The following command will start the
feature gathering process and will store the feature statistics as a json file. This file is then accessed when creating
a file that includes all feature statistics (UDF and non UDF).
python3 create_udf_graph.py --gather True --exp_folder ../Graceful_data/workload_runs/my_run/
Next, we need to gather feature statistics from the workload that was executed. The following command will start the
feature gathering process and and automatically includes the UDF stats (assumes file has been created in previous step).
python3 gather_feature_statistics.py --workload_runs workload.json --raw_dir ../Graceful_data/workload_runs/my_run/parsed_plans --target
Train and evaulate the flat vector baseline:
python3 flat_vector_baseline.py --exp_dir ../Graceful_data/workload_runs/duckdb_scan_only/ --test_against imdb --model_type xgboost