Effective caching for linear query workloads over DP databases. Turbo builds upon private multiplicative weights (PMW), a DP mechanism that is powerful in theory but very ineffective in practice, and transforms it into a highly effective caching object, namely PMW-Bypass. A description of this project can be found on our paper, titled Turbo: Effective caching in Differentially-Private Databases.
data
: scripts for generating the datasets, queries and workloads for the Covid and Citibike datasets.experiments
: scripts that automate the execution of Turbo experiments concurrently using Ray. You can extend runner.cli.caching.py with your own configuration to generate your own experiments.notebooks
: notebooks and utils for visualizing and analyzing execution logs.packaging
: scripts for building Turbo.turbo
: Turbo's core functionality. Refer here for Turbo core's structure.
We package Turbo using the following dockers:
- Turbo includes
- the turbo system
- the datasets, queries and workloads used in the evaluation for the Covid and Citibike datasets
- scripts for reproducing all experiments
- timescaledb includes an instance of Postgres running the TimescaleDB extension. This is the DBMS used for storing data and running queries.
- redis-cache includes an instance of RedisAI for storing differentially-private results and histograms (represented as tensors).
- redis-budgets includes an instance of Redis for budget accounting.
Make sure you have a working installation of docker
.
Clone this repository on your machine:
git clone https://github.com/columbia/turbo-artifact.git
Enter the repository:
cd turbo-artifact
Build the docker image for Turbo. This will automatically install all dependencies required for the Turbo system as well as the datasets, queries and workloads used in the evaluation section of the paper. This step takes several minutes to finish (~20') due to the processing and generation of the datasets.
sudo docker build -t turbo -f Dockerfile .
Deploy Postgres as well as two Redis instances. Postgres is used for storing the datasets and the execution of queries. The first Redis (RedisAI) instance is used for caching the differentially-private results or histograms (stored as tensors using RedisAI). The second Redis instance is used for budget accounting.
Check out the default ports for the three instances in the docker-compose.yaml
and change the host ports if they are already in use by other services.
sudo docker compose up -d
We prototype Turbo and perform a microbenchmark evaluation using a simulator. This simulates the execution of Turbo by orchestrating the arrival of new queries and data into the system. You can control the simulation and create your own simulation settings by editing the configuration files. The experiments/runner.cli.caching.py script automates the execution of multiple experiments concurrently using Ray. You can find the configuration for each experiment hardcoded inside the script.
The script experiments/run_all.sh contains a complete list of all the commands that generate the experiments presented in the paper.
Run 'sudo docker images'
and verify that there are four images turbo, timescaledb, redis and redisai.
Setup TimescaleDB by creating the databases and hypertables. Use the following command:
sudo docker exec -i timescaledb psql -U postgres < packaging/timescaledb.sql
If you have changed the default addresses of TimeScaleDB, Redis and RedisAI in the previous steps then update the addresses inside these configuration files, too: turbo/config/turbo_system_eval_monoblock_covid.json
and turbo/config/turbo_system_eval_monoblock_citibike.json
. They will be used by experiments that evaluate the system's runtime performance!
Now, reproduce all Turbo experiments by running the turbo docker with the following command:
sudo docker run -v $PWD/logs:/turbo/logs -v $PWD/turbo/config:/turbo/turbo/config -v $PWD/temp:/tmp --network=host --name turbo --shm-size=204.89gb --rm turbo experiments/run_all.sh
This step takes around 12 hours to finish.
Make sure that the turbo container has enough disk space at /tmp
which Ray uses to store checkpoints and other internal data. If that's not the case then mount the /tmp
directory on a directory that has enough space.
For example, use an additional -v flag followed by the directory of your choice:
-v $PWD/temp:/tmp
With the -v
flag we mount directories turbo/logs
and turbo/config
from the host into the container so that we can access the logs from the host even after the container stops and also allow for the container to access user-defined configurations stored in the host.
Alternatively, you can also run experiments from experiments/run_all.sh individually. You can enter the Turbo container:
sudo docker run -v $PWD/logs:/turbo/logs -v $PWD/turbo/config:/turbo/turbo/config -v $PWD/temp:/tmp --network=host --name turbo --shm-size=204.89gb -it turbo
and simply type the python command from experiments/run_all.sh that corresponds to the experiment you want to reproduce.
For example the following command runs the experiment for Figure 8.a:
python experiments/runner.cli.caching.py --exp caching_monoblock_heuristics --dataset covid19
If you want to bypass runner.cli.caching.py
you can directly call the simulation entrypoint with your configuration file as an argument.
For example the following command runs the experiment for Figure 10.d:
python run_simulation.py --omegaconf turbo/config/turbo_system_eval_monoblock_covid.json
Make sure that you clean the state of Postgres/Redis after usage before you start an experiment anew.
You can do it manually or simply use the packaging/storage_utils.py
helper functions:
python packaging/storage_utils.py --omegaconf turbo/config/turbo_system_eval_monoblock_covid.json --storage "*" --function delete-all --database covid
Set the --database
argument to covid or citibike, depending on which dataset you used for the experiment.
Note that, for system's performance evaluation we bypass runner.cli.caching.py
since we do not wish to parallelize the experiments.
run_simulation
is the official entrypoing for running microbenchmarks under controlled simulation settings.
The experiments/runner.cli.caching.py script will automatically analyze the execution logs and create plots corresponding to the figures presented in the paper.
Note that the terminology used in the labels might differ slightly from that used in the paper.
Typically, you will see that here we refer to Turbo
as Hybrid
which is an alternative name we use internally to describe Turbo.
Check the logs/figures
directory for all the outputs.
Note that for figure_10 (runtime performance) we output the results in a text format (figure_10.txt) here instead of a barplot.