This repository contains the source code used for "TUNA: Tuning Unstable and Noisy Cloud Applications" (to appear in EuroSys'25). TUNA is a sampling methodology that uses a few key insights to improve the quality of configurations found during system autotuning. In particular, TUNA uses
- An outlier detector to prevent unstable configurations from being learned,
- A noise adjustor model to provide a more stable signal to an optimizer to learn from, and
- Cost concious multi-fidelity tuning to improve the rate of convergence.
This code can be used to run the main experimental results from our paper. Users interested in using these techniques can make small modifications to the scripts to integrate their own target systems, particularly in nautilus
, or in proxy/evaluation_server.py
.
To pull the code, we use submodules for library dependencies. These commands should pull all the necessary code.
git clone [email protected]:uw-mad-dash/TUNA.git
git submodule update --init --recursive
This project has been tested on Ubuntu 20, however we believe that it should work on most versions of linux. We use the following dependencies:
python3.11
miniconda
docker
java 21
fio
stress-ng
linux-tools
There are additional python library dependencies listed in /src/processing/requirements.txt
Both the python libraries and the dependencies can be installed using scripts in /src/processing
. For the scripts to run properly, the mlos
conda environment must be activated.
TUNA uses MLOS as it's base tuning framework, and implements a custom scheduling and sampling policy on top of it, and then selects tuners from those offered (we intend to incorporate them upstream in time). TUNA also uses Nautilus to manage and deploy the execution environment. The code here is a fork of a private library that will be released soon.
src
benchmarks
: Metric collection scriptsclient
: Orchestrator side scheduling policiesMLOS
: The MLOS library. We used a specific branch that is no longer available on git, so we provide a fork that has an up to date version, with minor updates to keep it compatible with updated version of the libraries.nautilus
: Nautilus dependency. See Paper.proto
: gRPC communication definition filesproxy
: A worker sided proxy to forward incoming messages to nautilusexecutors
: worker side server to listen for incoming gRPC requestsnautilus
: stub files for proper linting
processing
: deployment and management scriptsspaces
benchmark
: Workload definitions for nautilusdbms
: Database definitions for nautilusexperiment
: Files defining how benchmarks, dbms, knobs, and params are combinedknobs
: List of knob definitionsparams
: Machine characteristics
The environment for TUNA requires two different types of nodes: an Orchestrator and a set of Workers. These two different types of nodes has slightly different setup instruction, found below.
All scripts are found in src/processing
. One script that may be useful is add_hosts.sh <hosts> <port>
as a way to add all of the hosts listed in the specified host file to the trusted hosts list. This is required for pssh
which we use to deploy our scripts. One point of note is that you cannot have your username in the hosts file when using this script.
These scripts should work on any platform, however, we have only tested this on Azure, and CloudLab. We provide experiment files for c220g5
nodes on CloudLab, and D8s_v5
nodes in Azure. These instance types will become important later, however users can provide their own files in ./src/spaces
.
If you are trying to replicate the work found in our paper, we recommend using 10 worker nodes and 1 orchestrator node, where the 10 worker nodes are the first 10 nodes that were created. This will allow you to use our provided hosts files (hosts.azure
or hosts.cloudlab
). Alternatively, a custom host file can be used.
To install and copy our files, there are two commands we will need to run.
./worker_setup_remote.sh <hosts>
The first command will install all of the dependencies, as well as set up the environment.
./worker_deployment.sh <hosts> <node_type>
The second command will start all of the required processes. Note that the second command will say some of the commands fail. This is expected, as they simply ensure that any previous instances of stopped and deleted before beginning the initialization process.
At some point during running this command, there will be a required interaction to specify that the docker image that is building in the background has completed. There are two options here. First, you can connect to one of the workers and run sudo tmux a -t install
, and sure that the pane has completed all of its commands. Alternatively, you can wait around 20 minutes, and this will most likely be long enough for the image to complete building.
Building the orchestrator requires slightly more interaction from the user.
bash orchestrator_deploy.sh <orchestrator_host> 22
First, like before we will set up the environment using a deployment script. This will, again, automate the file transfer and environment setup.
Next, connect to your orchestrator node and run the following commands. Note, that the first command is tmux
. We recommend using this as tuning runs are long running. Without tmux
disconnects are common over ssh
, however this is not technically required.
tmux
make -C src/MLOS # (only the conda-env target is required)
conda activate mlos
To test functionality of TUNA, there is one main script that can be run on the orchestartor node:
python3 TUNA.py <experiment> <seed> <hosts>
An example of this with the parameters filled out the way we used it in the paper is as follows:
python3 TUNA.py spaces/experiment/pg16.1-tpcc-8c32m.json 1 hosts.cloudlab
Running this command will run a tuning run for around 8 hours. The results will be output into the results folder in .csv and .pickle file formats. These can then be rerun using mass_reruns_v2.py
, however this is not required to get tuning results.
Here we provide a brief description of each tuning script. All of the following scripts are run with the following general pattern python3 <script> <experiment> <seed> <hosts>
, with the exception of parallel_prior.py
which takes an additional noise command at the end of the script. An example is included in the next section.
mass_reruns_v2.py
: A script to rerun selected configurations for the "inference" step as described in the paper.naive_measured.py
: A script to run our naive distributed tuning setup, where every configuration is run on each node. This is the script we used in our ablation study.parallel_gp.py
: A script to run traditional sampling, with a gaussian process model.parallel_prior.py
: A script to run traditional sampling with injected gaussian noise determined by the noise level. This script is run as follows:python3 parallel_prior.py <experiment> <seed> <hosts> <noise>
parallel.py
: A script to run the default traditional sampling baseline as described in the paper.TUNA_gp.py
: A script to run TUNA with the optimizer switched out for a gaussian process modelTUNA_no_model.py
:The full TUNA sampling methodology without the noise adjustor model.TUNA_no_outlier.py
: The full TUNA sampling methodology without the outlier detection.TUNA.py
: The normal TUNA sampling script that is used tbroughout the paper.
Description | Estimated Azure Cost | Tuning Commands | |
Figure 3 | Run parallel tuning runs with a gaussian prior. This is used to compare the rate of convergence in a noisy environment. | N/A |
python3 parallel_prior.py spaces/experiment/pg16.1-epinions-c220g5.json <seed> <hosts> <noise>
|
Figure 9a | Run 10 parallel tuning runs, and 10 runs using TUNA targeting the TPC-C workload on Postgres 16.1 running on Azure. Take the data that is generated from this output and rerun the data on a set of 10 new worker nodes. | $320 |
python3 parallel.py spaces/experiment/pg16.1-epinions-8c32m.json <seed> <hosts>
python3 TUNA.py spaces/experiment/pg16.1-epinions-8c32m.json <seed> <hosts>
python3 mass_reruns_v2.py
|
Figure 9b | Run 10 parallel tuning runs, and 10 runs using TUNA targeting the epinions workload on Postgres 16.1 running on Azure. Take the data that is generated from this output and rerun the data on a set of 10 new worker nodes. | $320 |
python3 parallel.py spaces/experiment/pg16.1-epinions-8c32m.json <seed> <hosts>
python3 TUNA.py spaces/experiment/pg16.1-epinions-8c32m.json <seed> <hosts>
python3 mass_reruns_v2.py
|
Figure 9c | Run 10 parallel tuning runs, and 10 runs using TUNA targeting the TPC-H workload on Postgres 16.1 running on Azure. Take the data that is generated from this output and rerun the data on a set of 10 new worker nodes.on | $320 |
python3 parallel.py spaces/experiment/pg16.1-tpch-8c32m.json <seed> <hosts>
python3 TUNA.py spaces/experiment/pg16.1-tpch-8c32m.json <seed> <hosts>
python3 mass_reruns_v2.py
|
Figure 10 | Identical to Figure 9a, however running on a different region. | $320 |
python3 parallel.py spaces/experiment/pg16.1-tpcc-8c32m.json <seed> <hosts>
python3 TUNA.py spaces/experiment/pg16.1-tpcc-8c32m.json <seed> <hosts>
python3 mass_reruns_v2.py
|
Figure 11 | Identical to Figure 9a, however running on CloudLab `c220g5` nodes rather than on Azure. | N/A |
python3 parallel.py spaces/experiment/pg16.1-tpcc-c220g5.json <seed> <hosts>
python3 TUNA.py spaces/experiment/pg16.1-tpcc-c220g5.json <seed> <hosts>
python3 mass_reruns_v2.py
|
Figure 12 | Run 10 parallel tuning runs, and 10 runs using TUNA targeting the YCSB-C workload on redis running on Azure. Take the data that is generated from this output and rerun the data on a set of 10 new worker nodes. | $320 |
python3 parallel.py spaces/experiment/redis7.2-ycsb-8c32m-latency.json <seed> <hosts>
python3 TUNA.py spaces/experiment/redis7.2-ycsb-8c32m-latency.json <seed> <hosts>
python3 mass_reruns_v2.py`
|
Figure 14 | Identical to Figure 9a, however using a gaussian process optimizer rather than a random forest optimizer (SMAC). | $320 |
python3 parallel_gp.py spaces/experiment/pg16.1-tpcc-8c32m.json <seed> <hosts>
python3 TUNA_gp.py spaces/experiment/pg16.1-tpcc-8c32m.json <seed> <hosts>
python3 mass_reruns_v2.py
|
Figure 16 | Identical to Figure 9a, with the outlier detector ablated. | $1400 |
python3 TUNA.py spaces/experiment/pg16.1-epinions-8c32m.json <seed> <hosts>
python3 TUNA_no_model.py spaces/experiment/pg16.1-epinions-8c32m.json <seed> <hosts>
|
- https://aka.ms/mlos/tuna-eurosys-dataset - The VM noise dataset used to inform the design of TUNA.