Skip to content

Squall Local Configs

avitorovic edited this page May 10, 2012 · 11 revisions

We will explain the content of a config file on INSTALL_DIR/dip/SQLtoQueryPlanPlugin/confs/0.1G_hyracks_serial:

DIP_DISTRIBUTED false
DIP_QUERY_NAME hyracks

DIP_TOPOLOGY_NAME_PREFIX teamX
DIP_DATA_ROOT /path/to/tpch/on/local/machine/
DIP_SQL_ROOT ../dip/SQLtoQueryPlanPlugin/SQLqueries/

# DIP_DB_SIZE is in GBs
DIP_DB_SIZE 0.1
DIP_MAX_SRC_PAR 1

# below are unlikely to change
DIP_EXTENSION .tbl
DIP_READ_SPLIT_DELIMITER \|
DIP_GLOBAL_ADD_DELIMITER |
DIP_GLOBAL_SPLIT_DELIMITER \|

DIP_ACK_EVERY_TUPLE true
DIP_KILL_AT_THE_END true

In order to distinguish parameters of Squall and Storm, we use prefix DIP for Squall, which is a shortcut for Distributed Incremental Processing. DIP_DISTRIBUTED must be false to execute the query plan in Local mode. DIP_QUERY_NAME must correspond to a query from INSTALL_DIR/dip/SQLtoQueryPlanPlugin/SQLqueries/. In this case, DIP_QUERY_NAME = hyracks corresponds to a SQL query from INSTALL_DIR/dip/SQLtoQueryPlanPlugin/SQLqueries/hyracks.sql. Topology name is built by concatenation of DIP_TOPOLOGY_NAME_PREFIX and DIP_TOPOLOGY_NAME. DIP_TOPOLOGY_NAME_PREFIX is there to distinguish different users, thus it can remain empty.

A database path is built by the concatenation of DIP_DATA_ROOT, DIP_DB_SIZE parameters and G string. We needed DIP_DB_SIZE separately because our optimizer uses this information for allocating parallelism for Storm components. The only way you can control parallelism is via DIP_MAX_SRC_PAR. For small relations (less than 100 tuples) the parallelism is 1, and for all others the parallelism is set to DIP_MAX_SRC_PAR. The parallelism for Bolts is set automatically, taking into account the position of a component in the query plan, such that there is no bottleneck with the minimal number of nodes used.

Due to main memory constraints, you cannot run arbitrary large database with small component parallelism. For information on detecting this behavior, please consult Squall query plans vs Storm topologies, section How to know we run out of memory?. The way you control it is through MAX_SRC_PAR parameter - the larger the parameter is, bigger database can be processed.

DIP_SQL_ROOT is the absolute path for SQL queries on your local machine. DIP_ACK_EVERY_TUPLE refers to a way we ensure that the processing is done, so the final result and the full execution time can be acquired. If the parameter is set to true, that means we ack each and every tuple. If the parameter is set to false, each Spout sends a special message as the last tuple. For more information about implications of this parameter, please consult Squall query plans vs Storm topologies, section To ack or not to ack?.

Now we explain the parameters you most likely would not need to change; DIP_EXTENSION refers to file extension in your database. In our case, the names of the database files were customer.tbl, orders.tbl, etc. DIP_READ_SPLIT_DELIMITER is a regular expression used for delimiting columns of a tuple in a database file. DIP_GLOBAL_ADD_DELIMITER and DIP_GLOBAL_SPLIT_DELIMITER are used in Squall internally for serializing and deserializing tuples between different components. DIP_KILL_AT_THE_END assures your topology is killed after the final result is written to a file. If you set this to false, your topology will execute forever, consuming resources that could be used by other topologies executing at the same time.

Thus, in order to change database size, we have to modify DIP_DB_SIZE parameter, and for changing the query we have to change DIP_QUERY_NAME. You can find more examples of config files in INSTALL_DIR/dip/SQLtoQueryPlanPlugin/confs/, but for Local Mode only those ending with _serial are applicable. You can also write config files from scratch, but make sure you put them in INSTALL_DIR/dip/SQLtoQueryPlanPlugin/confs/. To run a config file MY_CONFIG from this directory, you have to run:

cd $INSTALL_DIR/bin
./pluginSQLLocalRun.sh MY_CONFIG

Keep in mind that in each config file you want to use you have to set DIP_DATA_ROOT parameter.