This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) [1] using a bottom-up approach: instead of giving a fixed threshold on input and then finding all the RDFs, this method infers distances from different RHS attributes by itself and then discovers the RFDs for these ones.
rfd-discovery takes a dataset, representing a table of a relational database, in CSV format as input and prints the set of the discovered RFDs.
CSV file can contain the following formats:
- int;
- int32;
- int64;
- float;
- float64;
- string;
- datetime64*.
*for date format you can use one of the formats known by pandas
Index:
rfd-discovery is developed using Python 3.5, a C compiler (gcc or Visual Studio C++) and Cython 0.25.2, the latter is used to improve time and memory consuming in CPU-bound operations.
For running rdf-discovery correctly, you have to install Python 3.5 and Cython 0.25. For installing correctly all the requirements you have to install pip 9.0 (or high).
rdf-discovery use the following Python's libraries:
matplotlib✛
numpy✛
pandas✛
tornado
Cython
nltk
flask
You can install these by following the Setup Section.
✛these libraries are part of SciPy stack
In order to install rfd-discovery and all his requirements, you have to create a virtual environment using venv on Python 3.5. To install venv, run the following:
[sudo] pip3 install virtualenv
on Linux/macOS
or
pip install virtualenv
using the prompt as the administrator on Windows.
To create a virtual environment, in the main directory of the project run:
virtualenv venv
.
To activate the virtual environment, in the main directory on the project run:
source venv/bin/activate
on Linux/MacOS
or
venv\Scripts\activate
on Windows.
You can check if the virtual environment is activated, checking if the command prompt has the prefix (venv)
.
To install all the requirements, run the following:
pip install -r requirements.txt
This should install, using pip, all the requirements.
To install WordNet, run:
python setup.py install
.
Part of rfd-discovery is written using Cython, a superset of the Python programming language, designed to give C-like
performance with code which is mostly written in Python. This because operations that take place in the code are mostly
CPU bound, wasting computation and memory resources.
You can compile Cython code running the following:
python build.py build_ext --inplace
this will generate C code from Cython code and will try to compile it.
** Note that you'll need gcc or other C compiler **
If building phase ends without errors, you should have some .c and .pyd (or .so, depending by your OS) files. Don't worry about dealing with these, Python does it automatically :).
Using rdf-discovery is easy enough. Just run the following command:
python3 main.py -c <csv-file> [options]
-c <your-csv>
: is the path of the dataset on which you want to discover RFDs;
Options:
-v
: display the version number;-s <sep>
: the separation char used in your CSV file. If you don't provide this, rfd-discovery tries to infer it for you;-h
: Indicates that the CSV file has the header row. If you don't provide this, rdf-discovery tries to infer it for you.-r <rhs_index>
: is the column number of the RHS attribute. It must be a valid integer. You can avoid specifying it only if you don't specify LHS attributes (it will find RFDs using each attribute as RHS and the remaining as LHS);-l <lhs_index_1, lhs_index_2, ...,lhs_index_k>
: column indexes of LHS attributes separated by commas (e.g. 1,2,3). You can avoid specifying them:
if you don't specify the index for RHS attribute it will find RFDs using each attribute as RHS and the remaining as LHS;
if you specify a valid RHS index it will assume your LHS as the remaining attributes;-i <index_col>
: the column which contains the primary key of the dataset. Specifying it, the program will not calculate distance on it. NOTE: index column should contain unique values;-d <datetime columns>
: a list of columns, separated by commas, which values are in datetime format; Specifying this, rfd-discovery can depict distance between two date in time format (e.g. ms, sec, min);--semantic
: use semantic distance on Wordnet for string; For more info here.--human
: print the RFDs to the standard output in a human-readable form;--help
: show help.
python main.py -c resources/dataset.csv
python main.py -c resources/dataset.csv -r 0
python main.py -c resources/dataset.csv -r 0 -l 1,2,3 -s , -h 0