Skip to content

This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) using a bottom-up approach.

Notifications You must be signed in to change notification settings

dariodip/rfd-discovery

Repository files navigation

rfd-discovery

Build Status

By

Description

This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) [1] using a bottom-up approach: instead of giving a fixed threshold on input and then finding all the RDFs, this method infers distances from different RHS attributes by itself and then discovers the RFDs for these ones.

rfd-discovery takes a dataset, representing a table of a relational database, in CSV format as input and prints the set of the discovered RFDs.

CSV file can contain the following formats:

  • int;
  • int32;
  • int64;
  • float;
  • float64;
  • string;
  • datetime64*.

*for date format you can use one of the formats known by pandas

Index:

Requirements

rfd-discovery is developed using Python 3.5, a C compiler (gcc or Visual Studio C++) and Cython 0.25.2, the latter is used to improve time and memory consuming in CPU-bound operations.

For running rdf-discovery correctly, you have to install Python 3.5 and Cython 0.25. For installing correctly all the requirements you have to install pip 9.0 (or high).

rdf-discovery use the following Python's libraries:
matplotlib✛
numpy✛
pandas✛
tornado
Cython
nltk
flask

You can install these by following the Setup Section.

✛these libraries are part of SciPy stack

Setup

In order to install rfd-discovery and all his requirements, you have to create a virtual environment using venv on Python 3.5. To install venv, run the following:

[sudo] pip3 install virtualenv on Linux/macOS or pip install virtualenv using the prompt as the administrator on Windows.

To create a virtual environment, in the main directory of the project run:

virtualenv venv.

To activate the virtual environment, in the main directory on the project run:

source venv/bin/activate on Linux/MacOS or venv\Scripts\activate on Windows.

You can check if the virtual environment is activated, checking if the command prompt has the prefix (venv).

To install all the requirements, run the following:

pip install -r requirements.txt

This should install, using pip, all the requirements.

To install WordNet, run:

python setup.py install.

Build

Part of rfd-discovery is written using Cython, a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. This because operations that take place in the code are mostly CPU bound, wasting computation and memory resources.
You can compile Cython code running the following:

python build.py build_ext --inplace

this will generate C code from Cython code and will try to compile it.

** Note that you'll need gcc or other C compiler **

If building phase ends without errors, you should have some .c and .pyd (or .so, depending by your OS) files. Don't worry about dealing with these, Python does it automatically :).

Usage

Using rdf-discovery is easy enough. Just run the following command:

python3 main.py -c <csv-file> [options]

  • -c <your-csv>: is the path of the dataset on which you want to discover RFDs;

Options:

  • -v : display the version number;
  • -s <sep>: the separation char used in your CSV file. If you don't provide this, rfd-discovery tries to infer it for you;
  • -h: Indicates that the CSV file has the header row. If you don't provide this, rdf-discovery tries to infer it for you.
  • -r <rhs_index>: is the column number of the RHS attribute. It must be a valid integer. You can avoid specifying it only if you don't specify LHS attributes (it will find RFDs using each attribute as RHS and the remaining as LHS);
  • -l <lhs_index_1, lhs_index_2, ...,lhs_index_k>: column indexes of LHS attributes separated by commas (e.g. 1,2,3). You can avoid specifying them:
    if you don't specify the index for RHS attribute it will find RFDs using each attribute as RHS and the remaining as LHS;
    if you specify a valid RHS index it will assume your LHS as the remaining attributes;
  • -i <index_col>: the column which contains the primary key of the dataset. Specifying it, the program will not calculate distance on it. NOTE: index column should contain unique values;
  • -d <datetime columns>: a list of columns, separated by commas, which values are in datetime format; Specifying this, rfd-discovery can depict distance between two date in time format (e.g. ms, sec, min);
  • --semantic: use semantic distance on Wordnet for string; For more info here.
  • --human: print the RFDs to the standard output in a human-readable form;
  • --help: show help.
Valid Examples:
Check on each combination of attributes:

python main.py -c resources/dataset.csv

Infer LHS attributes given a fixed RHS' attribute index:

python main.py -c resources/dataset.csv -r 0

RHS and LHS fixed, separator and header line specified:

python main.py -c resources/dataset.csv -r 0 -l 1,2,3 -s , -h 0

About

This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) using a bottom-up approach.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published