This file explains the instructions to run the project.
git lfs install && git lfs pull
). Datasets are too big for github.
Both Python and Java must be installed on the system. We tested that the Python version must be less than 3.10.9. We didn't test for a minimum required version.
See Install dependencies automatically or Install dependencies manually
All required packages can easily be installed with:
pip install -r requirements.txt
We implemented SON using both data on a local machine (Frequent_Itemset_local.py
) and stored on a MongoDB database (Frequent_Itemset_db.py
).
In both cases pyspark
is necessary. Install with
pip install pyspark
Other additional steps may be required (e.g. installing spark on the system).
matplotlib
is also used for plotting the resultsefficient-apriori
is the state-of-the-art implementation of Apriori, we used it to do some benchmarks:
pip install matplotlib efficient-apriori
If you want to run SON using data on a MongoDB database you also need to do the following steps:
- Install MongoDB on the system (the installation depends from the OS)
- Install
pymongo
with
pip install pymongo
Local datasets are already present in the Datasets/ folder. To start the algorithm run
python Frequent_Itemset_local.py
You can change the selected dataset in the script Frequent_Itemset_local.py
: 0 for travel reviews, 1 for online retail (line 106)
Execution informations can be found in the file logs/SON.log
.
For a MongoDB execution, the following steps are required:
- Load the data into the database with the scripts present in the folder
dataset_importers/
:
python dataset_importers/import_travel_reviews.py
or
python dataset_importers/import_online_retail.py
- Change the dataset in the script
Frequent_Itemset_db.py
accordingly. 0 for travel reviews, 1 for online retail (line 97) - Run the script
Frequent_Itemset_db.py
with
python Frequent_Itemset_db.py
Execution informations can be found in the file logs/SON.log
.
Follow the instructions explained in the How to run section. Then execute the script benchmark.py
with
python benchmark.py
The benchmark program takes care of loading the data where it needs to.
Results are saved in the file logs/benchmark.log
.
The benchmark dataset can be changed by defining a preprocessing function which returns the dataset as list in the Scripts/preprocessing.py
file and then passing it as argument to the benchmark preprocessing function call (line 125).
It is possible to configure the benchmark by changing the parameters in the file at line 127 (like the support to use and the number of partitions to create).
By changing the support
parameter it is possible to change the frequency threshold for the frequent itemsets, while by changing the partitions
parameter it is possible to change the number of partitions used by the algorithm. By default it is set to None, wich lets the specific
partitioner assign it. The local instance will by default use one partition per core, while the Database version will use the connector partitioner.
By setting partitions
anything other than None will force both DB and local versions to use the specified number of partitions.