Skip to content

Latest commit

 

History

History

dataset

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

DataSet

Set the current directory

ROOT="$(git rev-parse --show-toplevel)"
cd "${ROOT}/dataset"

Download dataset

curl -fsSL "https://snap.stanford.edu/data/cit-HepPh.txt.gz" | gunzip -d > cit-HepPh.txt
curl -fsSL "https://snap.stanford.edu/data/cit-HepPh-dates.txt.gz" | gunzip -d > cit-HepPh-dates.txt
# OR https://www.cs.cornell.edu/projects/kddcup/datasets.html

Set up Python virtual environment

python3 -m venv .venv
source .venv/bin/activate
python --version
# Python 3.10.8

Run the partition script. Two directories will be generated in dataset directory: [batch, incremental]

This data partitioning script performs data cleaning of the High-energy physics citation network dataset. Specifically, some nodes in the edge file do not have a date, and vice versa, nodes with a date are not necessarily in the edge file. For this case, we took the intersection of these two datasets, which means that a node must reference or be referenced by other nodes in the edge file, and this node must have an exact date.

python partition.py