Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
partition.py		partition.py

README.md

DataSet

Set the current directory

ROOT="$(git rev-parse --show-toplevel)"
cd "${ROOT}/dataset"

Download dataset

curl -fsSL "https://snap.stanford.edu/data/cit-HepPh.txt.gz" | gunzip -d > cit-HepPh.txt
curl -fsSL "https://snap.stanford.edu/data/cit-HepPh-dates.txt.gz" | gunzip -d > cit-HepPh-dates.txt
# OR https://www.cs.cornell.edu/projects/kddcup/datasets.html

Set up Python virtual environment

python3 -m venv .venv
source .venv/bin/activate
python --version
# Python 3.10.8

Run the partition script. Two directories will be generated in dataset directory: [batch, incremental]

This data partitioning script performs data cleaning of the High-energy physics citation network dataset. Specifically, some nodes in the edge file do not have a date, and vice versa, nodes with a date are not necessarily in the edge file. For this case, we took the intersection of these two datasets, which means that a node must reference or be referenced by other nodes in the edge file, and this node must have an exact date.

python partition.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

DataSet

Files

dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset

Folders and files

parent directory

README.md

DataSet