A Sketch-based Approach towards Scalable and Efficient Attributed Network Embedding
- BGENA is a super-fast sketch-based ANE solver, which uses BinSketch coupled with a novel edge propagation mechanism.
- PBGENA, Parallelized BGENA, works with MPI to leverage a system's multi-core architecture to further accelerate the embedding process.
- PBGENA's binary embeddings allow efficient bitarray/sparse-matrix storage.
- PBGENA embeddings achieve state-of-the-art performance in graph analysis tasks like node classification and link prediction.
- PBGENA is highly flexible and can work with just the topology or the attributes of the graph and enables turning off propagation altogether.
git clone https://github.com/tapadeep/PBGENA.git
pip install -r PBGENA/requirements.txt
cd PBGENA/Code/Algorithm
python BGENA.py --graph Facebook --N 2000 --alpha 0.5 --b_a 0.7 --b_t 0.6
cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph Facebook --algorithm BGENA --tr 0.7 --multi True
Ignore the --multi
flag for graphs that are not multi-labeled.
cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph Facebook --algorithm BGENA --erf 0.3 --N 2000 --alpha 0.8 --b_a 1 --b_t 0
cd PBGENA/Code/Algorithm
python PBGENA.py --graph PubMed --N 8000 --alpha 0.65 --b_a 0 --b_t 0.8 --p 6
The system must have Open MPI, MPICH, or Microsoft MPI installed.
cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph PubMed --algorithm PBGENA --tr 0.7
cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph PubMed --algorithm PBGENA --erf 0.3 --N 8000 --alpha 1 --b_a 0.25 --b_t 0 --p 6
cd PBGENA/Code/Algorithm
python Baseline.py --graph CiteSeer --algorithm FeatherNode --reduction_dimensions 25 --eval_points 5 --order 1
cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph CiteSeer --algorithm FeatherNode --tr 0.7
cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph CiteSeer --algorithm FeatherNode --erf 0.3 --reduction_dimensions 25 --eval_points 5 --order 1
Flag | Type | Description | Range | Default |
---|---|---|---|---|
--algorithm |
STRING |
Embedding Method to be used | {BGENA, PBGENA, Baselines} |
PBGENA |
--alpha |
FLOAT |
Fraction of the Embedding Dimension to be used for attributes |
[0, 1] | - |
--b_a |
FLOAT |
Attribute Bitset Probability | [0, 1] | - |
--b_t |
FLOAT |
Topology Bitset Probability | [0, 1] | - |
--erf |
FLOAT |
Fraction of edges to be removed for link prediction |
(0, 1) | 0.3 |
--f |
INTEGER |
Fragment parameter for PBGENA communication calls |
ℕ | 1 |
--f_a |
FLOAT |
Reduction fraction for b_a each pass | [1, ∞) | 2 |
--f_t |
FLOAT |
Reduction fraction for b_t each pass | [1, ∞) | 2 |
--graph |
STRING |
Network to be used for embedding | Any Network | - |
--l_a |
INTEGER |
Number of passes of attribute edge propagation |
ℕ | 1 |
--l_t |
INTEGER |
Number of passes of topology edge propagation |
ℕ | 1 |
--multi |
BOOLEAN |
Identify if the graph is Multi-Labeled | {True, False} | False |
--N |
INTEGER |
Embedding Dimension | ℕ | 2000 |
--p |
INTEGER |
Number of workers to use | ℕ | 32 |
--tr |
FLOAT |
Training Ratio for Node Classification | (0, 1) | 0.7 |
Some of the others are standard flags, and the remaining flag descriptions are available with the python graph learning framework Karate Club.
An online Google Colaboratory Notebook demonstrates PBGENA's working, and other examples are in the Examples
folder.
Graph | α | b_a | b_t |
---|---|---|---|
Wikipedia | 0.85 | 0.00 | 0.20 |
Cora | 0.60 | 0.80 | 0.80 |
CiteSeer | 0.80 | 0.90 | 0.40 |
0.50 | 0.70 | 0.60 | |
BlogCatalog | 0.60 | 0.00 | 0.00 |
Flickr | 1.00 | 0.00 | 0.00 |
PubMed | 0.65 | 0.00 | 0.80 |
PPI | 0.10 | 0.95 | 0.50 |
0.60 | 0.00 | 0.00 | |
Google+ | 0.00 | 0.00 | 0.00 |
0.00 | 0.00 | 0.00 | |
TWeibo | 0.60 | 0.00 | 0.00 |
MAKG | 0.85 | 0.86 | 0.60 |
Graph | α | b_a | b_t |
---|---|---|---|
Wikipedia | 0.95 | 0.20 | 0.20 |
Cora | 0.60 | 0.30 | 0.40 |
CiteSeer | 0.90 | 0.20 | 0.40 |
0.80 | 1.00 | 0.00 | |
BlogCatalog | 0.90 | 0.05 | 0.00 |
Flickr | 0.90 | 0.00 | 0.00 |
PubMed | 1.00 | 0.25 | 0.00 |
PPI | 0.00 | 0.00 | 0.30 |
1.00 | 0.00 | 0.00 | |
Google+ | 0.90 | 0.05 | 0.10 |
0.95 | 0.20 | 0.00 | |
TWeibo | 0.95 | 0.00 | 0.00 |
MAKG | 0.85 | 0.86 | 0.60 |
The other baseline hyperparameters used in the experiments can be obtained here.
Graph | #Vertices | #Edges | #Attributes | #Labels | Multi- labeled? |
---|---|---|---|---|---|
Wikipedia | 2,405 | 11,596 | 4,973 | 17 | No |
Cora | 2,708 | 5,278 | 1,433 | 7 | No |
CiteSeer | 3,312 | 4,536 | 3,703 | 6 | No |
4,039 | 88,234 | 1,283 | 193 | Yes | |
BlogCatalog | 5,196 | 171,743 | 8,189 | 6 | No |
Flickr | 7,575 | 239,738 | 12,047 | 9 | No |
PubMed | 19,717 | 44,324 | 500 | 3 | No |
PPI | 56,944 | 793,632 | 50 | 121 | Yes |
81,306 | 1,342,296 | 216,839 | 4,065 | Yes | |
Google+ | 107,614 | 12,238,285 | 15,907 | 468 | Yes |
232,965 | 57,307,946 | 602 | 41 | No | |
TWeibo | 2,320,895 | 50,133,369 | 1,657 | 9 | No |
MAKG | 59,249,719 | 976,901,586 | 7,211 | 100 | Yes |
Some preprocessed networks are added to the Datasets
folder, and the remaining can be downloaded online. All of the datasets are obtained from Dr. Jieming Shi's website. To generate files suitable for PBGENA with data from their website, use:
cd PBGENA/Datasets
python data_processing.py --file cora --graph Cora
This converts cora.attr.tar.gz
to the Cora
folder containing relevant files which can be used to train PBGENA. Use the --multi True
flag for converting multi-labeled networks.
Graph | Node Classification (%) | Link Prediction (%) | Time (s) | |||
Macro-F1 | Micro-F1 | AUC-ROC | Avg Precision | Serial | Parallel | |
Wikipedia | 70.23±4.45 | 79.53±1.78 | 85.32±0.26 | 86.49±0.27 | 0.85 | 0.08 |
Cora | 84.88±1.56 | 85.66±1.15 | 89.50±0.60 | 90.42±0.65 | 0.64 | 0.08 |
CiteSeer | 68.95±0.95 | 72.07±0.54 | 92.93±0.55 | 94.33±0.45 | 0.80 | 0.09 |
32.89±1.65 | 75.41±0.88 | 97.70±0.05 | 97.88±0.04 | 1.22 | 0.16 | |
BlogCatalog | 91.97±0.58 | 92.07±0.59 | 80.23±0.14 | 80.46±0.17 | 0.36 | 0.06 |
Flickr | 78.08±0.76 | 77.92±0.81 | 90.40±0.07 | 90.49±0.08 | 0.09 | 0.06 |
PubMed | 86.55±0.24 | 86.75±0.22 | 91.91±0.23 | 92.38±0.28 | 2.09 | 0.20 |
PPI | 40.30±0.06 | 54.58±0.05 | 96.12±0.03 | 96.82±0.03 | 16.30 | 1.05 |
25.43±0.33 | 56.47±0.74 | 97.46±0.02 | 97.50±0.02 | 46.19 | 1.61 | |
Google+ | 59.24±0.41 | 84.57±0.23 | 93.63±0.05 | 93.21±0.12 | 13.83 | 1.14 |
87.01±0.11 | 88.13±0.12 | 88.10±0.01 | 86.16±0.03 | 75.32 | 3.68 | |
TWeibo | 17.14±0.20 | 56.83±0.09 | 96.22±0.02 | 96.39±0.01 | 84.08 | 14.02 |
MAKG | 37.77±0.16 | 50.26±0.21 | 85.15±0.03 | 84.81±0.05 | 24504.66 | 6136.49 |
Except for MAKG all of the above reported results were performed with N=2000
, tr=0.7
, erf=0.3
, and p=32
in a server with 270GB RAM with proper hyperparameters. For MAKG, we used tr=0.1
, p=6
, and f=100
. To get even better results perform the experiments at N=8000
. A more comprehensive performance report can be viewed here.
To create your network, you need three files: edge_list.npy
, attribute_matrix.npz
, and label_array.npy
or label_array.npz
(depending on whether the graph is single-labeled or multi-labeled). Ensure all your vertex IDs are in the range (0, nodes-1). Create a numpy array of the edges of shape (edges, 2) and save that file as edge_list.npy
. Store the attributes as a sparse CSR matrix of shape (nodes, attributes). To run PBGENA for non-attributed graphs, create an empty attribute matrix for the graph of the proper shape and set α=0. Save the attribute file as attribute_matrix.npz
. The file for label array is required only for node classification and can be ignored for performing link prediction on unlabeled graphs. If the graph is single-labeled, the file label_array.npy
is a simple 1D array where the i'th number denotes the label for node i. For multi-labeled graphs, label_array.npz
is a MultiLabelBinarized CSR-matrix of shape (nodes, labels). Finally, put these three files in a folder with the graph name and add it to the Datasets
folder.