Influenza viral proteins will target human proteins that have unique sequence features and topological features in the human interactome networks. This graph variational auto-encoder is thus designed to learn the topological features of the human proteins as well as sequence features of human proteins.
Construct the full dataset by concatenating STRING physical interaction networks between human proteins. Influenza interactions are retrieved from STRING viruses using Influenza A virus and added on top of the STRING human-human interaction network.
All the influenza and host sequences were then embedded using SeqVec and each seqeunce became a vector of 1024.
First, randomly remove 20% of edges from the adjacency matrix and split these 20% of edegs into test set validation set. Then the rest of edges, together with the sequence embedded features,is taken up by the variational graph auto-encoder (VGAE). The learning task is to reconstruct the masked 20% edges by learning the topological and sequence features of the nodes of the remaining 80% of edges.
A typical GCN layer will convolute information from the node's one-hop neighbour but in this project, I used high-oder GCN (HO-GCN), which can take convolute the information from 2nd or even further neighbours.
HO-GCN achieves this by performing random walks at each GCN layer before passing on the information to the next layer. Thus, naturally, HO-GCN introduced 2 new hyper-perameters: the number of random walks and
the possibility of restart at each random walk.
The docker image can be obtained
docker pull ghcr.io/dingquanyu/ho-vgae_ppi_predictor:latest
Then docker run automatically starts the training:
docker run --rm -it ghcr.io/dingquanyu/ho-vgae_ppi_predictor:latest
The entrypoint by default was set to be:
python /app/ho-vgae_ppi_predictor/train.py \
--path_to_graph=/input_data/influenza_human_PPN_clean.gml \
--path_to_node_features=/input_data/feature_mtx.pkl \
--model=HOVGAE --alpha=0.2
The only hyper-perameter is alpha
and the user is free to overwrite the entrypoint and use different settings of alpha
The user can choose either "HOVGAE" or "VGAE" as --model
to see how high-order GCN could make a difference.
Since the number of random walks at each convultion layer, it is very time and computationally consuming to determine the best number of random walk. Thus, below is the approximation of infinte steps of random walk at each layer.
At
And since random walk converges to a Wiener process, if the number of walks is infinite, then