Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformers with Pytorch Geometric and Programl #179

Open
mtzortzi opened this issue Sep 29, 2021 · 6 comments
Open

Transformers with Pytorch Geometric and Programl #179

mtzortzi opened this issue Sep 29, 2021 · 6 comments

Comments

@mtzortzi
Copy link

❓ Questions and Help

At the point of choosing which node attributes I want to use in my model, I want to choose the 'text' attribute, containing "[external] call shl i64 ...". The thing with this attribute is that is of a type string, so for it to be compatible with Pytorch Geometric (pyg) graphs I have to somehow encode it into a numerical representation - embeddings.
I have chosen to use Transformers for this job, I saw that you also used Transformers in your Case Study C: Algorithm classification. So, I have some questions about this.

  1. When you create your text - strings to be compatible with the Transformer, how this information is traversed from the graphs. For example, in which way did you take these strings "[external] call shl i64" with that specific order. Why is this order the correct one for the strings to be contextually relevant?

  2. As far as the pre-trained models from transformers concern (or sentence_transformers, if you chose this one), typically we have pre-trained models. But here we have a very "different language" than the other vocabularies. We have the "main strings" of an llvm ir code line. So there isn't either a dataset from HuggingFace for example so as to download and then build my tokenizer, etc ... or a pre-trained-model to work with that. My solution is to build my dataset for each of the llvm ir code I have from each graph and then follow the steps to create my transformer model from scratch. Did you use a pre-trained model for your Transformer, or did you create the dataset and then made it from scratch?

Additional Context

I'm currently working with your programl graphs, specifically after converting each llvm ir code into your programl graph, I convert it into networkx and after that into Pytorch Geometric (pyg) graph.
My goal is to predict the end-to-end time for each kernel to run in each device (gpu or cpu), so as to decide eventually which device is the best for each kernel.

I would be really grateful if you could answer my questions.

Thank you,
Marianna

@ChrisCummins
Copy link
Owner

Hi @mtzortzi,

CC'ing @Zacharias030 who did the transformer experiment you're referencing, he can provide the context for that.

For your second question, we didn't use a pretrained Transformer model. In fact we didn't really find pre-trained embeddings particularly helpful. See our abalation studies where we compare inst2vec pre-trained embeddings against random initialization.

Cheers,
Chris

@Zacharias030
Copy link
Collaborator

Hi Marianna,

thanks for your interest in the transformer details!

In standard transformers, such as from huggingface, the input sentences, e.g. "What don't the foxes say?" is tokenized into elements of a vocabulary according to certain splitting rules (mostly splits on white space, but also subword splittings), here for instance What|do|##n't|the|fox|##es|say|?|[PAD]|[PAD]|[PAD].... each token from the vocabulary is then looked up in an embedding table and the sequence length of the transformer model is the number of tokens that an input can have.

Here, we consider the Transformer model as a graph neural network where each token is now associated with one node in the graph, which means each node is associated with exactly one embedding vector as an input.

In your example of a ProGraML graph, you have a node whose text is "[external] call shl i64". We would directly build a vocabulary from different such text strings (considering each one to be a unique token!) over many ProGraML graphs, where each unique string constitutes a different element in our vocabulary. In order to keep the cardinality of the vocabulary finite, we only keep the most frequent text strings as unique tokens and map all very rare tokens to a special element in our vocabulary, the so-called "unknown token" [UNK!].
In order to have fewer such "misses" that are mapped to the unknown-token, we experiment with different ways of mapping instructions to a vocabulary, where taking the set of unique texts is just the simplest idea.

Since our vocabulary is specific to LLVM instructions, it is now easy to see why pretrained huggingface models don't quite "apply" to our problem formulation. In the paper, as @ChrisCummins mentioned, we instead compare by using pretrained embeddings from inst2vec, because these use the same text-based tokenization and thus directly apply.

If you are interested in using pretrained Transformer models from NLP, you would have to keep using their tokenization and give it a good think how to aggregate the sequence of tokens that you obtain for each node into a single node-representation of the ProGraML graph.

I hope that I was able to give you a better picture of the relation between our graph transformer models and standard NLP transformers, please don't hesitate to follow up with questions with an @-mention of me!
I'd be happy to send you my thesis, where I go into much more detail on the formulation of the graph transformers that are used in the paper.

Best,
Zacharias

@mtzortzi
Copy link
Author

Hi @Zacharias030,

First of all, I would like to thank you both for your immediate answers, I really appreciate it.

Yes, exactly that was my thought too. Right now, I've created some text files as my dataset, in which I have for each graph a "big text" corresponding to the strings of the text node attribute.

graph0: node0text, node1text, node2text, node3text ...
graph1: node0text, node1text, node2text, node3text ...

I have used the name graph0, in order to be more clear. In the place of "node0text, node1text, node2text, node3text ..." I have
" [external] call shl i64 ... "

The order of the nodetext ([external] call shl i64 ...) is the order derived from the networkx object, while looping for each node['text']. I was wondering how this order is the best to ensure contextuality? Is this managed from the nature of the graph?

After creating my dataset and building my tokenizer, I train the transformer model with a masked language model. That's how I initialize my transformer from scratch.

So, If my understanding is correct, you took as the embedding vectors the inst2vec, but you didn't train your Transformer with that? I'd be really happy if you could send me your thesis, to understand better your graph transformers.

Well, keeping the tokenization for each sequence of tokens that I'm obtaining and then somehow average their subword embedding vectors to generate an approximate vector for the original word, is a question that I'm working right now.

I'm sorry for the volume of questions.

Best,
Marianna

@Zacharias030
Copy link
Collaborator

The order of the nodetext ([external] call shl i64 ...) is the order derived from the networkx object, while looping for each node['text']. I was wondering how this order is the best to ensure contextuality? Is this managed from the nature of the graph?

I don't know exactly how we ordered the nodes while creating the networkx graphs, but you could easily take a topological ordering of the nodes (like Ben-Nun et al. (Neurips) 2018) do.

In my opinion this could still be lacking though, because the context for each node is precisely the graph neighborhood and that's why we created the ProGraML graphs in the first place.
Since we may want to train our transformer models from scratch anyway, we should be able to design a graph-transformer that can faithfully represent the context for each node and simultaneously aggregate sub-nodetext tokens into node-level vector representations. I could imagine a two-stage design with a per-node transformer underneath a graph transformer that can be trained end-to-end, or a hybrid architecture that interchanges the two modes of aggregation.

I'm attaching my thesis for context on Graph Transformers and as additional background for the ProGraML paper.

thesis_zacharias_fisches_print_version.pdf

@mastersthesis{Fisches2020,
  author =        {Fisches, Zacharias V.},
  school =        {ETH Zurich},
  title =         {{Neural Self-Supervised Models of Code}},
  year =          {2020},
}

@mtzortzi
Copy link
Author

Hello @Zacharias030,

Thank you so much for your willingness to answer my questions. I'm also thrilled about the field of transformers and your work motivates me to learn more about it. I've read your thesis and it's really enlightening.

I don't know exactly how we ordered the nodes while creating the networkx graphs, but you could easily take a topological ordering of the nodes (like Ben-Nun et al. (Neurips) 2018) do.

In my opinion this could still be lacking though, because the context for each node is precisely the graph neighborhood and that's why we created the ProGraML graphs in the first place.

As far as the NLP encoding is concerned, yes, I believe that as well. That's why I was wondering what is the order of the nodes when I made the "for loop" from the networkx graph, since when you create the total text of the graph (which I tokenize and train the transformer to build my vocabulary) you probably want to keep closer the text of the neighboring nodes. Unless this is going to be achieved by using the Position encoding. This brings me to the following question:

When using the Graph Transformer the position encoding(PE) needed, is the "position" feature from the edges, is this right? From the ProGraML paper, I understand that there is a different position encoding for each type of edge (each type of edge feature "flow": control, data, call), i.e. for call edges it is mentioned that there is no position label and for control flow edges it mentions "instructions with a single control successor, the position of the control edge is 0" .. "n successor statements, the control edge positions are in the range 0<=e<=n"...
So I'm a bit confused about which position encoding to use.

simultaneously aggregate sub-nodetext tokens into node-level vector representations.

If we are using the Graph Transformer we mostly benefit from the restriction of the attention of transformer only to the local neighbors in our graph. This is the outer part of the architecture. The inner one concerns the creation of the vector representations that will constitute the embeddings for the outer part. What do you mean by the "aggregation of the sub-nodetext tokens", would you please explain this to me?

I'm really excited about this and it would be really great if I could contribute in a way to this project.

Thank you very much once again,
Marianna

@Zacharias030
Copy link
Collaborator

Hey Marianna,

thanks for your kind words and continued interest in our work! I'll have to postpone a more detailed answer for a few days and get back to you!

Best,
Zach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants