-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformers with Pytorch Geometric and Programl #179
Comments
Hi @mtzortzi, CC'ing @Zacharias030 who did the transformer experiment you're referencing, he can provide the context for that. For your second question, we didn't use a pretrained Transformer model. In fact we didn't really find pre-trained embeddings particularly helpful. See our abalation studies where we compare inst2vec pre-trained embeddings against random initialization. Cheers, |
Hi Marianna, thanks for your interest in the transformer details! In standard transformers, such as from huggingface, the input sentences, e.g. Here, we consider the Transformer model as a graph neural network where each token is now associated with one node in the graph, which means each node is associated with exactly one embedding vector as an input. In your example of a ProGraML graph, you have a node whose Since our vocabulary is specific to LLVM instructions, it is now easy to see why pretrained huggingface models don't quite "apply" to our problem formulation. In the paper, as @ChrisCummins mentioned, we instead compare by using pretrained embeddings from inst2vec, because these use the same If you are interested in using pretrained Transformer models from NLP, you would have to keep using their tokenization and give it a good think how to aggregate the sequence of tokens that you obtain for each node into a single node-representation of the ProGraML graph. I hope that I was able to give you a better picture of the relation between our graph transformer models and standard NLP transformers, please don't hesitate to follow up with questions with an @-mention of me! Best, |
Hi @Zacharias030, First of all, I would like to thank you both for your immediate answers, I really appreciate it. Yes, exactly that was my thought too. Right now, I've created some text files as my dataset, in which I have for each graph a "big text" corresponding to the strings of the text node attribute. graph0: node0text, node1text, node2text, node3text ... I have used the name graph0, in order to be more clear. In the place of "node0text, node1text, node2text, node3text ..." I have The order of the nodetext ([external] call shl i64 ...) is the order derived from the networkx object, while looping for each node['text']. I was wondering how this order is the best to ensure contextuality? Is this managed from the nature of the graph? After creating my dataset and building my tokenizer, I train the transformer model with a masked language model. That's how I initialize my transformer from scratch. So, If my understanding is correct, you took as the embedding vectors the inst2vec, but you didn't train your Transformer with that? I'd be really happy if you could send me your thesis, to understand better your graph transformers. Well, keeping the tokenization for each sequence of tokens that I'm obtaining and then somehow average their subword embedding vectors to generate an approximate vector for the original word, is a question that I'm working right now. I'm sorry for the volume of questions. Best, |
I don't know exactly how we ordered the nodes while creating the networkx graphs, but you could easily take a topological ordering of the nodes (like Ben-Nun et al. (Neurips) 2018) do. In my opinion this could still be lacking though, because the context for each node is precisely the graph neighborhood and that's why we created the ProGraML graphs in the first place. I'm attaching my thesis for context on Graph Transformers and as additional background for the ProGraML paper. thesis_zacharias_fisches_print_version.pdf
|
Hello @Zacharias030, Thank you so much for your willingness to answer my questions. I'm also thrilled about the field of transformers and your work motivates me to learn more about it. I've read your thesis and it's really enlightening.
As far as the NLP encoding is concerned, yes, I believe that as well. That's why I was wondering what is the order of the nodes when I made the "for loop" from the networkx graph, since when you create the total text of the graph (which I tokenize and train the transformer to build my vocabulary) you probably want to keep closer the text of the neighboring nodes. Unless this is going to be achieved by using the Position encoding. This brings me to the following question: When using the Graph Transformer the position encoding(PE) needed, is the "position" feature from the edges, is this right? From the ProGraML paper, I understand that there is a different position encoding for each type of edge (each type of edge feature "flow": control, data, call), i.e. for call edges it is mentioned that there is no position label and for control flow edges it mentions "instructions with a single control successor, the position of the control edge is 0" .. "n successor statements, the control edge positions are in the range 0<=e<=n"...
If we are using the Graph Transformer we mostly benefit from the restriction of the attention of transformer only to the local neighbors in our graph. This is the outer part of the architecture. The inner one concerns the creation of the vector representations that will constitute the embeddings for the outer part. What do you mean by the "aggregation of the sub-nodetext tokens", would you please explain this to me? I'm really excited about this and it would be really great if I could contribute in a way to this project. Thank you very much once again, |
Hey Marianna, thanks for your kind words and continued interest in our work! I'll have to postpone a more detailed answer for a few days and get back to you! Best, |
❓ Questions and Help
At the point of choosing which node attributes I want to use in my model, I want to choose the 'text' attribute, containing "[external] call shl i64 ...". The thing with this attribute is that is of a type string, so for it to be compatible with Pytorch Geometric (pyg) graphs I have to somehow encode it into a numerical representation - embeddings.
I have chosen to use Transformers for this job, I saw that you also used Transformers in your Case Study C: Algorithm classification. So, I have some questions about this.
When you create your text - strings to be compatible with the Transformer, how this information is traversed from the graphs. For example, in which way did you take these strings "[external] call shl i64" with that specific order. Why is this order the correct one for the strings to be contextually relevant?
As far as the pre-trained models from transformers concern (or sentence_transformers, if you chose this one), typically we have pre-trained models. But here we have a very "different language" than the other vocabularies. We have the "main strings" of an llvm ir code line. So there isn't either a dataset from HuggingFace for example so as to download and then build my tokenizer, etc ... or a pre-trained-model to work with that. My solution is to build my dataset for each of the llvm ir code I have from each graph and then follow the steps to create my transformer model from scratch. Did you use a pre-trained model for your Transformer, or did you create the dataset and then made it from scratch?
Additional Context
I'm currently working with your programl graphs, specifically after converting each llvm ir code into your programl graph, I convert it into networkx and after that into Pytorch Geometric (pyg) graph.
My goal is to predict the end-to-end time for each kernel to run in each device (gpu or cpu), so as to decide eventually which device is the best for each kernel.
I would be really grateful if you could answer my questions.
Thank you,
Marianna
The text was updated successfully, but these errors were encountered: