-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Generation of Embeddings from list of sequences via ProtT5 #134
Comments
Seems like you are running out of vRAM. |
@mheinzinger , thanks I have solved my issue. Can you please tell, how to select the maximum length of residue from our protein sequences? In ProtT5 model, there is an option to select max_length of residues. |
I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset. |
@mheinzinger , on my use case, protein sequences are not comprised of PDB chains, my protein sequences are generated against some protein that belong to reviewed swissprot of uniporotKB entries. I want to do feature extraction for my protein sequences with ProtT5 model. Can you tell me which code better fits for my use case. The one which is mentioned in this link https://colab.research.google.com/drive/1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing , or the other one which is mentioned here: https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing. Should I generate per protein representations or per residue representations. If I single protein sequence to ProtT5 instead of batch, will the embedding that will generate via ProtT5 be same as are produced if we provide sequences in batch. Or, by providing sequences in batch we got more optimize results. |
Providing sequences as batch or processing them as single sequences should not make a difference (except for batching being faster). |
I am generating the embedding on my protein sequences via ProtT5 by the following code. I have total 5000 protein sequences which I am providing as list. I am fixing the max_length parameter to 500, but it gives me an out of memory error. Can you help me to fix? I have generate per protein embeddings. The final output which I want is of (5000, 1024).
RuntimeError: CUDA out of memory. Tried to allocate 10.77 GiB
Code:
`p_sequence = list(p_sequence)
from transformers import T5Tokenizer, T5EncoderModel
import torch
import re
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)
only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()
prepare the protein sequences as a list
p_sequence = p_sequence
replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in p_sequence]
tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="max_length", truncation=True, max_length=500)
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
generate embeddings
with torch.no_grad():
embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)
extract residue embeddings for each sequence in the batch and removed padded and special tokens
emb_list = []
for i in range(len(sequence_examples)):
emb_i = embedding_repr.last_hidden_state[i]
emb_list.append(emb_i)
take mean of embedding vectors for the entire protein
emb_per_protein_list = []
for emb in emb_list:
emb_per_protein = torch.mean(emb, dim=0)
emb_per_protein_list.append(emb_per_protein)`
The text was updated successfully, but these errors were encountered: