EmbedAIRR

EmbedAIRR is a tool for extracting embeddings and attention matrices from protein sequences using pre-trained models. This tool supports various configurations for extracting embeddings and attention matrices, including options for handling CDR3 sequences. Currently implemented models are ESM2 from the 2023 paper "Evolutionary-scale prediction of atomic-level protein structure with a language model" and AntiBERTa2-CSSP from the 2023 pre-print "Enhancing Antibody Language Models with Structural Information".

Usage

Clone the repository:
```
git clone <repository-url>
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Run the embedding script:

python embed_airr.py --fasta_path <input-file> --output_path <output-file> --model_name <model_name> --<optional_arguments>

Arguments

--model_name (str, required): Model name. Example: esm2_t33_650M_UR50D.
--fasta_path (str, required): Path to the FASTA file.
--output_path (str, required): Directory for output files. Will generate a subdirectory for outputs of each output type.
--cdr3_path (str, optional): Path to the CDR3 CSV file. Only required when calculating CDR3 sequence embeddings.
--context (int, optional): Number of amino acids to include before and after the CDR3 sequence. Default is 0.
--layers (str, optional): Representation layers to extract from the model. Default is the last layer. Example: --layers -1 6.
--extract_embeddings (str, optional): Set the embedding return types. Choose one or more from: pooled, unpooled, false. Default is pooled.
--extract_cdr3_embeddings (str, optional): Set the CDR3 embedding return types. Choose one or more from: pooled, unpooled, false. Requires --cdr3_path to be set. Default is pooled.
--extract_attention_matrices (str, optional): Set the attention matrix return types. Choose one or more from: false, all_heads, average_layer, average_all. Default is false.
--extract_cdr3_attention_matrices (str, optional): Set the CDR3 attention matrix return types. Choose one or more from: false, all_heads, average_layer, average_all. Requires --cdr3_path to be set. Default is false.
--batch_size (int, optional): Batch size for loading sequences. Default is 1024.
--discard_padding (bool, optional): Discard padding tokens from unpooled embeddings output. Default is False.
--max_length (int, optional): Length to which sequences will be padded. Default is 140.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EmbedAIRR

Usage

Arguments

Files

README.md

Latest commit

History

README.md

File metadata and controls

EmbedAIRR

Usage

Arguments