Python classes in this package allow convenient access to UniProt for protein ID mapping and information retrieval.
Protein IDs differ from database to database. The class UniProtMapper can be utilized for mapping of protein IDs from one database to corresponding IDs of another database, specified by letter codes.
from UniProtClient import UniProtMapper
origin_database = 'P_GI' # PubChem Gene ID
target_database = 'ACC' # UniProt Accession
gi_2_acc_mappig = UniProtMapper(origin_database, target_database)
The obtained object has a function called map_protein_ids
, which takes a list of strings with protein IDs as input, returning a pandas DataFrame. The DataFrame has two columns: "From" and "To" referring to the origin and target ID, respectively.
gi_numbers = ['224586929', '224586929', '4758208'] # IDs should be represented as a list of strings
# a pandas DataFrame is returned containing the columns "From" and "To"
mapping_df = gi_2_acc_mappig.map_protein_ids(gi_numbers)
uniprot_accessions = mapping_df['To'].tolist()
mapping_df
From | To | |
---|---|---|
0 | 224586929 | Q9Y2R2 |
1 | 224586929 | B4DZW8 |
2 | 4758208 | P51452 |
UniProt provides a varity of protein specific information, such as protein family, organism, function, EC-number, and many more.
The class UniProtProteinInfo is initialized with column identifier specifing the requested information. Spaces in column names should be substituted by underscores.
If no columns are specified the default is used:
Column-ID |
---|
id |
entry_name |
protein_names |
families |
organism |
ec |
genes(PREFERRED) |
go(molecular_function) |
The column "protein_names" contains all protein names, where secondary names are given in brackets or parenthesis. If this column is requested, the primary name is extracted and added as a new column, called "primary_name".
from UniProtClient import UniProtProteinInfo
info = UniProtProteinInfo()
info.load_protein_info(["B4DZW8", "Q9Y2R2", "P51452"])
entry_name | protein_names | protein_families | organism | ec_number | gene_names(primary) | gene_ontology(molecular_function) | primary_name | subfamily | family | superfamily | |
---|---|---|---|---|---|---|---|---|---|---|---|
entry | |||||||||||
P51452 | DUS3_HUMAN | Dual specificity protein phosphatase 3 (EC 3.1... | Protein-tyrosine phosphatase family, Non-recep... | Homo sapiens (Human) | 3.1.3.16; 3.1.3.48 | DUSP3 | cytoskeletal protein binding [GO:0008092]; MAP... | Dual specificity protein phosphatase 3 | Non-receptor class dual specificity subfamily | Protein-tyrosine phosphatase family | None |
Q9Y2R2 | PTN22_HUMAN | Tyrosine-protein phosphatase non-receptor type... | Protein-tyrosine phosphatase family, Non-recep... | Homo sapiens (Human) | 3.1.3.48 | PTPN22 | kinase binding [GO:0019900]; non-membrane span... | Tyrosine-protein phosphatase non-receptor type 22 | Non-receptor class 4 subfamily | Protein-tyrosine phosphatase family | None |
B4DZW8 | B4DZW8_HUMAN | cDNA FLJ55436, highly similar to Tyrosine-prot... | Homo sapiens (Human) | protein tyrosine phosphatase activity [GO:0004... | cDNA FLJ55436, highly similar to Tyrosine-prot... | None | None | None |
If downloaded, the string 'protein_families' is parsed automatically. It is split into the categories subfamily, family
and superfamily.
Some proteins belong to multiple families. The default behaviour is to extract the individual categories and merge them
into a ;
seperated string.
# Extending column with. Not important for extraction.
import pandas as pd
pd.set_option('max_colwidth', 400)
info = UniProtProteinInfo(merge_multi_fam_strings="string") # Default behaviour
info.load_protein_info(["Q923J1"])[["organism", "subfamily", "family", "superfamily"]]
organism | subfamily | family | superfamily | |
---|---|---|---|---|
entry | ||||
Q923J1 | Mus musculus (Mouse) | ALPK subfamily; LTrpC subfamily | Alpha-type protein kinase family; Transient receptor (TC 1.A.4) family | Protein kinase superfamily; - |
Setting merge_multi_fam_strings
to 'list'
will arrange each family association in a list.
To keep types consistent this applies to proteins with only one family as well.
info = UniProtProteinInfo(merge_multi_fam_strings="list") # Default behaviour
info.load_protein_info(["Q923J1", "Q9Y2R2"])[["organism", "subfamily", "family", "superfamily"]]
organism | subfamily | family | superfamily | |
---|---|---|---|---|
entry | ||||
Q923J1 | Mus musculus (Mouse) | [ALPK subfamily, LTrpC subfamily] | [Alpha-type protein kinase family, Transient receptor (TC 1.A.4) family] | [Protein kinase superfamily, None] |
Q9Y2R2 | Homo sapiens (Human) | [Non-receptor class 4 subfamily] | [Protein-tyrosine phosphatase family] | [None] |
Setting merge_multi_fam_strings
to None
will create for each family association an
individual row where remaining protein information are identical.
info = UniProtProteinInfo(merge_multi_fam_strings=None)
info.load_protein_info(["Q923J1"])[["organism", "subfamily", "family", "superfamily"]]
organism | subfamily | family | superfamily | |
---|---|---|---|---|
entry | ||||
Q923J1 | Mus musculus (Mouse) | ALPK subfamily | Alpha-type protein kinase family | Protein kinase superfamily |
Q923J1 | Mus musculus (Mouse) | LTrpC subfamily | Transient receptor (TC 1.A.4) family | None |