UniProtClient

IMPORTANT! Mapping is UNDER CONSTRUCTRION and not working

Python classes in this package allow convenient access to UniProt for protein ID mapping and information retrieval.

Usage

Mapping

Protein IDs differ from database to database. The class UniProtMapper can be utilized for mapping of protein IDs from one database to corresponding IDs of another database, specified by letter codes.

from UniProtClient import UniProtMapper
origin_database = 'P_GI'  # PubChem Gene ID
target_database = 'ACC'  # UniProt Accession
gi_2_acc_mappig = UniProtMapper(origin_database, target_database)

The obtained object has a function called map_protein_ids, which takes a list of strings with protein IDs as input, returning a pandas DataFrame. The DataFrame has two columns: "From" and "To" referring to the origin and target ID, respectively.

gi_numbers = ['224586929', '224586929', '4758208'] # IDs should be represented as a list of strings
# a pandas DataFrame is returned containing the columns "From" and "To"
mapping_df = gi_2_acc_mappig.map_protein_ids(gi_numbers)
uniprot_accessions = mapping_df['To'].tolist()

mapping_df

	From	To
0	224586929	Q9Y2R2
1	224586929	B4DZW8
2	4758208	P51452

Protein information

UniProt provides a varity of protein specific information, such as protein family, organism, function, EC-number, and many more. The class UniProtProteinInfo is initialized with column identifier specifing the requested information. Spaces in column names should be substituted by underscores.
If no columns are specified the default is used:

Column-ID
id
entry_name
protein_names
families
organism
ec
genes(PREFERRED)
go(molecular_function)

The column "protein_names" contains all protein names, where secondary names are given in brackets or parenthesis. If this column is requested, the primary name is extracted and added as a new column, called "primary_name".

from UniProtClient import UniProtProteinInfo
info = UniProtProteinInfo()

info.load_protein_info(["B4DZW8", "Q9Y2R2", "P51452"])

	entry_name	protein_names	protein_families	organism	ec_number	gene_names(primary)	gene_ontology(molecular_function)	primary_name	subfamily	family	superfamily
entry
P51452	DUS3_HUMAN	Dual specificity protein phosphatase 3 (EC 3.1...	Protein-tyrosine phosphatase family, Non-recep...	Homo sapiens (Human)	3.1.3.16; 3.1.3.48	DUSP3	cytoskeletal protein binding [GO:0008092]; MAP...	Dual specificity protein phosphatase 3	Non-receptor class dual specificity subfamily	Protein-tyrosine phosphatase family	None
Q9Y2R2	PTN22_HUMAN	Tyrosine-protein phosphatase non-receptor type...	Protein-tyrosine phosphatase family, Non-recep...	Homo sapiens (Human)	3.1.3.48	PTPN22	kinase binding [GO:0019900]; non-membrane span...	Tyrosine-protein phosphatase non-receptor type 22	Non-receptor class 4 subfamily	Protein-tyrosine phosphatase family	None
B4DZW8	B4DZW8_HUMAN	cDNA FLJ55436, highly similar to Tyrosine-prot...		Homo sapiens (Human)			protein tyrosine phosphatase activity [GO:0004...	cDNA FLJ55436, highly similar to Tyrosine-prot...	None	None	None

Protein Families

If downloaded, the string 'protein_families' is parsed automatically. It is split into the categories subfamily, family and superfamily. Some proteins belong to multiple families. The default behaviour is to extract the individual categories and merge them into a ; seperated string.

# Extending column with. Not important for extraction.
import pandas as pd
pd.set_option('max_colwidth', 400)

info = UniProtProteinInfo(merge_multi_fam_strings="string")  # Default behaviour
info.load_protein_info(["Q923J1"])[["organism", "subfamily", "family", "superfamily"]]

	organism	subfamily	family	superfamily
entry
Q923J1	Mus musculus (Mouse)	ALPK subfamily; LTrpC subfamily	Alpha-type protein kinase family; Transient receptor (TC 1.A.4) family	Protein kinase superfamily; -

Setting merge_multi_fam_strings to 'list' will arrange each family association in a list. To keep types consistent this applies to proteins with only one family as well.

info = UniProtProteinInfo(merge_multi_fam_strings="list")  # Default behaviour
info.load_protein_info(["Q923J1", "Q9Y2R2"])[["organism", "subfamily", "family", "superfamily"]]

	organism	subfamily	family	superfamily
entry
Q923J1	Mus musculus (Mouse)	[ALPK subfamily, LTrpC subfamily]	[Alpha-type protein kinase family, Transient receptor (TC 1.A.4) family]	[Protein kinase superfamily, None]
Q9Y2R2	Homo sapiens (Human)	[Non-receptor class 4 subfamily]	[Protein-tyrosine phosphatase family]	[None]

Setting merge_multi_fam_strings to None will create for each family association an individual row where remaining protein information are identical.

info = UniProtProteinInfo(merge_multi_fam_strings=None)
info.load_protein_info(["Q923J1"])[["organism", "subfamily", "family", "superfamily"]]

	organism	subfamily	family	superfamily
entry
Q923J1	Mus musculus (Mouse)	ALPK subfamily	Alpha-type protein kinase family	Protein kinase superfamily
Q923J1	Mus musculus (Mouse)	LTrpC subfamily	Transient receptor (TC 1.A.4) family	None

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
UniProtClient		UniProtClient
LICENSE		LICENSE
README.ipynb		README.ipynb
README.md		README.md
setup.py		setup.py
unit_test.py		unit_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniProtClient

IMPORTANT! Mapping is UNDER CONSTRUCTRION and not working

Usage

Mapping

Protein information

Protein Families

About

Languages

License

c-feldmann/UniProtClient

Folders and files

Latest commit

History

Repository files navigation

UniProtClient

IMPORTANT! Mapping is UNDER CONSTRUCTRION and not working

Usage

Mapping

Protein information

Protein Families

About

Topics

Resources

License

Stars

Watchers

Forks

Languages