The database QuickGO provides protein function annotations for proteins, specified by UniProt ID. Arranging proteins by function rather than family extends protein associations beyond evolutionary relations. However, proteins may have multiple functions (e.g. receptor tyrosine kinases) and are therefore not uniquely assigned.
Provided code can be used to extract (specified) annotations from QuickGO.
If not already installed, install pip and git:
conda install git
conda install pip
Then install via pip:
pip install git+git://github.com/c-feldmann/QuickGOProteinAnnotation
The script annotate_protein_list.py
takes an input-file (here: demo_data/demo_uniprot_ids.tsv) where
proteins are specified in the column "uniprot_id". Results are saved to the file demo_data/demo_output.tsv as a
tab-separated file.
python annotate_protein_list.py -i demo_data/demo_uniprot_ids.tsv -o demo_data/demo_output.tsv -c "uniprot_id" -s tab
Argument | Explanation |
---|---|
-i | input file |
-o | output file |
-c | column name |
-s | separator |
The default value for -s
is "tab", whereas the default output-file is named go_function_annotation.tsv.
A short example how this package could be used in a python code:
from go_protein_annotation import DefaultAnnotation
test_proteins = ["Q16512", "P30085", "P25774"]
default_annotation = DefaultAnnotation()
protein_class_df = default_annotation.annotate_proteins(test_proteins)
protein_class_df
uniprot_id | protein_function | |
---|---|---|
0 | Q16512 | Transcription regulator |
1 | Q16512 | Kinase |
2 | P30085 | Kinase |
3 | P25774 | Peptidase |
QuckGO functions are ordered hierarchically. E.g. an explicit annotation of peptidase activity implies also a hydrolase activity. Provided code extracts all explicit functional annotations and extends it with implicit annotations.
To obtain all annotations for a protein the class AllFunctionAnnotation
is used.
from go_protein_annotation import AllFunctionAnnotation
all_functions = AllFunctionAnnotation()
# For a single protein
all_functions_q16512 = all_functions.get_protein_functions("Q16512")
# For a list of proteins
protein_functions = all_functions.annotate_proteins(["Q16512", "P30085"])
all_functions_q16512.head(10)
uniprot_id | go_id | protein_function | |
---|---|---|---|
0 | Q16512 | GO:0005515 | protein binding |
1 | Q16512 | GO:0035639 | purine ribonucleoside triphosphate binding |
2 | Q16512 | GO:0000166 | nucleotide binding |
3 | Q16512 | GO:1901363 | heterocyclic compound binding |
4 | Q16512 | GO:0050681 | androgen receptor binding |
5 | Q16512 | GO:0140110 | transcription regulator |
6 | Q16512 | GO:0017076 | purine nucleotide binding |
7 | Q16512 | GO:0019901 | protein kinase binding |
8 | Q16512 | GO:0042826 | histone deacetylase binding |
9 | Q16512 | GO:0035257 | nuclear hormone receptor binding |
protein_functions.groupby("uniprot_id").nunique()
go_id | protein_function | |
---|---|---|
uniprot_id | ||
P30085 | 30 | 30 |
Q16512 | 55 | 55 |
Often it can be useful to extract only a subset of protein functions. This can be achieved using the class
SelectedFunctionAnnotation
.
from go_protein_annotation import SelectedFunctionAnnotation
selected_functions = {"GO:0016301", # Kinase activity
"GO:0140110", # Transcription regulator activity
"GO:0008233", # Peptidase activity
}
sel_function_extraction = SelectedFunctionAnnotation(selected_functions)
out = sel_function_extraction.get_protein_functions("Q16512")
out
uniprot_id | go_id | protein_function | |
---|---|---|---|
0 | Q16512 | GO:0140110 | transcription regulator |
1 | Q16512 | GO:0016301 | kinase |
Users can also specify groups based on personal preferences. Therefore three arguments need to be specified:
- Required functions: A set of functions which a protein must have to be assigned to this group.
- Permitted functions: A set of functions of which must not overlap with the protein functions.
- A name
This class is also used to define the class DefaultAnnotation
. The individual definitions can be found in the file
go_protein_annotation/default_use.py.
A simple example to separate protein kinases from other kinases and non-kinases:
from go_protein_annotation import SpecialFunctionAnnotation
# Must have 'GO:0004672' (protein kinase activity)
# No permitted functions
# Name: "Protein kinase"
protein_kinases = ({"GO:0004672"}, set(), "Protein kinase")
# Must have 'GO:0004672' (kinase activity)
# Must not have '"GO:0004672' (protein kinase activity)
# Name: "Other kinase"
other_kinases = ({"GO:0016301"}, {"GO:0004672"}, "Other kinase")
# No required functions (all proteins would match this)
# Must not have '"GO:0016301' (kinase activity)
# Name: "Non-kinase"
non_kinases = (set(), {"GO:0016301"}, "Non-kinase")
example_classification = SpecialFunctionAnnotation([protein_kinases, other_kinases, non_kinases])
test_protein_annotations = example_classification.annotate_proteins(test_proteins)
test_protein_annotations
uniprot_id | protein_function | |
---|---|---|
0 | Q16512 | Protein kinase |
1 | P30085 | Other kinase |
2 | P25774 | Non-kinase |
See go_protein_annotation/default_use.py. Explicit explanation will follow.
- Only QuickGO protein functions are used. QuckGO also gives information about involvement in biological processes. These annotations are not considered.
- The classes
AllFunctionAnnotation
andSelectedFunctionAnnotation
accept the keywordalternative_name_dict
- Keys: GO ID
- Value: Alternative name
- The classes
AllFunctionAnnotation
andSelectedFunctionAnnotation
accept the keywordsimplify_name
- True (default): " activity" is removed from each protein function name (e.g. "kinase activity" -> "kinase")
- False: protein functions are named as given by QuickGO