Skip to content

c-feldmann/QuickGOProteinAnnotation

Repository files navigation

QuckGOProteinAnnotation

The database QuickGO provides protein function annotations for proteins, specified by UniProt ID. Arranging proteins by function rather than family extends protein associations beyond evolutionary relations. However, proteins may have multiple functions (e.g. receptor tyrosine kinases) and are therefore not uniquely assigned.

Provided code can be used to extract (specified) annotations from QuickGO.

Installation in Conda

If not already installed, install pip and git:

conda install git
conda install pip

Then install via pip:

pip install git+git://github.com/c-feldmann/QuickGOProteinAnnotation

Quickstart

From Terminal

The script annotate_protein_list.py takes an input-file (here: demo_data/demo_uniprot_ids.tsv) where proteins are specified in the column "uniprot_id". Results are saved to the file demo_data/demo_output.tsv as a tab-separated file.

python annotate_protein_list.py -i demo_data/demo_uniprot_ids.tsv -o demo_data/demo_output.tsv -c "uniprot_id" -s tab
Argument Explanation
-i input file
-o output file
-c column name
-s separator

The default value for -s is "tab", whereas the default output-file is named go_function_annotation.tsv.

In Python

A short example how this package could be used in a python code:

from go_protein_annotation  import DefaultAnnotation
test_proteins = ["Q16512", "P30085", "P25774"]
default_annotation = DefaultAnnotation()
protein_class_df = default_annotation.annotate_proteins(test_proteins)
protein_class_df
uniprot_id protein_function
0 Q16512 Transcription regulator
1 Q16512 Kinase
2 P30085 Kinase
3 P25774 Peptidase

Details

QuckGO functions are ordered hierarchically. E.g. an explicit annotation of peptidase activity implies also a hydrolase activity. Provided code extracts all explicit functional annotations and extends it with implicit annotations.

All Protein Functions

To obtain all annotations for a protein the class AllFunctionAnnotation is used.

from go_protein_annotation import  AllFunctionAnnotation
all_functions = AllFunctionAnnotation()

# For a single protein
all_functions_q16512 = all_functions.get_protein_functions("Q16512")

# For a list of proteins
protein_functions = all_functions.annotate_proteins(["Q16512", "P30085"])
all_functions_q16512.head(10)
uniprot_id go_id protein_function
0 Q16512 GO:0005515 protein binding
1 Q16512 GO:0035639 purine ribonucleoside triphosphate binding
2 Q16512 GO:0000166 nucleotide binding
3 Q16512 GO:1901363 heterocyclic compound binding
4 Q16512 GO:0050681 androgen receptor binding
5 Q16512 GO:0140110 transcription regulator
6 Q16512 GO:0017076 purine nucleotide binding
7 Q16512 GO:0019901 protein kinase binding
8 Q16512 GO:0042826 histone deacetylase binding
9 Q16512 GO:0035257 nuclear hormone receptor binding
protein_functions.groupby("uniprot_id").nunique()
go_id protein_function
uniprot_id
P30085 30 30
Q16512 55 55

A Subset of Protein Functions

Often it can be useful to extract only a subset of protein functions. This can be achieved using the class SelectedFunctionAnnotation.

from go_protein_annotation import SelectedFunctionAnnotation
selected_functions = {"GO:0016301",  # Kinase activity
                      "GO:0140110",  # Transcription regulator activity
                      "GO:0008233",  # Peptidase activity
                      }
sel_function_extraction = SelectedFunctionAnnotation(selected_functions)
out = sel_function_extraction.get_protein_functions("Q16512")
out
uniprot_id go_id protein_function
0 Q16512 GO:0140110 transcription regulator
1 Q16512 GO:0016301 kinase

User defined Protein Annotations

Users can also specify groups based on personal preferences. Therefore three arguments need to be specified:

  • Required functions: A set of functions which a protein must have to be assigned to this group.
  • Permitted functions: A set of functions of which must not overlap with the protein functions.
  • A name

This class is also used to define the class DefaultAnnotation. The individual definitions can be found in the file go_protein_annotation/default_use.py. A simple example to separate protein kinases from other kinases and non-kinases:

from go_protein_annotation import SpecialFunctionAnnotation
# Must have 'GO:0004672' (protein kinase activity)
# No permitted functions
# Name: "Protein kinase"
protein_kinases = ({"GO:0004672"}, set(), "Protein kinase")

# Must have 'GO:0004672' (kinase activity)
# Must not have '"GO:0004672' (protein kinase activity)
# Name: "Other kinase"
other_kinases = ({"GO:0016301"}, {"GO:0004672"}, "Other kinase")

# No required functions (all proteins would match this)
# Must not have '"GO:0016301' (kinase activity)
# Name: "Non-kinase"
non_kinases = (set(), {"GO:0016301"}, "Non-kinase")

example_classification = SpecialFunctionAnnotation([protein_kinases, other_kinases, non_kinases])

test_protein_annotations = example_classification.annotate_proteins(test_proteins)
test_protein_annotations
uniprot_id protein_function
0 Q16512 Protein kinase
1 P30085 Other kinase
2 P25774 Non-kinase

Default Function Definition

See go_protein_annotation/default_use.py. Explicit explanation will follow.

Miscellaneous

  • Only QuickGO protein functions are used. QuckGO also gives information about involvement in biological processes. These annotations are not considered.
  • The classes AllFunctionAnnotation and SelectedFunctionAnnotation accept the keyword alternative_name_dict
    • Keys: GO ID
    • Value: Alternative name
  • The classes AllFunctionAnnotation and SelectedFunctionAnnotation accept the keyword simplify_name
    • True (default): " activity" is removed from each protein function name (e.g. "kinase activity" -> "kinase")
    • False: protein functions are named as given by QuickGO

About

Tool to retrieve predefined protein functions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published