Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
only_pfam28_hits_in_clstr30_rep.sql		only_pfam28_hits_in_clstr30_rep.sql
singletons_div_3D.png		singletons_div_3D.png

Repository files navigation

test

Images

An inline image , title is optional.

A reference style image.

Singletons

Total	5790292
TARA	2713509
OSD	786615
GOS	2290168

[TARA-OSD-GOS protein sequences classifications](#TARA-OSD-GOS protein sequences classification)
[Hierarchical clustering](#Hierarchical clustering)
[Filtering singletons & spurious ORFs](#Filtering singletons & spurious ORFs)
[BLAST/LAMBDA search vs UniProtKB](#BLAST/LAMBDA search vs UniProtKB)
[BLAST/LAMBDA search/alignment vs NCBI nr database](#BLAST/LAMBDA search/alignment vs NCBI nr database)

#TARA-OSD-GOS protein sequences classification
We performed a functional analysis of protein sequences in terms of similarities to known protein families using the ultrafast protein classification ([UProC (version 1.2.0)](http://uproc.gobics.de/)) that is part of the CoMet web server and is available in terms of an open source C library.

We used Pfam (release 28.0) as reference database.

Why we chose to run UProC:

UProC is up to three orders of magnitude faster than profile-based methods and achieved up to 80% higher sensitivity on unassembled short reads (100 bp) from simulated metagenomes. UProC does not depend on a multiple alignment of family-specific sequences.

Script

###Results The retrieved hits compose/represent the *KNOWN* fraction/families While the no-hits (i.e the *UNKNOWN* families) will be further processed and categorized/divided into **"Genomic Unknowns"** (ORFs with unknown function, but similar to hypothetical genes in sequenced genomes) and **"Environmental Unknowns"** (ORFs with unknown function and no appreciable similarity to genes in sequenced genomes)

[Results (ToDo!)](link to owcloud-mpi-bremen.de)

#Hierarchical clustering
We performed a hierarchical clustering, using [CD-HIT programs](http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide), to reduce sequence redundancy and improve the performance of the further sequence analyses.

First, we used cd-hit to cluster the input down to 60% identity.

Then, since the lowest threshold of CD-HIT is around 40%, we used PSI-CD-HIT, that clusters proteins at very low threshold (in our case 30%) At the end we used the script clstr_rev.pl: that combines a .clstr file with its parent .clstr file

For an easier reading of the information contained in the cluster files we built summary tables, using the script clstr2text.pl

Scripts (ToDo!) only one script cd-hit, psi-cd-hit.pl and clstr_rev.pl (is possible)?

###Results Number of clusters at 30% --> repres sequences...
[Tables]()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

test

Images

Table of Contents

About

Releases

Packages

ChiaraVanni/test

Folders and files

Latest commit

History

Repository files navigation

test

Images

Table of Contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages