A version-controlled, open source library of marine microbial eukaryotic protein sequences for the taxonomic annotation of environmental metatranscriptomes
The Marine Functional EukaRyotic Reference Taxa (MarFERReT) is a version-controlled and open source reference sequence library of marine eukarote proteins that allows for community-supported expansion over time. MarFERReT was constructed for the primary purpose of taxonomic annotation of environmental metatranscriptomes. The MarFERRet data products can be downloaded from the Zenodo repository as described in Part A, or built from the source data using a containerized pipeline following the steps in Part B. The case studies included in Part C illustrate how MarFERReT can be used on its own or in combination with other reference libraries for taxonomic and functional annotation.
The contents of this repo are organized into four parts:
- Part A: Using MarFERReT directly
- Part B: Building MarFERReT
- Part C: Using MarFERRet
- Part D: Future MarFERReT releases
Finalized MarFERReT data products include nearly 28 million intra-species clustered protein sequences, metadata with curated taxonomy identifiers, Pfam protein annotations, core transcribed gene catalogs for marine microbial eukaryote lineages, and other supporting data. The URLs are provided for the sources of the individual sequences, and the compiled, translated, and clustered sequences are available for download through this Zenodo repository.
If downloaded directly, the steps listed below to build MarFERReT can be skipped. The Zenodo repository contains the intra-taxa clustered protein sequences for the subset of validated entries, a pre-constructed DIAMOND indexed database of the MarFERReT protein sequences, and Pfam 34.0 annotations. MarFERReT can be combined with other protein sequence reference libraries for expanded phylogenetic coverage. For an example of combining databases, see step 8 (Combining MarFERReT with other reference sequence libraries). An example workflow for using these data to annotate environmental metatranscriptome is described in subsection 9 below (Using MarFERReT to annotate environmental metatranscriptomes).
This section details how to build your own copy of MarFERReT starting from source reference sequences and the scripts stored in this repository. If you want to begin using MarFERReT right away, you can download the clustered protein sequences and taxonomic information from Zenodo Zenodo repository as described above and skip to Part C.
If you're still here, that means you're ready to get into the technical details of building your own copy of the MarFERReT data. This work is broken down into five steps:
- Cloning the MarFERReT repository
- Collecting and organizing inputs
- Building software containers
- Running MarFERReT database construction pipeline
- Annotating MarFERReT database sequences
- Building Core Transcribed Gene catalog
- Combining MarFERReT with other reference sequence libraries
The first step is to copy the MarFERReT pipeline code onto the computer where you intend to build the database. This can be done by cloning this repository into a suitable directory on your machine.
Two sets of input files are required to build MarFERReT: 1) the source reference sequences and 2) a corresponding metadata file. The source reference sequences will need to be collected from their various public locations, and the metadata file will need to be edited to match the reference sequences.
Once you have cloned the MarFERReT repository onto your machine, make a new directory called source_seqs
under the data
directory. You will deposit all of the fasta files of the source reference sequences into this directory. Detailed instructions for finding and downloading the source reference sequences used to build MarFERReT v1 can be found in this document: download_source_sequences.md. Before running the MarFERReT pipeline, all of these fasta files should be unzipped.
A metadata file entitled MarFERReT.{VERSION}.metadata.csv
contains important information on each of the source reference sequences used to build the MarFERReT database. Every source reference sequence in the source_seqs
directory should have a corresponding line in the metadata file with at least the following fields properly filled in:
entry_id
: a unique MarFERReT identifier for the reference sequenceaccepted
: [Y/N]; determines inclusion in final buildmarferret_name
: a human-readable name for the reference sequence (no spaces or special characters)tax_id
: an NCBI taxonomical identifiersource_filename
: this should exactly match the name of the fasta file insource_seqs
(unzipped)seq_type
: the sequence type of the source fasta -- 'nt' for nucleotide and 'aa' for amino acid
The MarFERReT database construction pipeline is entirely containerized, meaning that you do not need to worry about software dependencies. Additionally, MarFERReT supports both Singularity and Docker containerization, depending on user preference. The necessary containers can be built in two steps:
- Install either Singularity or Docker on your machine.
- Navigate to the
containers
directory and run either thebuild_singularity_images.sh
orbuild_docker_images.sh
script from the command line.
Once the input source reference sequences have been collected, metadata has been organized, and the software containers have been built, you are ready to run the MarFERReT database construction pipeline. Navigate to the scripts
directory and run the assemble_marferret.sh
script from the command line. You will be prompted to enter either 1
or 2
depending on whether you are using Singularity or Docker containerization.
The pipeline will take several hours to run, depending on your computer system specifications. When it is done you should find the following outputs in the data
directory:
MarFERReT.{VERSION}.proteins.faa.gz
-- MarFERReT protein libraryMarFERReT.{VERSION}.taxonomies.tab.gz
-- taxonomy mapping file required as input for building diamond databaseMarFERReT.{VERSION}.proteins_info.tab.gz
-- mapping file connecting each MarFERReT protein to its originating reference sequence/aa_seq
-- directory with translated & standardized amino acid sequences/taxid_grouped
-- directory with amino acid sequences grouped by taxid/clustered
-- directory with amino acid sequences clustered within taxid
The three .gz files listed above can also be downloaded directly from the Zenodo repository
Information on the functions of the proteins included in MarFERReT can be added in by annotating the sequences with one of the many bioinformatic tools available for functional inference. In this repository we have included a script for annotating the database with Pfam (now included as a part of the InterPro consortium).
To annotate MarFERRet, you must first download a copy of the Pfam database of HMM profiles. Make a new directory named pfam
under the data
directory. Download into this directory the latest version of Pfam from the Pfam ftp site.
Once the Pfam HMM database has been downloaded, navigate to the scripts
directory and run the pfam_annotate.sh
script from the command line. In addition to the data/pfam/Pfam-A.hmm
HMM database, this script requires the data/MarFERReT.{VERSION}.proteins.faa.gz
file as an input.
The annotation script can take many days to run, as every protein is compared against every HMM profile. Once it has successfully completed, you should find the following outputs in the data
directory:
MarFERReT.{VERSION}.best_pfam_annotations.csv.gz
-- a summary of the best Pfam annotation for each MarFERReT reference proteinpfam/MarFERReT.${VERSION}.pfam.domtblout.tab.gz
-- the complete set of Pfam annotations against each MarFERReT reference protein
The complete set of Pfam annotations can also be found on the Zenodo repository as MarFERReT.v1.Pfam_annotations.tar.gz
After the MarFERReT protein sequences have been functionally annotated, sets of CTGs can be derived from the RNA-derived data for specific marine lineages or for all eukaryotes. Selecting the Pfam IDs that are present in at least 95% of species of a given lineage allows us to define a set of functions that can be reasonably expected to be found in a relatively complete transcriptome. These CTG catalogs can be used downstream of environmental sequence annotation with MarFERReT to assess the coverage of environmental taxon bins, as demonstrated in Case Study 2. Documentation and code for generating and using the CTG catalogs from Pfam annotations for user-defined lineages are found here: Case_study_2.md
MarFERReT can be combined with other domain-focused reference sequence libraries or new reference sequence transcriptomes and genomes to expand taxonomic coverage. In the Case Studies, we show an example combining MarFERReT with a filtered version of the prokaryote-focused MARMICRODB library. Both libraries use NCBI Taxonomy identifiers as their primary classification framework, facilitating compatible annotation approaches. After downloading or building the MarFERReT protein sequence database, bacterial sequences can be downloaded from the MARMICRODB Zenodo repository and the libraries concatenated together for use in downstream processes.
MarFERReT can also be combined with individual reference sequence transcriptomes and genomes that have just been released, or are not incorporated in current reference libraries, or to add representation for specific research needs. This also requires that every sequence entry has an NCBI Taxonomy identifier. Instruction and code for combining MarFERReT with other large reference libraries like MARMICRODB or with sets of individual reference sequence entries can be found on the codebase repository here: combining_marferret_and_other_references.md
The primary intended use of MarFERReT is the taxonomic annotation of marine metatranscriptomic datasets. This can be done without building your own copy of the database. Instead, the current MarFERReT database files necessary for annotation can be downloaded from Zenodo repository.
One means of performing this taxonomical annotation is to search MarFERReT for the closest matches to your data sequences. There are many bioinformatic tools available for this type of sequence alignment. One such popular tool for high performance sequence alignment of big datasets is DIAMOND. In the scripts
directory of this repository, we have included a script named build_diamond_db.sh
for using DIAMOND in combination with MarFERReT. This script requires the MarFERReT.v1.proteins.faa.gz
and MarFERReT.v1.taxonomies.tab.gz
as inputs in the data
directory.
The case studies presented here are practical examples how MarFERReT can be used by itself or in conjunction with other protein sequence libraries to assign taxonomic identity to environmental sequences using DIAMOND fast protein alignment, and then to assess the completeness of annotated environmental transcript bins.
A visual diagram of the case study analyses can be found here:
Case Study 1 This Case Study shows how MarFERReT can be used to annotate unknown environmental sequences using the DIAMOND fast protein-alignment tool (Buchfink et al., 2015). In summary, a DIAMOND-formatted database is created from sequence data and NCBI Taxonomy information, and used to annotate unknown environmental reads.
Case Study 2 documents how to identify core transcribed genes for user-defined lineages using MarFERReT protein sequences, and provides an example on how to estimate the completeness of environmental transcriptome bins with taxonomic annotation (from Case Study 1) and functional annotation with Pfam 34.0 (now included as a part of the InterPro consortium). In summary, the taxonomic and functional annotations are aggregated together, and the percentage of lineage-specific CTGs is determined for each species-level environmental taxon bin.
While the case studies discussed above demonstrate the practicality of using MarFERReT for annotating environmental metatranscriptomes, MarFERReT can also be used to supplement other analyses that depend on reference proteins, including the analysis of marine metagenomes and metaproteomes.
MarFERReT was designed to be updated as new microbial eukaryote functional reference sequences are publicly released, with releases identified either through literature reviews, updates to public repositories, or through user nominations. Suggestions for new additions or changes to MarFERReT in future versions can be submitted via the ‘Issues’ request function on this repository. When submitting an organism request for future MarFERReT versions, the following information is required:
- Full scientific name of the organism (with strain name if possible)
- An NCBI Taxonomy taxID of the organism (as specific as possible, e.g. strain-level)
- A URL to the location of the assembled source data, with additional instructions if necessary
- Brief justification for why this organism should be included, e.g. "New SAGs from a clade of marine haptophytes".
- A citation or publication for the data, if available.
New entries will be processed through the workflow described for candidate entries and validated by hierarchical clustering of annotated protein content (see MarFERReT candidate entry validation notes). Future versions of MarFERReT will be documented in a changelog in the repository (link), describing any additions or modifications to the library composition. The changelog will detail updates to the MarFERReT code and MarFERReT files hosted on Zenodo, including revisions to the scripts, metadata files, functional annotation protocols, protein sequence library, DIAMOND databases, and CTG inventories.