The North Pacific Eukaryotic Gene Catalog (NPEGC) is a compilation of metatranscriptome sequence data and annotations derived from 261 samples collected from four oceanographic research cruises in the North Pacific Ocean.
- 261 metatranscriptomes from five cruise studies
- 182 million transcript contigs (clustered at 99% protein identity)
- Taxonomic and functional annotations
- Read abundance data
Sample sites for metatranscriptomes in the North Pacific Eukaryotic Gene Catalog
- Diel1: 48 samples from SCOPE HOE-Legacy 2 (July 2015) Diel 1 project page
- Gradients1: 47 samples from KOK1606 (April-May 2016) Gradients 1 project page
- Gradients2: 59 samples from MGL1704 (May-June 2017) Gradients 2 project page
- Gradients3: 63 samples from KM1906 (April 2019) Gradients 3 project page
- G3 diel: 44 samples from KM1906 (April 2019) G3 diel study project page
These scripts are used across all studies in the North Pacific Eukaryotic Gene Catalog:
-
Illumina_QC_AWS.sh: Description: Performs quality control and trimming of raw Illumina sequencing data using Trimmomatic.
-
NPEGC.6tr_frame_selection_clustering.sh: Translates nucleotide sequences, selects the longest coding frame(s), and clusters protein sequences at 99% identity.
-
NPEGC.diamond_taxonomy.log.sh: Assigns taxonomic identifiers to protein sequences using DIAMOND alignment against the MarFERReT + MARMICRODB database.
-
NPEGC.hmmer_function.sh: Annotates protein sequences with protein families using HMMER against the Pfam database.
-
NPEGC.nt_kallisto_counts.sh: Quantifies transcript abundances by aligning short reads to assembled transcripts using kallisto.
-
aggregate_kallisto_counts.R: Consolidates kallisto output files, joining sequence length and estimated count values for each project's metatranscriptome.
Each study (G1PA, G2PA, G3PA, G3PA_diel, D1PA) has two specific scripts:
-
{STUDY_ID}.process_short_reads.sh
: Performs quality control and preprocessing of raw sequencing data for the specific study. -
{STUDY_ID}.trinity_assemblies.sh
: Uses Trinity to perform de novo assembly of metatranscriptomes for the specific study.
Links to study-specific scripts:
-
Gradients 1 (G1PA):
-
Gradients 2 (G2PA):
-
Gradients 3 (G3PA):
-
G3 Diel (G3PA_diel):
-
Diel1 (D1PA):
Additional metadata and associated datasets are available on the Simons CMAP ocean data portal.
- SCOPE Diel1 associated data: https://simonscmap.com/catalog/cruises/KM1513
- Gradients 1 associated data: https://simonscmap.com/catalog/cruises/KOK1606
- Gradients 2 associated data: https://simonscmap.com/catalog/cruises/MGL1704
- Gradients 3 associated data: https://simonscmap.com/catalog/cruises/KM1906
Additional metadata for the Gradients cruises can be found here: http://scope.soest.hawaii.edu/data/gradients/gradients.html
If you use this data in your research, please cite:
Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J. & Armbrust, E. V. The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Sci. Data 11, 1161 (2024). doi:10.1038/s41597-024-04005-5