Skip to content

Latest commit

 

History

History
100 lines (62 loc) · 6.42 KB

README.md

File metadata and controls

100 lines (62 loc) · 6.42 KB

North Pacific Eukaryotic Gene Catalog (NPEGC)

Overview

The North Pacific Eukaryotic Gene Catalog (NPEGC) is a compilation of metatranscriptome sequence data and annotations derived from 261 samples collected from four oceanographic research cruises in the North Pacific Ocean.

Key Features

  • 261 metatranscriptomes from five cruise studies
  • 182 million transcript contigs (clustered at 99% protein identity)
  • Taxonomic and functional annotations
  • Read abundance data

NPEGC cruise tracks

Sample sites for metatranscriptomes in the North Pacific Eukaryotic Gene Catalog

Data Sources

  1. Diel1: 48 samples from SCOPE HOE-Legacy 2 (July 2015) Diel 1 project page
  2. Gradients1: 47 samples from KOK1606 (April-May 2016) Gradients 1 project page
  3. Gradients2: 59 samples from MGL1704 (May-June 2017) Gradients 2 project page
  4. Gradients3: 63 samples from KM1906 (April 2019) Gradients 3 project page
  5. G3 diel: 44 samples from KM1906 (April 2019) G3 diel study project page

Data Products

1. Raw Metatranscriptome Assemblies

2. Processed Protein Contigs and Annotations

3. Processed Nucleotide Metatranscripts and Read Counts

Script Index

Universal Scripts

These scripts are used across all studies in the North Pacific Eukaryotic Gene Catalog:

  1. Illumina_QC_AWS.sh: Description: Performs quality control and trimming of raw Illumina sequencing data using Trimmomatic.

  2. NPEGC.6tr_frame_selection_clustering.sh: Translates nucleotide sequences, selects the longest coding frame(s), and clusters protein sequences at 99% identity.

  3. NPEGC.diamond_taxonomy.log.sh: Assigns taxonomic identifiers to protein sequences using DIAMOND alignment against the MarFERReT + MARMICRODB database.

  4. NPEGC.hmmer_function.sh: Annotates protein sequences with protein families using HMMER against the Pfam database.

  5. NPEGC.nt_kallisto_counts.sh: Quantifies transcript abundances by aligning short reads to assembled transcripts using kallisto.

  6. aggregate_kallisto_counts.R: Consolidates kallisto output files, joining sequence length and estimated count values for each project's metatranscriptome.

Study-Specific Scripts

Each study (G1PA, G2PA, G3PA, G3PA_diel, D1PA) has two specific scripts:

  1. {STUDY_ID}.process_short_reads.sh: Performs quality control and preprocessing of raw sequencing data for the specific study.

  2. {STUDY_ID}.trinity_assemblies.sh: Uses Trinity to perform de novo assembly of metatranscriptomes for the specific study.

Links to study-specific scripts:

Associated Data

Additional metadata and associated datasets are available on the Simons CMAP ocean data portal.

Additional metadata for the Gradients cruises can be found here: http://scope.soest.hawaii.edu/data/gradients/gradients.html

Citation

If you use this data in your research, please cite:

Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J. & Armbrust, E. V. The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Sci. Data 11, 1161 (2024). doi:10.1038/s41597-024-04005-5