North Pacific Eukaryotic Gene Catalog (NPEGC)

Overview

The North Pacific Eukaryotic Gene Catalog (NPEGC) is a compilation of metatranscriptome sequence data and annotations derived from 261 samples collected from four oceanographic research cruises in the North Pacific Ocean.

Key Features

261 metatranscriptomes from five cruise studies
182 million transcript contigs (clustered at 99% protein identity)
Taxonomic and functional annotations
Read abundance data

Sample sites for metatranscriptomes in the North Pacific Eukaryotic Gene Catalog

Data Sources

Diel1: 48 samples from SCOPE HOE-Legacy 2 (July 2015) Diel 1 project page
Gradients1: 47 samples from KOK1606 (April-May 2016) Gradients 1 project page
Gradients2: 59 samples from MGL1704 (May-June 2017) Gradients 2 project page
Gradients3: 63 samples from KM1906 (April 2019) Gradients 3 project page
G3 diel: 44 samples from KM1906 (April 2019) G3 diel study project page

Data Products

Script Index

Universal Scripts

These scripts are used across all studies in the North Pacific Eukaryotic Gene Catalog:

Illumina_QC_AWS.sh: Description: Performs quality control and trimming of raw Illumina sequencing data using Trimmomatic.
NPEGC.6tr_frame_selection_clustering.sh: Translates nucleotide sequences, selects the longest coding frame(s), and clusters protein sequences at 99% identity.
NPEGC.diamond_taxonomy.log.sh: Assigns taxonomic identifiers to protein sequences using DIAMOND alignment against the MarFERReT + MARMICRODB database.
NPEGC.hmmer_function.sh: Annotates protein sequences with protein families using HMMER against the Pfam database.
NPEGC.nt_kallisto_counts.sh: Quantifies transcript abundances by aligning short reads to assembled transcripts using kallisto.
aggregate_kallisto_counts.R: Consolidates kallisto output files, joining sequence length and estimated count values for each project's metatranscriptome.

Study-Specific Scripts

Each study (G1PA, G2PA, G3PA, G3PA_diel, D1PA) has two specific scripts:

{STUDY_ID}.process_short_reads.sh: Performs quality control and preprocessing of raw sequencing data for the specific study.
{STUDY_ID}.trinity_assemblies.sh: Uses Trinity to perform de novo assembly of metatranscriptomes for the specific study.

Links to study-specific scripts:

Gradients 1 (G1PA):
- G1PA.process_short_reads.sh
- G1PA.trinity_assemblies.sh
Gradients 2 (G2PA):
- G2PA.process_short_reads.sh
- G2PA.trinity_assemblies.sh
Gradients 3 (G3PA):
- G3PA_UW.process_short_reads.sh
- G3PA_UW.trinity_assemblies.sh
G3 Diel (G3PA_diel):
- G3PA_diel.process_short_reads.sh
- G3PA_diel.trinity_assemblies.sh
Diel1 (D1PA):
- D1PA.process_short_reads.sh
- D1PA.trinity_assemblies.sh

Associated Data

Additional metadata and associated datasets are available on the Simons CMAP ocean data portal.

SCOPE Diel1 associated data: https://simonscmap.com/catalog/cruises/KM1513
Gradients 1 associated data: https://simonscmap.com/catalog/cruises/KOK1606
Gradients 2 associated data: https://simonscmap.com/catalog/cruises/MGL1704
Gradients 3 associated data: https://simonscmap.com/catalog/cruises/KM1906

Additional metadata for the Gradients cruises can be found here: http://scope.soest.hawaii.edu/data/gradients/gradients.html

Citation

If you use this data in your research, please cite:

Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J. & Armbrust, E. V. The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Sci. Data 11, 1161 (2024). doi:10.1038/s41597-024-04005-5

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
containers		containers
data		data
images		images
projects		projects
scripts		scripts
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

North Pacific Eukaryotic Gene Catalog (NPEGC)

Overview

Key Features

Data Sources

Data Products

1. Raw Metatranscriptome Assemblies

2. Processed Protein Contigs and Annotations

3. Processed Nucleotide Metatranscripts and Read Counts

Script Index

Universal Scripts

Study-Specific Scripts

Associated Data

Citation

About

Releases

Packages

Contributors 4

Languages

License

armbrustlab/NPac_euk_gene_catalog

Folders and files

Latest commit

History

Repository files navigation

North Pacific Eukaryotic Gene Catalog (NPEGC)

Overview

Key Features

Data Sources

Data Products

1. Raw Metatranscriptome Assemblies

2. Processed Protein Contigs and Annotations

3. Processed Nucleotide Metatranscripts and Read Counts

Script Index

Universal Scripts

Study-Specific Scripts

Associated Data

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages