forked from awslabs/open-data-registry
-
Notifications
You must be signed in to change notification settings - Fork 0
/
dnastack-covid-19-sra-data.yaml
49 lines (49 loc) · 2.85 KB
/
dnastack-covid-19-sra-data.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Name: DNAStack COVID19 SRA Data
Description: >
The [Sequence Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra/) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data.
This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodologies to be combined and compared for a more powerful global analysis of available SARS-CoV-2 data, allowing researchers rapid access to aggregated downstream results for accelerated insight generation.
Methodology: Reads from the SRA were extracted in FASTQ format, then entered into a different pipeline depending on the sequencing technology used to create the reads: the [ARTIC protocol](https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html) for Oxford Nanopore-derived reads; the [SIGNAL pipeline](https://github.com/jaleezyy/covid-19-signal) for paired-end Illumina reads; and the [CoSA pipeline](https://github.com/PacificBiosciences/CoSA) (using [DeepVariant](https://github.com/google/deepvariant) for variant calling) for PacBio reads. Briefly, reads were primer-trimmed and aligned to the SARS-CoV-2 reference genome, following which contiguous regions were assembled and variant sites were called. [Pangolin](https://github.com/cov-lineages/pangolin) was then used to assign viral lineage based on the assembled genome.
Documentation: https://github.com/DNAstack/dnastack-open-data
Contact: "[DNAstack]([email protected])"
ManagedBy: "[DNAstack](https://dnastack.com/)"
UpdateFrequency: Rolling
Tags:
- aws-pds
- bam
- bioinformatics
- coronavirus
- COVID-19
- fasta
- fastq
- global
- genetic
- genomic
- health
- life sciences
- long read sequencing
- SARS-CoV-2
- vcf
- virus
- whole genome sequencing
License: "[DNAstack terms of use](https://dnastack.com/terms-of-use/)"
Resources:
- Description: SARS-CoV-2 raw sequencing and output data (FASTQ, BAM, FASTA, VCF)
ARN: arn:aws:s3:::dnastack-covid-19-sra-data
Region: us-west-2
Type: S3 Bucket
RequesterPays: False
Explore:
- "[Browse bucket](https://dnastack-covid-19-sra-data.s3.amazonaws.com/)"
DataAtWork:
Tutorials:
- Title: "Viral lineage assignment"
URL: https://github.com/DNAstack/dnastack-open-data/tree/master/tutorials/assign_lineage
AuthorName: "Heather Ward"
Services:
- Batch
- EC2
- S3
Tools & Applications:
- Title: "Viral AI"
URL: https://viral.ai/collections/ncbi-sra-covid
AuthorName: DNAstack