Skip to content

Commit

Permalink
[Fetch_SRR_Accession] New wf to retrieve SRR after Terra2NCBI wf (#668)
Browse files Browse the repository at this point in the history
* inital commit part 1 retrieve srr from Biosample

* update task and wf names and meta

* dockstore add

* Documentation and update column name

* update dockstore name

* Remove unnecessary blank lines in fetch_srr_metadata WDL task

* Update SRR metadata workflow and documentation for clarity and accuracy

* Remove redundant docker input from wf_update_srr_metadata workflow

* update

* update dockstore

* initial updates

* handle multiple SRR accessionss as string version outputs

* update task path

* forgot to import task versioning

* update dockstore yml

* comma sep output as string instead of array

* update wf name

* test local worked

* set euo pipefail

* more explicit fail invalid biosample

* update logic failure

* logic handling valid biosample or SRA

* enhance error handling and logging for biosample ID or SRA fetching

* Update logic for no SRR accessions and invalid samples

* update docs version in table

* add sample level to docs

* update input and ouptut tables
  • Loading branch information
fraser-combe authored Nov 26, 2024
1 parent 24b6abe commit b4aad55
Show file tree
Hide file tree
Showing 8 changed files with 149 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,11 @@ workflows:
primaryDescriptorPath: /workflows/utilities/data_import/wf_terra_2_bq.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Fetch_SRR_Accession_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/data_import/wf_fetch_srr_accession.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Concatenate_Column_Content_PHB
subclass: WDL
primaryDescriptorPath: /workflows/utilities/file_handling/wf_concatenate_column.wdl
Expand Down
52 changes: 52 additions & 0 deletions docs/workflows/public_data_sharing/fetch_srr_accession.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Fetch SRR Accession Workflow

## Quick Facts

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Public Data Sharing](../../workflows_overview/workflows_type.md/#public-data-sharing) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.3.0 | Yes | Sample-level |

## Fetch SRR Accession

This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. The primary inputs are BioSample IDs (e.g., SAMN00000000) or SRA Experiment IDs (e.g., SRX000000), which link to sequencing data in the SRA repository.

The workflow uses the fastq-dl tool to fetch metadata from SRA and specifically parses this metadata to extract the associated SRR accession and outputs the SRR accession.

### Inputs

| **Terra Task Name** | **Variable** | **Type** | **Description**| **Default Value** | **Terra Status** |
| --- | --- | --- | --- | --- | --- |
| fetch_srr_metadata | **sample_accession** | String | SRA-compatible accession, such as a **BioSample ID** (e.g., "SAMN00000000") or **SRA Experiment ID** (e.g., "SRX000000"), used to retrieve SRR metadata. | | Required |
| fetch_srr_metadata | **cpu** | Int | Number of CPUs allocated for the task. | 2 | Optional |
| fetch_srr_metadata | **disk_size** | Int | Disk space in GB allocated for the task. | 10 | Optional |
| fetch_srr_metadata | **docker**| String | Docker image for metadata retrieval. | `us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0` | Optional |
| fetch_srr_metadata | **memory** | Int | Memory in GB allocated for the task. | 8 | Optional |
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

### Workflow Tasks

This workflow has a single task that performs metadata retrieval for the specified sample accession.

??? task "`fastq-dl`: Fetches SRR metadata for sample accession"
When provided a BioSample accession or SRA experiment ID, 'fastq-dl' collects metadata and returns the appropriate SRR accession.

!!! techdetails "fastq-dl Technical Details"
| | Links |
| --- | --- |
| Task | [Task on GitHub](https://github.com/theiagen-org/phb-workflows/blob/main/tasks/utilities/data_handling/task_fetch_srr_metadata.wdl) |
| Software Source Code | [fastq-dl Source](https://github.com/rvalieris/fastq-dl) |
| Software Documentation | [fastq-dl Documentation](https://github.com/rvalieris/fastq-dl#documentation) |
| Original Publication | [fastq-dl: A fast and reliable tool for downloading SRA metadata](https://doi.org/10.1186/s12859-021-04346-3) |

### Outputs

| **Variable** | **Type** | **Description**|
|---|---|---|
| srr_accession| String | The SRR accession's associated with the input sample accession.|
| fetch_srr_accession_version | String | The version of the fetch_srr_accession workflow. |
| fetch_srr_accession_analysis_date | String | The date the fetch_srr_accession analysis was run. |

## References

> Valieris, R. et al., "fastq-dl: A fast and reliable tool for downloading SRA metadata." Bioinformatics, 2021.
1 change: 1 addition & 0 deletions docs/workflows_overview/workflows_alphabetically.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ title: Alphabetical Workflows
| [**TheiaValidate**](../workflows/standalone/theiavalidate.md)| This workflow performs basic comparisons between user-designated columns in two separate tables. | Any taxa | | No | v2.0.0 | [TheiaValidate_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaValidate_PHB:main?tab=info) |
| [**Transfer_Column_Content**](../workflows/data_export/transfer_column_content.md)| Transfer contents of a specified Terra data table column for many samples ("entities") to a GCP storage bucket location | Any taxa | Set-level | Yes | v1.3.0 | [Transfer_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Transfer_Column_Content_PHB:main?tab=info) |
| [**Samples_to_Ref_Tree**](../workflows/phylogenetic_placement/usher.md)| Use UShER to rapidly and accurately place your samples on any existing phylogenetic tree | Monkeypox virus, SARS-CoV-2, Viral | Sample-level, Set-level | Yes | v2.1.0 | [Usher_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Usher_PHB:main?tab=info) |
| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | | Yes | v2.3.0 | [*Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) |
| [**Usher_PHB**](../workflows/genomic_characterization/vadr_update.md)| Update VADR assignments | HAV, Influenza, Monkeypox virus, RSV-A, RSV-B, SARS-CoV-2, Viral, WNV | Sample-level | Yes | v1.2.1 | [VADR_Update_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/VADR_Update_PHB:main?tab=info) |
| [**Zip_Column_Content**](../workflows/data_export/zip_column_content.md)| Zip contents of a specified Terra data table column for many samples ("entities") | Any taxa | Set-level | Yes | v2.1.0 | [Zip_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Zip_Column_Content_PHB:main?tab=info) |

Expand Down
1 change: 1 addition & 0 deletions docs/workflows_overview/workflows_kingdom.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ title: Workflows by Kingdom
| [**TheiaMeta**](../workflows/genomic_characterization/theiameta.md) | Genome assembly and QC from metagenomic sequencing | Any taxa | Sample-level | Yes | v2.0.0 | [TheiaMeta_Illumina_PE_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaMeta_Illumina_PE_PHB:main?tab=info) |
| [**TheiaValidate**](../workflows/standalone/theiavalidate.md)| This workflow performs basic comparisons between user-designated columns in two separate tables. | Any taxa | | No | v2.0.0 | [TheiaValidate_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/TheiaValidate_PHB:main?tab=info) |
| [**Transfer_Column_Content**](../workflows/data_export/transfer_column_content.md)| Transfer contents of a specified Terra data table column for many samples ("entities") to a GCP storage bucket location | Any taxa | Set-level | Yes | v1.3.0 | [Transfer_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Transfer_Column_Content_PHB:main?tab=info) |
| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | Set-level | Yes | v2.3.0 | [Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) |
| [**Zip_Column_Content**](../workflows/data_export/zip_column_content.md)| Zip contents of a specified Terra data table column for many samples ("entities") | Any taxa | Set-level | Yes | v2.1.0 | [Zip_Column_Content_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Zip_Column_Content_PHB:main?tab=info) |

</div>
Expand Down
1 change: 1 addition & 0 deletions docs/workflows_overview/workflows_type.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ title: Workflows by Type
| [**Mercury_Prep_N_Batch**](../workflows/public_data_sharing/mercury_prep_n_batch.md)| Prepare metadata and sequence data for submission to NCBI and GISAID | Influenza, Monkeypox virus, SARS-CoV-2, Viral | Set-level | No | v2.2.0 | [Mercury_Prep_N_Batch_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Mercury_Prep_N_Batch_PHB:main?tab=info) |
| [**Terra_2_GISAID**](../workflows/public_data_sharing/terra_2_gisaid.md)| Upload of assembly data to GISAID | SARS-CoV-2, Viral | Set-level | Yes | v1.2.1 | [Terra_2_GISAID_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Terra_2_GISAID_PHB:main?tab=info) |
| [**Terra_2_NCBI**](../workflows/public_data_sharing/terra_2_ncbi.md)| Upload of sequence data to NCBI | Bacteria, Mycotics, Viral | Set-level | No | v2.1.0 | [Terra_2_NCBI_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Terra_2_NCBI_PHB:main?tab=info) |
| [**Fetch_SRR_Accession**](../workflows/public_data_sharing/fetch_srr_accession.md)| Update SRR metadata in a Terra data table at the sample level | Any taxa | | Yes | v2.3.0 | [Fetch_SRR_Accession_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Fetch_SRR_Accession_PHB:main?tab=info) |

</div>

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ nav:
- Samples_to_Ref_Tree: workflows/phylogenetic_placement/samples_to_ref_tree.md
- Usher_PHB: workflows/phylogenetic_placement/usher.md
- Public Data Sharing:
- Fetch_SRR_Accession: workflows/public_data_sharing/fetch_srr_accession.md
- Mercury_Prep_N_Batch: workflows/public_data_sharing/mercury_prep_n_batch.md
- Terra_2_GISAID: workflows/public_data_sharing/terra_2_gisaid.md
- Terra_2_NCBI: workflows/public_data_sharing/terra_2_ncbi.md
Expand Down
62 changes: 62 additions & 0 deletions tasks/utilities/data_handling/task_fetch_srr_accession.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
version 1.0

task fetch_srr_accession {
input {
String sample_accession
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0"
Int disk_size = 10
Int cpu = 2
Int memory = 8
}
meta {
volatile: true
}
command <<<
set -euo pipefail

# Output the current date and fastq-dl version for debugging
date -u | tee DATE
fastq-dl --version | tee VERSION

echo "Fetching metadata for accession: ~{sample_accession}"

# Run fastq-dl and capture stderr
fastq-dl --accession ~{sample_accession} --only-download-metadata -m 2 --verbose 2> stderr.log || true

# Handle whether the ID/accession is valid and contains SRR metadata based on stderr
if grep -q "No results found for" stderr.log; then
echo "No SRR accession found" > srr_accession.txt
echo "No SRR accession found for accession: ~{sample_accession}"
elif grep -q "received an empty response" stderr.log; then
echo "No SRR accession found" > srr_accession.txt
echo "No SRR accession found for accession: ~{sample_accession}"
elif grep -q "is not a Study, Sample, Experiment, or Run accession" stderr.log; then
echo "Invalid accession: ~{sample_accession}" >&2
exit 1
elif [[ ! -f fastq-run-info.tsv ]]; then
echo "No metadata file found for accession: ~{sample_accession}" >&2
exit 1
else
# Extract SRR accessions from the TSV file if it exists
SRR_accessions=$(awk -F'\t' 'NR>1 {print $1}' fastq-run-info.tsv | paste -sd ',' -)
if [[ -z "${SRR_accessions}" ]]; then
echo "No SRR accession found" > srr_accession.txt
else
echo "Extracted SRR accessions: ${SRR_accessions}"
echo "${SRR_accessions}" > srr_accession.txt
fi
fi
>>>
output {
String srr_accession = read_string("srr_accession.txt")
String fastq_dl_version = read_string("VERSION")
}
runtime {
docker: docker
memory: "~{memory} GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB"
preemptible: 1
}
}
26 changes: 26 additions & 0 deletions workflows/utilities/data_import/wf_fetch_srr_accession.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
version 1.0

import "../../../tasks/utilities/data_handling/task_fetch_srr_accession.wdl" as srr_task
import "../../../tasks/task_versioning.wdl" as versioning_task

workflow fetch_srr_accession {
meta {
description: "This workflow retrieves the Sequence Read Archive (SRA) accession (SRR) associated with a given sample accession. It uses the fastq-dl tool to fetch metadata from SRA and outputs the SRR accession."
}
input {
String sample_accession
}
call versioning_task.version_capture {
input:
}
call srr_task.fetch_srr_accession as fetch_srr {
input:
sample_accession = sample_accession
}
output {
String srr_accession = fetch_srr.srr_accession
# Version Captures
String fetch_srr_accession_version = version_capture.phb_version
String fetch_srr_accession_analysis_date = version_capture.date
}
}

0 comments on commit b4aad55

Please sign in to comment.