Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve some Singularity issues. Updates. DSL2 Migration. #57

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 48 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,6 @@

Convert BAM files back to FASTQ.

## Quickstart with Conda

We do not recommend Conda for running the workflow. It may happen that packages are not available in any channels anymore and that the environment is broken. For reproducible research, please use containers.

Provided you have a working [Conda](https://docs.conda.io/en/latest/) installation, you can run the workflow with

```bash
mkdir test_out/
nextflow run main.nf \
-profile local,conda \
-ansi-log \
--input=/path/to/your.bam \
--outputDir=test_out \
--sortFastqs=false
```

For each BAM file in the comma-separated `--input` parameter, one directory with FASTQs is created in the `outputDir`. With the `local` profile the processing jobs will be executed locally. The `conda` profile will let Nextflow create a Conda environment from the `task-environment.yml` file. By default, the conda environment will be created in the source directory of the workflow (see [nextflow.config](https://github.com/DKFZ-ODCF/nf-bam2fastq/blob/master/nextflow.config)).

## Quickstart with Docker

Dependent on the version of the workflow that you want to run it might not be possible to re-build the Conda environment. Therefore, to guarantee reproducibility we create [container images](https://github.com/orgs/DKFZ-ODCF/packages) of the task environment.
vinjana marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -41,25 +23,47 @@ nextflow run main.nf \

In your cluster, you may not have access to Docker. In this situation you can use [Singularity](https://singularity.lbl.gov/), if it is installed in your cluster. Note that unfortunately, Nextflow will fail to convert the Docker image into a Singularity image, unless Docker is available. But you can get the Singularity image yourself:

1. Create a Singularity image from the public Docker container
```bash
version=1.0.0
repoDir=/path/to/nf-bam2fastq

singularity build \
"$repoDir/cache/singularity/nf-bam2fastq_$version.sif" \
"docker://ghcr.io/dkfz-odcf/nf-bam2fastq:$version"
```
Note that the location and name of the Singularity image is configured in the `nextflow.config`.
3. Now, you can run the workflow with the "singularity" profile, e.g. on an LSF cluster:
```bash
nextflow run /path/to/nf-bam2fastq/main.nf \
-profile lsf,singularity \
-ansi-log \
--input=test/test1_paired.bam,test/test1_unpaired.bam \
--outputDir=test_out \
--sortFastqs=true
```
You can run the workflow with the "singularity" profile, e.g. on an LSF cluster:

```bash
nextflow run $repoDir/main.nf \
-profile lsf,singularity \
--input=$repoDir/test/test1_paired.bam,$repoDir/test/test1_unpaired.bam \
--outputDir=test_out \
--sortFastqs=true
```

Nextflow will automatically pull the Docker image, convert it into a Singularity image, put it at `$repoDir/cache/singularity/ghcr.io-dkfz-odcf-nf-bam2fastq-$version.img`, and then run the workflow.
vinjana marked this conversation as resolved.
Show resolved Hide resolved
vinjana marked this conversation as resolved.
Show resolved Hide resolved

> WARNING: Downloading the cached container is probably *not* concurrency-safe. If you run multiple workflows at the same time, all of them trying to cache the Singularity container, you will probably end up with a mess. In that case, download the container manually with following command to pull the container:
> ```bash
> version=1.0.0
> repoDir=/path/to/nf-bam2fastq
>
> singularity build \
> "$repoDir/cache/singularity/ghcr.io-dkfz-odcf-nf-bam2fastq-$version.img" \
> "docker://ghcr.io/dkfz-odcf/nf-bam2fastq:$version"
> ```

vinjana marked this conversation as resolved.
Show resolved Hide resolved
## Quickstart with Conda

> NOTE: Conda is a decent tool for building containers, although these containers tend to be rather big. However, we do *not* recommend you use Conda for reproducibly running workflows. The Conda solution proposed here really is mostly for development. We will not give support for this.

We do not recommend Conda for running the workflow. It may happen that packages are not available in any channels anymore and that the environment is broken. For reproducible research, please use containers.

Provided you have a working [Conda](https://docs.conda.io/en/latest/) installation, you can run the workflow with

```bash
mkdir test_out/
nextflow run main.nf \
-profile local,conda \
--input=/path/to/your.bam \
--outputDir=test_out \
--sortFastqs=false
```

For each BAM file in the comma-separated `--input` parameter, one directory with FASTQs is created in the `outputDir`. With the `local` profile the processing jobs will be executed locally. The `conda` profile will let Nextflow create a Conda environment from the `task-environment.yml` file. By default, the conda environment will be created in the source directory of the workflow (see [nextflow.config](https://github.com/DKFZ-ODCF/nf-bam2fastq/blob/master/nextflow.config)).
vinjana marked this conversation as resolved.
Show resolved Hide resolved


## Remarks

Expand Down Expand Up @@ -132,8 +136,8 @@ By default, the Conda environments of the jobs as well as the Singularity contai
cd $workflowRepoDir
# Refer to the nextflow.config for the name of the Singularity image.
singularity build \
cache/singularity/nf-bam2fastq_1.0.0.sif \
docker://ghcr.io/dkfz-odcf/nf-bam2fastq:1.0.0
cache/singularity/ghcr.io-dkfz-odcf-nf-bam2fastq-$version.img \
docker://ghcr.io/dkfz-odcf/nf-bam2fastq:$version

# Test your container
test/test1.sh test-results/ singularity nextflowEnv/
Expand Down Expand Up @@ -189,6 +193,11 @@ This is an outline of the procedure to release the container to [Github Containe

## Release Notes

* 1.3.0
* Minor: Let Nextflow automatically create the cached Singularity image.
> NOTE: The cached image name was changed to Nextflow's default name. If you want to prevent a re-conversion of the image, you may rename an existing image to `cache/singularity/ghcr.io-dkfz-odcf-nf-bam2fastq-$version.img`.
* Patch: Mention Conda only for development in `README.md`.

* 1.2.0
* Minor: Updated to miniconda3:4.10.3 base container, because the previous version (4.9.2) didn't build anymore.
* Minor: Use `-env none` for "lsf" cluster profile. Local environment should not be copied. This probably caused problems with the old "dkfzModules" environment profile.
Expand Down
84 changes: 44 additions & 40 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -162,10 +162,6 @@ String toSortMemoryString(MemoryUnit mem) {
}

/** The actual workflow */
bamFiles_ch = Channel.
fromPath(params.input.split(',') as List<String>,
checkIfExists: true)


Boolean compressBamToFastqOutput = params.sortFastqs ? params.compressIntermediateFastqs : true

Expand All @@ -180,10 +176,10 @@ process bamToFastq {
publishDir params.outputDir, enabled: !params.sortFastqs, mode: publishMode.toString()

input:
file bamFile from bamFiles_ch
file bamFile

output:
tuple file(bamFile), file("**/*.${fastqSuffix(compressBamToFastqOutput)}") into readsFiles_ch
tuple file(bamFile), file("**/*.${fastqSuffix(compressBamToFastqOutput)}")

shell:
"""
Expand All @@ -199,29 +195,6 @@ process bamToFastq {

}

// Create two channels of matched paired-end and unmatched or single-end reads, each of tuples of (bam, fastq).
readsFiles_ch.into { readsFilesA_ch; readsFilesB_ch }
pairedFastqs_ch = readsFilesA_ch.flatMap {
def (bam, fastqs) = it
fastqs.grep { it.getFileName() =~ /.+_R[12]\.fastq(?:\.[^.]*)?$/ }.
groupBy { fastq -> fastq.getFileName().toString().replaceFirst("_R[12].fastq(?:.gz)?\$", "") }.
collect { key, files ->
assert files.size() == 2
files.sort()
[bam, files[0], files[1]]
}
}


// Unpaired FASTQs are unmatched or orphaned paired-reads (1 or 2) and singletons, i.e. unpaired reads.
unpairedFastqs_ch = readsFilesB_ch.flatMap {
def (bam, fastqs) = it
fastqs.
grep { it.getFileName() =~ /.+_(U[12]|S)\.fastq(?:\.[^.]*)?$/ }.
collect { [bam, it] }
}


process nameSortUnpairedFastqs {
cpus { params.sortThreads + (params.compressIntermediateFastqs ? params.compressorThreads : 0 ) }
memory { (sortMemory + 100.MB) * params.sortThreads * 1.2 }
Expand All @@ -235,15 +208,14 @@ process nameSortUnpairedFastqs {
params.sortFastqs

input:
tuple file(bam), file(fastq) from unpairedFastqs_ch
tuple file(bam), file(fastq)

output:
tuple file(bam), file(sortedFastqFile) into sortedUnpairedFastqs_ch
tuple file(bam), file(sortedFastqFile)

script:
bamFileName = bam.getFileName().toString()
outDir = "${bamFileName}_sorted_fastqs"
sortedFastqFile = sortedFastqFile(outDir, fastq, true)
outDir = "${bam.getFileName().toString()}_sorted_fastqs" as String
sortedFastqFile = sortedFastqFile(outDir, fastq.toRealPath(), true)
"""
mkdir -p "$outDir"
compressedInputFastqs="$compressBamToFastqOutput" \
Expand Down Expand Up @@ -272,16 +244,15 @@ process nameSortPairedFastqs {
params.sortFastqs

input:
tuple file(bam), file(fastq1), file(fastq2) from pairedFastqs_ch
tuple file(bam), file(fastq1), file(fastq2)

output:
tuple file(bam), file(sortedFastqFile1), file(sortedFastqFile2) into sortedPairedFastqs_ch
tuple file(bam), file(sortedFastqFile1), file(sortedFastqFile2)

script:
bamFileName = bam.getFileName().toString()
outDir = "${bamFileName}_sorted_fastqs"
sortedFastqFile1 = sortedFastqFile(outDir, fastq1, true)
sortedFastqFile2 = sortedFastqFile(outDir, fastq2, true)
outDir = "${bam.getFileName().toString()}_sorted_fastqs" as String
sortedFastqFile1 = sortedFastqFile(outDir, fastq1.toRealPath(), true)
sortedFastqFile2 = sortedFastqFile(outDir, fastq2.toRealPath(), true)
"""
mkdir -p "$outDir"
compressedInputFastqs="$compressBamToFastqOutput" \
Expand All @@ -298,6 +269,39 @@ process nameSortPairedFastqs {

}

workflow {

bamFiles_ch = Channel.fromPath(params.input.split(',') as List<String>, checkIfExists: true)
readsFiles_ch = bamToFastq(bamFiles_ch)

pairedFastqs_ch = readsFiles_ch.flatMap {
def (bam, fastqs) = it
fastqs.grep {
it.getFileName() =~ /.+_R[12]\.fastq(?:\.[^.]*)?$/
}.
groupBy { fastq ->
fastq.getFileName().toString().replaceFirst("_R[12].fastq(?:.gz)?\$", "")
}.
collect { key, files ->
assert files.size() == 2
files.sort()
[bam, files[0], files[1]]
}
}
nameSortPairedFastqs(pairedFastqs_ch)

// Unpaired FASTQs are unmatched or orphaned paired-reads (1 or 2) and singletons, i.e. unpaired reads.
unpairedFastqs_ch = readsFiles_ch.flatMap {
def (bam, fastqs) = it
fastqs.
grep { it.getFileName() =~ /.+_(U[12]|S)\.fastq(?:\.[^.]*)?$/ }.
collect { [bam, it] }
}
nameSortUnpairedFastqs(unpairedFastqs_ch)


}


workflow.onComplete {
println "Workflow run $workflow.runName completed at $workflow.complete with status " +
Expand Down
6 changes: 5 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,15 @@ profiles {
}

singularity {
process.container = "nf-bam2fastq_${ext.containerVersion}.sif"
// Automatically pull the Docker image, and put it into the cache directory
process.container = "docker://ghcr.io/dkfz-odcf/nf-bam2fastq:${ext.containerVersion}"
singularity.enabled = true
singularity.cacheDir = "${projectDir}/cache/singularity"
// The singularity containers are stored in the workflow-directory
singularity.autoMounts = true
// Don't mount the home directory by default, because Bash may setup the environment from
// there thus breaking the environment encapsulation and reproducibility of the workflow.
singularity.runOptions = "--no-home"
}

lsf {
Expand Down
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
22 changes: 15 additions & 7 deletions test/test1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ set -ue
set -o pipefail

outDir="${1:?No outDir set}"
environmentProfile="${2:-conda}"
environmentProfile="${2:-singularity}"
nextflowEnvironment="${3:-$outDir/nextflowEnv}"

if [[ "$environmentProfile" == "mamba" ]]; then
Expand Down Expand Up @@ -78,18 +78,18 @@ nextflow run "$workflowDir/main.nf" \
-ansi-log \
-resume \
-work-dir "$outDir/work" \
--input="$workflowDir/test/test1_paired.bam,$workflowDir/test/test1_unpaired.bam" \
--input="$workflowDir/test/reference/test1_paired.bam,$workflowDir/test/reference/test1_unpaired.bam" \
--outputDir="$outDir" \
--sortFastqs=false \
--compressorThreads=0 \
--sortThreads=1 \
--sortMemory="100 MB"
assertEqual \
"$(readsInBam "$workflowDir/test/test1_paired.bam")" \
"$(readsInBam "$workflowDir/test/reference/test1_paired.bam")" \
"$(readsInOutputDir "$outDir/test1_paired.bam_fastqs")" \
"Read number in unsorted output FASTQs on paired-end input bam"
assertEqual \
"$(readsInBam "$workflowDir/test/test1_unpaired.bam")" \
"$(readsInBam "$workflowDir/test/reference/test1_unpaired.bam")" \
"$(readsInOutputDir "$outDir/test1_unpaired.bam_fastqs")" \
"Read number in unsorted output FASTQs on single-end input bam"

Expand All @@ -98,19 +98,27 @@ nextflow run "$workflowDir/main.nf" \
-ansi-log \
-resume \
-work-dir "$outDir/work" \
--input="$workflowDir/test/test1_paired.bam,$workflowDir/test/test1_unpaired.bam" \
--input="$workflowDir/test/reference/test1_paired.bam,$workflowDir/test/reference/test1_unpaired.bam" \
--outputDir="$outDir" \
--sortFastqs=true \
--compressorThreads=0 \
--sortThreads=1 \
--sortMemory="100 MB"
assertEqual \
"$(readsInBam "$workflowDir/test/test1_paired.bam")" \
"$(readsInBam "$workflowDir/test/reference/test1_paired.bam")" \
"$(readsInOutputDir "$outDir/test1_paired.bam_sorted_fastqs")" \
"Read number in sorted output FASTQs on paired-end input bam"
assertEqual \
"$(readsInBam "$workflowDir/test/test1_unpaired.bam")" \
"$(readsInBam "$workflowDir/test/reference/test1_unpaired.bam")" \
"$(readsInOutputDir "$outDir/test1_unpaired.bam_sorted_fastqs")" \
"Read number in sorted output FASTQs on single-end input bam"

for ref in reference/test*/*; do
out="$outDir/$(echo "$ref" | sed "s/reference//")"
assertEqual \
"$(zcat "$ref" | md5sum | cut -d' ' -f1)" \
"$(zcat "$out" | md5sum | cut -d' ' -f1)" \
"MD5 of $ref and $out"
done
vinjana marked this conversation as resolved.
Show resolved Hide resolved

testFinished