Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress FASTA files with contigs and scaffolds using bgzip instead of gzip, to enable file indexing #746

Open
amizeranschi opened this issue Jan 19, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@amizeranschi
Copy link
Contributor

Description of feature

Certain downstream analysis tools expect the FASTA files with contigs and scaffolds to be compressed with bgzip instead of gzip, for file indexing purposes. To save time with decompressing and recompressing, it would be helpful if the MEGAHIT and SPAdes modules would directly compress their outputs with bgzip instead of gzip.

@amizeranschi amizeranschi added the enhancement New feature or request label Jan 19, 2025
@jfy133
Copy link
Member

jfy133 commented Jan 19, 2025

Do you have examples of what tools require this type of gzipping?

This will require quite a fundamental change of the official nf-core modules, so I want to check if it's worth the effofrt.

@amizeranschi
Copy link
Contributor Author

I've been looking into QC tools for finding misassemblies from reads mapped to contigs. Two promising examples are DeepMAsED (Deep learning for Metagenome Assembly Error Detection) and ResMiCo (Residual neural network for Misassembled Contig identification). Both of them use BAM files created by the pipeline and they require indexing of both the BAM files with mapped reads and the FASTA files with assembled contigs. The latter are failing due to plain gzip compression.

On a side note, these tools are available in Bioconda and Biocontainers, so they'd make good candidates for adding as nf-core modules and being incorporated into nf-core/mag to complement MetaQUAST for assembly QC. Any thoughts on this?

I'm including some relevant links below:

DeepMAsED (Deep learning for Metagenome Assembly Error Detection)
https://academic.oup.com/bioinformatics/article/36/10/3011/5756210
https://github.com/leylabmpi/DeepMAsED
https://anaconda.org/bioconda/deepmased/files
https://quay.io/repository/biocontainers/deepmased?tab=tags

ResMiCo (Residual neural network for Misassembled Contig identification)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011001
https://github.com/leylabmpi/ResMiCo
https://anaconda.org/bioconda/resmico/files
https://quay.io/repository/biocontainers/resmico?tab=tags

@amizeranschi
Copy link
Contributor Author

Yet another QC tool for misassembly detection is metaMIC (Reference-free Misassembly Identification and Correction of metagenomic assemblies), this one's is using a random forest classifier instead of deep learning and it also starts from BAM + FASTA (while also requiring a samtools pileup file):

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02810-y
https://github.com/ZhaoXM-Lab/metaMIC

Unfortunately, this package isn't in Bioconda or Biocontainers.

@jfy133
Copy link
Member

jfy133 commented Jan 19, 2025

I've been looking into QC tools for finding misassemblies from reads mapped to contigs. Two promising examples are DeepMAsED (Deep learning for Metagenome Assembly Error Detection) and ResMiCo (Residual neural network for Misassembled Contig identification). Both of them use BAM files created by the pipeline and they require indexing of both the BAM files with mapped reads and the FASTA files with assembled contigs. The latter are failing due to plain gzip compression.

On a side note, these tools are available in Bioconda and Biocontainers, so they'd make good candidates for adding as nf-core modules and being incorporated into nf-core/mag to complement MetaQUAST for assembly QC. Any thoughts on this?

I'm including some relevant links below:

DeepMAsED (Deep learning for Metagenome Assembly Error Detection) https://academic.oup.com/bioinformatics/article/36/10/3011/5756210 https://github.com/leylabmpi/DeepMAsED https://anaconda.org/bioconda/deepmased/files https://quay.io/repository/biocontainers/deepmased?tab=tags

ResMiCo (Residual neural network for Misassembled Contig identification) https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011001 https://github.com/leylabmpi/ResMiCo https://anaconda.org/bioconda/resmico/files https://quay.io/repository/biocontainers/resmico?tab=tags

Short answer:

finding misassemblies from reads mapped to contigs

This sounds in scope to me! Maybe propose this on the slack channel first!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants