-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress FASTA files with contigs and scaffolds using bgzip instead of gzip, to enable file indexing #746
Comments
Do you have examples of what tools require this type of gzipping? This will require quite a fundamental change of the official nf-core modules, so I want to check if it's worth the effofrt. |
I've been looking into QC tools for finding misassemblies from reads mapped to contigs. Two promising examples are DeepMAsED (Deep learning for Metagenome Assembly Error Detection) and ResMiCo (Residual neural network for Misassembled Contig identification). Both of them use BAM files created by the pipeline and they require indexing of both the BAM files with mapped reads and the FASTA files with assembled contigs. The latter are failing due to plain gzip compression. On a side note, these tools are available in Bioconda and Biocontainers, so they'd make good candidates for adding as nf-core modules and being incorporated into nf-core/mag to complement MetaQUAST for assembly QC. Any thoughts on this? I'm including some relevant links below: DeepMAsED (Deep learning for Metagenome Assembly Error Detection) ResMiCo (Residual neural network for Misassembled Contig identification) |
Yet another QC tool for misassembly detection is metaMIC (Reference-free Misassembly Identification and Correction of metagenomic assemblies), this one's is using a random forest classifier instead of deep learning and it also starts from BAM + FASTA (while also requiring a samtools pileup file): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02810-y Unfortunately, this package isn't in Bioconda or Biocontainers. |
Short answer:
This sounds in scope to me! Maybe propose this on the slack channel first! |
Description of feature
Certain downstream analysis tools expect the FASTA files with contigs and scaffolds to be compressed with bgzip instead of gzip, for file indexing purposes. To save time with decompressing and recompressing, it would be helpful if the MEGAHIT and SPAdes modules would directly compress their outputs with bgzip instead of gzip.
The text was updated successfully, but these errors were encountered: