-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip files created from multiple gzip inputs with cat not fully processed #31
Comments
Thank you for the detailed report. I found the root cause of this issue: A gzip file consists of a series of members concatenated one after another. MultiGzDecoder decodes all members of a file. |
I'll implement this fix in the next release. |
I see this should be fixed in https://github.com/eric9n/Kun-peng/releases/tag/v0.7.5, thanks! |
We can confirm the new release fixes this issue. Thanks for the quick turnaround! |
Hi, my colleague @mperisin-lallemand and I dug a little deeper to understand why some sequence files don't show classification of all reads with Kun-peng (see also #30 ). A (for us) reproducible example:
Input
1000 read ONT dataset https://github.com/MaestSi/MetONTIIME/blob/master/Zymo-GridION-EVEN-BB-SN_sup_pass_filtered_27F_1492Rw_1000_reads.fastq.gz
(see https://github.com/MaestSi/MetONTIIME/tree/master?tab=readme-ov-file#test-dataset )
Steps to reproduce:
kun-peng/kun_peng classify --db kun-peng/k2_pluspfp_20240904 --chunk-dir temp-7 --output-dir kun-peng_out_ont-zymo-original -p 20 Zymo-GridION-EVEN-BB-SN_sup_pass_filtered_27F_1492Rw_1000_reads.fastq.gz
seqkit split -p 3 Zymo-original.fastq.gz
) per https://bioinf.shenwei.me/seqkit/usage/#splitcat
orzcat
cat Zymo-original.fastq.gz.split/Zymo-original.part_001.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_002.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_003.fastq.gz > Zymo-combined-from-3.fastq.gz
and
zcat Zymo-original.fastq.gz.split/Zymo-original.part_001.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_002.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_003.fastq.gz > Zymo-combined-from-3-zcat.fastq.gz
kun-peng/kun_peng classify --db kun-peng/k2_pluspfp_20240904 --chunk-dir temp-10 --output-dir kun-peng_out_ont-zymo-combined-3 -p 20 Zymo-combined-from-3.fastq.gz
kun-peng/kun_peng classify --db kun-peng/k2_pluspfp_20240904 --chunk-dir temp-11 --output-dir kun-peng_out_ont-zymo-combined-3-zcat -p 20 Zymo-combined-from-3-zcat.fastq.gz
cat
version generates an "output_1.txt" file with 334 lines, thezcat
version generates an "output_1.txt" file with the full 1000 lines.It's our understanding that using "cat" to combine multiple .gz files is valid usage (see e.g. https://stackoverflow.com/a/8005155/1712389 ), and this hasn't caused issues in other downstream software as far as we're aware. ONT sequencers by default generate multiple fastq.gz files per sequence run, so ONT sequence data preprocessing usually includes a concatenation step.
The text was updated successfully, but these errors were encountered: