Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip files created from multiple gzip inputs with cat not fully processed #31

Closed
rzelle-lallemand opened this issue Jan 27, 2025 · 4 comments

Comments

@rzelle-lallemand
Copy link

Hi, my colleague @mperisin-lallemand and I dug a little deeper to understand why some sequence files don't show classification of all reads with Kun-peng (see also #30 ). A (for us) reproducible example:

Input

1000 read ONT dataset https://github.com/MaestSi/MetONTIIME/blob/master/Zymo-GridION-EVEN-BB-SN_sup_pass_filtered_27F_1492Rw_1000_reads.fastq.gz
(see https://github.com/MaestSi/MetONTIIME/tree/master?tab=readme-ov-file#test-dataset )

Steps to reproduce:

  1. By default, Kun-peng 0.7.4 classifies all 1000 reads.

kun-peng/kun_peng classify --db kun-peng/k2_pluspfp_20240904 --chunk-dir temp-7 --output-dir kun-peng_out_ont-zymo-original -p 20 Zymo-GridION-EVEN-BB-SN_sup_pass_filtered_27F_1492Rw_1000_reads.fastq.gz

  1. I then split the fastq.gz file into three parts with seqkit v2.8.2 (seqkit split -p 3 Zymo-original.fastq.gz) per https://bioinf.shenwei.me/seqkit/usage/#split
  2. I then recombined the three parts with either cat or zcat

cat Zymo-original.fastq.gz.split/Zymo-original.part_001.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_002.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_003.fastq.gz > Zymo-combined-from-3.fastq.gz

and

zcat Zymo-original.fastq.gz.split/Zymo-original.part_001.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_002.fastq.gz Zymo-original.fastq.gz.split/Zymo-original.part_003.fastq.gz > Zymo-combined-from-3-zcat.fastq.gz

  1. I then ran both versions through Kun-peng 0.7.4:

kun-peng/kun_peng classify --db kun-peng/k2_pluspfp_20240904 --chunk-dir temp-10 --output-dir kun-peng_out_ont-zymo-combined-3 -p 20 Zymo-combined-from-3.fastq.gz

kun-peng/kun_peng classify --db kun-peng/k2_pluspfp_20240904 --chunk-dir temp-11 --output-dir kun-peng_out_ont-zymo-combined-3-zcat -p 20 Zymo-combined-from-3-zcat.fastq.gz

  1. The cat version generates an "output_1.txt" file with 334 lines, the zcat version generates an "output_1.txt" file with the full 1000 lines.

It's our understanding that using "cat" to combine multiple .gz files is valid usage (see e.g. https://stackoverflow.com/a/8005155/1712389 ), and this hasn't caused issues in other downstream software as far as we're aware. ONT sequencers by default generate multiple fastq.gz files per sequence run, so ONT sequence data preprocessing usually includes a concatenation step.

@eric9n
Copy link
Owner

eric9n commented Jan 29, 2025

Thank you for the detailed report. I found the root cause of this issue:
The problem is related to how we use the flate2 crate for reading gzip files. When using cat to combine multiple gzip files, it creates a file with multiple gzip members (each with its own header and footer). Currently in our implementation, we're using GzDecoder from flate2 which can only read the first gzip member, resulting in partial data being processed (334 reads).
The solution is to switch from flate2::read::GzDecoder to flate2::read::MultiGzDecoder which properly handles files with multiple gzip members. According to flate2's documentation:

A gzip file consists of a series of members concatenated one after another. MultiGzDecoder decodes all members of a file.

@eric9n
Copy link
Owner

eric9n commented Jan 29, 2025

I'll implement this fix in the next release.

@rzelle-lallemand
Copy link
Author

I see this should be fixed in https://github.com/eric9n/Kun-peng/releases/tag/v0.7.5, thanks!

@rzelle-lallemand
Copy link
Author

We can confirm the new release fixes this issue. Thanks for the quick turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants