Benchmarking kun_peng with ZYMO D6311 sample #30

humbleflowers · 2025-01-03T12:59:40Z

I am running benchmarking on ZYMO D6311 containing 10 species in log proportion.
Theoretical Composition Based on Genomic DNA: Listeria monocytogenes - 89.1%, Pseudomonas aeruginosa - 8.9%, Bacillus subtilis - 0.89%, Saccharomyces cerevisiae - 0.89%, Escherichia coli - 0.089%, Salmonella enterica - 0.089%, Lactobacillus fermentum - 0.0089%, Enterococcus faecalis - 0.00089%, Cryptococcus neoformans - 0.00089%, and Staphylococcus aureus - 0.000089%
Command i ran was

kun-peng classify -p 30 --db /mnt/wwbixprd/kraken2_wwm_ont/k2-241224/ --output-dir test/ --chunk-dir chunk-test/ -S /mnt/wwbixprd/ont_zymo_samples/zym_D6311_r104.10gb.fastq.gz

I got following output

100.00  1       0       R       1       root
100.00  1       0       R1      131567    cellular organisms
100.00  1       0       D       2           Bacteria
100.00  1       0       K       1783272       Bacillati
100.00  1       0       P       1239            Bacillota
100.00  1       0       C       91061             Bacilli
100.00  1       0       O       1385                Bacillales
100.00  1       0       F       186820                Listeriaceae
100.00  1       1       G       1637                    Listeria

kun-peng identifes only one species at genus level. Is there a way to improve classification to identify all species. My database contains 25K organisms.
Thank you.

The text was updated successfully, but these errors were encountered:

eric9n · 2025-01-03T13:58:37Z

To help investigate this issue, I would need to:

Check the database you're using:
Could you share where you downloaded the database from?
Or provide a link to the database you used?

Access the sample data:
Could you provide a link to the ZYMO D6311 sample data you used?
Or tell us where we can download the same sample?

mperisin-lallemand · 2025-01-21T20:49:29Z

Hi @eric9n and @humbleflowers. I am running into this same issue with ONT reads (only the first read is classified). This appears to be an ONT specific issue because when I classify only one Illumina read (R1 in this case), all the reads are classified. Perhaps Kun-peng is having an issue with ONT fastq headers?

mperisin-lallemand · 2025-01-21T21:11:57Z

@eric9n and @humbleflowers. The issue in my case was due to long sequence read names that exceeded the maximum sequence name length allowed by the SAM format (https://www.biostars.org/p/482930/; SAM specification, https://samtools.github.io/hts-specs/SAMv1.pdf: "QNAME limited to 254 bytes (was 255). (Aug 2015)"). I shortened the ONT sequence names with: zcat file.fastq.gz | sed 's/ parent_read_id=[^ ]* / /g' | gzip > file.modified.fastq.gz, and then Kun-peng classified all reads.

eric9n · 2025-01-23T00:48:09Z

Could you please provide a small example of the problematic FASTQ sequence? Just need one record that contains the long sequence name. This will help me reproduce and test the issue.

mperisin-lallemand · 2025-01-23T17:19:23Z

@eric9n, here are the first 5 ONT reads with long sequence names. Github will not let me upload a fastq file, so I changed it to txt.

example_ont_reads.txt

Q-chen27 · 2025-01-25T12:01:40Z

@eric9n @mperisin-lallemand @humbleflowers I've tested the example_ont_reads.txt with Kun-peng v7.4 against a 273GB bacteria-only database - all sequences were successfully classified.

C b855900f-87f6-4fd5-a4dd-76c69183d0fc 48296 2852 0:10 48296:2 0:2 48296:7 2:2 0:1 2:6 0:522 2:2 0:390
C 0cac2675-101c-4d4e-9877-cc9c5f39788f 87883 10544 0:23 48296:1 2:2 0:1 2:5 0:21 2:1 0:11 431946:1 2:1 0:1 2:14 1224:3 0:1 1224:1 0:13 1224:6 0:1 2:4 0:1 1224:1 2:1 1224:2 2:2 0:1 2:1 0:2 2:1 0:1 2:1 0:1 2:2 0:1 2:2 0:3 2:1 0:35 2:1 0:2 2:15 0:1 2:1 0:1 2:5 0:1 2:1 0:15 2:2 0:2 87883:3 0:1 87883:1 0:13 2:1 0:6 2:12 0:12 2:3 0:5 2:1 0:22 2:4 0:23 2:1 0:1 2:10 0:1 2:1 0:2 2:1 0:1 2:2 0:5 2:1 0:1 2:1 0:42 2:7 0:2 2:1 0:16 1783272:9 0:1 76335:6 0:15 526226:6 1783272:4 0:26 76335:1 0:23 2:1 0:3 1783272:4 0:15 1783272:23 2:6 0:1 2:2 0:10 1783272:1 0:1 1783272:1 2:1 0:1 2:1 0:12 2:1 0:1 2:19 0:1 2:1 0:9 1783272:2 0:1 1783272:5 0:1 1783272:1 0:9 54914:1 2:1 54914:1 2:1 54914:1 2:1 54914:1 2:9 0:1 2:3 0:1 2:5 0:1 2:15 0:1 2:4 0:1 2:8 0:1 2:15 0:2 2:32 0:41 2:9 54914:2 0:1 54914:1 0:3 54914:2 0:4 2:1 0:2 2:10 0:1 2:4 0:28 2:11 0:9 2:5 0:1 1239:1 0:1 1239:1 1491:1 0:9 2:1 0:33 2:6 0:19 526226:1 0:54 2:1 0:2 2:1 0:22 2:3 0:19 3100176:1 0:2 3100176:1 0:3 3100176:1 0:21 526226:1 3100176:3 0:4 373687:1 0:24 2:1 0:1 2:3 0:1 2:1 0:1 2:3 0:3 2:1 0:13 2:5 0:1 2:4 0:41 2:1 0:1 2:1 0:2 2:1 0:5 2:4 0:14 3100176:1 0:18 3100176:1 0:1 3100176:1 0:8 3100176:1 0:9 2:1 0:1 2:1 0:1 3100176:1 0:1 3100176:3 0:1 2:1 3100176:1 0:1 3100176:1 0:1 3100176:1 0:2 3100176:2 2:1 0:1 2:1 0:3 2:1 0:5 2:1 0:3 2:1 0:1 2:1 0:1 2:1 0:14 2:1 0:19 2:2 0:1 2:3 0:3 2:1 0:1 2:3 0:2 2:1 0:1 2:1 0:1 2:9 0:1 2:1 0:1 2:3 0:12 431946:1 2:7 0:1 2:17 0:12 2:1 0:4 2:1 0:2 2:1 0:15 2:11 0:53 2:1 0:1 2:4 0:10 2:2 0:1 2:11 0:1 2:19 0:1 2:4 0:1 2:3 0:2 431946:1 0:5 431946:2 0:1 2:1 0:19 2:2 0:31 2:1 0:3 2:1 0:2 2:11 0:1 2:1 0:14 2:1 0:3 2:1 0:1 2:5 0:9 2:7 0:12 2:4 0:21 2:6 0:1 2:3 0:1 2:7 0:1 2:2 0:10 2:2 0:1 2:11 0:1 2:5 0:33 2:3 0:40 2:1 0:1 2:1 0:2 2:1 0:1 2:3 0:1 2:1 0:2 2:10 0:1 2:2 0:1 2:1 0:1 2:1 0:27 2:1 0:1 2:1 0:1 2:9 0:12 2:1 0:18 2:1 0:11 2:2 0:1 2:2 0:9 2:7 0:10 2:1 0:11 2:14 0:2 2:1 0:1 2:3 0:3 2:13 0:9 2:2 0:11 2:1 0:28 1236:1 2:2 0:4 2:2 0:1 2:1 0:1 2:1 0:3 2:3 0:15 2:1 0:1 2:4 0:17 2:1 0:1 2:4 0:1 2:1 0:1 2:23 0:10 2:1 0:4 2:1 0:1 2:2 0:4 2:1 0:2 2:1 0:4 2:3 0:2 2:3 0:1 2:3 0:2 2:1 0:2 2:1 0:1 2:3 0:2 2:4 0:1 2:2 0:2 2:3 0:1 2:1 3100176:1 0:1 3100176:2 0:1 3100176:1 2:4 0:1 2:1 0:1 2:4 0:4 2:1 0:2 2:1 0:1 2:1 0:1 2:2 0:1 2:1 0:45 2:5 0:1 2:24 0:3 2:2 0:2 3100176:1 0:1 3100176:2 0:2 2:3 0:20 2:5 0:1 2:5 0:1 2:5 0:6 2:2 0:1 2:3 0:9 2:9 0:9 2:8 0:10 2:18 0:3 2:1 0:4 2:1 0:4 2:9 0:1 2:8 0:1 2:1 0:1 2:13 0:5 2:1 0:4 2:14 0:16 2:1 0:10 2:1 0:1 2:1 0:42 2:1 0:1 2:5 0:31 2:3 0:1 2:4 1224:1 0:15 87883:1 0:37 2:2 0:1 2:10 0:60 2:3 0:1 2:5 0:25 87883:1 0:1 87883:2 2:1 0:1 2:2 0:1 2:3 0:2 1224:1 2:1 0:3 1224:2 0:1 1224:1 0:1 1224:1 87883:1 2:1 0:13 1224:3 0:1 1224:3 0:10 1224:14 0:9 1224:1 0:7 1224:2 2:1 1224:1 0:2 87883:2 0:1 87883:1 0:1 87883:2 0:43 1224:1 2:7 0:1 2:3 1224:2 0:1 2:1 1224:2 0:36 2:4 0:49 2:1 0:1 2:1 0:1 2:2 0:1 2:2 0:3 2:1 0:12 2:1 0:3 2:1 0:1 2:3 0:1 2:2 0:3 2:1 0:1 2:1 0:1 2:1 0:2 2:1 0:2 2:15 0:1 2:1 0:1 2:3 0:6 2707174:1 0:2 2:5 0:1 2:2 0:3 2:2 0:2 87883:3 0:1 87883:1 0:3 87883:1 0:13 2:6 0:20 2:6 0:1 2:2 0:1 2:1 0:27 2:1 0:1 2:5 0:10 2:3 0:1 2:14 0:1 2:5 0:7 2:1 0:3 2:11 0:1 2:6 0:1 2:1 0:2 2:1 0:45 2:5 0:40 526226:1 0:16
C 23fdab33-a418-460c-9232-0f4db5722ccb 48296 6515 0:14 48296:7 2:2 0:1 2:5 0:32 2608247:1 0:1239 1783272:1 0:1 2:1 0:1 2:1 0:1 2:1 0:1 2:1 0:1 2:1 2877941:1 0:821
C f87ce81c-2d4e-49c0-a166-b5cc37fb5e32 48296 1380 0:8 48296:1 0:2 48296:7 2:2 0:1 2:5 0:310 1812112:1 0:119
C 89a96608-1899-49e1-b077-767a40d5ae27 2 3928 0:14 48296:1 0:49 61645:1 0:2 2922858:1 0:41 2:1 0:66 72407:1 0:89 115981:1 0:4 1160769:1 0:2 525370:1 0:13 1239:1 0:1 470:1 0:80 2993654:1 2763540:1 0:28 748003:1 0:6 29347:1 0:162 1915078:1 2:1 0:1 28131:1 0:1 2833771:1 0:21 287:1 0:79 2675710:1 0:81 161896:1 0:1 2:1 297246:1 0:69 1241834:1 0:9 115981:1 0:5 2:1 1202538:1 0:6 2917715:1 0:57 1970738:1 0:1 1921565:1 0:106 374981:1 0:17 3066272:1 2993654:1 0:13 37637:1 0:8 2490996:1 0:33 2:1 0:109 2675710:1 1194418:1 0:2 2982694:1 0:54 2:3 0:5

rzelle-lallemand · 2025-01-30T02:42:15Z

@humbleflowers, can you retry with the latest Kun-peng release (v0.7.5 or newer) to see if that resolves your issue?

The long sequence names of our ONT data turned out to be a red herring. When we shortened the sequence names, we generated a new (non-concatenated) .gz archive that could be successfully parsed by Kun-peng v0.7.4, so we thought the sequence name length was the issue. But per #31 it turned out to be an issue with the type of .gz file.

P.S. You can probably also skip the "-S" flag which is only necessary if using paired-end data where R1 and R2 are stored in a single file.

humbleflowers · 2025-01-30T13:47:09Z

Hello guys

Thanks for looking into it @rzelle-lallemand, i will try again with new update.

@eric9n Incase you want to look at sample I was running kun-peng against zymo D6322 ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR728/ERR7287988/zymo_hmw_r104.fastq.gz

Thank you. Apologies for late response as i was away.

Q-chen27 · 2025-02-01T05:40:20Z

@humbleflowers
Thank you for providing the sample information. I've analyzed the ZYMO D6311 dataset using the latest version of Kun-peng with our comprehensive pan-domain database (204,477 genomes, 4.3TB) recently described in our preprint (https://www.biorxiv.org/content/10.1101/2024.12.19.629356v1.abstract). Here's the command I used:

kun-peng classify -p 20 --db ${db_path} --chunk-dir ${chunk_path} --output-dir ${out_path} -S zymo_hmw_r104.fastq.gz

After running Bracken (v2.9) for abundance estimation, we successfully identified eight out of the ten species from the ZYMO D6311 mock community.

The top 20 species and their relative abundances are as follows:

Thank you!

humbleflowers changed the title ~~Benchmarking kun_peng with ZYMO D6311 sample~~ Benchmarking kun-peng with ZYMO D6311 sample Jan 3, 2025

humbleflowers changed the title ~~Benchmarking kun-peng with ZYMO D6311 sample~~ Benchmarking kun_peng with ZYMO D6311 sample Jan 3, 2025

rzelle-lallemand mentioned this issue Jan 27, 2025

gzip files created from multiple gzip inputs with cat not fully processed #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking kun_peng with ZYMO D6311 sample #30

Benchmarking kun_peng with ZYMO D6311 sample #30

humbleflowers commented Jan 3, 2025 •

edited

Loading

eric9n commented Jan 3, 2025

mperisin-lallemand commented Jan 21, 2025

mperisin-lallemand commented Jan 21, 2025

eric9n commented Jan 23, 2025

mperisin-lallemand commented Jan 23, 2025

Q-chen27 commented Jan 25, 2025

rzelle-lallemand commented Jan 30, 2025 •

edited

Loading

humbleflowers commented Jan 30, 2025

Q-chen27 commented Feb 1, 2025

Benchmarking kun_peng with ZYMO D6311 sample #30

Benchmarking kun_peng with ZYMO D6311 sample #30

Comments

humbleflowers commented Jan 3, 2025 • edited Loading

eric9n commented Jan 3, 2025

mperisin-lallemand commented Jan 21, 2025

mperisin-lallemand commented Jan 21, 2025

eric9n commented Jan 23, 2025

mperisin-lallemand commented Jan 23, 2025

Q-chen27 commented Jan 25, 2025

rzelle-lallemand commented Jan 30, 2025 • edited Loading

humbleflowers commented Jan 30, 2025

Q-chen27 commented Feb 1, 2025

humbleflowers commented Jan 3, 2025 •

edited

Loading

rzelle-lallemand commented Jan 30, 2025 •

edited

Loading