Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking kun_peng with ZYMO D6311 sample #30

Open
humbleflowers opened this issue Jan 3, 2025 · 9 comments
Open

Benchmarking kun_peng with ZYMO D6311 sample #30

humbleflowers opened this issue Jan 3, 2025 · 9 comments

Comments

@humbleflowers
Copy link

humbleflowers commented Jan 3, 2025

I am running benchmarking on ZYMO D6311 containing 10 species in log proportion.
Theoretical Composition Based on Genomic DNA: Listeria monocytogenes - 89.1%, Pseudomonas aeruginosa - 8.9%, Bacillus subtilis - 0.89%, Saccharomyces cerevisiae - 0.89%, Escherichia coli - 0.089%, Salmonella enterica - 0.089%, Lactobacillus fermentum - 0.0089%, Enterococcus faecalis - 0.00089%, Cryptococcus neoformans - 0.00089%, and Staphylococcus aureus - 0.000089%
Command i ran was

kun-peng classify -p 30 --db /mnt/wwbixprd/kraken2_wwm_ont/k2-241224/ --output-dir test/ --chunk-dir chunk-test/ -S /mnt/wwbixprd/ont_zymo_samples/zym_D6311_r104.10gb.fastq.gz

I got following output

100.00  1       0       R       1       root
100.00  1       0       R1      131567    cellular organisms
100.00  1       0       D       2           Bacteria
100.00  1       0       K       1783272       Bacillati
100.00  1       0       P       1239            Bacillota
100.00  1       0       C       91061             Bacilli
100.00  1       0       O       1385                Bacillales
100.00  1       0       F       186820                Listeriaceae
100.00  1       1       G       1637                    Listeria

kun-peng identifes only one species at genus level. Is there a way to improve classification to identify all species. My database contains 25K organisms.
Thank you.

@humbleflowers humbleflowers changed the title Benchmarking kun_peng with ZYMO D6311 sample Benchmarking kun-peng with ZYMO D6311 sample Jan 3, 2025
@humbleflowers humbleflowers changed the title Benchmarking kun-peng with ZYMO D6311 sample Benchmarking kun_peng with ZYMO D6311 sample Jan 3, 2025
@eric9n
Copy link
Owner

eric9n commented Jan 3, 2025

To help investigate this issue, I would need to:

Check the database you're using:
Could you share where you downloaded the database from?
Or provide a link to the database you used?

Access the sample data:
Could you provide a link to the ZYMO D6311 sample data you used?
Or tell us where we can download the same sample?

@mperisin-lallemand
Copy link

Hi @eric9n and @humbleflowers. I am running into this same issue with ONT reads (only the first read is classified). This appears to be an ONT specific issue because when I classify only one Illumina read (R1 in this case), all the reads are classified. Perhaps Kun-peng is having an issue with ONT fastq headers?

@mperisin-lallemand
Copy link

@eric9n and @humbleflowers. The issue in my case was due to long sequence read names that exceeded the maximum sequence name length allowed by the SAM format (https://www.biostars.org/p/482930/; SAM specification, https://samtools.github.io/hts-specs/SAMv1.pdf: "QNAME limited to 254 bytes (was 255). (Aug 2015)"). I shortened the ONT sequence names with: zcat file.fastq.gz | sed 's/ parent_read_id=[^ ]* / /g' | gzip > file.modified.fastq.gz, and then Kun-peng classified all reads.

@eric9n
Copy link
Owner

eric9n commented Jan 23, 2025

Could you please provide a small example of the problematic FASTQ sequence? Just need one record that contains the long sequence name. This will help me reproduce and test the issue.

@mperisin-lallemand
Copy link

@eric9n, here are the first 5 ONT reads with long sequence names. Github will not let me upload a fastq file, so I changed it to txt.

example_ont_reads.txt

@Q-chen27
Copy link
Collaborator

@eric9n @mperisin-lallemand @humbleflowers I've tested the example_ont_reads.txt with Kun-peng v7.4 against a 273GB bacteria-only database - all sequences were successfully classified.

C b855900f-87f6-4fd5-a4dd-76c69183d0fc 48296 2852 0:10 48296:2 0:2 48296:7 2:2 0:1 2:6 0:522 2:2 0:390
C 0cac2675-101c-4d4e-9877-cc9c5f39788f 87883 10544 0:23 48296:1 2:2 0:1 2:5 0:21 2:1 0:11 431946:1 2:1 0:1 2:14 1224:3 0:1 1224:1 0:13 1224:6 0:1 2:4 0:1 1224:1 2:1 1224:2 2:2 0:1 2:1 0:2 2:1 0:1 2:1 0:1 2:2 0:1 2:2 0:3 2:1 0:35 2:1 0:2 2:15 0:1 2:1 0:1 2:5 0:1 2:1 0:15 2:2 0:2 87883:3 0:1 87883:1 0:13 2:1 0:6 2:12 0:12 2:3 0:5 2:1 0:22 2:4 0:23 2:1 0:1 2:10 0:1 2:1 0:2 2:1 0:1 2:2 0:5 2:1 0:1 2:1 0:42 2:7 0:2 2:1 0:16 1783272:9 0:1 76335:6 0:15 526226:6 1783272:4 0:26 76335:1 0:23 2:1 0:3 1783272:4 0:15 1783272:23 2:6 0:1 2:2 0:10 1783272:1 0:1 1783272:1 2:1 0:1 2:1 0:12 2:1 0:1 2:19 0:1 2:1 0:9 1783272:2 0:1 1783272:5 0:1 1783272:1 0:9 54914:1 2:1 54914:1 2:1 54914:1 2:1 54914:1 2:9 0:1 2:3 0:1 2:5 0:1 2:15 0:1 2:4 0:1 2:8 0:1 2:15 0:2 2:32 0:41 2:9 54914:2 0:1 54914:1 0:3 54914:2 0:4 2:1 0:2 2:10 0:1 2:4 0:28 2:11 0:9 2:5 0:1 1239:1 0:1 1239:1 1491:1 0:9 2:1 0:33 2:6 0:19 526226:1 0:54 2:1 0:2 2:1 0:22 2:3 0:19 3100176:1 0:2 3100176:1 0:3 3100176:1 0:21 526226:1 3100176:3 0:4 373687:1 0:24 2:1 0:1 2:3 0:1 2:1 0:1 2:3 0:3 2:1 0:13 2:5 0:1 2:4 0:41 2:1 0:1 2:1 0:2 2:1 0:5 2:4 0:14 3100176:1 0:18 3100176:1 0:1 3100176:1 0:8 3100176:1 0:9 2:1 0:1 2:1 0:1 3100176:1 0:1 3100176:3 0:1 2:1 3100176:1 0:1 3100176:1 0:1 3100176:1 0:2 3100176:2 2:1 0:1 2:1 0:3 2:1 0:5 2:1 0:3 2:1 0:1 2:1 0:1 2:1 0:14 2:1 0:19 2:2 0:1 2:3 0:3 2:1 0:1 2:3 0:2 2:1 0:1 2:1 0:1 2:9 0:1 2:1 0:1 2:3 0:12 431946:1 2:7 0:1 2:17 0:12 2:1 0:4 2:1 0:2 2:1 0:15 2:11 0:53 2:1 0:1 2:4 0:10 2:2 0:1 2:11 0:1 2:19 0:1 2:4 0:1 2:3 0:2 431946:1 0:5 431946:2 0:1 2:1 0:19 2:2 0:31 2:1 0:3 2:1 0:2 2:11 0:1 2:1 0:14 2:1 0:3 2:1 0:1 2:5 0:9 2:7 0:12 2:4 0:21 2:6 0:1 2:3 0:1 2:7 0:1 2:2 0:10 2:2 0:1 2:11 0:1 2:5 0:33 2:3 0:40 2:1 0:1 2:1 0:2 2:1 0:1 2:3 0:1 2:1 0:2 2:10 0:1 2:2 0:1 2:1 0:1 2:1 0:27 2:1 0:1 2:1 0:1 2:9 0:12 2:1 0:18 2:1 0:11 2:2 0:1 2:2 0:9 2:7 0:10 2:1 0:11 2:14 0:2 2:1 0:1 2:3 0:3 2:13 0:9 2:2 0:11 2:1 0:28 1236:1 2:2 0:4 2:2 0:1 2:1 0:1 2:1 0:3 2:3 0:15 2:1 0:1 2:4 0:17 2:1 0:1 2:4 0:1 2:1 0:1 2:23 0:10 2:1 0:4 2:1 0:1 2:2 0:4 2:1 0:2 2:1 0:4 2:3 0:2 2:3 0:1 2:3 0:2 2:1 0:2 2:1 0:1 2:3 0:2 2:4 0:1 2:2 0:2 2:3 0:1 2:1 3100176:1 0:1 3100176:2 0:1 3100176:1 2:4 0:1 2:1 0:1 2:4 0:4 2:1 0:2 2:1 0:1 2:1 0:1 2:2 0:1 2:1 0:45 2:5 0:1 2:24 0:3 2:2 0:2 3100176:1 0:1 3100176:2 0:2 2:3 0:20 2:5 0:1 2:5 0:1 2:5 0:6 2:2 0:1 2:3 0:9 2:9 0:9 2:8 0:10 2:18 0:3 2:1 0:4 2:1 0:4 2:9 0:1 2:8 0:1 2:1 0:1 2:13 0:5 2:1 0:4 2:14 0:16 2:1 0:10 2:1 0:1 2:1 0:42 2:1 0:1 2:5 0:31 2:3 0:1 2:4 1224:1 0:15 87883:1 0:37 2:2 0:1 2:10 0:60 2:3 0:1 2:5 0:25 87883:1 0:1 87883:2 2:1 0:1 2:2 0:1 2:3 0:2 1224:1 2:1 0:3 1224:2 0:1 1224:1 0:1 1224:1 87883:1 2:1 0:13 1224:3 0:1 1224:3 0:10 1224:14 0:9 1224:1 0:7 1224:2 2:1 1224:1 0:2 87883:2 0:1 87883:1 0:1 87883:2 0:43 1224:1 2:7 0:1 2:3 1224:2 0:1 2:1 1224:2 0:36 2:4 0:49 2:1 0:1 2:1 0:1 2:2 0:1 2:2 0:3 2:1 0:12 2:1 0:3 2:1 0:1 2:3 0:1 2:2 0:3 2:1 0:1 2:1 0:1 2:1 0:2 2:1 0:2 2:15 0:1 2:1 0:1 2:3 0:6 2707174:1 0:2 2:5 0:1 2:2 0:3 2:2 0:2 87883:3 0:1 87883:1 0:3 87883:1 0:13 2:6 0:20 2:6 0:1 2:2 0:1 2:1 0:27 2:1 0:1 2:5 0:10 2:3 0:1 2:14 0:1 2:5 0:7 2:1 0:3 2:11 0:1 2:6 0:1 2:1 0:2 2:1 0:45 2:5 0:40 526226:1 0:16
C 23fdab33-a418-460c-9232-0f4db5722ccb 48296 6515 0:14 48296:7 2:2 0:1 2:5 0:32 2608247:1 0:1239 1783272:1 0:1 2:1 0:1 2:1 0:1 2:1 0:1 2:1 0:1 2:1 2877941:1 0:821
C f87ce81c-2d4e-49c0-a166-b5cc37fb5e32 48296 1380 0:8 48296:1 0:2 48296:7 2:2 0:1 2:5 0:310 1812112:1 0:119
C 89a96608-1899-49e1-b077-767a40d5ae27 2 3928 0:14 48296:1 0:49 61645:1 0:2 2922858:1 0:41 2:1 0:66 72407:1 0:89 115981:1 0:4 1160769:1 0:2 525370:1 0:13 1239:1 0:1 470:1 0:80 2993654:1 2763540:1 0:28 748003:1 0:6 29347:1 0:162 1915078:1 2:1 0:1 28131:1 0:1 2833771:1 0:21 287:1 0:79 2675710:1 0:81 161896:1 0:1 2:1 297246:1 0:69 1241834:1 0:9 115981:1 0:5 2:1 1202538:1 0:6 2917715:1 0:57 1970738:1 0:1 1921565:1 0:106 374981:1 0:17 3066272:1 2993654:1 0:13 37637:1 0:8 2490996:1 0:33 2:1 0:109 2675710:1 1194418:1 0:2 2982694:1 0:54 2:3 0:5

@rzelle-lallemand
Copy link

rzelle-lallemand commented Jan 30, 2025

@humbleflowers, can you retry with the latest Kun-peng release (v0.7.5 or newer) to see if that resolves your issue?

The long sequence names of our ONT data turned out to be a red herring. When we shortened the sequence names, we generated a new (non-concatenated) .gz archive that could be successfully parsed by Kun-peng v0.7.4, so we thought the sequence name length was the issue. But per #31 it turned out to be an issue with the type of .gz file.

P.S. You can probably also skip the "-S" flag which is only necessary if using paired-end data where R1 and R2 are stored in a single file.

@humbleflowers
Copy link
Author

Hello guys

Thanks for looking into it @rzelle-lallemand, i will try again with new update.

@eric9n Incase you want to look at sample I was running kun-peng against zymo D6322 ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR728/ERR7287988/zymo_hmw_r104.fastq.gz

Thank you. Apologies for late response as i was away.

@Q-chen27
Copy link
Collaborator

Q-chen27 commented Feb 1, 2025

@humbleflowers
Thank you for providing the sample information. I've analyzed the ZYMO D6311 dataset using the latest version of Kun-peng with our comprehensive pan-domain database (204,477 genomes, 4.3TB) recently described in our preprint (https://www.biorxiv.org/content/10.1101/2024.12.19.629356v1.abstract). Here's the command I used:

kun-peng classify -p 20 --db ${db_path} --chunk-dir ${chunk_path} --output-dir ${out_path} -S zymo_hmw_r104.fastq.gz

After running Bracken (v2.9) for abundance estimation, we successfully identified eight out of the ten species from the ZYMO D6311 mock community.

The top 20 species and their relative abundances are as follows:

species | relative_abundance
Salmonella_enterica | 0.35432818
Escherichia_coli | 0.34752198
Enterococcus_faecalis | 0.0672665
Staphylococcus_aureus | 0.04936784
Pseudomonas_aeruginosa | 0.02813966
Listeria_monocytogenes | 0.0133805
Klebsiella_pneumoniae | 0.00729877
Wurfbainia_villosa | 0.00665557
Kokia_cookei | 0.00655194
Saccharomyces_cerevisiae | 0.00635697
Shigella_boydii | 0.00602923
Hordeum_vulgare | 0.00582005
Kokia_kauaiensis | 0.00477517
Viscum_album | 0.00452311
Bacillus_subtilis | 0.00403585
Triticum_aestivum | 0.00281451
Escherichia_fergusonii | 0.00251929
Cymbidium_sinense | 0.00251645
Hymenobacter_volaticus | 0.00236811
Kokia_drynarioides | 0.00208665

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants