-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking kun_peng with ZYMO D6311 sample #30
Comments
To help investigate this issue, I would need to: Check the database you're using: Access the sample data: |
Hi @eric9n and @humbleflowers. I am running into this same issue with ONT reads (only the first read is classified). This appears to be an ONT specific issue because when I classify only one Illumina read (R1 in this case), all the reads are classified. Perhaps Kun-peng is having an issue with ONT fastq headers? |
@eric9n and @humbleflowers. The issue in my case was due to long sequence read names that exceeded the maximum sequence name length allowed by the SAM format (https://www.biostars.org/p/482930/; SAM specification, https://samtools.github.io/hts-specs/SAMv1.pdf: "QNAME limited to 254 bytes (was 255). (Aug 2015)"). I shortened the ONT sequence names with: |
Could you please provide a small example of the problematic FASTQ sequence? Just need one record that contains the long sequence name. This will help me reproduce and test the issue. |
@eric9n, here are the first 5 ONT reads with long sequence names. Github will not let me upload a fastq file, so I changed it to txt. |
@eric9n @mperisin-lallemand @humbleflowers I've tested the example_ont_reads.txt with Kun-peng v7.4 against a 273GB bacteria-only database - all sequences were successfully classified. C b855900f-87f6-4fd5-a4dd-76c69183d0fc 48296 2852 0:10 48296:2 0:2 48296:7 2:2 0:1 2:6 0:522 2:2 0:390 |
@humbleflowers, can you retry with the latest Kun-peng release (v0.7.5 or newer) to see if that resolves your issue? The long sequence names of our ONT data turned out to be a red herring. When we shortened the sequence names, we generated a new (non-concatenated) .gz archive that could be successfully parsed by Kun-peng v0.7.4, so we thought the sequence name length was the issue. But per #31 it turned out to be an issue with the type of .gz file. P.S. You can probably also skip the "-S" flag which is only necessary if using paired-end data where R1 and R2 are stored in a single file. |
Hello guys Thanks for looking into it @rzelle-lallemand, i will try again with new update. @eric9n Incase you want to look at sample I was running kun-peng against zymo D6322 ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR728/ERR7287988/zymo_hmw_r104.fastq.gz Thank you. Apologies for late response as i was away. |
@humbleflowers kun-peng classify -p 20 --db ${db_path} --chunk-dir ${chunk_path} --output-dir ${out_path} -S zymo_hmw_r104.fastq.gz After running Bracken (v2.9) for abundance estimation, we successfully identified eight out of the ten species from the ZYMO D6311 mock community. The top 20 species and their relative abundances are as follows: species | relative_abundance Thank you! |
I am running benchmarking on ZYMO D6311 containing 10 species in log proportion.
Theoretical Composition Based on Genomic DNA: Listeria monocytogenes - 89.1%, Pseudomonas aeruginosa - 8.9%, Bacillus subtilis - 0.89%, Saccharomyces cerevisiae - 0.89%, Escherichia coli - 0.089%, Salmonella enterica - 0.089%, Lactobacillus fermentum - 0.0089%, Enterococcus faecalis - 0.00089%, Cryptococcus neoformans - 0.00089%, and Staphylococcus aureus - 0.000089%
Command i ran was
kun-peng classify -p 30 --db /mnt/wwbixprd/kraken2_wwm_ont/k2-241224/ --output-dir test/ --chunk-dir chunk-test/ -S /mnt/wwbixprd/ont_zymo_samples/zym_D6311_r104.10gb.fastq.gz
I got following output
kun-peng identifes only one species at genus level. Is there a way to improve classification to identify all species. My database contains 25K organisms.
Thank you.
The text was updated successfully, but these errors were encountered: