Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads come mostly from one chromosome only #178

Open
dariober opened this issue Dec 19, 2022 · 1 comment
Open

Reads come mostly from one chromosome only #178

dariober opened this issue Dec 19, 2022 · 1 comment

Comments

@dariober
Copy link

Hello - I'm simulating reads from Leishmania major genome using training reads sequenced from this genome:

read_analysis.py genome -o training -i fastq_runid_ae10a0b01321b03612390f5e1e5fe2b0d834516a_0_0.fastq.gz \
    -rg TriTrypDB-45_LmajorFriedlin_Genome.fasta -t 16

simulator.py genome -c training -n 20000 -t 32 -b guppy -rg TriTrypDB-45_LmajorFriedlin_Genome.fasta

After mapping the simulated reads back to the reference genome, I find that most reads come from one chromosome only:

minimap2 -a -x map-ont -t 12 TriTrypDB-45_LmajorFriedlin_Genome.fasta simulated_aligned_reads.fasta simulated_unaligned_reads.fasta \
| samtools sort > simulated.bam
samtools index simulated.bam

samtools idxstats simulated.bam | column -t
LmjF.01  268988   0      0
LmjF.02  355712   15     0
LmjF.03  384502   0      0
LmjF.04  472852   0      0
LmjF.05  465823   1      0
LmjF.06  516869   0      0
LmjF.07  596352   14     0
LmjF.08  574960   0      0
LmjF.09  573434   0      0
LmjF.10  570865   0      0
LmjF.11  582573   1      0
LmjF.12  675346   0      0
LmjF.13  654595   0      0
LmjF.14  622644   2      0
LmjF.15  629517   0      0
LmjF.16  714651   0      0
LmjF.17  684829   0      0
LmjF.18  739748   0      0
LmjF.19  702208   0      0
LmjF.20  742537   0      0
LmjF.21  772972   22     0
LmjF.22  716602   0      0
LmjF.23  772565   0      0
LmjF.24  840950   0      0
LmjF.25  912845   23     0
LmjF.26  1091540  0      0
LmjF.27  1130424  2      0
LmjF.28  1160104  0      0
LmjF.29  1212663  0      0
LmjF.30  1403434  0      0
LmjF.31  1484328  5      0
LmjF.32  1604637  0      0
LmjF.33  1583653  1      0
LmjF.34  1866748  86     0
LmjF.35  2090474  4      0
LmjF.36  2682151  12218  0
*        0        0      7856

The training reads look reasonably mapped:

minimap2 -a -x map-ont -t 12 TriTrypDB-45_LmajorFriedlin_Genome.fasta fastq_runid_ae10a0b01321b03612390f5e1e5fe2b0d834516a_0_0.fastq.gz \
| samtools sort > out.bam
samtools index out.bam

samtools idxstats out.bam | column -t
LmjF.01  268988   24   0
LmjF.02  355712   24   0
LmjF.03  384502   30   0
LmjF.04  472852   33   0
LmjF.05  465823   30   0
LmjF.06  516869   34   0
LmjF.07  596352   41   0
LmjF.08  574960   105  0
LmjF.09  573434   45   0
LmjF.10  570865   42   0
LmjF.11  582573   43   0
LmjF.12  675346   173  0
LmjF.13  654595   44   0
LmjF.14  622644   67   0
LmjF.15  629517   92   0
LmjF.16  714651   63   0
LmjF.17  684829   48   0
LmjF.18  739748   61   0
LmjF.19  702208   67   0
LmjF.20  742537   55   0
LmjF.21  772972   56   0
LmjF.22  716602   57   0
LmjF.23  772565   42   0
LmjF.24  840950   79   0
LmjF.25  912845   72   0
LmjF.26  1091540  88   0
LmjF.27  1130424  213  0
LmjF.28  1160104  89   0
LmjF.29  1212663  63   0
LmjF.30  1403434  292  0
LmjF.31  1484328  254  0
LmjF.32  1604637  128  0
LmjF.33  1583653  155  0
LmjF.34  1866748  224  0
LmjF.35  2090474  198  0
LmjF.36  2682151  237  0
*        0        0    1570

This is with NanoSim 3.1.0.

Any idea why I'm getting reads from virtually one chromosome only?

@SaberHQ
Copy link
Collaborator

SaberHQ commented Feb 25, 2023

Sorry for this late reply @dariober

It is indeed very weird behaviour by NanoSim which I have never seen before. Have you tried using pre-trained models as well? Is the issue remains in that case?

It would be also insightful to investigate the output files from the training stage. This will help us know whether the issue is related to the training process or the simulation stage. For example, you may check the length distributions of aligned reads or the histogram of match, mismatch, and indels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants