Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No output with --n_reads > 2000000 #4

Open
lpipes opened this issue Jan 29, 2023 · 6 comments
Open

No output with --n_reads > 2000000 #4

lpipes opened this issue Jan 29, 2023 · 6 comments

Comments

@lpipes
Copy link

lpipes commented Jan 29, 2023

Hi, I am wondering why I get no output (no FASTQ files) if I set --n_reads to higher than 2000000? For 1,000,000 and 2,000,000 I do get output. The command that I'm using is

python simulate_metagenome.py --genomes_file random.fasta --genome_abundances abundances.tsv --output_folder output --n_reads 3000000
@willboulton
Copy link
Collaborator

Hi - thanks for catching this bug, hope our simulator can be useful to you. I think this is to do with a limit we had set on how much random data to create when shuffling reads, and it should have been fixed by the last commit. Are you using the docker version? If so, I will update that as well.

@lpipes
Copy link
Author

lpipes commented Feb 9, 2023

No I'm not using the docker version. Thanks for fixing this! I am able to print out more simulated reads now. However, I see that some amplicons are completely missing in my reads (see example from amplicon 90). Is this to simulate amplicon dropout and if so how is that simulated?

85      hCoV-19/Italy/VEN-IZSVe-22RS8391-6_PD/2022      89      False   5000000 0.2     998944  0.0022428652075625      0.00030361152567553085  308
86      hCoV-19/Italy/VEN-IZSVe-22RS8391-6_PD/2022      91      False   5000000 0.2     998944  0.0001771084847563      5.695689668623623e-07   1

@rabiafidan
Copy link
Collaborator

Yes, we simulate amplicon drop-out. Amplicons might drop out because of

  1. Differential read-depth coverage along the genome (that amplicon might end up not having any reads). See also our preprint Methods, section B
  2. Primers failing to align the reference genome you provide. Lots of reference genomes are missing some portions at the genome ends so the first and the last primers frequently drop out because there is nothing for the primers to align. Or if the portion of the reference genome has lots of Ns, this might also fail the alignment.

@lpipes
Copy link
Author

lpipes commented Feb 9, 2023

Ok thanks for the explanation!

@lpipes lpipes closed this as completed Feb 9, 2023
@gavinmdouglas
Copy link

gavinmdouglas commented Apr 3, 2023

Hi there,

Thanks for making this great tool! I ran into the same problem, with the most recent development version in a conda environment. I was able to fix it by changing shuf --random-source={random_seed} to shuf (edit: in art_runner.py). I believe this original argument just ensures that the output is reproducibly shuffled, which is not something we need for our purposes.

Cheers,

Gavin

@willboulton
Copy link
Collaborator

Hi Gavin,

Thanks for flagging this - I'll reopen the issue, glad it doesn't affect you!

Yep - just to confirm that you're right about the shuffling; the random source is there only for reproducibility of the order of the reads - the reads themselves would still be the same for runs with the same random seed

Best,
Will

@willboulton willboulton reopened this Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants