No output with --n_reads > 2000000 #4

lpipes · 2023-01-29T07:09:25Z

Hi, I am wondering why I get no output (no FASTQ files) if I set --n_reads to higher than 2000000? For 1,000,000 and 2,000,000 I do get output. The command that I'm using is

python simulate_metagenome.py --genomes_file random.fasta --genome_abundances abundances.tsv --output_folder output --n_reads 3000000

The text was updated successfully, but these errors were encountered:

willboulton · 2023-01-29T15:07:03Z

Hi - thanks for catching this bug, hope our simulator can be useful to you. I think this is to do with a limit we had set on how much random data to create when shuffling reads, and it should have been fixed by the last commit. Are you using the docker version? If so, I will update that as well.

lpipes · 2023-02-09T19:13:23Z

No I'm not using the docker version. Thanks for fixing this! I am able to print out more simulated reads now. However, I see that some amplicons are completely missing in my reads (see example from amplicon 90). Is this to simulate amplicon dropout and if so how is that simulated?

85      hCoV-19/Italy/VEN-IZSVe-22RS8391-6_PD/2022      89      False   5000000 0.2     998944  0.0022428652075625      0.00030361152567553085  308
86      hCoV-19/Italy/VEN-IZSVe-22RS8391-6_PD/2022      91      False   5000000 0.2     998944  0.0001771084847563      5.695689668623623e-07   1

rabiafidan · 2023-02-09T19:39:55Z

Yes, we simulate amplicon drop-out. Amplicons might drop out because of

Differential read-depth coverage along the genome (that amplicon might end up not having any reads). See also our preprint Methods, section B
Primers failing to align the reference genome you provide. Lots of reference genomes are missing some portions at the genome ends so the first and the last primers frequently drop out because there is nothing for the primers to align. Or if the portion of the reference genome has lots of Ns, this might also fail the alignment.

lpipes · 2023-02-09T19:41:39Z

Ok thanks for the explanation!

gavinmdouglas · 2023-04-03T14:02:21Z

Hi there,

Thanks for making this great tool! I ran into the same problem, with the most recent development version in a conda environment. I was able to fix it by changing shuf --random-source={random_seed} to shuf (edit: in art_runner.py). I believe this original argument just ensures that the output is reproducibly shuffled, which is not something we need for our purposes.

Cheers,

Gavin

willboulton · 2023-04-04T09:48:46Z

Hi Gavin,

Thanks for flagging this - I'll reopen the issue, glad it doesn't affect you!

Yep - just to confirm that you're right about the shuffling; the random source is there only for reproducibility of the order of the reads - the reads themselves would still be the same for runs with the same random seed

Best,
Will

lpipes closed this as completed Feb 9, 2023

willboulton reopened this Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No output with --n_reads > 2000000 #4

No output with --n_reads > 2000000 #4

lpipes commented Jan 29, 2023

willboulton commented Jan 29, 2023

lpipes commented Feb 9, 2023

rabiafidan commented Feb 9, 2023

lpipes commented Feb 9, 2023

gavinmdouglas commented Apr 3, 2023 •

edited

Loading

willboulton commented Apr 4, 2023

No output with --n_reads > 2000000 #4

No output with --n_reads > 2000000 #4

Comments

lpipes commented Jan 29, 2023

willboulton commented Jan 29, 2023

lpipes commented Feb 9, 2023

rabiafidan commented Feb 9, 2023

lpipes commented Feb 9, 2023

gavinmdouglas commented Apr 3, 2023 • edited Loading

willboulton commented Apr 4, 2023

gavinmdouglas commented Apr 3, 2023 •

edited

Loading