Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

orangebromeliad · 2019-03-26T11:39:08Z

Hi team,

I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below:

# FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta'
fasta = readDNAStringSet(fasta_file)

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
readspertx = round(2 * width(fasta) / 100)

fold_changes = matrix(1,ncol=2, nrow=5254)

# simulation call:
simulate_experiment(fasta  = fasta_file,
                    reads_per_transcript=readspertx,
                    fold_changes=fold_changes,
                    num_reps=c(1,1),
                    paired = T,
                    reportCoverage = T,
                    readlen = 100,
                    outdir='polyester_pberghei_simulated_reads_twosample')

Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument
In addition: Warning messages:
1: In rnbinom(n = length(basemeans), mu = basemeans, size = size) :
  NAs produced
2: In rnbinom(n = length(basemeans), mu = basemeans, size = size) :
  NAs produced

However when I use a subset of twenty transcripts the package appears to work without producing any errors:

# FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta'
fasta = readDNAStringSet(fasta_file)

# subset the FASTA file to first 20 transcripts
small_fasta = fasta[1:20]
writeXStringSet(small_fasta, 'berghei_small.fa')

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
readspertx = round(2 * width(small_fasta) / 100)

transcript.number = count_transcripts('berghei_small.fa')
fold_changes = matrix(1, ncol = 2, nrow=20)

# simulation call:
simulate_experiment(fasta  = 'berghei_small.fa',
                    reads_per_transcript=readspertx,
                    fold_changes=fold_changes,
                    num_reps=c(1,1),
                    paired = T,
                    reportCoverage = T,
                    readlen = 100,
                    outdir='polyester_pberghei_simulated_reads_small')

I don't believe that this is a duplicate of the earlier closed issue: #8 since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is.

Hope that this helps. Sessioninfo is below too:

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SGSeq_1.16.2                Rsamtools_1.34.1            Biostrings_2.50.2           XVector_0.22.0             
 [5] polyester_1.18.0            tibble_2.1.1                gridExtra_2.3               tidyselect_0.2.5           
 [9] readxl_1.3.1                data.table_1.12.0           scales_1.0.0                RColorBrewer_1.1-2         
[13] pheatmap_1.0.12             DESeq2_1.22.2               SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[17] BiocParallel_1.16.6         matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0       
[21] GenomeInfoDb_1.18.2         IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        
[25] ggplot2_3.1.0               tidyr_0.8.3                 dplyr_0.8.0.1              

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             bit64_0.9-7              httr_1.4.0               progress_1.2.0          
 [5] tools_3.5.2              backports_1.1.3          utf8_1.1.4               R6_2.4.0                
 [9] rpart_4.1-13             Hmisc_4.2-0              DBI_1.0.0                lazyeval_0.2.2          
[13] colorspace_1.4-1         nnet_7.3-12              withr_2.1.2              prettyunits_1.0.2       
[17] bit_1.1-14               compiler_3.5.2           cli_1.1.0                htmlTable_1.13.1        
[21] rtracklayer_1.42.2       labeling_0.3             checkmate_1.9.1          genefilter_1.64.0       
[25] stringr_1.4.0            digest_0.6.18            foreign_0.8-71           base64enc_0.1-3         
[29] pkgconfig_2.0.2          htmltools_0.3.6          limma_3.38.3             htmlwidgets_1.3         
[33] rlang_0.3.2              rstudioapi_0.10          RSQLite_2.1.1            acepack_1.4.1           
[37] RCurl_1.95-4.12          magrittr_1.5             GenomeInfoDbData_1.2.0   Formula_1.2-3           
[41] Matrix_1.2-16            Rcpp_1.0.1               munsell_0.5.0            fansi_0.4.0             
[45] stringi_1.4.3            logspline_2.1.12         yaml_2.2.0               zlibbioc_1.28.0         
[49] plyr_1.8.4               grid_3.5.2               blob_1.1.1               crayon_1.3.4            
[53] lattice_0.20-38          splines_3.5.2            GenomicFeatures_1.34.6   annotate_1.60.1         
[57] hms_0.4.2                locfit_1.5-9.1           knitr_1.22               pillar_1.3.1            
[61] igraph_1.2.4             RUnit_0.4.32             biomaRt_2.38.0           geneplotter_1.60.0      
[65] reshape2_1.4.3           XML_3.98-1.19            glue_1.3.1               latticeExtra_0.6-28     
[69] BiocManager_1.30.4       cellranger_1.1.0         gtable_0.3.0             purrr_0.3.2             
[73] assertthat_0.2.1         xfun_0.5                 xtable_1.8-3             survival_2.43-3         
[77] GenomicAlignments_1.18.1 AnnotationDbi_1.44.0     memoise_1.1.0            cluster_2.0.7-1

Best,
Tim L.

The text was updated successfully, but these errors were encountered:

alyssafrazee · 2019-03-26T22:06:25Z

My guess here is that some values of `readspertx` are zero or NA (but that there aren't any zeros/NAs in the first 20 transcripts / small simulation). Sadly this is a known limitation: #28. However, you can work around this by either removing the transcripts you don't want to simulate from, or replacing all the 0 counts with some small number (like 1) to approximate the simulation. Hope this helps!

…

On Tue, Mar 26, 2019 at 4:39 AM orangebromeliad ***@***.***> wrote: Hi team, I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below: # FASTA annotation fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta' fasta = readDNAStringSet(fasta_file) # ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20 # here all transcripts will have ~equal FPKM readspertx = round(2 * width(fasta) / 100) fold_changes = matrix(1,ncol=2, nrow=5254) # simulation call: simulate_experiment(fasta = fasta_file, reads_per_transcript=readspertx, fold_changes=fold_changes, num_reps=c(1,1), paired = T, reportCoverage = T, readlen = 100, outdir='polyester_pberghei_simulated_reads_twosample') Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument In addition: Warning messages: 1: In rnbinom(n = length(basemeans), mu = basemeans, size = size) : NAs produced 2: In rnbinom(n = length(basemeans), mu = basemeans, size = size) : NAs produced However when I use a subset of twenty transcripts the package appears to work without producing any errors: # FASTA annotation fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta' fasta = readDNAStringSet(fasta_file) # subset the FASTA file to first 20 transcripts small_fasta = fasta[1:20] writeXStringSet(small_fasta, 'berghei_small.fa') # ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20 # here all transcripts will have ~equal FPKM readspertx = round(2 * width(small_fasta) / 100) transcript.number = count_transcripts('berghei_small.fa') fold_changes = matrix(1, ncol = 2, nrow=20) # simulation call: simulate_experiment(fasta = 'berghei_small.fa', reads_per_transcript=readspertx, fold_changes=fold_changes, num_reps=c(1,1), paired = T, reportCoverage = T, readlen = 100, outdir='polyester_pberghei_simulated_reads_small') I don't believe that this is a duplicate of the earlier closed issue: #8 <#8> since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is. Hope that this helps. Sessioninfo is below too: R version 3.5.2 (2018-12-20) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] SGSeq_1.16.2 Rsamtools_1.34.1 Biostrings_2.50.2 XVector_0.22.0 [5] polyester_1.18.0 tibble_2.1.1 gridExtra_2.3 tidyselect_0.2.5 [9] readxl_1.3.1 data.table_1.12.0 scales_1.0.0 RColorBrewer_1.1-2 [13] pheatmap_1.0.12 DESeq2_1.22.2 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 [17] BiocParallel_1.16.6 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 [21] GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 [25] ggplot2_3.1.0 tidyr_0.8.3 dplyr_0.8.0.1 loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 httr_1.4.0 progress_1.2.0 [5] tools_3.5.2 backports_1.1.3 utf8_1.1.4 R6_2.4.0 [9] rpart_4.1-13 Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2 [13] colorspace_1.4-1 nnet_7.3-12 withr_2.1.2 prettyunits_1.0.2 [17] bit_1.1-14 compiler_3.5.2 cli_1.1.0 htmlTable_1.13.1 [21] rtracklayer_1.42.2 labeling_0.3 checkmate_1.9.1 genefilter_1.64.0 [25] stringr_1.4.0 digest_0.6.18 foreign_0.8-71 base64enc_0.1-3 [29] pkgconfig_2.0.2 htmltools_0.3.6 limma_3.38.3 htmlwidgets_1.3 [33] rlang_0.3.2 rstudioapi_0.10 RSQLite_2.1.1 acepack_1.4.1 [37] RCurl_1.95-4.12 magrittr_1.5 GenomeInfoDbData_1.2.0 Formula_1.2-3 [41] Matrix_1.2-16 Rcpp_1.0.1 munsell_0.5.0 fansi_0.4.0 [45] stringi_1.4.3 logspline_2.1.12 yaml_2.2.0 zlibbioc_1.28.0 [49] plyr_1.8.4 grid_3.5.2 blob_1.1.1 crayon_1.3.4 [53] lattice_0.20-38 splines_3.5.2 GenomicFeatures_1.34.6 annotate_1.60.1 [57] hms_0.4.2 locfit_1.5-9.1 knitr_1.22 pillar_1.3.1 [61] igraph_1.2.4 RUnit_0.4.32 biomaRt_2.38.0 geneplotter_1.60.0 [65] reshape2_1.4.3 XML_3.98-1.19 glue_1.3.1 latticeExtra_0.6-28 [69] BiocManager_1.30.4 cellranger_1.1.0 gtable_0.3.0 purrr_0.3.2 [73] assertthat_0.2.1 xfun_0.5 xtable_1.8-3 survival_2.43-3 [77] GenomicAlignments_1.18.1 AnnotationDbi_1.44.0 memoise_1.1.0 cluster_2.0.7-1 Best, Tim L. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#65>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB6e2Ct3PNF5h3esJ4Wp730AruYljTrmks5vagbdgaJpZM4cLObX> .

orangebromeliad · 2019-03-28T11:15:55Z

Hi Alyssa,

Yes you are right; it was because of the use of the round function when calculating the reads-per-transcript, if I remove it then the problem goes away. I'm using a genome which has some very short transcripts which were reduced to 0.5 or smaller reads per transcript by the calculation and round was turning this to zero.

Would it be worth inserting an error into the code to check for this an inform of the problem? Something like:

if(any(reads_per_transcript <= 0)){error('Reads_per_transcript contains zero or negative values')}
?

Best wishes and thank you for your help,
Tim,

orangebromeliad closed this as completed Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

orangebromeliad commented Mar 26, 2019

alyssafrazee commented Mar 26, 2019 via email

orangebromeliad commented Mar 28, 2019

Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

Comments

orangebromeliad commented Mar 26, 2019

alyssafrazee commented Mar 26, 2019 via email

orangebromeliad commented Mar 28, 2019