Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

Closed
orangebromeliad opened this issue Mar 26, 2019 · 2 comments

Comments

@orangebromeliad
Copy link

Hi team,

I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below:

# FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta'
fasta = readDNAStringSet(fasta_file)

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
readspertx = round(2 * width(fasta) / 100)

fold_changes = matrix(1,ncol=2, nrow=5254)

# simulation call:
simulate_experiment(fasta  = fasta_file,
                    reads_per_transcript=readspertx,
                    fold_changes=fold_changes,
                    num_reps=c(1,1),
                    paired = T,
                    reportCoverage = T,
                    readlen = 100,
                    outdir='polyester_pberghei_simulated_reads_twosample') 

Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument
In addition: Warning messages:
1: In rnbinom(n = length(basemeans), mu = basemeans, size = size) :
  NAs produced
2: In rnbinom(n = length(basemeans), mu = basemeans, size = size) :
  NAs produced

However when I use a subset of twenty transcripts the package appears to work without producing any errors:

# FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta'
fasta = readDNAStringSet(fasta_file)

# subset the FASTA file to first 20 transcripts
small_fasta = fasta[1:20]
writeXStringSet(small_fasta, 'berghei_small.fa')

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
readspertx = round(2 * width(small_fasta) / 100)

transcript.number = count_transcripts('berghei_small.fa')
fold_changes = matrix(1, ncol = 2, nrow=20)

# simulation call:
simulate_experiment(fasta  = 'berghei_small.fa',
                    reads_per_transcript=readspertx,
                    fold_changes=fold_changes,
                    num_reps=c(1,1),
                    paired = T,
                    reportCoverage = T,
                    readlen = 100,
                    outdir='polyester_pberghei_simulated_reads_small') 

I don't believe that this is a duplicate of the earlier closed issue: #8 since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is.

Hope that this helps. Sessioninfo is below too:

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SGSeq_1.16.2                Rsamtools_1.34.1            Biostrings_2.50.2           XVector_0.22.0             
 [5] polyester_1.18.0            tibble_2.1.1                gridExtra_2.3               tidyselect_0.2.5           
 [9] readxl_1.3.1                data.table_1.12.0           scales_1.0.0                RColorBrewer_1.1-2         
[13] pheatmap_1.0.12             DESeq2_1.22.2               SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[17] BiocParallel_1.16.6         matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0       
[21] GenomeInfoDb_1.18.2         IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        
[25] ggplot2_3.1.0               tidyr_0.8.3                 dplyr_0.8.0.1              

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             bit64_0.9-7              httr_1.4.0               progress_1.2.0          
 [5] tools_3.5.2              backports_1.1.3          utf8_1.1.4               R6_2.4.0                
 [9] rpart_4.1-13             Hmisc_4.2-0              DBI_1.0.0                lazyeval_0.2.2          
[13] colorspace_1.4-1         nnet_7.3-12              withr_2.1.2              prettyunits_1.0.2       
[17] bit_1.1-14               compiler_3.5.2           cli_1.1.0                htmlTable_1.13.1        
[21] rtracklayer_1.42.2       labeling_0.3             checkmate_1.9.1          genefilter_1.64.0       
[25] stringr_1.4.0            digest_0.6.18            foreign_0.8-71           base64enc_0.1-3         
[29] pkgconfig_2.0.2          htmltools_0.3.6          limma_3.38.3             htmlwidgets_1.3         
[33] rlang_0.3.2              rstudioapi_0.10          RSQLite_2.1.1            acepack_1.4.1           
[37] RCurl_1.95-4.12          magrittr_1.5             GenomeInfoDbData_1.2.0   Formula_1.2-3           
[41] Matrix_1.2-16            Rcpp_1.0.1               munsell_0.5.0            fansi_0.4.0             
[45] stringi_1.4.3            logspline_2.1.12         yaml_2.2.0               zlibbioc_1.28.0         
[49] plyr_1.8.4               grid_3.5.2               blob_1.1.1               crayon_1.3.4            
[53] lattice_0.20-38          splines_3.5.2            GenomicFeatures_1.34.6   annotate_1.60.1         
[57] hms_0.4.2                locfit_1.5-9.1           knitr_1.22               pillar_1.3.1            
[61] igraph_1.2.4             RUnit_0.4.32             biomaRt_2.38.0           geneplotter_1.60.0      
[65] reshape2_1.4.3           XML_3.98-1.19            glue_1.3.1               latticeExtra_0.6-28     
[69] BiocManager_1.30.4       cellranger_1.1.0         gtable_0.3.0             purrr_0.3.2             
[73] assertthat_0.2.1         xfun_0.5                 xtable_1.8-3             survival_2.43-3         
[77] GenomicAlignments_1.18.1 AnnotationDbi_1.44.0     memoise_1.1.0            cluster_2.0.7-1 

Best,
Tim L.

@alyssafrazee
Copy link
Owner

alyssafrazee commented Mar 26, 2019 via email

@orangebromeliad
Copy link
Author

Hi Alyssa,

Yes you are right; it was because of the use of the round function when calculating the reads-per-transcript, if I remove it then the problem goes away. I'm using a genome which has some very short transcripts which were reduced to 0.5 or smaller reads per transcript by the calculation and round was turning this to zero.

Would it be worth inserting an error into the code to check for this an inform of the problem? Something like:

if(any(reads_per_transcript <= 0)){error('Reads_per_transcript contains zero or negative values')}
?

Best wishes and thank you for your help,
Tim,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants