Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAST unable to estimate components when including mixed effect #98

Open
gabprg opened this issue Aug 1, 2018 · 9 comments
Open

MAST unable to estimate components when including mixed effect #98

gabprg opened this issue Aug 1, 2018 · 9 comments

Comments

@gabprg
Copy link

gabprg commented Aug 1, 2018

Hi there!

I've just recently encountered this issue and was hoping to see whether you guys could help with understanding what's going on: I want to identify differentially expressed genes between 2 conditions. We have 4 replicates (samples from different animals) per condition, which were processed in 2 batches (2 from each condition/batch).

The formula I used to fit the model is: ~ Condition + Batch + nGenes + (1|animal) . One of the known differences between the 2 conditions is that a transgene is only expressed in one of them (and the signal is very robust). From previous posts, I understand that the continuous component of the model can't be computed in that scenario, which is what I see if I fit the model with the formula as above, just without the mixed effect. However, when fitting with the mixed effect, the discrete component is no longer estimated, and a coefficient appears for the continuous component (although not significant). I'm trying to understand why this is happening, since this transgene is essentially a positive control for differential expression, and isn't showing up. Do you have any insights on what's causing this (/how to resolve it)?

Here are the relevant rows from the summary object ("sum.lrt <- MAST::summary(mast.zlmFit, doLRT = T)") using the 2 formulas:

~ Condition + Batch + nGenes:
image

~ Condition + Batch + nGenes + (1|animal):
image

Thanks!

Gabriela

@gfinak
Copy link
Member

gfinak commented Aug 1, 2018

This does seem odd.
Can you post a minimal reproducible example (with data) that we could use to investigate?

@gabprg
Copy link
Author

gabprg commented Aug 7, 2018

Hi again!

Sorry for the delay - I've put together a minimal SingleCellAssay object with 3 genes - 2 of those are other random housekeeping genes which are expressed across the board, and then the 3rd is the transgene I mentioned above. Condition 1 corresponds to the one where the transgene is robustly expressed, it's absent in the other cases. I'm attaching the SingleCellAssay object (rds format). You should be able to re-obtain the result above by running the following code (load the rds object as mast.raw):

`
print("Fitting the zlm model...")
mast.zlmFit <- zlm(
formula = formula(~Condition + Batch + (1|animal) + ngeneson),
sca = mast.raw,
parallel = FALSE,
method = 'glmer',
ebayes = FALSE
)

sum.lrt <- MAST::summary(mast.zlmFit, doLRT = T)
summ1 = sum.lrt$datatable

mast.zlmFit2 <- zlm(
formula = formula(~Condition + Batch + ngeneson),
sca = mast.raw,
parallel = FALSE,
method = 'glm',
ebayes = FALSE
)

sum.lrt2 <- MAST::summary(mast.zlmFit2, doLRT = T)
summ2 = sum.lrt2$datatable
`

I suspect that part of the problem is due to the fact that we're trying to fit a complex model with relatively low cell numbers - if that is the case, it'd be great if you had recommendations on alternative ways to run MAST on this data or even other tools that could be used.

Let me know if anything is unclear, and thanks for the help!

mastExampleObject.zip

@gfinak
Copy link
Member

gfinak commented Aug 7, 2018

Could you post your sessionInfo()?
I get the following error reading your SingleCellAssay object:

Error in updateObject(x@rowRanges, check = FALSE) : 
  no slot of name "rowRanges" for this object of class "SingleCellAssay"

Just looking into what changes have occurred in SummarizedExperiment.

I am using SummarizedExperiment v 1.10.1

NEWS file indicates:

    o rowRanges() now is supported on a SummarizedExperiment object that is
      not a RangedSummarizedExperiment, and returns NULL. Also doing
      'rowRanges(x) <- NULL' on a RangedSummarizedExperiment object now is
      supported and degrades it to a SummarizedExperiment instance.

I wonder if there is some backward-incompatibility.

@gfinak
Copy link
Member

gfinak commented Aug 7, 2018

Note: I've managed to rescue the file.

@gabprg
Copy link
Author

gabprg commented Aug 7, 2018

Here's the output:

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 13.04

Matrix products: default
BLAS: /usr/lib/openblas-base/libopenblas.so.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] lme4_1.1-15 MAST_1.4.1 SummarizedExperiment_1.6.5 DelayedArray_0.4.1 matrixStats_0.53.1
[6] Biobase_2.36.2 GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 IRanges_2.12.0 S4Vectors_0.16.0
[11] BiocGenerics_0.24.0 Seurat_2.2.1 Matrix_1.2-12 cowplot_0.9.2 ggplot2_2.2.1

loaded via a namespace (and not attached):
[1] backports_1.1.2 Hmisc_4.1-1 fastmatch_1.1-0 VGAM_1.0-5 sn_1.5-1 plyr_1.8.4
[7] igraph_1.2.1 lazyeval_0.2.1 splines_3.4.1 BiocParallel_1.10.1 digest_0.6.15 foreach_1.4.4
[13] htmltools_0.3.6 GOSemSim_2.4.1 lars_1.2 GO.db_3.4.1 gdata_2.18.0 magrittr_1.5
[19] checkmate_1.8.5 memoise_1.1.0 cluster_2.0.6 mixtools_1.1.0 ROCR_1.0-7 sfsmisc_1.1-2
[25] recipes_0.1.2 gower_0.1.2 dimRed_0.1.0 R.utils_2.6.0 colorspace_1.3-2 blob_1.1.0
[31] dplyr_0.7.4 RCurl_1.95-4.10 bindr_0.1.1 survival_2.41-3 iterators_1.0.9 ape_5.1
[37] glue_1.2.0 DRR_0.0.3 gtable_0.2.0 ipred_0.9-6 zlibbioc_1.24.0 XVector_0.16.0
[43] kernlab_0.9-25 ddalpha_1.3.1.1 prabclus_2.2-6 DEoptimR_1.0-8 abind_1.4-5 scales_0.5.0
[49] DOSE_3.4.0 mvtnorm_1.0-7 DBI_0.8 Rcpp_0.12.17 dtw_1.18-1 metap_0.8
[55] htmlTable_1.11.2 tclust_1.3-1 proxy_0.4-21 foreign_0.8-69 bit_1.1-12 mclust_5.4
[61] SDMTools_1.1-221 Formula_1.2-2 tsne_0.1-3 lava_1.6 prodlim_1.6.1 htmlwidgets_1.0
[67] fgsea_1.3.1.9000 FNN_1.1 gplots_3.0.1 RColorBrewer_1.1-2 fpc_2.1-11 acepack_1.4.1
[73] modeltools_0.2-21 ica_1.0-1 pkgconfig_2.0.1 R.methodsS3_1.7.1 flexmix_2.3-14 nnet_7.3-12
[79] caret_6.0-79 tidyselect_0.2.4 rlang_0.2.0 reshape2_1.4.3 AnnotationDbi_1.40.0 munsell_0.4.3
[85] tools_3.4.1 RSQLite_2.0 ranger_0.8.0 ggridges_0.4.1 broom_0.4.3 stringr_1.3.0
[91] yaml_2.1.18 ModelMetrics_1.1.0 knitr_1.20 bit64_0.9-7 robustbase_0.92-8 caTools_1.17.1
[97] purrr_0.2.4 bindrcpp_0.2.2 pbapply_1.3-4 nlme_3.1-131.1 R.oo_1.21.0 RcppRoll_0.2.2
[103] DO.db_2.9 compiler_3.4.1 rstudioapi_0.7 tibble_1.4.2 stringi_1.1.7 lattice_0.20-35
[109] trimcluster_0.1-2 psych_1.7.8 nloptr_1.0.4 diffusionMap_1.1-0 pillar_1.2.1 irlba_2.3.2
[115] data.table_1.10.4-3 bitops_1.0-6 qvalue_2.10.0 R6_2.2.2 latticeExtra_0.6-28 KernSmooth_2.23-15
[121] gridExtra_2.3 codetools_0.2-15 MASS_7.3-49 gtools_3.5.0 assertthat_0.2.0 CVST_0.2-1
[127] withr_2.1.1 mnormt_1.5-5 GenomeInfoDbData_1.0.0 diptest_0.75-7 clusterProfiler_3.6.0 grid_3.4.1
[133] rpart_4.1-13 timeDate_3043.102 tidyr_0.8.0 class_7.3-14 minqa_1.2.4 rvcheck_0.0.9
[139] segmented_0.5-3.0 Rtsne_0.13 numDeriv_2016.8-1 lubridate_1.7.3 scatterplot3d_0.3-41 base64enc_0.1-3

@gfinak
Copy link
Member

gfinak commented Aug 7, 2018

The attached file can be read in by MAST under R > 3.5.0 and associated Bioconductor packages.

mastexample.zip

I can fit the discrete part by hand when passing nAGQ=0 to glmer.
Maybe we need to add the ability to pass arguments to the fit function.

@amcdavid
Copy link
Member

amcdavid commented Aug 8, 2018

Sorry, I am late to the party and haven't read this thread through in detail yet. We do have this ability. We can pass arguments to the discrete and continuous "fit" methods with fitArgsC or fitArgsD = list(...). See

object@fitC <- do.call(cfun, c(list(formula=formC, data=quote(datpos), REML=FALSE), fitArgsC))
for how it's used under the hood. The documentation for this is deficient, but there is something under
`?"LMlike-class"'

@gfinak
Copy link
Member

gfinak commented Aug 8, 2018

library(dplyr)
mast.raw <- readRDS("mast_rescued_example.rds")
summary(zlm(~ Condition + (1|Batch), sca = mast.raw, method = "glmer", ebayes = FALSE, fitArgsD = list(nAGQ = 0)))$datatable %>% 
filter(primerid=="transgene")

Gives the following:

fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients

Done!
Combining coefficients and standard errors
Calculating log-fold changes
    primerid component    contrast       ci.hi        ci.lo          coef
1  transgene         D (Intercept) 6048.941514 -6090.025397 -2.054194e+01
2  transgene         S (Intercept)          NA           NA            NA
3  transgene         C (Intercept)    4.194017     2.904607  3.549312e+00
4  transgene         C  Condition1          NA           NA  3.549312e+00
5  transgene         S  Condition1          NA           NA            NA
6  transgene     logFC  Condition1         NaN          NaN  6.157843e+00
7  transgene         D  Condition1 6091.904201 -6047.062776  2.242071e+01
8  transgene         C  Condition2          NA           NA  3.549312e+00
9  transgene         S  Condition2          NA           NA            NA
10 transgene     logFC  Condition2         NaN          NaN  4.271039e-09
11 transgene         D  Condition2 9159.859938 -9159.856155  1.891544e-03
               z
1  -6.633426e-03
2   7.625161e+00
3   1.079024e+01
4             NA
5             NA
6            NaN
7   7.240120e-03
8             NA
9             NA
10           NaN
11  4.047397e-07

The continuous part has a model.matrix of rank 1 for the transgene as it's only expressed in condition 1.
The discrete part has complete separation between conditions.
I don't think there's much to be done if you want to fit a random effects model at this time.

The following is a little more reasonable:

> summary(zlm(~Condition+Batch,sca=mast.raw,method="bayesglm",ebayes=TRUE))$datatable %>% filter(primerid=="transgene")

Done!
Combining coefficients and standard errors
Calculating log-fold changes
    primerid component    contrast      ci.hi       ci.lo        coef
1  transgene         D (Intercept) -1.8729913 -6.14760085 -4.01029607
2  transgene         S (Intercept)         NA          NA          NA
3  transgene         C (Intercept)  3.4800833  2.69328621  3.08668474
4  transgene         D      Batch2  1.5481903 -0.79105237  0.37856896
5  transgene     logFC      Batch2  0.1757916 -0.07852615  0.04863272
6  transgene         S      Batch2         NA          NA          NA
7  transgene         C      Batch2  1.4946451  0.36616222  0.93040364
8  transgene         C  Condition1         NA          NA          NA
9  transgene         S  Condition1         NA          NA          NA
10 transgene     logFC  Condition1        NaN         NaN         NaN
11 transgene         D  Condition1  7.7171705  3.54865402  5.63291228
12 transgene         C  Condition2         NA          NA          NA
13 transgene         S  Condition2         NA          NA          NA
14 transgene     logFC  Condition2        NaN         NaN         NaN
15 transgene         D  Condition2  2.4696289 -4.63723070 -1.08380089
            z
1  -3.6775456
2   8.2736652
3  15.3782752
4   0.6343776
5   0.7496007
6   2.7338533
7   3.2318748
8          NA
9          NA
10        NaN
11  5.2969948
12         NA
13         NA
14        NaN
15 -0.5977917

and you can add animal and ngeneson fixed effects to the model.

@gabprg
Copy link
Author

gabprg commented Aug 8, 2018

Thanks for looking into this! I'll go ahead and try some of your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants