You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into some convergence issues when I tried to fit a model with a random effect term. See the attached minimum reproducible rds file and the following code block.
(This may relate to these issues: #147#148#142)
If I'm understanding this correctly, the reason that genes MALAT1 and GPM6A failed to converge on the discrete part of the model is that all the cells are expressing these genes. So when the model is trying to predict a constant value (1 for expressed in this case) from the covariants it raised the error "Error: Response is constant". But this is also true when I ran it with a fixed model (excluding the subject term). How come in that model the discrete part for these two genes fit well? Is it because of the eBayes steps used in the fixed model (which is turned off in the glmer model used with the random effect)?
Another issue I noticed is for some genes like CADM2 and MAGI, MAST reported extremely huge FC and confidence intervals. I'm not sure whether this is also because of the imbalance of the number of cells expressing or not expressing these genes, but there are only < 5 cells that do not express these genes (see the code block below). Is it why the coefficients of the discrete part are so huge and hence the extreme FC and CI?
Genes are expressed in all cells in comparison can still be potentially interesting, especially in scenarios that comparing disease vs. control samples. So what would you recommend to do if I want to recover these genes given the mixed model? Can I compute the logFC and p-value only using the continuous part of the model? If so can you point me to the code block that I need to modify to achieve that? (I don't quite understand how you compute the logFC and p-value in MAST).
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
library(NMF)
#> Loading required package: pkgmaker
#> Loading required package: registry
#> Loading required package: rngtools
#> Loading required package: cluster
#> NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 31/32
#> To enable shared memory capabilities, try: install.extras('
#> NMF
#> ')
library(rsvd)
library(lme4)
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
library(MAST)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#>
#> anyMissing, rowMedians
#> The following object is masked from 'package:dplyr':
#>
#> count
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> The following object is masked from 'package:Biobase':
#>
#> rowMedians
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:Matrix':
#>
#> expand
#> The following object is masked from 'package:NMF':
#>
#> nrun
#> The following object is masked from 'package:pkgmaker':
#>
#> new2
#> The following objects are masked from 'package:data.table':
#>
#> first, second
#> The following objects are masked from 'package:dplyr':
#>
#> first, rename
#> The following object is masked from 'package:tidyr':
#>
#> expand
#> The following object is masked from 'package:base':
#>
#> expand.grid
#> Loading required package: IRanges
#>
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:data.table':
#>
#> shift
#> The following objects are masked from 'package:dplyr':
#>
#> collapse, desc, slice
#> The following object is masked from 'package:purrr':
#>
#> reduce
#> Loading required package: GenomeInfoDb
sca <- readRDS("test_sca.rds") # expression values are log2(CPM)
# fitting with a fixed model works fine
zlmCond <- zlm(~disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch,
sca)
#> Warning in .nextMethod(object = object, value = value): Coefficients
#> seq_batchbatch_6 are never estimible and will be dropped.
#>
#> Done!
lrt_term <- "diseaseALS"
summary_cond <- summary(zlmCond, doLRT = lrt_term)
#> Combining coefficients and standard errors
#> Calculating log-fold changes
#> Calculating likelihood ratio tests
#> Refitting on reduced model...
#>
#> Done!
summary_Dt <- summary_cond$datatable
fcHurdle <- merge(summary_Dt[contrast==lrt_term & component=='H',.(primerid, `Pr(>Chisq)`)],
summary_Dt[contrast==lrt_term & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
fcHurdle_df <- fcHurdle %>%
as_tibble() %>%
arrange(fdr) %>%
dplyr::rename(gene = primerid,
p_value = `Pr(>Chisq)`,
log2FC = coef)
fcHurdle_df
#> # A tibble: 6 x 6
#> gene p_value log2FC ci.hi ci.lo fdr
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 GFAP 2.02e-187 3.20 3.47 2.93 1.21e-186
#> 2 GPM6A 1.84e-105 -0.558 -0.510 -0.607 5.53e-105
#> 3 CADM2 1.07e- 42 -0.368 -0.313 -0.422 2.14e- 42
#> 4 ERBB4 1.58e- 25 -0.302 -0.247 -0.358 2.36e- 25
#> 5 MAGI2 1.26e- 18 -0.244 -0.190 -0.299 1.51e- 18
#> 6 MALAT1 8.08e- 1 0.0108 0.0434 -0.0218 8.08e- 1
# fitting with a mixed model with a random effect term raise issues...
zlmCond_mixed <- zlm(~disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch + (1|subject),
sca,
method = 'glmer',
ebayes = FALSE,
strictConvergence = FALSE,
fitArgsD = list(nAGQ = 0)
)
#> Warning in .nextMethod(object = object, value = value): Coefficients
#> seq_batchbatch_6 are never estimible and will be dropped.
#>
#> Done!
lrt_term <- "diseaseALS"
summary_cond_mixed <- summary(zlmCond_mixed, doLRT = lrt_term, fitArgsD = list(nAGQ = 0))
#> Combining coefficients and standard errors
#> Calculating log-fold changes
#> Calculating likelihood ratio tests
#> Refitting on reduced model...
#>
#> Done!
summary_Dt_mixed <- summary_cond_mixed$datatable
fcHurdle_mixed <- merge(summary_Dt_mixed[contrast==lrt_term & component=='H',.(primerid, `Pr(>Chisq)`)],
summary_Dt_mixed[contrast==lrt_term & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
fcHurdle_mixed[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
fcHurdle_df_mixed <- fcHurdle_mixed %>%
as_tibble() %>%
arrange(fdr) %>%
dplyr::rename(gene = primerid,
p_value = `Pr(>Chisq)`,
log2FC = coef)
fcHurdle_df_mixed # genes failed to converge (like GPM6A and MALAT1) and genes with extremely huge FC and CI (like CADM2 and MAGI2)
#> # A tibble: 6 x 6
#> gene p_value log2FC ci.hi ci.lo fdr
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 GFAP 0.0000863 3.36 5.80 0.923 0.000345
#> 2 CADM2 0.00189 -10.1 10167. -10187. 0.00324
#> 3 ERBB4 0.00243 -0.240 -0.0962 -0.384 0.00324
#> 4 MAGI2 0.210 -5.46 9654. -9665. 0.210
#> 5 GPM6A NA NaN NaN NaN NA
#> 6 MALAT1 NA NaN NaN NaN NA
# try to manually fit the model as recommended in other Github issues
y <- assay(sca)["MALAT1",]
dat <- colData(sca)
dat <- data.frame(y,dat)
dat$y_disc <- dat$y>0
dat %>% as_tibble() %>% dplyr::count(disease, subject, y_disc)
#> # A tibble: 12 x 4
#> disease subject y_disc n
#> <fct> <chr> <lgl> <int>
#> 1 Control 1069 TRUE 480
#> 2 Control 902 TRUE 871
#> 3 Control 904 TRUE 216
#> 4 Control 906 TRUE 416
#> 5 Control 91 TRUE 286
#> 6 Control 945 TRUE 319
#> 7 ALS 110 TRUE 693
#> 8 ALS 111 TRUE 302
#> 9 ALS 113 TRUE 338
#> 10 ALS 332 TRUE 15
#> 11 ALS 388 TRUE 189
#> 12 ALS 52 TRUE 312
disc_fit <-
glmer(
y_disc ~ disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch + (1 | subject),
data = dat,
family = "binomial",
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0,
)
#> fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
#> Error: Response is constant
y <- assay(sca)["CADM2",]
dat <- colData(sca)
dat <- data.frame(y,dat)
dat$y_disc <- dat$y>0
dat %>% as_tibble() %>% dplyr::count(disease, subject, y_disc)
#> # A tibble: 14 x 4
#> disease subject y_disc n
#> <fct> <chr> <lgl> <int>
#> 1 Control 1069 TRUE 480
#> 2 Control 902 FALSE 1
#> 3 Control 902 TRUE 870
#> 4 Control 904 TRUE 216
#> 5 Control 906 TRUE 416
#> 6 Control 91 TRUE 286
#> 7 Control 945 TRUE 319
#> 8 ALS 110 FALSE 2
#> 9 ALS 110 TRUE 691
#> 10 ALS 111 TRUE 302
#> 11 ALS 113 TRUE 338
#> 12 ALS 332 TRUE 15
#> 13 ALS 388 TRUE 189
#> 14 ALS 52 TRUE 312
disc_fit <-
glmer(
y_disc ~ disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch + (1 | subject),
data = dat,
family = "binomial",
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0,
)
#> fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
zlmCond@coefC
#> (Intercept) diseaseALS cn_genes_on cn_nCount_SCT cn_percent_mt
#> GFAP 8.29401 1.12023728 -0.07655072 0.02442343 -0.020249861
#> MALAT1 15.59505 0.01082459 0.12577780 -0.24172040 -0.090441340
#> GPM6A 12.28415 -0.55844186 -0.07206625 0.08239564 -0.009892255
#> ERBB4 11.59645 -0.30211774 -0.10247324 0.21863447 -0.042176694
#> CADM2 12.29975 -0.36309474 -0.07710509 0.11810448 0.005804970
#> MAGI2 11.43671 -0.24065765 -0.13788845 0.11733356 -0.032837481
#> cn_age sexM seq_batchbatch_2 seq_batchbatch_3
#> GFAP -0.20639925 -0.09135056 0.49418686 0.44759747
#> MALAT1 -0.18206980 -0.20952663 -0.18163052 0.03419533
#> GPM6A 0.01503511 0.29918196 0.07135877 0.34234493
#> ERBB4 0.10387150 -0.14493914 -0.04503845 0.26233164
#> CADM2 0.01522204 -0.53816474 -0.55727286 0.30972378
#> MAGI2 0.06218008 0.18165004 -0.03257498 0.08604887
#> seq_batchbatch_4 seq_batchbatch_5
#> GFAP 0.84527533 1.7331418
#> MALAT1 -0.36238682 0.3129492
#> GPM6A 0.04144394 -0.5939891
#> ERBB4 -0.08014849 -0.4053022
#> CADM2 -0.48380033 -0.6011401
#> MAGI2 0.13870640 -0.6291547
zlmCond@coefD
#> (Intercept) diseaseALS cn_genes_on cn_nCount_SCT cn_percent_mt
#> GFAP 0.880975 2.362414e+00 3.703621e-01 2.676908e-01 -7.586761e-02
#> MALAT1 10.638409 -5.465235e-16 1.998504e-16 1.187756e-16 -2.199635e-16
#> GPM6A 10.638409 -5.465235e-16 1.998504e-16 1.187756e-16 -2.199635e-16
#> ERBB4 10.226903 -6.264171e-01 1.674600e+00 -3.311432e-01 -5.580800e-01
#> CADM2 8.050730 -8.027358e-01 1.219512e+00 -2.684928e-01 3.959452e-01
#> MAGI2 7.565095 -5.186919e-01 8.960942e-01 2.237338e-01 1.178158e-01
#> cn_age sexM seq_batchbatch_2 seq_batchbatch_3
#> GFAP -4.452868e-01 -4.945825e-01 1.093009e-01 -1.494435e-02
#> MALAT1 1.520002e-16 2.020446e-16 7.612317e-17 7.561554e-17
#> GPM6A 1.520002e-16 2.020446e-16 7.612317e-17 7.561554e-17
#> ERBB4 3.420845e-01 -1.659656e+00 6.848692e-01 -1.987180e-02
#> CADM2 -4.373242e-01 5.716655e-01 1.263799e+00 3.827985e-01
#> MAGI2 -8.245549e-01 -5.977621e-01 1.830101e+00 9.999911e-01
#> seq_batchbatch_4 seq_batchbatch_5
#> GFAP 1.363749e+00 5.789519e-01
#> MALAT1 4.987686e-17 9.059252e-17
#> GPM6A 4.987686e-17 9.059252e-17
#> ERBB4 2.389449e-01 -2.009024e-01
#> CADM2 1.101994e+00 -5.283315e-01
#> MAGI2 -1.031814e-01 1.404526e+00
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-conda-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.7 LTS
#>
#> Matrix products: default
#> BLAS/LAPACK: /cndd/junhao/anaconda3/envs/r_env_v4/lib/libopenblasp-r0.3.12.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] MAST_1.17.3 SingleCellExperiment_1.12.0
#> [3] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0
#> [5] GenomeInfoDb_1.26.1 IRanges_2.24.0
#> [7] S4Vectors_0.28.1 MatrixGenerics_1.2.0
#> [9] matrixStats_0.58.0 lme4_1.1-25
#> [11] Matrix_1.2-18 rsvd_1.0.3
#> [13] NMF_0.23.0 Biobase_2.50.0
#> [15] BiocGenerics_0.36.0 cluster_2.1.0
#> [17] rngtools_1.5 pkgmaker_0.32.2
#> [19] registry_0.5-1 data.table_1.13.6
#> [21] forcats_0.5.0 stringr_1.4.0
#> [23] dplyr_1.0.2 purrr_0.3.4
#> [25] readr_1.4.0 tidyr_1.1.2
#> [27] tibble_3.0.4 ggplot2_3.3.2
#> [29] tidyverse_1.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-150 bitops_1.0-6 fs_1.5.0
#> [4] lubridate_1.7.9.2 progress_1.2.2 doParallel_1.0.16
#> [7] RColorBrewer_1.1-2 httr_1.4.2 tools_4.0.3
#> [10] backports_1.2.0 utf8_1.1.4 R6_2.5.0
#> [13] DBI_1.1.1 colorspace_2.0-0 withr_2.4.1
#> [16] prettyunits_1.1.1 tidyselect_1.1.0 compiler_4.0.3
#> [19] cli_2.3.0 rvest_0.3.6 xml2_1.3.2
#> [22] DelayedArray_0.16.0 scales_1.1.1 digest_0.6.27
#> [25] minqa_1.2.4 rmarkdown_2.5 XVector_0.30.0
#> [28] pkgconfig_2.0.3 htmltools_0.5.0 dbplyr_2.0.0
#> [31] highr_0.8 rlang_0.4.10 readxl_1.3.1
#> [34] generics_0.1.0 jsonlite_1.7.2 RCurl_1.98-1.2
#> [37] magrittr_2.0.1 GenomeInfoDbData_1.2.4 fansi_0.4.2
#> [40] Rcpp_1.0.6 munsell_0.5.0 abind_1.4-5
#> [43] lifecycle_0.2.0 stringi_1.5.3 yaml_2.2.1
#> [46] zlibbioc_1.36.0 MASS_7.3-53 plyr_1.8.6
#> [49] grid_4.0.3 crayon_1.4.1 lattice_0.20-41
#> [52] haven_2.3.1 splines_4.0.3 hms_0.5.3
#> [55] knitr_1.30 pillar_1.4.7 boot_1.3-25
#> [58] reshape2_1.4.4 codetools_0.2-18 reprex_0.3.0
#> [61] glue_1.4.2 evaluate_0.14 modelr_0.1.8
#> [64] vctrs_0.3.5 nloptr_1.2.2.2 foreach_1.5.1
#> [67] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
#> [70] xfun_0.19 gridBase_0.4-7 xtable_1.8-4
#> [73] broom_0.7.2 iterators_1.0.13 statmod_1.4.35
#> [76] ellipsis_0.3.1
The text was updated successfully, but these errors were encountered:
hoholee
changed the title
Convergene issues due to all cells in comparison expressed certain genes?
Convergence issues due to all cells in comparison expressed certain genes?
Mar 4, 2021
Hi MAST developers,
I ran into some convergence issues when I tried to fit a model with a random effect term. See the attached minimum reproducible rds file and the following code block.
(This may relate to these issues: #147 #148 #142)
If I'm understanding this correctly, the reason that genes MALAT1 and GPM6A failed to converge on the discrete part of the model is that all the cells are expressing these genes. So when the model is trying to predict a constant value (1 for expressed in this case) from the covariants it raised the error "Error: Response is constant". But this is also true when I ran it with a fixed model (excluding the subject term). How come in that model the discrete part for these two genes fit well? Is it because of the eBayes steps used in the fixed model (which is turned off in the glmer model used with the random effect)?
Another issue I noticed is for some genes like CADM2 and MAGI, MAST reported extremely huge FC and confidence intervals. I'm not sure whether this is also because of the imbalance of the number of cells expressing or not expressing these genes, but there are only < 5 cells that do not express these genes (see the code block below). Is it why the coefficients of the discrete part are so huge and hence the extreme FC and CI?
Genes are expressed in all cells in comparison can still be potentially interesting, especially in scenarios that comparing disease vs. control samples. So what would you recommend to do if I want to recover these genes given the mixed model? Can I compute the logFC and p-value only using the continuous part of the model? If so can you point me to the code block that I need to modify to achieve that? (I don't quite understand how you compute the logFC and p-value in MAST).
Any help would be appreciated. Thanks in advance!
Best,
Junhao.
test_sca.rds.zip
The text was updated successfully, but these errors were encountered: