-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cell type annotation chapter broken because of breaking changes in AUCell #8
Comments
For BioC 3.15, one option would be to replace OSCA.basic/inst/book/cell-annotation.Rmd Lines 286 to 287 in f7910c6
with explicit coercion of the input to a dense matrix muraro.rankings <- AUCell_buildRankings(as.matrix(muraro.mat),
plotStats=FALSE, verbose=FALSE) I made this change locally ('local output with Things looked okay up until OSCA.basic/inst/book/cell-annotation.Rmd Lines 255 to 258 in f7910c6
at which point I noticed differences in some of the overlaid densities curves, most notably the dashed curves for the 'interneurons' (pink) and 'pyramidal CA1' and 'pyrimidal SS' (pink and red). Current output (BioC 3.15)Local output with
|
There are some other slight differences between 'current' and 'local copy with Some slight different numbers of genes in example GO terms from the OSCA.basic/inst/book/cell-annotation.Rmd Lines 346 to 361 in f7910c6
I presume this is due to updates to the annotations. There are consequent small differences in OSCA.basic/inst/book/cell-annotation.Rmd Lines 408 to 413 in f7910c6
I presume this is due to the aforementioned slight differences in the gene sets because rows are gene sets in Current output (BioC 3.15)dim(aggregated) # rows are gene sets, columns are cells
## [1] 22607 2772 Local output with
|
I also ran a version locally where OSCA.basic/inst/book/cell-annotation.Rmd Lines 286 to 287 in f7910c6
was changed to use the muraro.rankings <- AUCell_buildRankings(muraro.mat,
plotStats=FALSE, verbose=FALSE, splitByBlocks=TRUE) This is as advised by the AUCell authors for sparse matrix inputs. I made this change locally ('local output with Again, things looked okay up until OSCA.basic/inst/book/cell-annotation.Rmd Lines 255 to 258 in f7910c6
Local output with
|
I now realise that these plots are made before the problematic lines of OSCA.basic/inst/book/cell-annotation.Rmd Lines 286 to 287 in f7910c6
so this change in plotting behaviour is unrelated to the issue of sparse matrix support in AUCell_buildRankings() .
|
However, as previously noted, the OSCA.basic/inst/book/cell-annotation.Rmd Lines 286 to 287 in f7910c6
are actually different depending on whether the input is explicitly coerced to a dense matrix or the > muraro.rankings.as.matrix<- AUCell_buildRankings(
+ as.matrix(muraro.mat),
+ plotStats = FALSE,
+ verbose = FALSE)
>
> muraro.rankings.splitByBlocks <- AUCell_buildRankings(
+ muraro.mat,
+ plotStats = FALSE,
+ verbose = FALSE,
+ splitByBlocks = TRUE)
>
> muraro.rankings.as.matrix
Ranking for 16940 genes (rows) and 2079 cells (columns).
Top-left corner of the ranking:
cells
genes D28-1_1 D28-1_2 D28-1_4 D28-1_13 D28-1_15
A1BG-AS1 12375 7755 15108 16914 10106
A1BG 6971 13236 5730 10117 14253
A1CF 538 9809 1264 10216 716
A2M-AS1 15103 16274 14683 13022 12074
A2ML1 5286 12958 12528 6641 16138
A2M 15909 1434 10091 13875 11194
>
> muraro.rankings.splitByBlocks
Ranking for 16940 genes (rows) and 2079 cells (columns).
Top-left corner of the ranking:
cells
genes D28-1_1 D28-1_2 D28-1_4 D28-1_13 D28-1_15
A1BG-AS1 10989 16401 14765 12735 14289
A1BG 7375 11922 5759 10930 9983
A1CF 521 7797 1199 6793 691
A2M-AS1 13144 7071 16628 6286 6671
A2ML1 5443 12637 9928 5721 12270
A2M 9706 1331 15394 14266 15242 Yet the subsequent heatmaps used in the book look the very similar despite these differences in output (which ultimately influence the inputs to the heatmaps). Current output (BioC 3.15)Local output with
|
In a bit more detail, here's a reprex showing that library(rebook)
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(GSEABase))
library(AUCell)
extractFromPackage("muraro-pancreas.Rmd", package="OSCA.workflows",
chunk="normalization", objects="sce.muraro")
#> <button class="rebook-collapse">View set-up code ([Workflow Chapter 6](http://bioconductor.org/books/3.15/OSCA.workflows/muraro-human-pancreas-cel-seq.html#muraro-human-pancreas-cel-seq))</button>
#> <div class="rebook-content">
#>
#> ```r
#> #--- loading ---#
#> library(scRNAseq)
#> sce.muraro <- MuraroPancreasData()
#>
#> #--- gene-annotation ---#
#> library(AnnotationHub)
#> edb <- AnnotationHub()[["AH73881"]]
#> gene.symb <- sub("__chr.*$", "", rownames(sce.muraro))
#> gene.ids <- mapIds(edb, keys=gene.symb,
#> keytype="SYMBOL", column="GENEID")
#>
#> # Removing duplicated genes or genes without Ensembl IDs.
#> keep <- !is.na(gene.ids) & !duplicated(gene.ids)
#> sce.muraro <- sce.muraro[keep,]
#> rownames(sce.muraro) <- gene.ids[keep]
#>
#> #--- quality-control ---#
#> library(scater)
#> stats <- perCellQCMetrics(sce.muraro)
#> qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
#> batch=sce.muraro$donor, subset=sce.muraro$donor!="D28")
#> sce.muraro <- sce.muraro[,!qc$discard]
#>
#> #--- normalization ---#
#> library(scran)
#> set.seed(1000)
#> clusters <- quickCluster(sce.muraro)
#> sce.muraro <- computeSumFactors(sce.muraro, clusters=clusters)
#> sce.muraro <- logNormCounts(sce.muraro)
#> ```
#>
#> </div>
# Pruning out unknown or unclear labels.
sce.muraro <- sce.muraro[,!is.na(sce.muraro$label) &
sce.muraro$label!="unclear"]
# Downloading the signatures and caching them locally.
library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache(ask=FALSE)
scsig.path <- bfcrpath(bfc, file.path("http://software.broadinstitute.org",
"gsea/msigdb/supplemental/scsig.all.v1.0.symbols.gmt"))
scsigs <- getGmt(scsig.path)
muraro.mat <- counts(sce.muraro)
rownames(muraro.mat) <- rowData(sce.muraro)$symbol
runAUCell <- function(x, scsigs) {
if (is.matrix(x)) {
rankings <- AUCell_buildRankings(
x,
plotStats = FALSE,
verbose = FALSE)
} else if (is(x, "sparseMatrix")) {
rankings <- AUCell_buildRankings(
x,
plotStats = FALSE,
verbose = FALSE,
splitByBlocks = TRUE)
}
# Applying MsigDB to the Muraro dataset, because it's human:
AUCell_calcAUC(scsigs, rankings)
}
# Run AUCell on the same data stored as (1) dense matrix or (2) sparse matrix.
dense <- suppressMessages(runAUCell(as.matrix(muraro.mat), scsigs))
sparse <- suppressMessages(runAUCell(muraro.mat, scsigs))
# These return different results.
all.equal(dense, sparse)
#> [1] "Attributes: < Component \"assays\": Attributes: < Component \"data\": Attributes: < Component \"listData\": Component \"AUC\": Mean relative difference: 0.02188224 > > >"
# Specifically, the AUC values (the key output of AUCell!) are different.
all.equal(getAUC(dense), getAUC(sparse))
#> [1] "Mean relative difference: 0.02188224"
# Consequently, assigning cells based on the max AUC, as done in OSCA, gives
# different results for a small number of cells.
all.equal(
rownames(dense)[max.col(t(getAUC(dense)))],
rownames(sparse)[max.col(t(getAUC(sparse)))])
#> [1] "21 string mismatches" Created on 2022-10-13 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Big Sur ... 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2022-10-13
#> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> ! package * version date (UTC) lib source
#> P annotate * 1.74.0 2022-04-26 [?] Bioconductor
#> P AnnotationDbi * 1.58.0 2022-04-26 [?] Bioconductor
#> P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.2.0)
#> P AUCell * 1.18.1 2022-05-19 [?] Bioconductor
#> P Biobase * 2.56.0 2022-04-26 [?] Bioconductor
#> P BiocFileCache * 2.4.0 2022-04-26 [?] Bioconductor
#> P BiocGenerics * 0.42.0 2022-04-26 [?] Bioconductor
#> P BiocManager 1.30.18 2022-05-18 [?] CRAN (R 4.2.0)
#> P BiocStyle 2.24.0 2022-04-26 [?] Bioconductor
#> P Biostrings 2.64.1 2022-08-18 [?] Bioconductor
#> P bit 4.0.4 2020-08-04 [?] CRAN (R 4.2.0)
#> P bit64 4.0.5 2020-08-30 [?] CRAN (R 4.2.0)
#> P bitops 1.0-7 2021-04-24 [?] CRAN (R 4.2.0)
#> P blob 1.2.3 2022-04-10 [?] CRAN (R 4.2.0)
#> P cachem 1.0.6 2021-08-19 [?] CRAN (R 4.2.0)
#> P cli 3.4.1 2022-09-23 [?] CRAN (R 4.2.0)
#> P CodeDepends 0.6.5 2018-07-17 [?] CRAN (R 4.2.0)
#> P codetools 0.2-18 2020-11-04 [3] CRAN (R 4.2.1)
#> P crayon 1.5.2 2022-09-29 [?] CRAN (R 4.2.0)
#> P curl 4.3.3 2022-10-06 [?] CRAN (R 4.2.0)
#> P data.table 1.14.2 2021-09-27 [?] CRAN (R 4.2.0)
#> P DBI 1.1.3 2022-06-18 [?] CRAN (R 4.2.0)
#> P dbplyr * 2.2.1 2022-06-27 [?] CRAN (R 4.2.0)
#> P DelayedArray 0.22.0 2022-04-26 [?] Bioconductor
#> P DelayedMatrixStats 1.18.1 2022-09-27 [?] Bioconductor
#> P digest 0.6.29 2021-12-01 [?] CRAN (R 4.2.0)
#> P dir.expiry 1.4.0 2022-04-26 [?] Bioconductor
#> P dplyr 1.0.10 2022-09-01 [?] CRAN (R 4.2.0)
#> P ellipsis 0.3.2 2021-04-29 [?] CRAN (R 4.2.0)
#> P evaluate 0.17 2022-10-07 [?] CRAN (R 4.2.0)
#> P fansi 1.0.3 2022-03-24 [?] CRAN (R 4.2.0)
#> P fastmap 1.1.0 2021-01-25 [?] CRAN (R 4.2.0)
#> P filelock 1.0.2 2018-10-05 [?] CRAN (R 4.2.0)
#> P fs 1.5.2 2021-12-08 [?] CRAN (R 4.2.0)
#> P generics 0.1.3 2022-07-05 [?] CRAN (R 4.2.0)
#> P GenomeInfoDb * 1.32.4 2022-09-06 [?] Bioconductor
#> P GenomeInfoDbData 1.2.8 2022-10-11 [?] Bioconductor
#> P GenomicRanges * 1.48.0 2022-04-26 [?] Bioconductor
#> P glue 1.6.2 2022-02-24 [?] CRAN (R 4.2.0)
#> P graph * 1.74.0 2022-04-26 [?] Bioconductor
#> P GSEABase * 1.58.0 2022-04-26 [?] Bioconductor
#> P highr 0.9 2021-04-16 [?] CRAN (R 4.2.0)
#> P htmltools 0.5.3 2022-07-18 [?] CRAN (R 4.2.0)
#> P httpuv 1.6.6 2022-09-08 [?] CRAN (R 4.2.0)
#> P httr 1.4.4 2022-08-17 [?] CRAN (R 4.2.0)
#> P IRanges * 2.30.1 2022-08-18 [?] Bioconductor
#> P KEGGREST 1.36.3 2022-07-14 [?] Bioconductor
#> P knitr 1.40 2022-08-24 [?] CRAN (R 4.2.0)
#> P later 1.3.0 2021-08-18 [?] CRAN (R 4.2.0)
#> P lattice 0.20-45 2021-09-22 [3] CRAN (R 4.2.1)
#> P lifecycle 1.0.3 2022-10-07 [?] CRAN (R 4.2.0)
#> P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.2.0)
#> P Matrix 1.5-1 2022-09-13 [3] CRAN (R 4.2.0)
#> P MatrixGenerics * 1.8.1 2022-06-30 [?] Bioconductor
#> P matrixStats * 0.62.0 2022-04-19 [?] CRAN (R 4.2.0)
#> P memoise 2.0.1 2021-11-26 [?] CRAN (R 4.2.0)
#> P mime 0.12 2021-09-28 [?] CRAN (R 4.2.0)
#> P pillar 1.8.1 2022-08-19 [?] CRAN (R 4.2.0)
#> P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.2.0)
#> P png 0.1-7 2013-12-03 [?] CRAN (R 4.2.0)
#> P promises 1.2.0.1 2021-02-11 [?] CRAN (R 4.2.0)
#> P purrr 0.3.5 2022-10-06 [?] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [3] CRAN (R 4.2.0)
#> P R.methodsS3 1.8.2 2022-06-13 [?] CRAN (R 4.2.0)
#> P R.oo 1.25.0 2022-06-12 [?] CRAN (R 4.2.0)
#> P R.utils 2.12.0 2022-06-28 [?] CRAN (R 4.2.0)
#> P R6 2.5.1 2021-08-19 [?] CRAN (R 4.2.0)
#> P rappdirs 0.3.3 2021-01-31 [?] CRAN (R 4.2.0)
#> P Rcpp 1.0.9 2022-07-08 [?] CRAN (R 4.2.0)
#> P RCurl 1.98-1.9 2022-10-03 [?] CRAN (R 4.2.0)
#> P rebook * 1.6.0 2022-04-26 [?] Bioconductor
#> P reprex 2.0.2 2022-08-17 [?] CRAN (R 4.2.0)
#> P rlang 1.0.6 2022-09-24 [?] CRAN (R 4.2.0)
#> P rmarkdown 2.17 2022-10-07 [?] CRAN (R 4.2.0)
#> P RSQLite 2.2.18 2022-10-04 [?] CRAN (R 4.2.0)
#> P rstudioapi 0.14 2022-08-22 [?] CRAN (R 4.2.0)
#> P S4Vectors * 0.34.0 2022-04-26 [?] Bioconductor
#> P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.2.0)
#> P shiny 1.7.2 2022-07-19 [?] CRAN (R 4.2.0)
#> P SingleCellExperiment * 1.18.1 2022-10-02 [?] Bioconductor
#> P sparseMatrixStats 1.8.0 2022-04-26 [?] Bioconductor
#> P stringi 1.7.8 2022-07-11 [?] CRAN (R 4.2.0)
#> P stringr 1.4.1 2022-08-20 [?] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [3] CRAN (R 4.2.0)
#> P SummarizedExperiment * 1.26.1 2022-05-01 [?] Bioconductor
#> P tibble 3.1.8 2022-07-22 [?] CRAN (R 4.2.0)
#> P tidyselect 1.2.0 2022-10-10 [?] CRAN (R 4.2.1)
#> P utf8 1.2.2 2021-07-24 [?] CRAN (R 4.2.0)
#> P vctrs 0.4.2 2022-09-29 [?] CRAN (R 4.2.0)
#> P withr 2.5.0 2022-03-03 [?] CRAN (R 4.2.0)
#> P xfun 0.33 2022-09-12 [?] CRAN (R 4.2.0)
#> P XML * 3.99-0.11 2022-10-03 [?] CRAN (R 4.2.0)
#> P xtable 1.8-4 2019-04-21 [?] CRAN (R 4.2.0)
#> P XVector 0.36.0 2022-04-26 [?] Bioconductor
#> P yaml 2.3.5 2022-02-21 [?] CRAN (R 4.2.0)
#> P zlibbioc 1.42.0 2022-04-26 [?] Bioconductor
#>
#> [1] /Users/Peter/Library/Caches/org.R-project.R/R/renv/library/OSCA.basic-52015504/R-4.2/x86_64-apple-darwin17.0
#> [2] /Users/Peter/GitHub/OSCA.basic/renv/sandbox/R-4.2/x86_64-apple-darwin17.0/84ba8b13
#> [3] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> P ── Loaded and on-disk path mismatch.
#>
#> ────────────────────────────────────────────────────────────────────────────── |
To summarise (and apologies if you've already reached these conclusions):
(2) is annoying, but I'm inclined to ignore it. (3a) is definitely a problem, but the AUCell authors don't seem inclined to fix this. (3b) is the killer and is what is preventing a clean build of OSCA.basic in both BioC 3.15 and 3.16. I think our options to fix (3b) are:
I'm inclined to go with (2) and perhaps move to (1) during the next development cycle but could be swayed another way. @alanocallaghan Please let me know your thoughts. |
Very thorough thanks Peter. I think you're right on all points; I am a bit worried by conclusion 3) but I don't really like removing a section without replacing it. As such I think option 2) for now and to consider option 1) next release is a good choice. Particularly as it would give us more time to evaluate options to replace it. |
We opt to explicitly coerce the count matrix to a dense matrix to avoid issues with support for sparse matrices in AUCell. 1. There are some changes to the outputs in the 'Cell type annotation' chapter that are independent of AUCell. These are minor and are fine to ignore. 2. There were changes made to AUCell in the BioC 3.15 release cycle that mean we now get different plots. Perhaps this is correcting previously wrong outputs, I don't now and don't have time to dig into. But this is independent of the issue with `AUCell::AUCell_buildRankings()`. 3. `AUCell::AUCell_buildRankings()` is unreliable. a. It gives different results for dense and sparse inputs. b. It does not automatically/gracefully handle sparse inputs as it should. (2) is annoying, but for now we ignore it (3a) is definitely a problem, but the AUCell authors don't seem inclined to fix this. My assumption (hope?) is that the calculations for dense inputs are correct/reliable, so I'm inclined to use those. (3b) is the killer and is what is preventing a clean build of OSCA.basic in both BioC 3.15 and 3.16. See #8 for further analysis and rationale
We opt to explicitly coerce the count matrix to a dense matrix to avoid issues with support for sparse matrices in AUCell. 1. There are some changes to the outputs in the 'Cell type annotation' chapter that are independent of AUCell. These are minor and are fine to ignore. 2. There were changes made to AUCell in the BioC 3.15 release cycle that mean we now get different plots. Perhaps this is correcting previously wrong outputs, I don't now and don't have time to dig into. But this is independent of the issue with `AUCell::AUCell_buildRankings()`. 3. `AUCell::AUCell_buildRankings()` is unreliable. a. It gives different results for dense and sparse inputs. b. It does not automatically/gracefully handle sparse inputs as it should. (2) is annoying, but for now we ignore it (3a) is definitely a problem, but the AUCell authors don't seem inclined to fix this. My assumption (hope?) is that the calculations for dense inputs are correct/reliable, so I'm inclined to use those. (3b) is the killer and is what is preventing a clean build of OSCA.basic in both BioC 3.15 and 3.16. See #8 for further analysis and rationale
Thanks Alan. In any case, I've pushed both to the BioC git server with a version bump to trigger a build. |
Following my overnight check on my laptop I found that another small tweak was required in BioC 3.16 (e0b0c9a). |
Success! Now building and passing checks in BioC 3.16 as well as BioC 3.15 |
Trying to bring this issue to a close for both BioC 3.15 and 3.16.
The issue was initially noted in #4 (comment)
It was followed up in aertslab/AUCell#28, but the discussion in that issue and lack of follow-up by the AUCell authors leaves me with little hope that they will fix the underlying problem.
The simplest fix, albeit somewhat drastic, is to simply remove usage of AUCell from OSCA, as proposed/implemented in #7.
I'll document here some attempts to work around (if not fix) the issue.
The text was updated successfully, but these errors were encountered: