-
Notifications
You must be signed in to change notification settings - Fork 3
3.1 scDNA: CopyKit
The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.
This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.
The example dataset and Rscript can be downloaded here.
-
Install the stable or development version from github using
devtools
. Note that knn smoothing is only available in development version. -
Install from package archive files, download
v0.1.2
here.
devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")
Check version and load libraries.
packageVersion("copykit")
## [1] '0.1.2'
library(copykit)
Whenever possible, CopyKit uses the BiocParallel framework.
library(BiocParallel)
n_threads = 8
register(MulticoreParam(workers = n_threads, progressbar = F), default = T)
Confirm parameters:
bpparam()
## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 50; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
## bpexportglobals: TRUE; bpexportvariables: FALSE; bpforceGC: FALSE
## bpfallback: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
Run CBS segmentation from bam files.
This part takes more than 10 mins depending on the sample size, thus will be skipped in the workshop.
tumor <- runVarbin(
dir = "~/path/to/bam/files/",
genome = "hg38",
resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
remove_Y = FALSE,
is_paired_end = FALSE,
seed = 17,
BPPARAM = bpparam()
)
An example of basic segmentation and metric calculation of sequencing data from mock data generated from the built-in simulator.
copykit_obj <- mock_bincounts(ncells = 5,
ncells_diploid = 1,
position_gain = c(1,5,1000))
## Running variance stabilization transformation: ft
## Smoothing outlier bins.
## Running segmentation algorithm: CBS for genome hg38
## Merging levels.
## Done.
copykit_obj <- runMetrics(copykit_obj)
## Calculating overdispersion.
## Counting breakpoints.
## Done.
Calculated metric was stored in colData
and can be visualized using
plotMetrics()
.
kable(colData(copykit_obj))
sample | ground_truth | overdispersion | breakpoint_count | |
---|---|---|---|---|
V1 | V1 | FALSE | 0.0002988 | 0 |
V2 | V2 | TRUE | 0.0002742 | 0 |
V3 | V3 | TRUE | 0.0002272 | 0 |
V4 | V4 | TRUE | 0.0001224 | 0 |
V5 | V5 | TRUE | 0.0001481 | 0 |
plotMetrics(copykit_obj, metric = "overdispersion", label = "ground_truth")
First, we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.
tumor <- readRDS("/DATAPATH/sample_obj.rds")
Then take a look at the data structure of CopyKit object.
head(tumor)
## class: CopyKit
## dim: 6 600
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(6): 1 2 ... 5 6
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
## PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
## PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
## PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
## percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 6 ranges
## Phylo: Phylogenetic tree with 0 tips and nodes
## consensus dim: 0 0
kable(rowData(tumor)[1:5,])
gc_content | abspos | arm |
---|---|---|
0.613786 | 1039596 | p |
0.591817 | 1264158 | p |
0.543128 | 1830107 | p |
0.572288 | 2065015 | p |
0.604703 | 2290536 | p |
kable(colData(tumor)[1:5,])
sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | |
---|---|---|---|---|---|---|---|---|---|
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 |
PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 |
PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 |
PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 |
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 |
kable(segment_ratios(tumor)[1:5,1:5])
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 |
---|---|---|---|---|
1 | 1 | 0.98 | 0.71 | 1 |
1 | 1 | 0.98 | 0.71 | 1 |
1 | 1 | 0.98 | 0.71 | 1 |
1 | 1 | 0.98 | 0.71 | 1 |
1 | 1 | 0.98 | 0.71 | 1 |
First, we want to detect euploid cells by calculating the sample-wise
coefficient of variation from the segment ratio means. The threshold can
be changed from the automatic detection to a custom threshold with the
argument resolution
. By setting a threshold = 0.1,
findAneuploidCells()
will mark as euploid all cells with a coefficient
of variation less or equal than 0.1.
tumor <- findAneuploidCells(tumor, seed = 17)
## number of iterations= 15
## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067
## Added information to colData(CopyKit).
Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).
tumor <- findOutliers(tumor, k = 5, resolution = 0.9)
## Calculating correlation matrix.
## Marked 70 cells as outliers.
## Adding information to metadata. Access with colData(scCNA).
## Done.
kable(colData(tumor)[1:5,])
sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | is_aneuploid | find_normal_cv | cell_corr_value | outlier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 | FALSE | 0.02 | 1.000 | FALSE |
PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 | FALSE | 0.00 | 1.000 | FALSE |
PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 | FALSE | 0.06 | 0.678 | TRUE |
PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 | TRUE | 0.54 | 0.972 | FALSE |
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 | FALSE | 0.00 | 1.000 | FALSE |
Visualize the QC results by a heatmap.
plotHeatmap(tumor,
order_cells = "hclust",
label = c('is_aneuploid','outlier'),
row_split = 'outlier',
n_threads = n_threads)
In the last step of the QCm we remove unwanted cells by sub-setting the CopyKit object.
tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]
tumor <- tumor[,colData(tumor)$outlier == FALSE]
The dataset should now be ready to proceed with the analysis.
We apply dimension reduction and embed the data into 2 dimensional space.
tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])
PMTC6LiverC154AL1S2_S154_L001_R1_001 | -2.8515888 | -0.8837085 |
PMTC6LiverC72AL3S1_S840_L003_R1_001 | -2.0377223 | 0.5955269 |
PMTC6LiverC223AL1S3_S223_L001_R1_001 | 0.5117559 | -0.2887311 |
PMTC6LiverC153AL1S6_S537_L002_R1_001 | -2.5070448 | -0.5821292 |
PMTC6LiverC198DL1S7_S582_L002_R1_001 | -1.1519634 | -0.8655404 |
plotUmap(tumor)
By default, CopyKit uses
hdbscan
as its clustering method and it is recommended to do so. To perform
clustering, k
should be pre-determined. To help with choosing k
,
CopyKit provides the helper function findSuggestedK()
to bootstrap
clustering over a range of k values, and returns the value that
maximizes the Jaccard similarity.
tumor <- findSuggestedK(tumor, k_range = 3:7)
## Calculating Jaccard similarity for k range: 3 4 5 6 7
##
## Suggested k = 7 with median jaccard similarity of: 0.943
plotSuggestedK(tumor)
plotSuggestedK(tumor, geom = 'tile')
Finally, we cluster the cells using hdbscan with our preferred k.
tumor <- findClusters(tumor)
## Using suggested k_subclones = 7
## Finding clusters, using method: hdbscan
## Found 2 outliers cells (group 'c0')
## Found 4 subclones.
## Done.
Check the clustering result.
tumor <- calcConsensus(tumor);tumor <- runConsensusPhylo(tumor)
plotUmap(tumor, label = 'subclones')
plotHeatmap(tumor,
order_cells = "hclust",
row_split = "subclones",
label = 'subclones',
n_threads = n_threads)
It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.
tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)
Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.
merged <- cbind(tumor, tumor_c0)
CopyKit supports different methods of calculating integer copy number profiles.
- Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.
tumor <- calcInteger(tumor,
method = 'scquantum',
assay = 'bincounts')
- Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.
tumor <- calcInteger(tumor,
method = 'fixed',
ploidy_value = 4.3)
Plot the integer copy number and check the rounding error
plotHeatmap(tumor,
assay = 'integer',
order_cells = "hclust",
row_split = "subclones",
label = 'subclones',
n_threads = n_threads)
plotHeatmap(tumor,
assay = 'integer',
order_cells = "hclust",
row_split = "subclones",
rounding_error = TRUE,
label = 'subclones',
n_threads = n_threads)
We can also check profile of any cell using the plotRatio()
. Here
shown profile of a cell from c1.
plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')
To run a phylogenetic analysis of cells’ copy number profiles, use the
function runPhylo()
. Available methods are Neighbor
Joining and Balanced Minimum
evolution. Trees can also be
visualized by plotPhylo()
tumor <- runPhylo(tumor,
assay = 'segment_ratios',
metric = 'manhattan',
n_threads = n_threads)
phylo(tumor)
##
## Phylogenetic tree with 311 tips and 310 internal nodes.
##
## Tip labels:
## PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
##
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones')
Single cell phylogeny is very prone to data noise. By taking the median of all cells from the same subclone, we can build subclonal consensus profiles and construct phylogeny from there. We can also root the consensus tree by a diploid profile.
tumor <- calcConsensus(tumor, assay = 'segment_ratios')
kable(consensus(tumor)[1:5,])
c1 | c2 | c3 | c4 | |
---|---|---|---|---|
V1 | 0.69 | 0.73 | 0.68 | 0.69 |
V2 | 0.69 | 0.73 | 0.68 | 0.69 |
V3 | 0.69 | 0.73 | 0.68 | 0.69 |
V4 | 0.69 | 0.73 | 0.68 | 0.69 |
V5 | 0.69 | 0.73 | 0.68 | 0.69 |
tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## c1, c2, c3, c4
##
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones', consensus = TRUE)
To further understand which copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.
genes = c("CDKN2A",
"FGFR1",
"TP53",
"PTEN",
"MYC",
"CDKN1A",
"MDM2",
"AURKA",
"PIK3CA",
"CCND1",
"KRAS")
tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor,
consensus = TRUE,
label = 'subclones',
genes = genes,
n_threads = n_threads)
Or more intuitively, check copy number states across genes in different subclones.
plotGeneCopy(tumor,
genes = genes,
label = 'subclones',
dodge.width = 0.8)
Also note that all the plot functions except heatmap are based on
ggplot2
, thus we can further customize them by adding themes, etc.
plotGeneCopy(tumor,
genes = genes,
label = 'subclones',
dodge.width = 0.8) +
ggplot2::ylim(c(0,4)) +
ggplot2::ggtitle("Copy number status across genes") +
ggplot2::theme_linedraw()
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 24.04 LTS
##
## Matrix products: default
## BLAS/LAPACK: /opt/miniconda3/lib/libopenblasp-r0.3.21.so; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: /UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocParallel_1.34.2 copykit_0.1.2
## [3] DNAcopy_1.74.1 Rsubread_2.14.2
## [5] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
## [7] Biobase_2.60.0 GenomicRanges_1.52.1
## [9] GenomeInfoDb_1.36.4 IRanges_2.34.1
## [11] S4Vectors_0.38.2 BiocGenerics_0.46.0
## [13] MatrixGenerics_1.12.3 matrixStats_1.3.0
## [15] knitr_1.47
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_1.8.8 shape_1.4.6.1
## [4] magrittr_2.0.3 ggbeeswarm_0.7.2 modeltools_0.2-23
## [7] farver_2.1.2 rmarkdown_2.27 GlobalOptions_0.1.2
## [10] fs_1.6.4 zlibbioc_1.46.0 vctrs_0.6.5
## [13] memoise_2.0.1 RCurl_1.98-1.14 ggtree_3.6.0
## [16] htmltools_0.5.8.1 S4Arrays_1.0.6 forcats_1.0.0
## [19] BiocNeighbors_1.18.0 gridGraphics_0.5-1 sass_0.4.9
## [22] bslib_0.7.0 htmlwidgets_1.6.4 plotly_4.10.4
## [25] cachem_1.1.0 igraph_2.0.3 mime_0.12
## [28] lifecycle_1.0.4 iterators_1.0.14 pkgconfig_2.0.3
## [31] Matrix_1.6-1.1 R6_2.5.1 fastmap_1.2.0
## [34] GenomeInfoDbData_1.2.10 shiny_1.8.1.1 clue_0.3-65
## [37] digest_0.6.35 aplot_0.2.3 colorspace_2.1-0
## [40] ggnewscale_0.4.10 patchwork_1.2.0 irlba_2.3.5.1
## [43] labeling_0.4.3 fansi_1.0.6 httr_1.4.7
## [46] abind_1.4-5 compiler_4.3.3 withr_3.0.0
## [49] doParallel_1.0.17 viridis_0.6.5 highr_0.11
## [52] MASS_7.3-60 DelayedArray_0.26.7 rjson_0.2.21
## [55] bluster_1.12.0 gtools_3.9.5 tools_4.3.3
## [58] vipor_0.4.7 beeswarm_0.4.0 ape_5.8
## [61] prabclus_2.3-3 httpuv_1.6.15 nnet_7.3-19
## [64] glue_1.7.0 dbscan_1.1-12 nlme_3.1-165
## [67] promises_1.3.0 grid_4.3.3 cluster_2.1.6
## [70] generics_0.1.3 gtable_0.3.5 class_7.3-22
## [73] tidyr_1.3.1 data.table_1.15.4 utf8_1.2.4
## [76] XVector_0.40.0 flexmix_2.3-19 foreach_1.5.2
## [79] pillar_1.9.0 yulab.utils_0.1.4 later_1.3.2
## [82] robustbase_0.99-2 circlize_0.4.16 splines_4.3.3
## [85] dplyr_1.1.4 treeio_1.24.3 lattice_0.22-6
## [88] FNN_1.1.4 survival_3.7-0 tidyselect_1.2.1
## [91] ComplexHeatmap_2.16.0 miniUI_0.1.1.1 amap_0.8-19
## [94] gridExtra_2.3 xfun_0.45 mixtools_2.0.0
## [97] diptest_0.77-1 scquantum_1.0.0 DEoptimR_1.1-3
## [100] lazyeval_0.2.2 ggfun_0.1.5 yaml_2.3.8
## [103] evaluate_0.24.0 codetools_0.2-20 kernlab_0.9-32
## [106] tibble_3.2.1 ggplotify_0.1.2 cli_3.6.2
## [109] uwot_0.2.2 xtable_1.8-4 segmented_2.1-0
## [112] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.12
## [115] png_0.1-8 fastcluster_1.2.6 parallel_4.3.3
## [118] ggplot2_3.5.1 mclust_6.1.1 ggalluvial_0.12.5
## [121] bitops_1.0-7 viridisLite_0.4.2 tidytree_0.4.6
## [124] scales_1.3.0 purrr_1.0.2 crayon_1.5.3
## [127] fpc_2.2-12 GetoptLong_1.0.5 rlang_1.1.4
For more info about us click here.