3.1 scDNA: CopyKit

Welcome to the single cell DNA Workshop!

Overview

The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.

This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.

The example dataset and Rscript can be downloaded here.

Installation

Install the stable or development version from github using devtools. Note that knn smoothing is only available in development version.
Install from package archive files, download v0.1.2 here.

devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")

Check version and load libraries.

packageVersion("copykit")

## [1] '0.1.2'

library(copykit)

Parallelization

Whenever possible, CopyKit uses the BiocParallel framework.

library(BiocParallel)
n_threads = 8
register(MulticoreParam(workers = n_threads, progressbar = F), default = T)

Confirm parameters:

bpparam()

## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 50; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
##   bpexportglobals: TRUE; bpexportvariables: FALSE; bpforceGC: FALSE
##   bpfallback: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

Segmentation

Run CBS segmentation from bam files.

This part takes more than 10 mins depending on the sample size, thus will be skipped in the workshop.

tumor <- runVarbin(
          dir = "~/path/to/bam/files/",
          genome = "hg38",
          resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
          remove_Y = FALSE,
          is_paired_end = FALSE,
          seed = 17,
          BPPARAM = bpparam()
        )

An example of basic segmentation and metric calculation of sequencing data from mock data generated from the built-in simulator.

copykit_obj <- mock_bincounts(ncells = 5,
                              ncells_diploid = 1,
                              position_gain = c(1,5,1000))

## Running variance stabilization transformation: ft

## Smoothing outlier bins.

## Running segmentation algorithm: CBS for genome hg38

## Merging levels.

## Done.

copykit_obj <- runMetrics(copykit_obj)

## Calculating overdispersion.

## Counting breakpoints.

## Done.

Calculated metric was stored in colData and can be visualized using plotMetrics().

kable(colData(copykit_obj))

	sample	ground_truth	overdispersion
V1	V1	FALSE	0.0002988
V2	V2	TRUE	0.0002742
V3	V3	TRUE	0.0002272
V4	V4	TRUE	0.0001224
V5	V5	TRUE	0.0001481

plotMetrics(copykit_obj, metric = "overdispersion", label = "ground_truth")

qc2-1

Example data set

First, we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.

tumor <- readRDS("/DATAPATH/sample_obj.rds")

Then take a look at the data structure of CopyKit object.

head(tumor)

## class: CopyKit 
## dim: 6 600 
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(6): 1 2 ... 5 6
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
##   PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
##   PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
##   PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
##   percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 6 ranges
## Phylo: Phylogenetic tree with 0 tips and  nodes
## consensus dim: 0 0

kable(rowData(tumor)[1:5,])

gc_content	abspos	arm
0.613786	1039596	p
0.591817	1264158	p
0.543128	1830107	p
0.572288	2065015	p
0.604703	2290536	p

kable(colData(tumor)[1:5,])

	sample	reads_assigned_bins	reads_unmapped	reads_duplicates	reads_unassigned	reads_ambiguous	reads_total	percentage_duplicates
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	506994	27749	67923	105388	115	708169	0.096
PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	476881	36334	45716	98424	96	657451	0.070
PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	549322	27843	50278	111227	107	738777	0.068
PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	812987	36743	78314	155468	179	1083691	0.072
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	624401	35599	74604	131717	111	866432	0.086

kable(segment_ratios(tumor)[1:5,1:5])

PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001
1	1	0.98	0.71	1
1	1	0.98	0.71	1
1	1	0.98	0.71	1
1	1	0.98	0.71	1
1	1	0.98	0.71	1

Quality Control

First, we want to detect euploid cells by calculating the sample-wise coefficient of variation from the segment ratio means. The threshold can be changed from the automatic detection to a custom threshold with the argument resolution. By setting a threshold = 0.1, findAneuploidCells() will mark as euploid all cells with a coefficient of variation less or equal than 0.1.

tumor <- findAneuploidCells(tumor, seed = 17)

## number of iterations= 15

## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067

## Added information to colData(CopyKit).

Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).

tumor <- findOutliers(tumor, k = 5, resolution = 0.9)

## Calculating correlation matrix.

## Marked 70 cells as outliers.

## Adding information to metadata. Access with colData(scCNA).

## Done.

kable(colData(tumor)[1:5,])

	sample	reads_assigned_bins	reads_unmapped	reads_duplicates	reads_unassigned	reads_ambiguous	reads_total	percentage_duplicates	is_aneuploid	find_normal_cv	cell_corr_value	outlier
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	506994	27749	67923	105388	115	708169	0.096	FALSE	0.02	1.000	FALSE
PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	476881	36334	45716	98424	96	657451	0.070	FALSE	0.00	1.000	FALSE
PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	549322	27843	50278	111227	107	738777	0.068	FALSE	0.06	0.678	TRUE
PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	812987	36743	78314	155468	179	1083691	0.072	TRUE	0.54	0.972	FALSE
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	624401	35599	74604	131717	111	866432	0.086	FALSE	0.00	1.000	FALSE

Visualize the QC results by a heatmap.

plotHeatmap(tumor, 
            order_cells = "hclust",
            label = c('is_aneuploid','outlier'),
            row_split = 'outlier', 
            n_threads = n_threads)

qcht-1

In the last step of the QCm we remove unwanted cells by sub-setting the CopyKit object.

tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]
tumor <- tumor[,colData(tumor)$outlier == FALSE]

The dataset should now be ready to proceed with the analysis.

Clustering

We apply dimension reduction and embed the data into 2 dimensional space.

tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])

PMTC6LiverC154AL1S2_S154_L001_R1_001	-2.8515888	-0.8837085
PMTC6LiverC72AL3S1_S840_L003_R1_001	-2.0377223	0.5955269
PMTC6LiverC223AL1S3_S223_L001_R1_001	0.5117559	-0.2887311
PMTC6LiverC153AL1S6_S537_L002_R1_001	-2.5070448	-0.5821292
PMTC6LiverC198DL1S7_S582_L002_R1_001	-1.1519634	-0.8655404

plotUmap(tumor)

umap-1

By default, CopyKit uses hdbscan as its clustering method and it is recommended to do so. To perform clustering, k should be pre-determined. To help with choosing k, CopyKit provides the helper function findSuggestedK() to bootstrap clustering over a range of k values, and returns the value that maximizes the Jaccard similarity.

tumor <- findSuggestedK(tumor, k_range = 3:7)

## Calculating Jaccard similarity for k range: 3 4 5 6 7

## 

## Suggested k = 7 with median jaccard similarity of: 0.943

plotSuggestedK(tumor)

findk-1

plotSuggestedK(tumor, geom = 'tile')

findk-2

Finally, we cluster the cells using hdbscan with our preferred k.

tumor <- findClusters(tumor)

## Using suggested k_subclones = 7

## Finding clusters, using method: hdbscan

## Found 2 outliers cells (group 'c0')

## Found 4 subclones.

## Done.

Check the clustering result.

tumor <- calcConsensus(tumor);tumor <- runConsensusPhylo(tumor)

plotUmap(tumor,  label = 'subclones')

visual-1

plotHeatmap(tumor, 
            order_cells = "hclust",
            row_split = "subclones",
            label = 'subclones', 
            n_threads = n_threads)

visual-2

It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.

tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)

Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.

merged <- cbind(tumor, tumor_c0)

Ploidy and Integer estimation

CopyKit supports different methods of calculating integer copy number profiles.

Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.

tumor <- calcInteger(tumor, 
                     method = 'scquantum', 
                     assay = 'bincounts')

Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.

tumor <- calcInteger(tumor, 
                     method = 'fixed', 
                     ploidy_value = 4.3)

Plot the integer copy number and check the rounding error

plotHeatmap(tumor, 
            assay = 'integer',
            order_cells = "hclust",
            row_split = "subclones",
            label = 'subclones', 
            n_threads = n_threads)

visualint-1

plotHeatmap(tumor, 
            assay = 'integer',
            order_cells = "hclust",
            row_split = "subclones",
            rounding_error = TRUE,
            label = 'subclones', 
            n_threads = n_threads)

visualint-2

We can also check profile of any cell using the plotRatio(). Here shown profile of a cell from c1.

plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')

ratio-1

Phylogeny

To run a phylogenetic analysis of cells’ copy number profiles, use the function runPhylo(). Available methods are Neighbor Joining and Balanced Minimum evolution. Trees can also be visualized by plotPhylo()

tumor <- runPhylo(tumor,
                  assay = 'segment_ratios',
                  metric = 'manhattan',
                  n_threads = n_threads)
phylo(tumor)

## 
## Phylogenetic tree with 311 tips and 310 internal nodes.
## 
## Tip labels:
##   PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
## 
## Rooted; includes branch lengths.

plotPhylo(tumor, label = 'subclones')

phylo-1

Single cell phylogeny is very prone to data noise. By taking the median of all cells from the same subclone, we can build subclonal consensus profiles and construct phylogeny from there. We can also root the consensus tree by a diploid profile.

tumor <- calcConsensus(tumor, assay = 'segment_ratios')
kable(consensus(tumor)[1:5,])

	c1	c2	c3	c4
V1	0.69	0.73	0.68	0.69
V2	0.69	0.73	0.68	0.69
V3	0.69	0.73	0.68	0.69
V4	0.69	0.73	0.68	0.69
V5	0.69	0.73	0.68	0.69

tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)

## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   c1, c2, c3, c4
## 
## Rooted; includes branch lengths.

plotPhylo(tumor, label = 'subclones', consensus = TRUE)

consensus-1

To further understand which copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.

genes = c("CDKN2A",
          "FGFR1",
          "TP53",
          "PTEN",
          "MYC",
          "CDKN1A",
          "MDM2",
          "AURKA",
          "PIK3CA",
          "CCND1",
          "KRAS")

tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor, 
            consensus = TRUE,
            label = 'subclones',
            genes = genes,
            n_threads = n_threads)

consensuspl-1

Or more intuitively, check copy number states across genes in different subclones.

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8)

consensusgene-1

Also note that all the plot functions except heatmap are based on ggplot2, thus we can further customize them by adding themes, etc.

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8) +
  ggplot2::ylim(c(0,4)) +
  ggplot2::ggtitle("Copy number status across genes") +
  ggplot2::theme_linedraw()

custplot-1

Session info

sessionInfo()

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 24.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /opt/miniconda3/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: /UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.34.2         copykit_0.1.2              
##  [3] DNAcopy_1.74.1              Rsubread_2.14.2            
##  [5] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
##  [7] Biobase_2.60.0              GenomicRanges_1.52.1       
##  [9] GenomeInfoDb_1.36.4         IRanges_2.34.1             
## [11] S4Vectors_0.38.2            BiocGenerics_0.46.0        
## [13] MatrixGenerics_1.12.3       matrixStats_1.3.0          
## [15] knitr_1.47                 
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      jsonlite_1.8.8          shape_1.4.6.1          
##   [4] magrittr_2.0.3          ggbeeswarm_0.7.2        modeltools_0.2-23      
##   [7] farver_2.1.2            rmarkdown_2.27          GlobalOptions_0.1.2    
##  [10] fs_1.6.4                zlibbioc_1.46.0         vctrs_0.6.5            
##  [13] memoise_2.0.1           RCurl_1.98-1.14         ggtree_3.6.0           
##  [16] htmltools_0.5.8.1       S4Arrays_1.0.6          forcats_1.0.0          
##  [19] BiocNeighbors_1.18.0    gridGraphics_0.5-1      sass_0.4.9             
##  [22] bslib_0.7.0             htmlwidgets_1.6.4       plotly_4.10.4          
##  [25] cachem_1.1.0            igraph_2.0.3            mime_0.12              
##  [28] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
##  [31] Matrix_1.6-1.1          R6_2.5.1                fastmap_1.2.0          
##  [34] GenomeInfoDbData_1.2.10 shiny_1.8.1.1           clue_0.3-65            
##  [37] digest_0.6.35           aplot_0.2.3             colorspace_2.1-0       
##  [40] ggnewscale_0.4.10       patchwork_1.2.0         irlba_2.3.5.1          
##  [43] labeling_0.4.3          fansi_1.0.6             httr_1.4.7             
##  [46] abind_1.4-5             compiler_4.3.3          withr_3.0.0            
##  [49] doParallel_1.0.17       viridis_0.6.5           highr_0.11             
##  [52] MASS_7.3-60             DelayedArray_0.26.7     rjson_0.2.21           
##  [55] bluster_1.12.0          gtools_3.9.5            tools_4.3.3            
##  [58] vipor_0.4.7             beeswarm_0.4.0          ape_5.8                
##  [61] prabclus_2.3-3          httpuv_1.6.15           nnet_7.3-19            
##  [64] glue_1.7.0              dbscan_1.1-12           nlme_3.1-165           
##  [67] promises_1.3.0          grid_4.3.3              cluster_2.1.6          
##  [70] generics_0.1.3          gtable_0.3.5            class_7.3-22           
##  [73] tidyr_1.3.1             data.table_1.15.4       utf8_1.2.4             
##  [76] XVector_0.40.0          flexmix_2.3-19          foreach_1.5.2          
##  [79] pillar_1.9.0            yulab.utils_0.1.4       later_1.3.2            
##  [82] robustbase_0.99-2       circlize_0.4.16         splines_4.3.3          
##  [85] dplyr_1.1.4             treeio_1.24.3           lattice_0.22-6         
##  [88] FNN_1.1.4               survival_3.7-0          tidyselect_1.2.1       
##  [91] ComplexHeatmap_2.16.0   miniUI_0.1.1.1          amap_0.8-19            
##  [94] gridExtra_2.3           xfun_0.45               mixtools_2.0.0         
##  [97] diptest_0.77-1          scquantum_1.0.0         DEoptimR_1.1-3         
## [100] lazyeval_0.2.2          ggfun_0.1.5             yaml_2.3.8             
## [103] evaluate_0.24.0         codetools_0.2-20        kernlab_0.9-32         
## [106] tibble_3.2.1            ggplotify_0.1.2         cli_3.6.2              
## [109] uwot_0.2.2              xtable_1.8-4            segmented_2.1-0        
## [112] munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.12            
## [115] png_0.1-8               fastcluster_1.2.6       parallel_4.3.3         
## [118] ggplot2_3.5.1           mclust_6.1.1            ggalluvial_0.12.5      
## [121] bitops_1.0-7            viridisLite_0.4.2       tidytree_0.4.6         
## [124] scales_1.3.0            purrr_1.0.2             crayon_1.5.3           
## [127] fpc_2.2-12              GetoptLong_1.0.5        rlang_1.1.4

For more info about us click here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly