README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->
<!--You'll still need to render `README.Rmd` regularly, to keep `README.md` up-to-date. `devtools::build_readme()` is handy for this. You could also use GitHub Actions to re-render `README.Rmd` every time you push. An example workflow can be found here: <https://github.com/r-lib/actions/tree/master/examples>. -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.path = "inst/figures/README-",
  out.width = "100%"
)
```

<!-- badges: start -->
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![DOI](https://zenodo.org/badge/374398121.svg)](https://zenodo.org/badge/latestdoi/374398121)
<!-- badges: end -->

![](./GRETTA_hex_logo-02.png)

# Introduction 
 
Genetic inteRaction and EssenTiality mApper (GRETTA) is an R package that leverages data generated by the [Cancer Dependency Map (DepMap) project](https://depmap.org/portal/) to perform in-silico genetic knockout screens and map essentiality networks. A manuscript describing this tool is available at  [bioinformatics (Takemon, Y. and Marra, MA., 2023)](https://doi.org/10.1093/bioinformatics/btad381).

The DepMap data used in this tutorial is version 22Q2. This version along with all versions provided in this repository were downloaded through the DepMap data portal, which was distributed and used under the terms and conditions of [CC Attribution 4.0 license](https://creativecommons.org/licenses/by/4.0/). 

## Maintainer

This repository is maintained by [Yuka Takemon](https://github.com/ytakemon), a PhD candidate in [Dr. Marco Marra](https://www.bcgsc.ca/labs/marra-lab)'s laboratory at [Canada's Michael Smith Genome Sciences Centre](https://www.bcgsc.ca/).

## Citations 

When using GRETTA, please cite the manuscript describing GRETTA:
Yuka Takemon, Marco A Marra, GRETTA: an R package for mapping in silico genetic interaction and essentiality networks, Bioinformatics, Volume 39, Issue 6, June 2023, btad381, https://doi.org/10.1093/bioinformatics/btad381

Please also cite the DepMap project and the appropriate data version found on https://depmap.org/portal/:
Tsherniak A, Vazquez F, Montgomery PG, Weir BA, Kryukov G, Cowley GS, Gill S, Harrington WF, Pantel S, Krill-Burger JM, Meyers RM, Ali L, Goodale A, Lee Y, Jiang G, Hsiao J, Gerath WFJ, Howell S, Merkel E, Ghandi M, Garraway LA, Root DE, Golub TR, Boehm JS, Hahn WC. Defining a Cancer Dependency Map. Cell. 2017 Jul 27;170(3):564-576.

## Questions

Please check the [FAQ section](https://github.com/ytakemon/GRETTA/wiki/Frequently-Asked-Questions) for additional information and if you cannot find your answer there or have a request please submit an [issue](https://github.com/ytakemon/GRETTA/issues).

## Requirements

* GRETTA is supported and compatible for R versions >= 4.2.0. 
* 12G of space to store one DepMap data set with and an additional 11G of temporary space to for .tar.gz prior to extraction.

# Installation

> **Warning**
> The new version of [dbplyr (v2.4.0)](https://stackoverflow.com/questions/77370659/error-failed-to-collect-lazy-table-caused-by-error-in-db-collect-using/77370920#77370920
) is currently incompatable with another library used in GRETTA. If you encounter an error message like the one below. Please install the previous working version also shown below.  
>
> Error message:
> 
> ```
> Error in `collect()`:
> ! Failed to collect lazy table.
> Caused by error in `db_collect()`:
> ! Arguments in `...` must be used.
> ✖ Problematic argument:
> • ..1 = Inf
> ℹ Did you misspell an argument name?
> ```
> 
> Solution:
> 
> ```
> install.packages("devtools")
> devtools::install_version("dbplyr", version = "2.3.4")`
> ```

You can install the GRETTA package from [GitHub](https://github.com) with:

``` r
install.packages(c("devtools", "dplyr","forcats","ggplot2"))
devtools::install_github("ytakemon/GRETTA")
```

DepMap 22Q2 data and the data documentation files are provided above and can be extracted directly in terminal using the following bash code (not in R/RStudio). For other DepMap data versions please refer to the [FAQ section](https://github.com/ytakemon/GRETTA/wiki/Frequently-Asked-Questions#q-how-to-download-and-use-other-versions-of-depmap-data).

``` bash
# Make a new directory/folder called GRETTA_project and go into directory
mkdir GRETTA_project
cd GRETTA_project

# Download data from the web
wget https://www.bcgsc.ca/downloads/ytakemon/GRETTA/22Q2/GRETTA_DepMap_22Q2_data.tar.gz

# Extract data and data documentation
tar -zxvf GRETTA_DepMap_22Q2_data.tar.gz
```

A singularity container has also been provided and instructions can be found [here](https://github.com/ytakemon/GRETTA/wiki/Frequently-Asked-Questions#q-how-to-run-singularity).

### Additional DepMap versions

In this example we use DepMap's 2022 data release (22Q2). However, we also provide previous data released in 2020 (v20Q1) and 2021 (v21Q4), which are available at :`https://www.bcgsc.ca/downloads/ytakemon/GRETTA/`. We are hoping to make new data sets available as the are released by DepMap.

# Workflows 

## Genetic interaction mapping

1. Install `GRETTA` and download accompanying data.
1. Select mutant cell lines that carry mutations in the gene of interest and control cell lines. 
    + ([optional specifications](https://github.com/ytakemon/GRETTA/wiki/Frequently-Asked-Questions#q-how-can-context-specific-genetic-screens-or-essentiality-network-analyses-be-performed)) can be used to select cell lines based on disease type, disease subtype, or amino acid change.
1. Determine differential expression between mutant and control cell line groups.
    + (optional but recommended).
1. Perform *in silico* genetic screen.
1. Visualize results.

## Co-essential network mapping

1. Install `GRETTA` and download accompanying data.
1. Run correlation coefficient analysis.
    + ([optional specifications](https://github.com/ytakemon/GRETTA/wiki/Frequently-Asked-Questions#q-how-can-context-specific-genetic-screens-or-essentiality-network-analyses-be-performed:~:text=For%20the%20essentiality%20network%20analysis%2C%20context%2Dspecific%20cell%20lines%20can%20be%20selected%20in%20two%20ways%3A)) can be used to perform analysis on cell lines of a specific disease type(s).
1. Calculate inflection points of negative/positive curve to determine a threshold.
1. Apply threshold.
1. Visualize results.

# Example: Identifying *ARID1A* genetic interactions

*ARID1A* encodes a member of the chromatin remodeling SWItch/Sucrose Non-Fermentable (SWI/SNF) complex and is a frequently mutated gene in cancer. It is known that *ARID1A* and its homolog, *ARID1B*, are synthetic lethal to one another: The dual loss of ARID1A and its homolog, ARID1B, in a cell is lethal; however, the loss of either gene alone is not  ([Helming et al., 2014](https://doi.org/10.1038/nm.3480)). This example will demonstrate how we can identify synthetic lethal interactors of *ARID1A* using `GRETTA` and predict this known interaction.

For this example you will need to call the following libraries. If you they are not installed yet use `install.packages()` (eg. `install.packages("dplyr")`).

```{r libraries}
# Load library
library(tidyverse)
library(GRETTA)
```

## Download example data

A small data set has been created for this tutorial and can be downloaded using the following code. 

```{r download_data}
path <- getwd()
download_example_data(path)
```

Then, assign variable that point to where the `.rda` files are stored and where result files should go.

```{r data_path}
gretta_data_dir <- paste0(path,"/GRETTA_example/")
gretta_output_dir <- paste0(path,"/GRETTA_example_output/")
```

## Exploring cell lines 

One way to explore cell lines that are available in DepMap is through their [portal](https://depmap.org/portal/). However, there are some simple built-in methods in GRETTA to provide users with a way to glimpse the data using the series of `list_available` functions: `list_mutations()`, `list_cancer_types()`, `list_cancer_subtypes()`

Current DepMap data used by default is version 22Q2, which contains whole-genome sequencing or whole-exome sequencing annotations for `1771` cancer cell lines (`1406` cell lines with RNA-seq data, `375` cell lines with quantitative proteomics data, and `1086` cell lines with CRISPR-Cas9 knockout screen data)

```{r listing_available_muts, eval = FALSE}
## Find ARID1A hotspot mutations detected in all cell lines
list_mutations(gene = "ARID1A", is_hotspot = TRUE, data_dir = gretta_data_dir) 
```

```{r listing_available_cancer_types, eval = TRUE}
## List all available cancer types
list_cancer_types(data_dir = gretta_data_dir)

## List all available cancer subtypes
list_cancer_subtypes(input_disease = "Lung Cancer", data_dir = gretta_data_dir)
```

## Selecting mutant and control cell line groups

As default `select_cell_lines()` will identify cancer cell lines with loss-of-function alterations in the gene specified and group them into six different groups. 

Loss-of-function alterations include variants that are annotated as: `"Nonsense_Mutation", "Frame_Shift_Ins", "Splice_Site", "De_novo_Start_OutOfFrame", "Frame_Shift_Del", "Start_Codon_SNP", "Start_Codon_Del",` and `"Start_Codon_Ins"`. Copy number alterations are also taken into consideration and group as `"Deep_del", "Loss", "Neutral",` or `"Amplified"`.

The cell line groups assigned by default are:

* `Control` cell lines do not harbor any single nucleotide variations (SNVs) or insertions and deletions (InDels) with a neutral copy number (CN).
* `HomDel` cell lines harbor one or more homozygous deleterious SNVs or have deep CN loss.
* `T-HetDel` cell lines harbor two or more heterozygous deleterious SNVs/InDels with neutral or CN loss.
* `HetDel` cell lines harbor one heterozygous deleterious SNV/InDel with neutral CN, or no SNV/InDel with CN loss.
* `Amplified` cell lines harbor no SNVs/InDels with increased CN.
* `Others` cell lines harbor deleterious SNVs with increased CN.

```{r select_groups, eval = TRUE}
ARID1A_groups <- select_cell_lines(input_gene = "ARID1A", data_dir = gretta_data_dir)

## Show number of cell lines in each group 
count(ARID1A_groups, Group)
```

### Optional cell line filters

There are several additional filters that can be combined together to narrow down your search. These 

* `input_aa_change` - by amino acid change (eg. "p.Q515*").
* `input_disease` - by disease type (eg. "Pancreatic Cancer")
* `input_disease_subtype` - by disease subtype (eg. "Ductal Adenosquamous Carcinoma")

```{r select_groups_option, eval = TRUE}
## Find pancreatic cancer cell lines with ARID1A mutations
ARID1A_pancr_groups <- select_cell_lines(input_gene = "ARID1A", 
                                         input_disease = "Pancreatic Cancer",
                                         data_dir = gretta_data_dir)

## Show number of cell lines in each group 
count(ARID1A_pancr_groups, Group)
```

## Check for differential expression 

Of the three mutant cancer cell line groups `ARID1A_HomDel`, `ARID1A_T-HetDel`, and `ARID1A_HetDel`, cancer cell lines with `ARID1A_HomDel` mutations are most likely to result in a loss or reduced expression of *ARID1A*. Therefore, we want to check whether cell lines in `ARID1A_HomDel` mutant group have significantly less *ARID1A* RNA or protein expression compared to control cell lines.

```{r Check_expression_rna, eval = TRUE}
## Select only HomDel and Control cell lines
ARID1A_groups_subset <- ARID1A_groups %>% filter(Group %in% c("ARID1A_HomDel", "Control"))

## Get RNA expression 
ARID1A_rna_expr <- extract_rna(
  input_samples = ARID1A_groups_subset$DepMap_ID, 
  input_genes = "ARID1A",
  data_dir = gretta_data_dir)
```

Not all cell lines contain RNA and/or protein expression profiles, and not all proteins were detected by mass spectrometer. (Details on data generation can be found on the [DepMap site](https://depmap.org/portal/).)

```{r Check_expression_protein, eval = FALSE}
## Get protein expression
ARID1A_prot_expr <- extract_prot(
  input_samples = ARID1A_groups_subset$DepMap_ID,
  input_genes = "ARID1A",
  data_dir = gretta_data_dir)

## Produces an error message since ARID1A protein data is not available
```
Using Welch's t-test, we can check to see whether *ARID1A* RNA expression (in TPM) is significantly reduced in `ARID1A_HomDel` cell lines compared to `Controls`. 

```{r Check_expression_rna_stats, eval = TRUE}
## Append groups and test differential expression
ARID1A_rna_expr <- left_join(
  ARID1A_rna_expr,
  ARID1A_groups_subset %>% select(DepMap_ID, Group)) %>%
  mutate(Group = fct_relevel(Group,"Control")) # show Control group first

## T-test 
t.test(ARID1A_8289 ~ Group, ARID1A_rna_expr)

## plot 
ggplot(ARID1A_rna_expr, aes(x = Group, y = ARID1A_8289)) +
  geom_boxplot()
```

## Perform genome-wide *in silico* genetic screen

After determining cell lines in the `ARID1A_HomDel` group has statistically significant reduction in RNA expression compared to `Control` cell lines, the next step is to perform a *in silico* genetic screen using `screen_results()`. This uses the dependency probabilities (or **"lethality probabilities"**) generated from DepMap's genome-wide CRISPR-Cas9 knockout screen. 

**Lethality probabilities** range from 0.0 to 1.0 and is quantified for each gene knock out in every cancer cell line screened (There are 18,334 genes targeted in 739 cancer cell lines). A gene knock out with a lethality probability of 0.0 indicates a non-essential for the cell line, and a gene knock out with a 1.0 indicates an essential gene (ie. very lethal). Details can be found in [Meyers, R., et al., 2017](https://doi.org/10.1038/ng.3984)

At its core, `screen_results()` performs multiple Mann-Whitney U tests, comparing lethality probabilities of each targeted gene between mutant and control groups. This generates a data frame with the following columns:

* `GeneName_ID` - Hugo symbol with NCBI gene ID
* `GeneNames` - Hugo symbol
* `_median, _mean, _sd, _iqr` - Control and mutant group's median, mean, standard deviation (sd), and interquartile range (iqr) of dependency probabilities. Dependency probabilities range from zero to one, where one indicates a essential gene (ie. KO of gene was lethal) and zero indicates a non-essential gene (KO of gene was not lethal)
* `Pval` - P-value from Mann Whitney U test between control and mutant groups.
* `Adj_pval` - BH-adjusted P-value.
* `log2FC_by_median` - Log2 normalized median fold change of dependency probabilities (mutant / control). Dependency probabilities range from 0.0-1.0 where 1.0 indicates high probability of KO leading to lethality, while 0.0 indicates little to no lethality.
* `log2FC_by_mean` - Log2 normalized mean fold change of dependency probabilities (mutant / control). Dependency probabilities range from 0.0-1.0 where 1.0 indicates high probability of KO leading to lethality, while 0.0 indicates little to no lethality.
* `CliffDelta` - Cliff's delta non-parametric effect size between mutant and control dependency probabilities. Ranges between -1 to 1.
* `dip_pval` - Hartigan's dip test p-value. Tests whether distribution of mutant dependency probability is unimodel. If dip test is rejected (p-value < 0.05), this indicates that there is a multimodel dependency probability distribution and that there may be another factor contributing to this separation.
* `Interaction_score` - Combined value generated from signed p-values: -log10(Pval) \* sign(log2FC_by_median). Positive scores indicate possible lethal genetic interaction, and negative scores indicate possible alleviating genetic interaction.

> **Warning**
> This process may take a few hours depending on the number of cores assigned. Our example below `GI_screen()` took ~2 hours to process. To save time, we have preprocessed this step for you.

```{r run_screen, eval = FALSE}
ARID1A_mutant_id <- ARID1A_groups %>% filter(Group %in% c("ARID1A_HomDel")) %>% pull(DepMap_ID)
ARID1A_control_id <- ARID1A_groups %>% filter(Group %in% c("Control")) %>% pull(DepMap_ID)

## See warning above.
## This can take several hours depending on number of lines/cores used. 
screen_results <- GI_screen(
  control_id = ARID1A_control_id, 
  mutant_id = ARID1A_mutant_id,
  core_num = 5, # depends on how many cores you have  
  output_dir = gretta_output_dir, # Will save your results here as well as in the variable
  data_dir = gretta_data_dir,
  test = FALSE) # use TRUE to run a short test to make sure all will run overnight.
```

```{r load_prepared_results, eval = TRUE}
## Load prepared ARID1A screen result
load(paste0(gretta_data_dir,"/sample_22Q2_ARID1A_KO_screen.rda"), envir = environment())
```

We can quickly determine whether any lethal genetic interactions were predicted by `GRETTA`. We use a `Pval` cut off of 0.05 and rank based on the `Interaction_score`. 

```{r checkout_res, eval = TRUE}
screen_results %>% 
  filter(Pval < 0.05) %>%
  arrange(-Interaction_score) %>% 
  select(GeneNames:Mutant_median, Pval, Interaction_score) %>% head
```

We immediately see that *ARID1B*, a known synthetic lethal interaction of *ARID1A*, was a the top of this list.

### Optional: Performing a small-scale screen

To perform a small in silico screen, a list of genes can be provided in the `gene_list =` argument. 

```{r run_screen_subset, eval = FALSE}
small_screen_results <- GI_screen(
  control_id = ARID1A_control_id, 
  mutant_id = ARID1A_mutant_id,
  gene_list = c("ARID1A", "ARID1B", "SMARCA2", "GAPDH", "SMARCC2"),
  core_num = 5, # depends on how many cores you have  
  output_dir = gretta_output_dir, # Will save your results here as well as in the variable
  data_dir = gretta_data_dir) 
```

## Visualize screen results

Finally once the *in silico* screen is complete, results can be quickly visualized using `plot_screen()`. Positive genetic interaction scores indicate potential synthetic lethal genetic interactors, and negative scores indicate potential alleviating genetic interactors. As expected, we identified *ARID1B* as a synthetic lethal interactor of *ARID1A*.

```{r plot, eval = TRUE}
## Visualize results, turn on gene labels, 
## and label three genes each that are predicted to have 
## lethal and alleviating genetic interactions, respectively

plot_screen(result_df = screen_results, 
            label_genes = TRUE, 
            label_n = 3)
```

# Example: Identifying *ARID1A* co-essential genes

Perturbing genes that function in same/synergistic pathways or in the same complex are said to show similar fitness effects, and these that show effects are considered to be "co-essential". The strategy of mapping co-essential gene have been used by several studies to attribute functions to previously annotated genes as well as to identify a novel subunit of a large complex ([Wainberg et al. 2021](https://doi.org/10.1038/s41588-021-00840-z); [Pan et al. 2018](https://doi.org/10.1016/j.cels.2018.04.011)). 

Given that ARID1A is known subunit of the mammalian SWI/SNF complex ([Mashtalir et al. 2018](https://doi.org/10.1016/j.cell.2018.09.032)), we expect that members of the SWI/SNF complex would share co-essentiality with *ARID1A*. This example will demonstrate how we can map *ARID1A*'s co-essential gene network using `GRETTA`.

## Identifying genes with highest correlation coefficients

To determine co-essential genes, we will perform multiple Pearson correlation coefficient analyses between *ARID1A* KO effects and the KO effects of all 18,333 genes. A cut off will be determined by calculating the inflection point of the ranked coefficient curve. As expected find SWI/SNF subunit encoding genes, *SMARCE1* and *SMARCB1*, as the top two co-essential genes. 

> **Warning**
> This process may take several minutes. Our example below `coessential_map()` + `get_inflection_points()` took ~17 minutes to process. To save time we have pre-processed this setp for you.

```{r coess_df, eval = FALSE}
## Map co-essential genes
coess_df <- coessential_map(
  input_gene = "ARID1A", 
  data_dir = gretta_data_dir, 
  output_dir = gretta_output_dir) 

## Calculate inflection points of positive and negative curve using co-essential gene results.
coess_inflection_df <- get_inflection_points(coess_df)
```

```{r load_prepared_results_coess, eval = TRUE, include = FALSE}
# Load pre-processed results
load(paste0(gretta_data_dir,"/sample_22Q2_ARID1A_coessential_result.rda"), envir = environment())
load(paste0(gretta_data_dir,"/sample_22Q2_ARID1A_coessential_inflection.rda"), envir = environment())
```

Next, we annotate the data frame containing the co-essential network data and visualize. 

```{r combine_n_visualize, eval = TRUE}
## Combine and annotate data frame containing co-essential genes
coess_annotated_df <- annotate_df(coess_df, coess_inflection_df)

plot_coess(
  result_df = coess_annotated_df, 
  inflection_df = coess_inflection_df,
  label_genes = TRUE, # Should gene names be labeled?
  label_n = 3) # Number of genes to display from each end
```

We also see that the top ten *ARID1A* co-essential genes include eight known SWI/SNF subunits, namely *ARID1A*, *SMARCB1*, *SMARCE1*, *SMARCC1*, *SS18*, *DPF2*, *SMARCC2*, and *SMARCD2*.

```{r top10_coess, eval = TRUE}
## Show top 10 co-essential genes. 
coess_annotated_df %>% arrange(Rank) %>% head(10)
```

### Optional filter for specific cancer types

Instead of mapping for essentiality across all available cell lines, users can also subset by disease type using the option `input_disease = ""`, or within a pre-selected group of cell lines using the option `input_cell_lines = c()`. Below we provide an example of how ARID1A essential genes are mapped for pancreatic cancers. 

> **Warning**
> Depending on the number of cell lines that are available after the subsetting step, 
> the inflection point calculation and thresholds may not be optimal. 
> Please use caution when interpreting these results.

```{r coess_for_panc, eval = FALSE}
## Map co-essential genes in pancreatic cancers only
coess_df <- coessential_map(
  input_gene = "ARID1A",
  input_disease = "Pancreatic Cancer",
  core_num = 5, ## Depending on how many cores you have access to, increase this value to shorten processing time.
  data_dir = gretta_data_dir, 
  output_dir = gretta_output_dir,
  test = FALSE)
```

### Optional filter for custom cell lines

We can also map essentiality across a manually defined list of cell lines using the `input_cell_lines = c()` option. 

> **Warning**
> Depending on the number of cell lines provided, the inflection point may not be calculated.
> Please use caution when interpreting these results.

```{r coess_for_lines, eval = FALSE}
custom_lines <- c("ACH-000001", "ACH-000002", "ACH-000003",...)

coess_df <- coessential_map(
  input_gene = "ARID1A",
  input_cell_lines = custom_lines,
  core_num = 5, ## Depending on how many cores you have access to, increase this value to shorten processing time.
  data_dir = gretta_data_dir, 
  output_dir = gretta_output_dir,
  test = FALSE)
```

# Session information
```{r sessioninfo}
sessionInfo()
```