-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
676 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# `bean count[-samples]`: Count (reporter) screen data | ||
`bean count-samples` (or `bean count` for a single sample) maps guide into guide counts, **allowing for base transition in spacer sequence**. When the matched reporter information is provided, it can count the **target site edits** and **alleles produced by each guide**. Mapping is efficiently done based on [CRISPResso2](https://github.com/pinellolab/CRISPResso2) modified for base-edit-aware mapping. | ||
|
||
|
||
|
||
```python | ||
bean count-samples \ | ||
--input sample_list.csv `# sample with lines 'R1_filepath,R2_filepath,sample_name\n'` \ | ||
-b A `# base that is being edited (A/G)` \ | ||
-f sgRNA_info_table.csv `# sgRNA information` \ | ||
-o . `# output directory` \ | ||
-r `# read edit/allele information from reporter` \ | ||
-t 12 `# number of threads` \ | ||
--name my_sorting_screen `# name of this sample run` \ | ||
``` | ||
```python | ||
bean count --R1 R1.fq --R2 R2.fq -b A -f sgRNA_info_table.csv -r | ||
``` | ||
By default, `bean count[-samples]` assume R1 and R2 are trimmed off of the adapter sequence. You may need to adjust the command arguments according to your read structure. | ||
|
||
<img src="/crispr-bean/assets/sequence_struct.png" alt="Read structuren" width="600"/> | ||
|
||
See full detail [below](#full-parameters). | ||
|
||
# Input file format | ||
See :ref:`input` for input file formats. | ||
|
||
# Output file format | ||
`count` or `count-samples` produces `.h5ad` and `.xlsx` file with guide and per-guide allele counts. | ||
* `.h5ad`: This output file follows annotated matrix format compatible with `AnnData` and is based on `Screen` object in [purturb_tools](https://github.com/pinellolab/perturb-tools). See [Data Structure](#data-structure) section for more information. | ||
* `.xlsx`: This output file contains `.guides`, `.samples`, `.X[_bcmatch,_edits]`. (`allele_tables` are often too large to write into an Excel!) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# `bean create-screen`: Create ReporterScreen object from flat files | ||
```bash | ||
bean create-screen gRNA_library.csv sample_list.csv gRNA_counts_table.csv | ||
``` | ||
## Input | ||
* gRNA_library.csv | ||
* sample_list.csv | ||
* gRNA_counts_table.csv: Table with gRNA ID in the first column and sample IDs as the column names (first row) | ||
`gRNA_library.csv` and `sample_list.csv` should be formatted as :ref:`input`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# `filter`: Filtering (and optionally translating) alleles | ||
As `tiling` mode of `bean run` accounts for any robustly observed alleles, `bean filter` filters for such alleles. | ||
```bash | ||
bean filter my_sorting_screen_masked.h5ad \ | ||
-o my_sorting_screen_filtered.h5ad `# Output file path` \ | ||
``` | ||
|
||
# Output | ||
Above command produces | ||
* `my_sorting_screen_filtered.h5ad` with filtered alleles stored in `.uns`, | ||
* `my_sorting_screen_filtered.filtered_allele_stats.pdf`, and `my_sorting_screen_filtered.filter_log.txt` that report allele count stats in each filtering step. | ||
|
||
You may want to adjust the flitering parameters to obtain optimal balance between # guides per variant & # variants that are scored. See example outputs of filtering step [here](docs/example_filtering_output/). | ||
|
||
|
||
# Translating alleles | ||
If you want to obtain **amino acid level variant** for coding sequence tiling screens, provide coding sequence positions which variants occuring within the coding sequence will be translated. *This is optional, but **highly recommended** to increase per-(coding)variant support.* | ||
|
||
<img src="/crispr-bean/assets/translation.png" alt="Allele translation" width="500"/> | ||
|
||
|
||
```bash | ||
bean filter my_sorting_screen.h5ad \ | ||
-o my_sorting_screen_masked.h5ad \ | ||
--translate `# Translate coding variants` \ | ||
[ --translate-gene-name GENE_SYMBOL OR | ||
--translate-genes-list path_to_gene_names_file.txt OR | ||
--translate-fasta gene_exon.fa, OR | ||
--translate-fastas-csv gene_exon_fas.csv] | ||
``` | ||
* When library covers a single gene, do either of the following: | ||
1. Feed `--translate-gene-name GENE_SYMBOL` if your `genomic_pos` column of `sgRNA_info_tbl` is compatible with [MANE transcript](https://useast.ensembl.org/info/genome/genebuild/mane.html)'s reference genome. (Per 10/23/2023, GRCh38). This will automatically load the exon positions based on MANE transcript annotation. | ||
2. To use your custom coding sequence and exon positions, feed `--translate-fasta gene_exon.fa` argument where `gene_exon.fa` is the FASTA file with entries of exons. [See full details here](docs/exon_fa_format.md). | ||
* When library covers multiple genes, do either of the following: | ||
1. Feed `--translate-genes-list path_to_gene_names_file.txt` where `path_to_gene_names_file.txt` is file with one gene symbol per line. | ||
2. Feed `--translate-fastas-csv gene_exon_fas.csv` where `gene_exon_fas.csv` is the csv file with lines `gene_id,gene_exon_fasta_path` without header. Each FASTA file in `gene_exon_fasta_path` is formatted [as the single-gene FASTA file](docs/exon_fa_format.md). | ||
* Translation will keep the variants outside the coding sequence as nucleotide-level variants, while aggregating variants leading to the same coding sequence variants. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
This document describes the input files of :ref:`count_samples`. | ||
## sgRNA_info_table.csv | ||
File should contain following columns. | ||
* `name`: gRNA ID column | ||
* `sequence`: gRNA sequence | ||
* `barcode`: R2 barcode to help match reporter to gRNA, written in the sense direction (as in R1) | ||
* In order to use accessibility in the [variant effect quantification](#bean-run-quantify-variant-effects), provide accessibility information in one of two options. (For non-targeting guides, provide NA values (empty cell).) | ||
* Option 1: `chrom` & `genomic_pos`: Chromosome (ex. `chr19`) and genomic position of guide sequence. You will have to provide the path to the bigwig file with matching reference version in `bean run`. | ||
* Option 2: `accessibility_signal`: ATAC-seq signal value of the target loci of each guide. | ||
* For variant library (gRNAs are designed to target specific variants and ignores bystander edits) | ||
* `target`: This column denotes which target variant/element of each gRNA. This is not used in `bean count[-samples]` but required to run `bean run` in later steps. | ||
* `target_group`: If negative/positive control gRNA will be considered in `bean qc` and/or `bean run`, specify as "NegCtrl"/"PosCtrl" in this column. | ||
* `target_pos`: If `--match_target_pos` flag is used, input file needs `target_pos` which specifies 0-based relative position of targeted base within Reporter sequence. | ||
* For tiling library (gRNAs tile coding / noncoding sequences) | ||
* `strand`: Specifies gRNA strand information relative to the reference genome. | ||
* `chrom`: Chromosome of gRNA targeted locus. | ||
* `start_pos`: gRNA starting position in the genome. Required when you provide `strand` column. Should specify the smaller coordinate value among start and end position regardless of gRNA strandedness. | ||
|
||
Also see examples for [variant library](tests/data/test_guide_info.csv) and [tiling library](tests/data/test_guide_info_tiling.csv). | ||
|
||
## sample_list.csv | ||
File should contain following columns with header. | ||
* `R1_filepath`: Path to read 1 `.fastq[.gz]` file | ||
* `R2_filepath`: Path to read 1 `.fastq[.gz]` file | ||
* `sample_id`: ID of sequencing sample | ||
* `replicate`: Replicate # of this sample (Should NOT contain `.`) | ||
* `condition`: Name of the sorting bin (ex. `top`, `bot`), or label of timepoint (ex. `D5`, `D18`) | ||
|
||
For FACS sorting screens: | ||
* `upper_quantile`: FACS sorting upper quantile | ||
* `lower_quantile`: FACS sorting lower quantile | ||
|
||
For proliferation / survival screens: | ||
* `time`: Numeric time following the base editing of each sample. | ||
|
||
|
||
Also see examples for [FACS sorting screen](tests/data/sample_list.csv) and [proliferation / survival screen](tests/data/sample_list_survival.csv). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
# Tiling sorting screen tutorial | ||
Tiling screen that tiles gRNA densely across locus or multiple loci, selected based on FACS signal quantiles. | ||
|
||
<table> | ||
<tr> | ||
<th>Library design</th> | ||
<td>Tiling (gRNAs tile each locus densely) <br> <img src="/crispr-bean/assets/tiling.png" alt="tiling library design" width="300"/> </td> | ||
</tr> | ||
<tr> | ||
<th>Selection</th> | ||
<td>Cells are sorted based on FACS signal quantiles <br> <img src="/crispr-bean//crispr-bean/assets/[email protected]" alt="variant library design" width="300"/></td> | ||
</tr> | ||
</table> | ||
|
||
<br></br> | ||
|
||
## Example workflow | ||
```bash | ||
screen_id=my_sorting_tiling_screen | ||
working_dir=my_workdir | ||
|
||
# 1. Count gRNA & reporter | ||
bean count-samples \ | ||
--input ${working_dir}/sample_list_tiling.csv `# Contains fastq file path; see test file for example.`\ | ||
-b A `# Base A is edited (into G)` \ | ||
-f ${working_dir}/test_guide_info_tiling_chrom.csv `# Contains gRNA metadata; see test file for example.`\ | ||
-o $working_dir `# Output directory` \ | ||
-r `# Quantify reporter edits` \ | ||
-n ${screen_id} `# ID of the screen` \ | ||
--tiling | ||
|
||
# 2. QC samples & guides | ||
bean qc \ | ||
${working_dir}/bean_count_${screen_id}.h5ad `# Input ReporterScreen .h5ad file path` \ | ||
-o ${working_dir}/bean_count_${screen_id}_masked.h5ad `# Output ReporterScreen .h5ad file path` \ | ||
-r ${working_dir}/qc_report_${screen_id} `# Prefix for QC report` \ | ||
|
||
# 3. Filter & translate alleles | ||
bean filter ${working_dir}/bean_count_${screen_id}_masked.h5ad \ | ||
-o ${working_dir}/bean_count_${screen_id}_alleleFiltered \ | ||
--filter-target-basechange `# Filter based on intended base changes. If -b A was provided in bean count, filters for A>G edit. If -b C was provided, filters for C>T edit.`\ | ||
--filter-window --edit-start-pos 0 --edit-end-pos 19 `# Filter based on editing window in spacer position within reporter.`\ | ||
--filter-allele-proportion 0.1 --filter-sample-proportion 0.3 `#Filter based on allele proportion larger than 0.1 in at least 0.3 (30%) of the control samples.` \ | ||
--translate --translate-genes-list ${working_dir}/gene_symbols.txt | ||
|
||
# 4. Quantify variant effect | ||
bean run sorting tiling \ | ||
${working_dir}/bean_count_${screen_id}_alleleFiltered.h5ad \ | ||
-o $working_dir \ | ||
--fit-negctrl \ | ||
--scale-by-acc \ | ||
--accessibility-col accessibility | ||
``` | ||
See more details below. | ||
|
||
## 1. Count gRNA & reporter (:ref:`count_samples`) | ||
``` | ||
screen_id=my_sorting_tiling_screen | ||
working_dir=my_workdir | ||
bean count-samples \ | ||
--input ${working_dir}/sample_list_tiling.csv `# Contains fastq file path; see test file for example.`\ | ||
-b A `# Base A is edited (into G)` \ | ||
-f ${working_dir}/test_guide_info_tiling_chrom.csv `# Contains gRNA metadata; see test file for example.`\ | ||
-o $working_dir `# Output directory` \ | ||
-r `# Quantify reporter edits` \ | ||
-n ${screen_id} `# ID of the screen` \ | ||
--tiling | ||
``` | ||
Make sure you follow the [input file format](../../README#input-file-format) for seamless downstream steps. This will produce `./bean_count_${screen_id}.h5ad`. | ||
|
||
## 2. QC (:ref:`qc`) | ||
Base editing data will include QC about editing efficiency. As QC uses predefined column names and values, beware to follow the [input file guideline](../../README#input-file-format), but you can change the parameters with the full argument list of [`bean qc`](../../README#bean qc-qc-of-reporter-screen-data). (Common factors you may want to tweak is `--ctrl-cond=bulk` and `--lfc-conds=top,bot` if you have different sample condition labels.) | ||
``` | ||
bean qc \ | ||
${working_dir}/bean_count_${screen_id}.h5ad `# Input ReporterScreen .h5ad file path` \ | ||
-o ${working_dir}/bean_count_${screen_id}_masked.h5ad `# Output ReporterScreen .h5ad file path` \ | ||
-r ${working_dir}/qc_report_${screen_id} `# Prefix for QC report` \ | ||
[--tiling] `# Not required if you have passed --tiling in counting step` | ||
``` | ||
|
||
|
||
|
||
If the data does not include reporter editing data, you can provide `--no-editing` flag to omit the editing rate QC. | ||
|
||
## 3. Filter alleles (:ref:`filter`) | ||
As tiling library doesn't have designated per-gRNA target variant, any base edit observed in reporter may be the candidate variant, while having too many variants with very low editing rate significantly decreases the power. Variants are filtered based on multiple criteria in `bean fitler`. | ||
|
||
If the screen targets coding sequence, it's beneficial to translate edits into coding varaints whenever possible for better power. For translation, provide `--translate` and one of the following: | ||
``` | ||
[ --translate-gene-name GENE_SYMBOL OR | ||
--translate-genes-list path_to_gene_names_file.txt OR | ||
--translate-fasta gene_exon.fa, OR | ||
--translate-fastas-csv gene_exon_fas.csv] | ||
``` | ||
where `path_to_gene_names_file.txt` has one gene symbol per line, and gene symbol uses its MANE transcript (hg38) coordinates of exons. In order to use other reference versions or transcript ID, you'll need to feed in fasta file. See detailed formatting of fasta file [here](../../README#translating-alleles). | ||
|
||
Example allele filtering given we're translating based on MANE transcript exons of multiple gene symbols: | ||
|
||
```bash | ||
bean filter ${working_dir}/bean_count_${screen_id}_masked.h5ad \ | ||
-o ${working_dir}/bean_count_${screen_id}_alleleFiltered \ | ||
--filter-target-basechange `# Filter based on intended base changes. If -b A was provided in bean count, filters for A>G edit. If -b C was provided, filters for C>T edit.`\ | ||
--filter-window --edit-start-pos 0 --edit-end-pos 19 `# Filter based on editing window in spacer position within reporter.`\ | ||
--filter-allele-proportion 0.1 --filter-sample-proportion 0.3 `#Filter based on allele proportion larger than 0.1 in at least 0.3 (30%) of the control samples.` \ | ||
--translate --translate-genes-list ${working_dir}/gene_symbols.txt | ||
``` | ||
|
||
Ouptut file `` shows number of alleles per guide and number of guides per variant, where we want high enough values for the latter. See the typical output for dataset with good editing coverage & filtering result [here](../example_filtering_ouptut/). | ||
|
||
## 4. Quantify variant effect (:ref:`run`) | ||
By default, `bean run [sorting,survival] tiling` uses most filtered allele counts table for variant identification and quantification of their effects. **Check [allele filtering output](../example_filtering_ouptut/)** and choose alternative filtered allele counts table if necessary. | ||
|
||
`bean run` can take 3 run options to quantify editing rate: | ||
1. From **reporter + accessibility** | ||
1-1. If your gRNA metadata table (`${working_dir}/test_guide_info.csv` above) included per-gRNA accessibility score, | ||
``` | ||
bean run sorting tiling \ | ||
${working_dir}/bean_count_${screen_id}_alleleFiltered.h5ad \ | ||
-o $working_dir \ | ||
--fit-negctrl \ | ||
--scale-by-acc \ | ||
--accessibility-col accessibility | ||
``` | ||
1-2. If your gRNA metadata table (`${working_dir}/test_guide_info.csv` above) included per-gRNA chromosome & position and you have bigWig file with accessibility signal, | ||
``` | ||
bean run sorting tiling \ | ||
${working_dir}/bean_count_${screen_id}_alleleFiltered.h5ad \ | ||
-o $working_dir \ | ||
--fit-negctrl \ | ||
--scale-by-acc \ | ||
--accessibility-bw accessibility.bw | ||
``` | ||
2. From **reporter** | ||
``` | ||
bean run sorting tiling \ | ||
${working_dir}/bean_count_${screen_id}_alleleFiltered.h5ad \ | ||
-o $working_dir \ | ||
--fit-negctrl | ||
``` | ||
3. No reporter information, assume the same editing efficiency of all gRNAs. | ||
Use this option if your data don't have editing rate information. | ||
``` | ||
bean run sorting tiling \ | ||
${working_dir}/bean_count_${screen_id}_alleleFiltered.h5ad \ | ||
-o $working_dir \ | ||
--fit-negctrl \ | ||
--uniform-edit | ||
``` |
Oops, something went wrong.