Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add celltyping to external instructions #502

Merged
merged 47 commits into from
Nov 15, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4d8c4b3
submitter docs and started annotation docs
sjspielman Oct 11, 2023
2407746
add example celltype metadata file
sjspielman Oct 11, 2023
1f2a543
full paths and add rel link to submitter file prep instructions
sjspielman Oct 11, 2023
c414d94
section on repeating
sjspielman Oct 11, 2023
3228b4f
add some links
sjspielman Oct 11, 2023
390d6cc
doctoc
sjspielman Oct 11, 2023
10a510a
caps for table descriptions
sjspielman Oct 11, 2023
6298807
A couple cleanups
sjspielman Oct 11, 2023
3b2aff9
spelling
sjspielman Oct 11, 2023
d7d7ea1
Merge branch 'development' into sjspielman/499-external-celltype-docs
sjspielman Oct 27, 2023
69de0f2
Merge branch 'development' into sjspielman/499-external-celltype-docs
sjspielman Nov 7, 2023
638c761
Merge branch 'development' into sjspielman/499-external-celltype-docs
sjspielman Nov 13, 2023
92a9927
Update external instructions with current naming scheme, with a littl…
sjspielman Nov 13, 2023
6c9f182
update example file
sjspielman Nov 13, 2023
2467ef7
spelling
sjspielman Nov 13, 2023
fb42f6f
reorg and flesh some reference bullets out
sjspielman Nov 13, 2023
2cf5384
doctoc
sjspielman Nov 13, 2023
1072ec9
delete duplicate text that was rewritten here but original remained. …
sjspielman Nov 13, 2023
ffdc7b8
Update some comments with more contextual information about overall w…
sjspielman Nov 14, 2023
057df40
merge in development and fix conflict, and precommit hook tweaked ext…
sjspielman Nov 14, 2023
920e6ae
bullet text and a little rephrasing
sjspielman Nov 14, 2023
2326a58
third bullet
sjspielman Nov 14, 2023
24df4d9
Update a bunch of relative links
sjspielman Nov 14, 2023
84802a6
fix weird underscore
sjspielman Nov 14, 2023
856ba33
submitter file is no longer required; remove it
sjspielman Nov 14, 2023
7361ab9
catch some small fixes from review, and move section to be above spec…
sjspielman Nov 15, 2023
f6ec451
Updates and rearrangements based on review comments
sjspielman Nov 15, 2023
3acec7f
need internet for default cell typing files, and reframe the cell typ…
sjspielman Nov 15, 2023
209b6e5
rewording and remove some essentially duplicated text
sjspielman Nov 15, 2023
5e8aafd
Changed my mind, better indeed to have references first even in this …
sjspielman Nov 15, 2023
fd26234
Apply suggestions from code review
sjspielman Nov 15, 2023
95e5d17
Clean up tables, add toc title, move output section up, and fix some …
sjspielman Nov 15, 2023
891a1e5
now actually add toc title post styling
sjspielman Nov 15, 2023
55627ba
Add button hack for celltype file, and update other buttons to use fi…
sjspielman Nov 15, 2023
38dba37
Add example submitter cell types file and link with button in externa…
sjspielman Nov 15, 2023
94ac08a
remove s3 reference paths
sjspielman Nov 15, 2023
83bec1a
move repeat mapping into a new section for additional settings, and d…
sjspielman Nov 15, 2023
49e210a
remove colon after a verb where it shouldnt be, and leave TODOs to ci…
sjspielman Nov 15, 2023
ffa8edf
Merge branch 'development' into sjspielman/499-external-celltype-docs
sjspielman Nov 15, 2023
447f0c9
Merge branch 'development' into sjspielman/499-external-celltype-docs
sjspielman Nov 15, 2023
2df686b
the triumphant return of the toc title
sjspielman Nov 15, 2023
e8b857d
update README bullets with all example files
sjspielman Nov 15, 2023
864a1df
rephrase and make it a table for cleaner spacing
sjspielman Nov 15, 2023
4246e78
we did not mean to have this, as discussed in #469
sjspielman Nov 15, 2023
df95988
Merge branch 'development' into sjspielman/499-external-celltype-docs
sjspielman Nov 15, 2023
8c9c99b
We do not provide an example submitter file. Also harmonize some tabl…
sjspielman Nov 15, 2023
f946280
Merge branch 'sjspielman/499-external-celltype-docs' of github.com:Al…
sjspielman Nov 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## Example files

This directory contains an example [metadata file](../external-data-instructions.md#prepare-the-metadata-file) and [configuration file](../external-data-instructions.md#configuration-files) for the `scpca-nf` workflow.
There is also an example [cell type annotation metadata file](../external-data-instructions.md#performing-cell-type-annotation).
sjspielman marked this conversation as resolved.
Show resolved Hide resolved
sjspielman marked this conversation as resolved.
Show resolved Hide resolved
These files should be used as an example of formats and content, but note that the values in these files may not be applicable or sufficient to allow running `scpca-nf` to be used directly on your system.

## Testing your setup with example data
Expand Down
4 changes: 4 additions & 0 deletions examples/example_project_celltype_metadata.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
scpca_project_id singler_ref_name singler_ref_file cellassign_ref_name cellassign_ref_file
project01 BlueprintEncodeData BlueprintEncodeData_model.rds blood PanglaoDB-blood.tsv
project02 HumanPrimaryCellAtlasData HumanPrimaryCellAtlasData_model.rds NA NA
project03 NA NA blood PanglaoDB-blood.tsv
83 changes: 78 additions & 5 deletions external-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@
- [Libraries with additional feature data (ADT or cellhash)](#libraries-with-additional-feature-data-adt-or-cellhash)
- [Multiplexed (cellhash) libraries](#multiplexed-cellhash-libraries)
- [Spatial transcriptomics libraries](#spatial-transcriptomics-libraries)
- [Cell type annotation](#cell-type-annotation)
- [Providing existing cell type labels](#providing-existing-cell-type-labels)
- [Performing cell type annotation](#performing-cell-type-annotation)
- [Repeating cell type annotation](#repeating-cell-type-annotation)
- [Output files](#output-files)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->
sjspielman marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -126,16 +130,18 @@ To run the workflow, you will need to create a tab separated values (TSV) metada
| `assay_ontology_term_id` | [Experimental Factor Ontology](https://www.ebi.ac.uk/ols/ontologies/efo) term id associated with the `tech_version` |
| `seq_unit` | Sequencing unit (one of: `cell`, `nucleus`, `bulk`, or `spot`)|
| `sample_reference`| The name of the reference to use for mapping, available references include: `Homo_sapiens.GRCh38.104` and `Mus_musculus.GRCm39.104` |
| `files_directory` | path/uri to directory containing fastq files (unique per run) |
| `files_directory` | The full path/uri to directory containing fastq files (unique per run) |

The following columns may be necessary for running other data modalities (CITE-seq, spatial trancriptomics) or are optional and can be included in the metadata file if desired:
The following columns may be necessary for running other data modalities (CITE-seq, spatial trancriptomics) or including existing cell type labels.

| column_id | contents |
|-----------------|----------------------------------------------------------------|
| `feature_barcode_file` | path/uri to file containing the feature barcode sequences (only required for ADT and cellhash samples); for samples with ADT tags, this file can optionally indicate whether antibodies are targets or controls. |
| `feature_barcode_file` | The full path/uri to TSV file containing the feature barcode sequences (only required for ADT and cellhash samples); for samples with ADT tags, this file can optionally indicate whether antibodies are targets or controls |
| `feature_barcode_geom` | A salmon `--read-geometry` layout string. <br> See https://github.com/COMBINE-lab/salmon/releases/tag/v1.4.0 for details (only required for ADT and cellhash samples) |
| `slide_section` | The slide section for spatial transcriptomics samples (only required for spatial transcriptomics) |
| `slide_serial_number`| The slide serial number for spatial transcriptomics samples (only required for spatial transcriptomics) |
| `submitter_cell_types_file` | The full path/uri to TSV file containing cell labels if you have cell type annotations results to include. See [instructions below](#providing-existing-cell-type-labels) for more information about preparing this file |


We have provided an example run metadata file for reference.

Expand Down Expand Up @@ -408,7 +414,7 @@ This file will contain one row for each library-sample pair (i.e. a library cont

| column_id | contents |
|-----------------|----------------------------------------------------------------|
| `scpca_library_id`| Multiplexed library ID matching values in the metadata file. |
| `scpca_library_id`| Multiplexed library ID matching values in the metadata file |
| `scpca_sample_id` | Sample ID for a sample contained in the listed multiplexed library |
| `barcode_id` | The barcode ID used for the sample within the library, as defined in `feature_barcode_file` |

Expand All @@ -428,7 +434,70 @@ As an example, the Dockerfile that we used to build Space Ranger can be found [h
After building the docker image, you will need to push it to a [private docker registry](https://www.docker.com/blog/how-to-use-your-own-registry/) and set `params.SPACERANGER_CONTAINER` to the registry location and image id in the `user_template.config` file.
*Note: The workflow is currently set up to work only with spatial transcriptomic libraries produced from the [Visium Spatial Gene Expression protocol](https://www.10xgenomics.com/products/spatial-gene-expression) and has not been tested using output from other spatial transcriptomics methods.*

## Cell type annotation
sjspielman marked this conversation as resolved.
Show resolved Hide resolved

### Providing existing cell type labels

If you have already performed cell type annotation and wish to include these labels in the final workflow results, you can include the column `submitter_cell_types_file` in your run metadata file ([see example here](examples/example_run_metadata.tsv)).
This column should be filled with the path or uri to a TSV file with existing cell type labels.

This file _must_ include the following columns:

| column_id | contents |
|-----------------|----------------------------------------------------------------|
| `scpca_library_id`| Multiplexed library ID matching values in the metadata file |
| `cell_barcode` | The cell id with the given annotation label |
| `cell_type_assignment` | The annotation label for that cell |

Optionally, you can also include a column `cell_type_ontology` with ontology labels corresponding to the given annotation label.

### Performing cell type annotation

`scpca-nf` can perform cell type annotation using two complementary methods: the reference-based method [`SingleR`](https://bioconductor.org/packages/release/bioc/html/SingleR.html) and the marker-gene based method [`CellAssign`](https://github.com/Irrationone/cellassign).

You can turn on cell type annotation by using the `--perform_celltyping` flag.
sjspielman marked this conversation as resolved.
Show resolved Hide resolved
You will also need to provide an additional workflow parameter `celltype_project_metafile` containing the path/uri to a TSV file with information about which references to use for cell typing, at a project level, which can be specified at the command line (shown below) or defined in your configuration file.

For example, you would run from the command line as:

```sh
nextflow run AlexsLemonade/scpca-nf \
--perform_celltyping \
--celltype_project_metafile = examples/example_project_celltype_metadata.tsv
```

References for use in cell type annotation have been pre-compiled as follows:

+ `SingleR` annotation uses references from the [`celldex` package](https://bioconductor.org/packages/release/data/experiment/html/celldex.html).
Available reference options include `BlueprintEncodeData`, `DatabaseImmuneCellExpressionData`, `HumanPrimaryCellAtlasData`, and `MonacoImmuneData`.
+ Please consult the [`celldex` documentation](https://bioconductor.org/packages/release/data/experiment/vignettes/celldex/inst/doc/userguide.html) to determine which of these references, if any, is most suitable for your dataset.
+ `CellAssign` annotation uses marker gene set references from [PanglaoDB](https://panglaodb.se/).
Available organ-based references include `blood`, `brain`, and `muscle`.
+ TODO do we want to say this? Please reach out to the Data Lab if you require a different set of marker genes for cell type annotation besides those in organs listed.
sjspielman marked this conversation as resolved.
Show resolved Hide resolved

This file should contain these five columns with the following information (see the example file in [`examples/example_project_celltype_metadata.tsv`](examples/example_project_celltype_metadata.tsv)):

| column_id | contents |
|-----------------|----------------------------------------------------------------|
| `scpca_project_id`| Project ID matching values in the metadata file |
| `singler_ref_name` | Reference name for `SingleR` annotation. Must be one of `BlueprintEncodeData`, `DatabaseImmuneCellExpressionData`, `HumanPrimaryCellAtlasData`, or `MonacoImmuneData`. Use `NA` to skip `SingleR` annotation |
| `singler_ref_file` | Path to internal `SingleR` reference file. Must be formatted as `<singler_ref_name>_model.rds`, e.g. `BlueprintEncodeData_model.rds`. Use `NA` to skip `SingleR` annotation |
| `cellassign_ref_name` | Reference name for `CellAssign` annotation. Must be one of `blood`, `brain`, or `muscle`. Use `NA` to skip `CellAssign` annotation |
| `cellassign_ref_file` | Path to internal `CellAssign` reference file. Must be formatted as `PanglaoDB-<cellassign_ref_name>`, e.g. `PanglaoDB-blood.tsv`. Use `NA` to skip `CellAssign` annotation |

#### Repeating cell type annotation

When cell typing is turned on with `--perform_celltyping`, `scpca-nf` will, by default, skip cell type annotation for any libraries whose cell type annotation results exist in the `checkpoints` folder of the output directory.
If the cell type annotation reference versions are unchanged, this will save substantial processing time and cost.
However, you may wish to repeat the cell typing process if there have been other changes to the data.

To force repeating the cell type annotation process, use the `--repeat_celltyping` flag along with the `--perform_celltyping` flag at the command line:

```sh
nextflow run AlexsLemonade/scpca-nf \
--perform_celltyping \
--repeat_celltyping \
```

## Output files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think "Output files" section should maybe be moved up? People probably want to know about this even with a basic run. Related, I kind of want to demote "Repeating mapping steps" from a top level header... but I don't know if there is a good place to nest it: maybe we can have a section of "Workflow options" and throw a few (or at least one) other things in there?

Copy link
Member Author

@sjspielman sjspielman Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can have a section of "Workflow options" and throw a few (or at least one) other things in there?

I like this concept, but nothing is really jumping out at me as also* belonging in that section. Though, one day I'm sure more things will arise! The only "maybe" is to also move the section on repeating cell typing into it too?


Expand Down Expand Up @@ -468,6 +537,8 @@ results

If bulk libraries were processed, a `bulk_quant.tsv` and `bulk_metadata.tsv` summarizing the counts data and metadata across all libraries will also be present in the `results` directory.

If you performed cell type annotation, an additional QC report specific to cell typing results called `library_id_celltype-report.html` will also be present in the `results` directory.

The `checkpoints` folder will contain intermediate files that are produced by individual steps of the workflow, including mapping with `salmon`.
The contents of this folder are used to allow restarting the workflow from internal checkpoints (in particular so the initial read mapping does not need to be repeated, see [repeating mapping steps](#repeating-mapping-steps)), and may contain log files and other outputs useful for troubleshooting or alternative analysis.

Expand Down Expand Up @@ -502,5 +573,7 @@ nextflow run AlexsLemonade/scpca-nf \
--publish_fry_outs
```

If genetic demultiplexing was performed, there will also be a folder called `vireo` with the output from running [vireo](https://vireosnp.readthedocs.io/en/latest/index.html) using genotypes identified from the bulk RNA-seq.
If genetic demultiplexing was performed, there will also be a checkpoints folder called `vireo` with the output from running [vireo](https://vireosnp.readthedocs.io/en/latest/index.html) using genotypes identified from the bulk RNA-seq.
Note that we do not output the genotype calls themselves for each sample or cell, as these may contain identifying information.

If cell type annotation was performed, there will also be a checkpoints folder called `celltype` with the output from running `SingleR` and `CellAssign`.