diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index b5cd2f07..d864a7ff 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -22,7 +22,7 @@ repos: rev: v2.2.0 hooks: - id: doctoc - args: [--update-only] + args: [--update-only, --title=**Table of Contents**] - repo: https://github.com/astral-sh/ruff-pre-commit # Ruff for linting and formatting python rev: v0.1.5 @@ -44,6 +44,7 @@ repos: rev: v3.0.3 hooks: - id: prettier + exclude: '\.md$' - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.5.0 hooks: diff --git a/internal-instructions.md b/internal-instructions.md index 4306ee49..b7d346b2 100644 --- a/internal-instructions.md +++ b/internal-instructions.md @@ -13,7 +13,6 @@ - ## Running `scpca-nf` as a Data Lab staff member This section provides instructions for running the main workflow, found in [`main.nf`](main.nf). @@ -39,28 +38,28 @@ nextflow run AlexsLemonade/scpca-nf -profile ccdl,batch There are several flags and/or parameters which you may additionally wish to specify, as follows. -+ Nextflow flags: - + `-resume`: Resume workflow from most recent checkpoint - + `-with-tower`: Use `Nextflow Tower` to monitor workflow (requires separate [Nextflow Tower registration](https://tower.nf/)) -+ Workflow parameters: - + `--run_ids list,of,ids`: A custom comma-separated list of ids (run, library, or sample) for this run. - + `--project list,of,project_ids`: A custom comma-separated list of project ids for this run - [The default](config/profile_ccdl.config) run ids are `"SCPCR000001,SCPCS000101"`. - + `--repeat_mapping`: Use this flag to repeat mapping, even if results already exist. - + By default, the workflow checks whether each library has existing `alevin-fry` or `salmon` mapping results, and skips mapping for libraries with existing results. - Using this flag will override that default behavior and repeat mapping even if the given library's results exist. - + For more implementation details, please refer to the [external instructions](external-instructions.md#repeating-mapping-steps). - + `--skip_genetic_demux`: Use this flag to skip genetic demultiplexing, which is turned on by default. - + Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming. - + When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library. - + `--repeat_genetic_demux`: Use this flag to repeat genetic demultiplexing, even if results already exist. - + By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results. - Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist. - + `--perform_celltyping`: Use this flag to perform cell type annotation, which is turned off by default. - + `--repeat_celltyping`: Use this flag to repeat cell type annotation, even if results already exist. - + By default, the workflow checks whether each library has existing cell type annotation results for `SingleR` and/or `CellAssign` (depending on references for that library). - Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist. - + This flag is _only considered_ if `--perform_celltyping` is also used. +- Nextflow flags: + - `-resume`: Resume workflow from most recent checkpoint + - `-with-tower`: Use `Nextflow Tower` to monitor workflow (requires separate [Nextflow Tower registration](https://tower.nf/)) +- Workflow parameters: + - `--run_ids list,of,ids`: A custom comma-separated list of ids (run, library, or sample) for this run. + - `--project list,of,project_ids`: A custom comma-separated list of project ids for this run + [The default](config/profile_ccdl.config) run ids are `"SCPCR000001,SCPCS000101"`. + - `--repeat_mapping`: Use this flag to repeat mapping, even if results already exist. + - By default, the workflow checks whether each library has existing `alevin-fry` or `salmon` mapping results, and skips mapping for libraries with existing results. + Using this flag will override that default behavior and repeat mapping even if the given library's results exist. + - For more implementation details, please refer to the [external instructions](external-instructions.md#repeating-mapping-steps). + - `--skip_genetic_demux`: Use this flag to skip genetic demultiplexing, which is turned on by default. + - Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming. + - When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library. + - `--repeat_genetic_demux`: Use this flag to repeat genetic demultiplexing, even if results already exist. + - By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results. + Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist. + - `--perform_celltyping`: Use this flag to perform cell type annotation, which is turned off by default. + - `--repeat_celltyping`: Use this flag to repeat cell type annotation, even if results already exist. + - By default, the workflow checks whether each library has existing cell type annotation results for `SingleR` and/or `CellAssign` (depending on references for that library). + Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist. + - This flag is _only considered_ if `--perform_celltyping` is also used. Please refer to [`nextflow.config`](nextflow.config) and [other configuration files](config/) for other parameters which can be modified. @@ -80,7 +79,6 @@ Please refer to our [`CONTRIBUTING.md`](CONTRIBUTING.md#stub-workflows) for more ### Running `scpca-nf` for ScPCA Portal release - When running the workflow for a project or group of samples that is ready to be released on ScPCA portal, please use the tag for the latest release: ``` @@ -116,29 +114,33 @@ Make sure to adjust the settings to make the zip file publicly accessible. ## Maintaining references for `scpca-nf` - Inside the `references` folder are files and scripts related to maintaining the reference files available for use with `scpca-nf`. 1. `ref-metadata.tsv`: Each row of this TSV file corresponds to a reference that is available for mapping with `scpca-nf`. -The columns included specify the `organism` (e.g., `Homo_sapiens`), `assembly`(e.g.,`GRCh38`), and `version`(e.g., `104`) of the `fasta` obtained from [Ensembl](https://www.ensembl.org/index.html) that was used to build the reference files. -This file is used as input to the `build-index.nf` workflow, which will create all required index files for `scpca-nf` for the listed organisms in the metadata file, provided the `fasta` and `gtf` files are stored in the proper location on S3. -See [instructions for adding additional organisms](#adding-additional-organisms) for more details. + The columns included specify the `organism` (e.g., `Homo_sapiens`), `assembly`(e.g.,`GRCh38`), and `version`(e.g., `104`) of the `fasta` obtained from [Ensembl](https://www.ensembl.org/index.html) that was used to build the reference files. + This file is used as input to the `build-index.nf` workflow, which will create all required index files for `scpca-nf` for the listed organisms in the metadata file, provided the `fasta` and `gtf` files are stored in the proper location on S3. + See [instructions for adding additional organisms](#adding-additional-organisms) for more details. 2. `scpca-refs.json`: Each entry of this file contains a supported reference for mapping with `scpca-nf` and the name used to refer to that supported reference, e.g., `Homo_sapiens.GRCh38.104`. -For each supported reference, a list of all the reference files that are needed to run `scpca-nf` will be included. -This file is required as input to `scpca-nf`. + For each supported reference, a list of all the reference files that are needed to run `scpca-nf` will be included. + This file is required as input to `scpca-nf`. -3. `celltype-reference-metadata.tsv`: Each row of this TSV file corresponds to a supported cell type reference available for cell type assignment using `add-celltypes.nf`. -For all references, the following columns will be populated: `celltype_ref_name`, `celltype_ref_source` (e.g., `celldex`), supported `celltype_method` (e.g., `SingleR`). -All references obtained from the `PanglaoDB` source also require an `organs` column containing the list of supported `PanglaoDB` organs to include when building the reference. -This should be a comma-separated list of all organs to include. -To find all possible organs, see the `organs` column of `PanglaoDB_markers_2020-03-27.tsv`. -This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for performing cell type annotation from the main workflow. -See [instructions for adding additional cell type references](#adding-additional-cell-type-references) for more details. +3. `celltype-reference-metadata.tsv`: Each row of this TSV file corresponds to a supported cell type reference available for cell type annotation. + This file is required as input to the `build-celltype-ref.nf` workflow to create and/or update cell type references. + For all references, the following columns must be populated: + + - `celltype_ref_name` (e.g., `BlueprintEncodeData` or `blood-compartment`) + - `celltype_ref_source` (e.g., `celldex` or `PanglaoDB`) + - `celltype_method` (e.g., `SingleR` or `CellAssign`) + - All references obtained from the `PanglaoDB` source also require an `organs` column containing the list of supported `PanglaoDB` organs to include when building the reference. + This should be a comma-separated list of all organs to include. + To find all possible organs, see the `organs` column of `PanglaoDB_markers_2020-03-27.tsv`. + + See [instructions for adding additional cell type references](#adding-additional-cell-type-references) for more details. 4. `PanglaoDB_markers_2020-03-27.tsv`: This file is used to build the cell type references from `PanglaoDB`. -This file was obtained from clicking the `get tsv file` button on the [PanglaoDB Dataset page](https://panglaodb.se/markers.html?cell_type=%27choose%27) and replacing the date in the filename with a date in ISO8601 format. -This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for the main workflow to use during cell type annotation. + This file was obtained from clicking the `get tsv file` button on the [PanglaoDB Dataset page](https://panglaodb.se/markers.html?cell_type=%27choose%27) and replacing the date in the filename with a date in ISO8601 format. + This file is required as input to the `build-celltype-ref.nf` workflow, which will create all required cell type references for the main workflow to use during cell type annotation. ### Adding additional organisms @@ -147,7 +149,7 @@ Adding additional organisms is handled, in part, by the `build-index.nf` workflo Follow the below steps to add support for additional references: 1. Download the desired `fasta` and `gtf` files for the organism of choice from `Ensembl`. -Add these to the `S3://scpca-references` bucket with the following directory structure, where the root directory here corresponds to the `organism` and the subdirectory corresponds to the `Ensembl` version: + Add these to the `S3://scpca-references` bucket with the following directory structure, where the root directory here corresponds to the `organism` and the subdirectory corresponds to the `Ensembl` version: ``` homo_sapiens @@ -166,16 +168,36 @@ homo_sapiens ### Adding additional cell type references -Adding additional organisms is handled, in part, by the `build-celltype-ref.nf` workflow. - +Adding additional references to use for cell type annotation is handled by the `build-celltype-ref.nf` workflow. -Follow the below steps to add support for additional cell type references. +Reference files are created and automatically named by the `build-celltype-ref.nf`. We currently only support `celldex` and `PanglaoDB` for reference sources for `SingleR` and `CellAssign` cell type annotation, respectively. -1. Add the `celltype_ref_name`, `celltype_ref_source`, `celltype_method`, and `organs` (if applicable) for the new reference to `celltype-reference-metadata.tsv`. -2. Generate the new cell type references using `nextflow run build-celltype-ref.nf -profile ccdl,batch` from the root directory of this repository. -3. Ensure that the new reference files are public and in the correct location on S3: - - `SingleR` reference files, which are the full reference datasets from the `celldex` package, should be in `s3://scpca-references/celltype/singler_references` named as `celldex-.rds`. - - `SingleR` trained model files for the given Nextflow parameter `singler_label_name` should be in `s3://scpca-references/celltype/singler_models` named as `_models.rds`. - - `CellAssign` organ-specific reference gene matrices should be in `s3://scpca-references/celltype/cellassign_references` named as `PanglaoDB-.tsv`. +Follow these steps to add support for additional cell type references. + +1. Add the `celltype_ref_name`, `celltype_ref_source`, `celltype_method`, and `organs` (if applicable) for the new reference to [`celltype-reference-metadata.tsv`](references/celltype-reference-metadata.tsv). + + - `` represents the reference dataset name. + For use with `SingleR`, this should be taken directly from a `celldex` dataset. + For `CellAssign`, names are established by the Data Lab as `-compartment` to represent a set of markers for a given tissue/organ. + - `` represents the reference dataset source. Currently only `celldex` and `PanglaoDB` are supported for `SingleR` and `CellAssign`, respectively. + - `` represents which annotation method to use with the specified reference, either `SingleR` or `CellAssign`. + - `organs` indicates which organs to be included in creation of references with `PanglaoDB` as the `celltype_ref_source`. + This must be a comma separated list of all organs to include. + +2. Generate the new cell type reference using `nextflow run build-celltype-ref.nf -profile ccdl,batch` from the root directory of this repository. +3. Ensure that the new reference files are public and in the correct location on S3. + +`SingleR` reference files, which are the full reference datasets from the `celldex` package, should be in `s3://scpca-references/celltype/singler_references` and named as `__.rds`. +Corresponding "trained" model files for use in the cell type annotation workflow should be stored in `s3://scpca-references/celltype/singler_models`, named as `___model.rds`. + + - `` is a given `celldex` dataset. + - Note that the workflow parameter `singler_label_name` will determine which `celldex` dataset label is used for annotation; by default, we use `"label.ont"` (ontology labels). + - `` is `celldex`. + - `` is the `celldex` version used during reference building, where we use dashes in place of periods (e.g., version `x.y.z` would be represented as `x-y-z`). + +`CellAssign` organ-specific reference gene matrices should be stored in `s3://scpca-references/celltype/cellassign_references` and named as `__.tsv`. + - `` is a given reference name established by the Data Lab. + - `` is `PanglaoDB` + - `` is the `PanglaoDB` date, which serves as their version, in ISO8601 format.