Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs release: adding portal-wide metadata #351

Merged
merged 13 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 21 additions & 19 deletions .github/workflows/spell-check.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

name: Spell check Markdown files

# Controls when the action will run.
Expand All @@ -14,29 +13,32 @@ jobs:
# This workflow contains a single job called "spell check"
spell-check:
runs-on: ubuntu-latest
container:
image: rocker/tidyverse:4.3.2

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
- uses: actions/checkout@v2

- name: Install packages
run: Rscript --vanilla -e "install.packages('spelling', repos = c(CRAN = 'https://cloud.r-project.org'))"
- name: Checkout
uses: actions/checkout@v4

- name: Run spell check
id: spell_check_run
- name: Remove files that do not need to be spellchecked
run: |
results=$(Rscript --vanilla "scripts/spell-check.R")
echo "::set-output name=sp_chk_results::$results"
cat spell_check_errors.tsv
- name: Archive spelling errors
uses: actions/upload-artifact@v2
rm ./LICENSE

- name: Spell check action
uses: alexslemonade/spellcheck@v0
id: spell
with:
name: spell-check-results
dictionary: components/dictionary.txt

- name: Upload spell check errors
uses: actions/upload-artifact@v4
id: artifact-upload-step
with:
name: spell_check_errors
path: spell_check_errors.tsv

# If there are too many spelling errors, this will stop the workflow
- name: Check spell check results - fail if too many errors
if: ${{ steps.spell_check_run.outputs.sp_chk_results > 0 }}
run: exit 1
- name: Fail if there are spelling errors
if: steps.spell.outputs.error_count > 0
run: |
echo "There were ${{ steps.spell.outputs.error_count }} errors"
column -t spell_check_errors.tsv
exit 1
5 changes: 5 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ For more information about `AlexsLemonade/scpca-nf` versions, please see [the re
<!-- PUT THE NEW CHANGELOG ENTRY RIGHT BELOW THIS -->
<!-------------------------------------------------->

## 2024.09.24

* Metadata for all samples from all projects on the Portal can now be downloaded in a single tab-separated values file.
* For more information on what to expect in the metadata file, see the {ref}`metadata section of the Downloadable files page <download_files:metadata>`.

## 2024.08.13

* A new column, `age_timing`, is now present in the sample metadata tables included with each download.
Expand Down
3 changes: 1 addition & 2 deletions docs/download_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,10 +164,8 @@ Metadata for all samples on the Portal is available to download separately from
Each project page has an option to download metadata for all of its samples as a single zip file containing the `metadata.tsv` file and a `README.md` file.
Project-specific metadata will contain all columns listed in [the above table](#metadata) and any additional project-specific columns, such as treatment or outcome.

<!--
Additionally, a single TSV file containing the metadata for all samples from all projects on the Portal is available for download.
The Portal-wide metadata will contain all columns listed in [the above table](#metadata).
-->

## Multiplexed sample libraries

Expand All @@ -185,6 +183,7 @@ Because we do not perform demultiplexing to separate cells from multiplexed libr
For more on the specific contents of multiplexed library `SingleCellExperiment` objects, see the {ref}`Additional SingleCellExperiment components for multiplexed libraries <sce_file_contents:additional singlecellexperiment components for multiplexed libraries>` section.

The [metadata file](#metadata) for multiplexed libraries (`single_cell_metadata.tsv`) will have the same format as for individual samples, but each row will represent a particular sample/library pair, meaning that there may be multiple rows for each `scpca_library_id`, one for each `scpca_sample_id` within that library.
In addition, an estimate of the total number of cells for each sample after demultiplexing will be found in the `sample_cell_estimate` (as opposed to the `sample_cell_count_estimate` column used for non-multiplexed samples).


## Merged object downloads
Expand Down
4 changes: 2 additions & 2 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Why did we use Alevin-fry for processing?

We aimed to process all of the data in the portal such that it is comparable to widely used pipelines, namely Cell Ranger from 10x Genomics.
In our own benchmarking, we found that [Alevin-fry](https://github.com/COMBINE-lab/alevin-fry) produces very similar results to [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count), while allowing faster, more memory efficient processing of single-cell and single-nuclei RNA-sequencing data.
In our own benchmarking, we found that [Alevin-fry](https://github.com/COMBINE-lab/alevin-fry) ([He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3)) produces very similar results to [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count), while allowing faster, more memory efficient processing of single-cell and single-nuclei RNA-sequencing data.
In the configuration that we are using ("selective alignment" mapping to a human transcriptome that includes introns), Alevin-fry uses approximately 12-16 GB of memory per sample and completes mapping and quantification in less than an hour.
By contrast, Cell Ranger uses up to 25-30 GB of memory per sample and takes anywhere from 2-8 hours to align and quantify one sample.
Quantification of samples processed with both Alevin-fry and Cell Ranger resulted in similar distributions of mapped UMI count per cell and genes detected per cell for both tools.
Expand All @@ -17,7 +17,7 @@ We also compared the mean gene expression reported for each gene by both methods
![](https://github.com/AlexsLemonade/alsf-scpca/blob/c0c2442d7242f6e06a5ac6d1e45bd1951780da14/analysis/docs-figures/plots/gene_exp_correlation.png?raw=true)

Recent reports from others support our findings.
[He _et al._ (2021)](https://doi.org/10.1101/2021.06.29.450377) demonstrated that `alevin-fry` can process single-cell and single-nuclei data more quickly and efficiently then other available methods, while also decreasing the false positive rate of gene detection that is commonly seen in methods that utilize transcriptome alignment.
[He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3) demonstrated that `alevin-fry` can process single-cell and single-nuclei data more quickly and efficiently then other available methods, while also decreasing the false positive rate of gene detection that is commonly seen in methods that utilize transcriptome alignment.
[You _et al._ (2021)](https://doi.org/10.1101/2021.06.17.448895) and [Tian _et al._ (2019)](https://doi.org/10.1038/s41592-019-0425-8) have also noted that results from different pre-processing workflows for single-cell RNA-sequencing analysis tend to result in compatible results downstream.

## How do I use the provided RDS files in R?
Expand Down
10 changes: 5 additions & 5 deletions docs/processing_information.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@

### Mapping and quantification using alevin-fry

We used [`salmon alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) to generate gene by cell counts matrices for all single-cell and single-nuclei samples.
We used [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) to generate gene by cell counts matrices for all single-cell and single-nuclei samples.
In brief, we utilized [selective alignment](#selective-alignment) to the [`splici` index](#reference-transcriptome-index) for all single-cell and single-nuclei samples.

#### Reference transcriptome index

For all samples, we aligned FASTQ files to a reference transcriptome index referred to as the `splici` index.
The [`splici` index](https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/) is built using transcripts from both spliced cDNA and intronic regions.
Inclusion of intronic regions in the index used for alignment allowed us to capture both reads from mature, spliced cDNA and nascent, unspliced cDNA.
Alignment of RNA-seq data to an index containing intronic regions has been shown to reduce spuriously detected genes ([He _et al._ 2021](https://doi.org/10.1101/2021.06.29.450377), [Kaminow _et al._ 2021](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full#sec-5)).
Alignment of RNA-seq data to an index containing intronic regions has been shown to reduce spuriously detected genes ([He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3), [Kaminow _et al._ 2021](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full#sec-5)).
In our hands, we have found that use of the `splici` index led to a more comparable distribution of unique genes found per cell to Cell Ranger than did use of an index obtained from spliced cDNA transcripts only.

#### Selective alignment
Expand All @@ -21,7 +21,7 @@ We mapped reads to the transcriptome index using `salmon` with the default "sele
Briefly, selective alignment uses a mapping score validated approach to identify maximal exact matches between reads and the provided index.
For all samples, we used selective alignment to the `splici` index.

A more detailed description of the mapping strategy invoked by `salmon` in conjunction with `alevin-fry` can be found in [Srivastava _et al._ (2020)](https://doi.org/10.1186/s13059-020-02151-8).
More detailed descriptions of the mapping strategy invoked by `salmon` in conjunction with `alevin-fry` can be found in [Srivastava _et al._ (2020)](https://doi.org/10.1186/s13059-020-02151-8) and [He _et al._ (2022)](https://doi.org/10.1038/s41592-022-01408-3).

#### Alevin-fry parameters

Expand Down Expand Up @@ -99,7 +99,7 @@ In these cases, the cell type annotations obtained from the submitter will be pr

## ADT quantification from CITE-seq experiments

CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using [`salmon alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.

Reference indices were constructed from the submitter-provided list of antibody barcode sequences corresponding to each library using the `--features` flag of `salmon index`.
Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with [selective alignment](#selective-alignment) and subsequent [quantification via alevin-fry](#alevin-fry-parameters).
Expand Down Expand Up @@ -130,7 +130,7 @@ Multiplexed libraries, or libraries with cells or nuclei from more than one biol

### Hashtag oligonucleotide (HTO) quantification

HTO reads were also quantified using [`salmon alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
HTO reads were also quantified using [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values.
Reference indices were constructed from the submitter-provided list of HTO sequences corresponding to each library using the `--features` flag of `salmon index`.
Mapping to these indices followed the same procedures as for RNA-seq data, including mapping with [selective alignment](#selective-alignment) and subsequent [quantification via alevin-fry](#alevin-fry-parameters).

Expand Down
27 changes: 0 additions & 27 deletions scripts/spell-check.R

This file was deleted.