-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reference-based gene symbol conversion #14
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
e5e5b43
Add gene reference notebook
jashapiro a8a8b88
update renv
jashapiro d4126f7
Update renv itself
jashapiro 3d5f9a5
Add creation of a reference table in the setup folder
jashapiro 3e8fda9
Update gene symbol conversion
jashapiro febc7f1
Add conversion for full SCE objects
jashapiro 0e34254
pre-commit update
jashapiro dc10c02
move setup code to data-raw
jashapiro 1a5a7d5
Missed an instance when renaming columns
jashapiro 5de82df
separate gene reference building and evaluation
jashapiro 9416b78
Add docs for data
jashapiro c8d17ab
Documentation updates
jashapiro fe5b1df
add readme
jashapiro 49f719c
Apply suggestions from code review
jashapiro 2ee3403
readme text updates
jashapiro e6ae30a
add intro to notebook
jashapiro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,3 +5,4 @@ | |
^\.github$ | ||
^\.lintr$ | ||
^\.pre-commit-config.yaml$ | ||
^data-raw$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,3 +41,5 @@ biocViews: | |
Transcriptomics, | ||
SingleCell, | ||
Clustering | ||
Depends: | ||
R (>= 4.0) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# nolint start | ||
|
||
#' Conversion table for Ensembl gene ids and gene symbols | ||
#' | ||
#' | ||
#' This table includes the mapping for gene ids to gene symbols from different | ||
#' reference genome gene annotation lists. | ||
#' Included are the original gene symbols and the modified gene symbols that | ||
#' are created when running the `make.unique()` function, as is done when | ||
#' importing data using Seurat. | ||
#' | ||
#' @format | ||
#' A data frame with 7 columns: | ||
#' \describe{ | ||
#' \item{gene_ids}{Ensembl gene ids} | ||
#' \item{gene_symbol_scpca}{The gene symbol used in the ScPCA reference} | ||
#' \item{gene_symbol_scpca_unique}{The gene symbol from the ScPCA reference, after `make.unique()`} | ||
#' \item{gene_symbol_10x2020}{The gene symbol used in the 2020 10x human genome reference} | ||
#' \item{gene_symbol_10x2020_unique}{The gene symbol from the 2020 10x human genome reference, after `make.unique()`} | ||
#' \item{gene_symbol_10x2024}{The gene symbol used in the 2024 10x human genome reference} | ||
#' \item{gene_symbol_10x2024_unique}{The gene symbol from the 2024 10x human genome reference, after `make.unique()`} | ||
#' } | ||
"scpca_gene_reference" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
This directory contains scripts to download and preprocess data used within the rOpenScPCA package. | ||
There are also notebooks that explore some of the prepared datasets for exploration. | ||
|
||
## Building Gene References | ||
|
||
- The `build_gene_references.R` script creates a table of Ensembl id to gene symbol references. | ||
The initial table of gene ids and gene symbols is extracted from an example ScPCA-formatted SCE object (`rOpenScPCA/tests/testthat/data/scpca_sce.rds`). | ||
This is combined with the reference information extracted from example 10x Genomics datasets. | ||
The full table is saved in `data/scpca_gene_reference.rda` (overwriting any previous file). | ||
|
||
- The `explore_gene_references.Rmd` notebook explores the resulting gene references table a bit to see where some of the conversions differ among different references. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
#!/usr/env Rscript | ||
|
||
# This script downloads and stores reference gene lists from 10x Genomics | ||
# datasets, creating a final table of Ensembl ids and corresponding symbols | ||
# including the symbols that would be created on read by Seurat by application | ||
# of the make.unique() function. | ||
suppressPackageStartupMessages({ | ||
library(SingleCellExperiment) | ||
library(dplyr) | ||
}) | ||
|
||
## Download the reference data ------------------------------------------------- | ||
|
||
|
||
# Read in the test data ScPCA SCE object and extract the row data: | ||
genes_scpca <- readRDS(here::here("tests", "testthat", "data", "scpca_sce.rds")) |> | ||
rowData() |> | ||
as.data.frame() |> | ||
# Use Ensembl ID if gene symbol is missing, then make unique | ||
mutate( | ||
gene_symbol_scpca = ifelse(is.na(gene_symbol), gene_ids, gene_symbol), | ||
gene_symbol_scpca_unique = make.unique(gene_symbol_scpca) | ||
) |> | ||
select(gene_ids, gene_symbol_scpca, gene_symbol_scpca_unique) | ||
|
||
|
||
# Download and read in a 2020 10x reference dataset and extract the gene symbols. | ||
# Note that the 2020 Cell Ranger reference does not use Ensembl gene IDs for | ||
# missing symbols, but the 2024 reference does. | ||
url_10x2020 <- "https://cf.10xgenomics.com/samples/cell-exp/7.0.1/SC3pv3_GEX_Human_PBMC/SC3pv3_GEX_Human_PBMC_filtered_feature_bc_matrix.h5" # nolint | ||
temp_10x2020 <- tempfile(fileext = ".h5") | ||
download.file(url_10x2020, temp_10x2020, mode = "wb") | ||
on.exit(unlink(temp_10x2020), add = TRUE) # delete when done | ||
|
||
genes_10x2020 <- DropletUtils::read10xCounts(temp_10x2020) |> | ||
rowData() |> | ||
as.data.frame() |> | ||
filter(Type == "Gene Expression") |> | ||
rename( | ||
gene_ids = ID, | ||
gene_symbol_10x2020 = Symbol | ||
) |> | ||
# add unique column | ||
mutate(gene_symbol_10x2020_unique = make.unique(gene_symbol_10x2020)) |> | ||
select(gene_ids, gene_symbol_10x2020, gene_symbol_10x2020_unique) | ||
|
||
# Download and read in a 2024 10x reference dataset and extract the gene symbols. | ||
url_10x2024 <- "https://cf.10xgenomics.com/samples/cell-exp/9.0.0/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5" # nolint | ||
temp_10x2024 <- tempfile(fileext = ".h5") | ||
download.file(url_10x2024, temp_10x2024, mode = "wb") | ||
on.exit(unlink(temp_10x2024), add = TRUE) # delete when done | ||
|
||
|
||
genes_10x2024 <- DropletUtils::read10xCounts(temp_10x2024) |> | ||
rowData() |> | ||
as.data.frame() |> | ||
filter(Type == "Gene Expression") |> | ||
rename( | ||
gene_ids = ID, | ||
gene_symbol_10x2024 = Symbol | ||
) |> | ||
mutate(gene_symbol_10x2024_unique = make.unique(gene_symbol_10x2024)) |> | ||
select(gene_ids, gene_symbol_10x2024, gene_symbol_10x2024_unique) | ||
|
||
# Join the gene lists ---------------------------------------------------------- | ||
scpca_gene_reference <- genes_scpca |> | ||
full_join(genes_10x2020, by = "gene_ids") |> | ||
full_join(genes_10x2024, by = "gene_ids") | ||
|
||
## Add the table to package data ----------------------------------------------- | ||
usethis::use_data( | ||
scpca_gene_reference, | ||
version = 3, | ||
overwrite = TRUE, | ||
compress = "xz" | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an SCE, it seems