Scripts and data used for the paper "Identifying Damage-Related Features in scRNA-seq Data"
Quality control (QC) is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis pipelines to ensure data reliability. A critical QC step involves identifying damaged cells using quality metrics like the percentage of mitochondrial genes or the total number of reads. However, automatically determining the threshold of these metrics for filtering damaged cells can be challenging. Moreover, using this metric alone may result in the removal of biologically meaningful cells. This study aims to find alternative biomarkers to improve the identification of damaged cells, focusing on gene lists other than mitochondrial genes. We hypothesized that genes localized within other organelles, particularly the nucleus, would exhibit similar enrichment patterns as mitochondrial genes. To test this hypothesis, we used a public scRNA-seq dataset where damaged cells were labelled via optical inspection. We considered as potential descriptors the percentage of genes from various lists, in particular lists of transcripts detected within the nucleus. We built a binary logistic regression model to differentiate damaged cells from good cells and evaluated its performance. Our results showed that the traditional criteria, such as mitochondrial genes, number of genes, and total counts, successfully identified damaged cells but tended to overestimate damage. Our findings suggest that although standard features are effective, their poor precision can be problematic. Incorporating other gene lists, particularly those related to nuclear transcripts, into classification models can improve the prediction of damaged cells. Further investigation is needed to understand the underlying mechanisms driving these relationships.