Pervasive mislocalization of pathogenic coding variants underlying human diseases

Link to biorxiv submission (bioRxiv preprint doi: https://doi.org/10.1101/2023.09.05.556368)

Abstract
Dataset
Analysis
- Transfection Detection
  - Filter out untransfected (unperturbed) single cells as the first analysis step
- Perturbation Level Profile
  - processing CellProfiler single cell outputs to generate per-well level profiles for each perturbation
- Techinal replicate reproducibility
  - as a measure of profile quality
- Protein Localization
  - Supervised population level classification of protein localization
- WT/MT impact-score
  - Supervised population level classification of protein localization
- Variant/Variant+Treatment reversal-score
  - Supervised population level classification of protein localization
- Protein Localization
  - Supervised population level classification of protein localization
Results
- Differental Phenotype Exploration
  - Plots exploring distinguishable WT/MT phenotypes
  - Plots for exploring and visualizing the distinguishable/non distinguishable single cell subpoulations
  - Impact scores based on each level of perturbation profiles

Abstract

Widespread genome and exome sequencing has yielded thousands of missense variants predicted or confirmed as disease-causing. This creates a new bottleneck: determining the functional impact of each variant - largely a painstaking, customized process undertaken one or a few gene(s) and/or variant(s) at a time. Here, we assay the impact of coding variation on protein localization for over 3,500 missense variants across >1,000 genes and phenotypes using a high-throughput imaging platform. We discovered that mislocalization is a common consequence of coding variation, affecting about one-sixth of all pathogenic missense variants, all cellular compartments, and recessive and dominant disorders alike. Mislocalization is primarily driven by effects on protein stability and membrane insertion rather than disruptions of trafficking signals or specific protein-protein interactions. Furthermore, mislocalization patterns help explain pleiotropy and disease severity and provide insights on variants of unknown significance. The resulting publicly available resource will likely accelerate the understanding of coding variation in human diseases.

Dataset

Summary

This dataset contains eight batches of data each having various properties.

Batch	Description	cell painting channels	protein marker channels	Size (max projected images/ single cell sqlite files)
PILOT_1	initial WT/MT screen	`Mito`,`ER`,`DNA`	`Protein`	1.04 TB / 74.12 GB
Cancer_Mutations_Screen	follow up WT/MT screen	`Mito`,`ER`,`DNA`	`Protein`	144.5 GB / 14.61 GB
Common_Variants	follow up WT/MT screen	`Mito`,`ER`,`DNA`	`Protein`	56.08 GB / 4.6 GB
Kinase_Plates	follow up WT/MT screen	`Mito`,`ER`,`DNA`	`Protein`	84.06 GB / 8.26 GB
Replicates_Original_Screen	replicate of intial WT/MT screen	`Mito`,`ER`,`DNA`	`Protein`	376.69 GB / 32.88 GB
2021_05_21_QualityControlPathwayArrayedScreen	compound screen	`ER`,`DNA`	`Protein`,`DsRed`	699.91 GB /
2022_01_12_Batch1	compound screen	`DNA`,`Lysosomes`	`Protein`,`DsRed`	223.68 GB /
2022_01_12_Batch2	compound screen	`DNA`	`Protein`,`DsRed`	83.97 GB /

Download

They can be downloaded at no cost and no need for registration of any sort, using the command:

aws s3 sync \
  --no-sign-request \
  s3://cellpainting-gallery/cpg0026-lacoste_haghighi-rare-diseases/broad/ .

AWS CLI installation instructions can be found here.

Folder Structure

The parent structure of the dataset is as follows.

cellpainting-gallery
└── cpg0026-lacoste_haghighi-rare-diseases
    └── broad
        ├── images
        │   ├── PILOT_1
        │   │   ├── illum
        │   │   ├── images_unprojected
        │   │   └── images
        │   ├── Cancer_Mutations_Screen 
        │   ├── Common_Variants
        │   ├── Kinase_Plates
        │   ├── Replicates_Original_Screen
        │   ├── 2021_05_21_QualityControlPathwayArrayedScreen 
        │   ├── 2022_01_12_Batch1     
        │   └── 2022_01_12_Batch2
        └── workspace
            ├── analysis
            ├── backend
            ├── load_data_csv
            ├── metadata
            │   ├── reprocessed
            │   └── raw
            └── profiles
                 ├── singlecell_profiles (Transfected per well single cells)
                 ├── enrichment_profiles
                 └── population_profiles (Average of transfected and untransfected per well)

General structure of the images, analysis, backend and load_data_csv for all projects in cellpainting-gallery can be found in the below links: - images folder structure - analysis folder structure - backend folder structure - load_data_csv folder structure

Data Level descriptions

Images

General images folder structure for all projects in cellpainting-gallery
Image file name pattern (images_unprojected subfolder)
- File names pattern: r(n)c(n)f(n)p(n)-c(n)sk1fk1fl1.tiff - (where (n) is a number describing each variable (the letters).)
  - r = row
  - c = column
  - f = field
  - p = position in the z-stack
  - c = channel
Image file name pattern (images subfolder)
- same as above except that p(n) is always p01
- These images are max projected images and have been used for inputs to CellProfiler software for feature extraction

CellProfiler software outputs

pipelines can be found at load_data_csv folder which follows the standard load_data_csv folder structure
CellProfiler generated single-cell profiles and cell outlines
- This follows standard (link) CP outputs
  - General structure of analysis and backend for all projects in cellpainting-gallery can be found in the below links:
    - analysis folder structure
    - backend folder structure

Preprocessed perturbation level/single cell level profiles

Singlecell profiles singlecell_profiles
- Singlecell profiles are transfected single cells per well, saved to reduce computational burden of single cell analysis and data explorations
Population profiles population_profiles
- population profiles are formed by first detecting the transferected single cells per well and then aggregation of the transfected single cells per well
Enrichment profiles enrichment_profiles
- Enrichment profiles are basically the histogram of cell counts per clusters defined in an experiment level

cellpainting-gallery
└── cpg0026-lacoste_haghighi-rare-diseases
    └── broad
        ├── images
        └── workspace
            └── profiles
                 ├── singlecell_profiles (transfected per well single cells)
                 │   ├── PILOT_1
                 │   │   ├── RC4_IF_01
                 │   │   |   ├── RC4_IF_01_B05.csv.gz
                 │   │   |        ├── RC4_IF_01_B05.csv.gz
                 │   │   |        ├── RC4_IF_01_B06.csv.gz
                 │   │   |        ...
                 │   │   ├──  RC4_IF_02
                 │   |   ...
                 │   ├── Cancer_Mutations_Screen
                 │   ...
                 ├── enrichment_profiles
                 └── population_profiles (average of transfected and untransfected per well)
                    ├── PILOT_1
                    │   ├── RC4_IF_01
                    │   |   └── RC4_IF_01.csv.gz
                    │   ├──  RC4_IF_02
                    |   ...
                    ├── Cancer_Mutations_Screen 
                    ...

Metadata

each batch has a raw annotation file provided by the wet lab and a reprocessed and standardized version that is used as the input for the analysis

cellpainting-gallery
└── cpg0026-lacoste_haghighi-rare-diseases
    └── broad
        └── workspace
            └── metadata
               ├── reprocessed
               └── raw

Column	Description	Batches which have this column
Metadata_Batch	one of eight batches	all
Metadata_Plate	plate key	all
Metadata_Well	well key	all
Metadata_Sample	WT+MT string	all
Variant	WT+MT string	all
Gene	WT string	all
MT	MT string	all
Treatment	chemical perturbation added to genetic perturbation	2021_05_21_QualityControlPathwayArrayedScreen, 2022_01_12_Batch1, 2022_01_12_Batch2
Metadata_Sample_Unique	978	1677
Metadata_Location	primary location of protein by visual annotation	PILOT_1

Analysis

Transfection Detection

In general, transfection detection is performed by single cell single feature thresholding method for most of the batches. The raw per-plate single-cell feature values are subjected to truncation at the 0.999th percentile of their respective distributions within each plate. Subsequently, these clipped per-plate values are normalized to a range of 0 to 1 by employing the per-plate min-max scaling technique. For batch 2022_01_12_Batch1 and 2022_01_12_Batch2, segmentation is done on DsRed channel in image analysis step and any cell that was detected based on that channels was assumed to be transfected. Therefore, CellProfiler outputs for the single cell profiles are just for transfected cells for this batch of data. Single Cell feature used for transfection detection for the remaining batches of data was Cells_Intensity_IntegratedIntensity_Protein except for 2021_05_21_QualityControlPathwayArrayedScreen batch in which we used Cells_Intensity_UpperQuartileIntensity_DsRed.
- Transfected/Untransfected Profiles:

Forming Perturbation Level Profile

Population (Average) Profiles:
- Population profiles per-well are form by averaging the single-cell transfected cells (labeled in a prior step) profiles
Subpopulation (Enrichment) Profiles:
- For extracting subpopulation or enrichement profiles, first we perform a kmeans (k=20) clustering model to a subsampled (1000 cell per plate) population of transfected single cells from all plates. Subpopulation or enrichement profiles for each well are then formed by per-cluster abundance of single cells in that well.

Techinal replicate reproducibility

The uniformity of single treatment profiles across distinct experimental batches serves as an indicator of data quality. We assess this uniformity through the following procedure:
- Replicate-level profiles per plate undergo standardization.
- The Pearson correlation coefficient is computed for each pair of replicate-level profiles corresponding to the same perturbation.
- The distribution of these coefficients for each dataset and modality is depicted in Supplementary Figure 1 as red curves.
- The blue curves, juxtaposed to the red curves, represent the null distribution, which displays the correlation coefficient between profile pairs originating from distinct perturbations.
- The non-zero dotted vertical line on the right signifies the 90th percentile of the null distribution.
- Perturbations exhibiting an average replicate correlation exceeding the 90th percentile of the null distribution are deemed high-quality data points, suitable for subsequent analyses.

Protein Localization

Manual annotations: for the following batches of data, we have per-well annotation captured by biologist's visual inspection of data and ...

Results

Differental Phenotype Exploration

Plots exploring distinguishable WT/MT phenotypes
Plots for exploring and visualizing the distinguishable/non distinguishable single cell subpoulations
Impact scores based on each level of perturbation profiles

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
snakemake-workflow		snakemake-workflow
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pipeline_notebook.ipynb		pipeline_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pervasive mislocalization of pathogenic coding variants underlying human diseases

Table of Contents

Abstract

Dataset

Summary

Download

Folder Structure

Data Level descriptions

Images

CellProfiler software outputs

Preprocessed perturbation level/single cell level profiles

Metadata

Analysis

Transfection Detection

Forming Perturbation Level Profile

Techinal replicate reproducibility

Protein Localization

Results

Differental Phenotype Exploration

About

Releases

Packages

Languages

License

carpenter-singh-lab/2024_LacosteHaghighi_Cell_Mislocalization

Folders and files

Latest commit

History

Repository files navigation

Pervasive mislocalization of pathogenic coding variants underlying human diseases

Table of Contents

Abstract

Dataset

Summary

Download

Folder Structure

Data Level descriptions

Images

CellProfiler software outputs

Preprocessed perturbation level/single cell level profiles

Metadata

Analysis

Transfection Detection

Forming Perturbation Level Profile

Techinal replicate reproducibility

Protein Localization

Results

Differental Phenotype Exploration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages