Skip to content

Cell-painting data analysis corresponding to the "Pervasive mislocalization of pathogenic coding variants underlying human diseases" paper

License

Notifications You must be signed in to change notification settings

carpenter-singh-lab/2024_LacosteHaghighi_Cell_Mislocalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pervasive mislocalization of pathogenic coding variants underlying human diseases

Link to biorxiv submission (bioRxiv preprint doi: https://doi.org/10.1101/2023.09.05.556368)

Table of Contents

Abstract

Widespread genome and exome sequencing has yielded thousands of missense variants predicted or confirmed as disease-causing. This creates a new bottleneck: determining the functional impact of each variant - largely a painstaking, customized process undertaken one or a few gene(s) and/or variant(s) at a time. Here, we assay the impact of coding variation on protein localization for over 3,500 missense variants across >1,000 genes and phenotypes using a high-throughput imaging platform. We discovered that mislocalization is a common consequence of coding variation, affecting about one-sixth of all pathogenic missense variants, all cellular compartments, and recessive and dominant disorders alike. Mislocalization is primarily driven by effects on protein stability and membrane insertion rather than disruptions of trafficking signals or specific protein-protein interactions. Furthermore, mislocalization patterns help explain pleiotropy and disease severity and provide insights on variants of unknown significance. The resulting publicly available resource will likely accelerate the understanding of coding variation in human diseases.

Dataset

Summary

This dataset contains eight batches of data each having various properties.

Batch Description cell painting channels protein marker channels Size (max projected images/ single cell sqlite files)
PILOT_1 initial WT/MT screen Mito,ER,DNA Protein 1.04 TB / 74.12 GB
Cancer_Mutations_Screen follow up WT/MT screen Mito,ER,DNA Protein 144.5 GB / 14.61 GB
Common_Variants follow up WT/MT screen Mito,ER,DNA Protein 56.08 GB / 4.6 GB
Kinase_Plates follow up WT/MT screen Mito,ER,DNA Protein 84.06 GB / 8.26 GB
Replicates_Original_Screen replicate of intial WT/MT screen Mito,ER,DNA Protein 376.69 GB / 32.88 GB
2021_05_21_QualityControlPathwayArrayedScreen compound screen ER,DNA Protein,DsRed 699.91 GB /
2022_01_12_Batch1 compound screen DNA,Lysosomes Protein,DsRed 223.68 GB /
2022_01_12_Batch2 compound screen DNA Protein,DsRed 83.97 GB /

Download

They can be downloaded at no cost and no need for registration of any sort, using the command:

aws s3 sync \
  --no-sign-request \
  s3://cellpainting-gallery/cpg0026-lacoste_haghighi-rare-diseases/broad/ .
  • AWS CLI installation instructions can be found here.

Folder Structure

The parent structure of the dataset is as follows.

cellpainting-gallery
└── cpg0026-lacoste_haghighi-rare-diseases
    └── broad
        ├── images
        │   ├── PILOT_1
        │   │   ├── illum
        │   │   ├── images_unprojected
        │   │   └── images
        │   ├── Cancer_Mutations_Screen 
        │   ├── Common_Variants
        │   ├── Kinase_Plates
        │   ├── Replicates_Original_Screen
        │   ├── 2021_05_21_QualityControlPathwayArrayedScreen 
        │   ├── 2022_01_12_Batch1     
        │   └── 2022_01_12_Batch2
        └── workspace
            ├── analysis
            ├── backend
            ├── load_data_csv
            ├── metadata
            │   ├── reprocessed
            │   └── raw
            └── profiles
                 ├── singlecell_profiles (Transfected per well single cells)
                 ├── enrichment_profiles
                 └── population_profiles (Average of transfected and untransfected per well)


General structure of the images, analysis, backend and load_data_csv for all projects in cellpainting-gallery can be found in the below links: - images folder structure - analysis folder structure - backend folder structure - load_data_csv folder structure

Data Level descriptions

Images

  • General images folder structure for all projects in cellpainting-gallery

  • Image file name pattern (images_unprojected subfolder)

    • File names pattern: r(n)c(n)f(n)p(n)-c(n)sk1fk1fl1.tiff - (where (n) is a number describing each variable (the letters).)

      • r = row
      • c = column
      • f = field
      • p = position in the z-stack
      • c = channel
  • Image file name pattern (images subfolder)

    • same as above except that p(n) is always p01
    • These images are max projected images and have been used for inputs to CellProfiler software for feature extraction

CellProfiler software outputs

Preprocessed perturbation level/single cell level profiles

  • Singlecell profiles singlecell_profiles
    • Singlecell profiles are transfected single cells per well, saved to reduce computational burden of single cell analysis and data explorations
  • Population profiles population_profiles
    • population profiles are formed by first detecting the transferected single cells per well and then aggregation of the transfected single cells per well
  • Enrichment profiles enrichment_profiles
    • Enrichment profiles are basically the histogram of cell counts per clusters defined in an experiment level
cellpainting-gallery
└── cpg0026-lacoste_haghighi-rare-diseases
    └── broad
        ├── images
        └── workspace
            └── profiles
                 ├── singlecell_profiles (transfected per well single cells)
                 │   ├── PILOT_1
                 │   │   ├── RC4_IF_01
                 │   │   |   ├── RC4_IF_01_B05.csv.gz
                 │   │   |        ├── RC4_IF_01_B05.csv.gz
                 │   │   |        ├── RC4_IF_01_B06.csv.gz
                 │   │   |        ...
                 │   │   ├──  RC4_IF_02
                 │   |   ...
                 │   ├── Cancer_Mutations_Screen
                 │   ...
                 ├── enrichment_profiles
                 └── population_profiles (average of transfected and untransfected per well)
                    ├── PILOT_1
                    │   ├── RC4_IF_01
                    │   |   └── RC4_IF_01.csv.gz
                    │   ├──  RC4_IF_02
                    |   ...
                    ├── Cancer_Mutations_Screen 
                    ...

Metadata

  • each batch has a raw annotation file provided by the wet lab and a reprocessed and standardized version that is used as the input for the analysis
cellpainting-gallery
└── cpg0026-lacoste_haghighi-rare-diseases
    └── broad
        └── workspace
            └── metadata
               ├── reprocessed
               └── raw
Column Description Batches which have this column
Metadata_Batch one of eight batches all
Metadata_Plate plate key all
Metadata_Well well key all
Metadata_Sample WT+MT string all
Variant WT+MT string all
Gene WT string all
MT MT string all
Treatment chemical perturbation added to genetic perturbation 2021_05_21_QualityControlPathwayArrayedScreen, 2022_01_12_Batch1, 2022_01_12_Batch2
Metadata_Sample_Unique 978 1677
Metadata_Location primary location of protein by visual annotation PILOT_1

Analysis

Transfection Detection

  • In general, transfection detection is performed by single cell single feature thresholding method for most of the batches. The raw per-plate single-cell feature values are subjected to truncation at the 0.999th percentile of their respective distributions within each plate. Subsequently, these clipped per-plate values are normalized to a range of 0 to 1 by employing the per-plate min-max scaling technique. For batch 2022_01_12_Batch1 and 2022_01_12_Batch2, segmentation is done on DsRed channel in image analysis step and any cell that was detected based on that channels was assumed to be transfected. Therefore, CellProfiler outputs for the single cell profiles are just for transfected cells for this batch of data. Single Cell feature used for transfection detection for the remaining batches of data was Cells_Intensity_IntegratedIntensity_Protein except for 2021_05_21_QualityControlPathwayArrayedScreen batch in which we used Cells_Intensity_UpperQuartileIntensity_DsRed.

    • Transfected/Untransfected Profiles:

Forming Perturbation Level Profile

  • Population (Average) Profiles:
    • Population profiles per-well are form by averaging the single-cell transfected cells (labeled in a prior step) profiles
  • Subpopulation (Enrichment) Profiles:
    • For extracting subpopulation or enrichement profiles, first we perform a kmeans (k=20) clustering model to a subsampled (1000 cell per plate) population of transfected single cells from all plates. Subpopulation or enrichement profiles for each well are then formed by per-cluster abundance of single cells in that well.

Techinal replicate reproducibility

  • The uniformity of single treatment profiles across distinct experimental batches serves as an indicator of data quality. We assess this uniformity through the following procedure:

    • Replicate-level profiles per plate undergo standardization.
    • The Pearson correlation coefficient is computed for each pair of replicate-level profiles corresponding to the same perturbation.
    • The distribution of these coefficients for each dataset and modality is depicted in Supplementary Figure 1 as red curves.
    • The blue curves, juxtaposed to the red curves, represent the null distribution, which displays the correlation coefficient between profile pairs originating from distinct perturbations.
    • The non-zero dotted vertical line on the right signifies the 90th percentile of the null distribution.
    • Perturbations exhibiting an average replicate correlation exceeding the 90th percentile of the null distribution are deemed high-quality data points, suitable for subsequent analyses.

Protein Localization

  • Manual annotations: for the following batches of data, we have per-well annotation captured by biologist's visual inspection of data and ...

Results

Differental Phenotype Exploration

  • Plots exploring distinguishable WT/MT phenotypes
  • Plots for exploring and visualizing the distinguishable/non distinguishable single cell subpoulations
  • Impact scores based on each level of perturbation profiles

About

Cell-painting data analysis corresponding to the "Pervasive mislocalization of pathogenic coding variants underlying human diseases" paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published