Tool to apply RNASeq and pathology foundation models to spatial transcriptomics. Currently, this tool only works for extracting expression and histology based features from 10x visium samples and assumes data have been processed by their Spaceranger pipeline.
- Set the environment variable:
export REPO_ROOT=/path/to/FMs-for-spatialomics
- Create conda environment:
conda env create -f $REPO_ROOT/env.yml
- Clone foundation model repos:
- UNI:
- Request access to UNI on huggingface.
- Clone the UNI huggingface repo locally (this method uses an ssh key with huggingface. Alternate methods to clone the repo are also possible e.g. with an access token and/or the HF CLI):
# make sure git-lfs is already installed git clone [email protected]:MahmoodLab/UNI $REPO_ROOT/models/UNI
- UCE:
- Clone the UCE github repo locally:
git clone [email protected]:snap-stanford/UCE.git $REPO_ROOT/models/UCE
- Download the model files:
wget https://figshare.com/ndownloader/articles/24320806/versions/5 -O $REPO_ROOT/models/UCE/model_files/temp.zip
- Unzip model files:
unzip $REPO_ROOT/models/UCE/model_files/temp.zip -d $REPO_ROOT/models/UCE/model_files
- Untar additional model files:
tar -xvf $REPO_ROOT/models/UCE/model_files/protein_embeddings.tar.gz -C $REPO_ROOT/models/UCE/model_files
- Clone the UCE github repo locally:
- UNI:
Feature extraction of expression and histology data using modality specific foundation models. The following instructions extract features per capture area (which we refer to as a slide) and should be repeated for each slide of interest.
- Activate conda environemnt:
conda activate spatialFM
- Prepare per-slide data:
where the
python $REPO_ROOT/1-convert-to-anndata.py \ --spatial_path /path/to/spaceranger/output/outs \ --slide_path /path/to/full/resolution/slide.tif \ --tile_width N \ --output_h5ad /path/to/converted.h5ad
tile_width
should typically be an integer within 1 pixel ofspot_diameter
. - Extract expression features:
where non-standard
python $REPO_ROOT/2-extract-expression-features.py \ --input_h5ad /path/to/converted.h5ad \ --output_h5ad /path/to/expr.h5ad \ --model uce_4 \ --species human
--species
may also need be configured with the UCE model, and--model
can be one ofuce_4
oruce_33
to use the 4 or 33 layer UCE models, respectively. - Extract histology features:
python $REPO_ROOT/3-extract-histology-features.py \ --input_h5ad /path/to/converted.h5ad \ --output_h5ad /path/to/hist.h5ad
- Unify features:
Differing inclusion criteria between the foundation models result in minor differences in which barcoded-spots actually get processed. This final step is to take the intersection of those spots for further analysis.python $REPO_ROOT/4-combine-data.py \ --source_h5ad /path/to/converted.h5ad \ --expr_h5ad /path/to/expr.h5ad \ --hist_h5ad /path/to/hist.h5ad \ --output_h5ad /path/to/extracted.h5ad
- Clean up (optional):
rm /path/to/converted.h5ad rm /path/to/uce.h5ad rm /path/to/uni.h5ad
An example script to run all steps of the feature extraction pipeline is located at run-extract.sh
. It should similarly be run with the conda environment activated.
The above inference scripts will default to use the first available GPU, if detected. To disable or alter this behavior, the simplest method is to set the environment variable CUDA_VISIBLE_DEVICES
. If running out of GPU memory, consider tuning the batch size with the --batch_size
flags to the feature extraction scripts. The default batch sizes were tuned for a single V100 16GB
GPU.
An example notebook for downstream analysis using extracted features can be found in the evaluations
subdirectory of this repo.