Skip to content

Latest commit

 

History

History
70 lines (41 loc) · 6.08 KB

README.md

File metadata and controls

70 lines (41 loc) · 6.08 KB

Automating Digital Pathology

Overview

The tumor proliferation speed or tumor growth is an important biomarker for predicting patient outcomes. Proper assessment of this biomarker is crucial for informing the decisions for the treatment plan for the patient. In a clinical setting, the most common method is to count mitotic figures under a microscope by a pathologist. The manual counting and subjectivity of the process pose a reproducibility challenge. This has been the main motivation for many efforts to automate this process and use advanced ML techniques.

One of the main challenges however for automating this task, is the fact that whole slide images are rather large. WSI images can vary anywhere between 0.5 to 3.5GB in size, and that can slow down the image preprocessing step which is necessary for any downstream ML application.

In this solution accelerator, we walk you through a step-by-step process to use databricks capabilities to perform image segmentation and pre-processing on WSI and train a binary classifier that produces a metastasis probability map over a whole slide image (WSI).


logo


Dataset

The data used in this solution accelerator is from the Camelyon16 Grand Challenge, along with annotations based on hand-drawn metastasis outlines. We use curated annotations for this dataset obtained from Baidu Research github repository.

Notebooks

We use Apache Spark's parallelization capabilities, using pandas_udf, to generate tumor/normal patches based on annotation data as well as feature extraction, using a pre-trained InceptionV3. We use the embeddings obtained this way to explore clusters of patches by visualizing 2d and 3d embeddings, using UMAP. We then use transfer learning with pytorch to train a convnet to classify tumor vs normal patches and later use the resulting model to overlay a metastasis heatmap on a new slide.

This solution accelerator contains the following notebooks:

  • config: configuring paths and other settings. Also for the first time setting up a cluster for patch generation, use the initscript generated by the config notebook to install openSlide on your cluster.

  • 1-create-annotation-deltalake: to download annotations and write to delta.

  • 2-patch-generation: This notebook generates patches from WSI based on annotations.

  • 3-feature-extraction: To extract image embeddings using InceptionV3 in a distributed manner

  • 4-unsupervised-learning: dimensionality reduction and cluster inspection with UMAP

  • 5-training: In this notebook we tune and train a binary classifier to classify tumor/normal patches with pytorch and log the model with mlflow.

  • 6-metastasis-heatmap: This notebook we use the model trained in the previous step to generate a metastasis probability heatmap for a given slide.

  • definitions: This notebook contains definitions for some of the functions that are used in multiple places (for example patch generation and pre processing)

License

Copyright / License info of the notebook. Copyright [2021] the Notebook Authors. The source in this notebook is provided subject to the Apache 2.0 License. All included or referenced third party libraries are subject to the licenses set forth below.

Library Name Library License Library License URL Library Source URL
Pandas BSD 3-Clause License https://github.com/pandas-dev/pandas/blob/master/LICENSE https://github.com/pandas-dev/pandas
Numpy BSD 3-Clause License https://github.com/numpy/numpy/blob/main/LICENSE.txt https://github.com/numpy/numpy
Apache Spark Apache License 2.0 https://github.com/apache/spark/blob/master/LICENSE https://github.com/apache/spark/tree/master/python/pyspark
Pillow (PIL) HPND License https://github.com/python-pillow/Pillow/blob/master/LICENSE https://github.com/python-pillow/Pillow/
OpenSlide GNU LGPL version 2.1 https://github.com/openslide/openslide/blob/main/LICENSE.txt https://github.com/openslide
Open Slide Python GNU LGPL version 2.1 https://github.com/openslide/openslide-python/blob/main/LICENSE.txt https://github.com/openslide/openslide-python
pytorch lightning Apache License 2.0 https://github.com/PyTorchLightning/pytorch-lightning/blob/master/LICENSE https://github.com/PyTorchLightning/pytorch-lightning
NCRF Apache License 2.0 https://github.com/baidu-research/NCRF/blob/master/LICENSE https://github.com/baidu-research/NCRF
Author
Databricks Inc.

Disclaimers

Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account. Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.

To run this accelerator, clone this repo into a Databricks workspace. Attach the RUNME notebook to any cluster running a DBR 11.0 or later runtime, and execute the notebook via Run-All. A multi-step-job describing the accelerator pipeline will be created, and the link will be provided. Execute the multi-step-job to see how the pipeline runs.

The job configuration is written in the RUNME notebook in json format. The cost associated with running the accelerator is the user's responsibility.