Skip to content

Latest commit

 

History

History
389 lines (204 loc) · 19.3 KB

README.md

File metadata and controls

389 lines (204 loc) · 19.3 KB

Deep Learning for Digital Pathology (DLDP) - Camelyon 2016 dataset

Weizhe Li, Weijie Chen

Project Overview

Code Overview

 

Code link: Training image extraction

Code link: Image processing

Code link: Neural network training

Code link: Heatmap construction

Code link: Slide-based prediction

Code link: Lesion-based prediction

Code link: WSI-heatmap visualization

Code Documentation  

  • Note: The codes were developed on python 3.5, but also work for python 3.6

 

0 - Preparation

0.1 Set up deep learning environment.

  • Setup Python Environment

  • Package Installation for Color Normalization

    Note: SPAMS doesn't need to be installed mannually from source anymore since it can be installed through pip.

  • Tensorflow and Keras version

    The code here was based on Keras 2.0.0 and Tensorflow 1.9. The compatibility between Tensorflow and Keras, and between Tensorflow and CUDA, are important, especially when the code runs on different machines. Certain machinse can only run some low version of Tensorflow that is compatible with a low version of Keras. When loading a model trained with higher version of Tensorflow and Keras to such a machine, the weights of the trained model would not be fully loaded, but it still works for testing (however, not for training, e.g., transfer learning). The code would need some changes if Keras with a version higher than 2.0.0 is used for model training (see the comments in the code for model training).

0.2 ASAP installation and image display

OpenSlide

alt text

alt text

ASAP

0.3 Mask image generation

  • Mask images can be generated from the xml files that store the pathologist's annotation of tumor contours and serve as ground truth for model training. The mask image is a binary image with normal tissue coded as ‘0’ and tumor tissue coded as ‘1’ for each corresponding pixel of a WSI image. To directly display the masks, the code here use 255 (rather than 1) for tumor pixels. See update for mask file generation on Camelyon 17 website.

  • Note:

    • The mask file has a pyramid structure corresponding to multiple levels of magnification. Except for the 40x magnification at which the xml annotation was made, the mask size may be slightly different from the corresponding WSI because the method used for creating the pyramid structure in the mask file may be different from that used for WSI.

    • The code below for the CAMELYON16 training slides is based on ASAP code, but we found that it did not work for testing slides. We therefore wrote our own mask generation code for the testing slides.

  • Code for Mask file generation - for training (tumor) WSIs

  • Code for Mask file generation - for testing (tumor) WSIs

  • Time consuming

  • WSI and Mask file (Example): tumor_026

0.4 Data Description (Important)

CAMELYON16 Data Set

  • Note: Some tumor slides were not fully annotated. Normal_86: Originally misclassified, renamed to Tumor_111. Test_049: Duplicate slide. Test_114: Does not have exhaustive annotations. Test_049 was removed (by the organizer) for slide based and lesion based tasks; Test_114 was removed (by the organizer) from lesion based tasks.

0.5 - WSI Visulization with Annotation

Annotation Visulization

  • Annotation Visulization Over Image Base on xml file

2 - Image processing

2.1 Image Segmentation

To reduce computation, the blank regions (no tissue) on slide will be excluded.

  • Color space switch to HSV

  • Tissue region segmentation (Otsu’s method of foreground segmentation)

(code embedded in Patch Extraction)

2.2 Patch Extraction

Extract normal image patches from normal slides

Extract normal and tumor image patches from tumor slides

  • Note: The codes for the following procedures are part of the modules for CNN training.

Step 0 : tumor training images and normal training images were randomly divided into training and validation data set.

          80% for training data set;
	  20% for validation data set. 
	  images for validation data set: 
	  tumor: tumor_002.tif, tumor_008.tif, tumor_010.tif, tumor_019.tif, tumor_022.tif, tumor_024.tif, tumor_025.tif, 
	         tumor_031.tif, tumor_040.tif, tumor_045.tif, tumor_049.tif, tumor_069.tif, tumor_076.tif, tumor_083.tif,
		 tumor_084.tif, tumor_085.tif, tumor_088.tif, tumor_091.tif, tumor_101.tif, tumor_102.tif, tumor_108.tif, 
		 tumor_109.tif
		 
	  normal: normal_003.tif, normal_013.tif, normal_021.tif, normal_023.tif, normal_024.tif, normal_030.tif, normal_031.tif
	  	  normal_040.tif, normal_045.tif, normal_057.tif, normal_062.tif, normal_066.tif, normal_068.tif,normal_075.tif,
		  normal_076.tif, normal_080.tif, normal_087.tif, normal_099.tif, normal_100.tif, normal_102.tif, normal_106.tif
		  normal_112.tif, normal_117.tif, normal_127.tif, normal_132.tif, normal_139.tif, normal_141.tif, normal_149.tif
		  normal_150.tif, normal_151.tif, normal_152.tif, normal_156.tif

Step 1 : Randomly extract patches (256 x 256) on the tissue region at the level of 40x

	Tumor slide : 1K positive and 1K negative from each slide. So total patches for tumor tissue: 111k

	Normal slide: 1K negative from each slide. So total patches for normal tissue: 111k + 159k = 270k

Step 2 : Crop 224x224 patches and conduct image augmentation

  • Method I: flip, rotation and cropping

      For a 256x256 image patch from step 1, it was flipped,rotated 3 times. The original 256x256 image patch became 4 image patches. Then based on the 4 256x256 image patches, 2 224x224 image patches were randomly cropped from them. So I had total 8 224x224 image patches derived from 1 256x256 image patch. However, for normal image patches, only 1 224x224 image patches were randomly cropped from them.
      
      So, I have total image patches for tumor tissue: 111 x 8 = 888k, for normal tissue: 270 x 4 = 1080k
    

Crop and Flip

  • Method II: stain (color) normalization and adding color noise

        For method II, I will have 2 millions of 224x224 image patches for training model. Because, all the previous 224x224 image patches will be color-normalized, then be added color noise. The orginal image patch with color-normalization, plus the one with color noise will give me two 224 x 224 images from 1 224 x 224 images. So I will have total 888 x 2 = 1776k for tumor; 1080 x 2 = 2160k for normal tissue.
    
  • Stain (color) normalization

    The color variety among patches

The patches before and after stain normalization

Dayong Wang's method (PathAI) (Based on HSV image patches):

If the image patches were added by a big value that will cause the some pixel values larger than 255, the image patches after adding color noise will look like this (not preferred):

Yun Liu's method (Google) (also called color perturbation):

Patches:

Ground Truth:

3 - Training Neural Network

Flowchart for Model Training and Prediction

GoogleNet (Inception V1) Training

Step 1: Network Training

Training GoogleNet

  • Optimization method: Stochastic gradient descent

  • Weight initialization: Random sampling from a Gaussian distribution

  • Batch size: 32

  • Batch normalization: No

  • Regularization: L2-regularization (0.0005) and 50% dropout

  • Learning rate: 0.01, multiplied by 0.5 every 50,000 iterations (0.01, multiplied by 0.1 per epoch)

  • Activation function: ReLu

  • Loss function: Cross-entropy

  • Number of training epochs/iterations: 300,000 iterations

Step 2: Hard Negative Mining

Details (11-06-18): 120k 224x224 image patches were extracted based on the predicted; then augmentation were performed by using rotation and horizontal flip. For example, 1 224x224 image patch was rotated 3 times, and get 4 image patches. Then the 4 image patches were flipped horizontal and get total 8 different patches. So, total about 1 million of patches were used for hard negative mining. 
To get a model with hard negative mining, googlenet v1 was retrained by adding the above mentioned 1 million of patches to my original training patches (1 million of normal 224 patches and 1 million of tumor 224 patches). 

  • Some false positive patches from partially annotated tumor slides are real positive and will be excluded.

Neural Network Training with False Positive Patches

The training will use the same code, but with the folder (hnm_dir) for false positive patches included.

Step 3: Neural Network Training with False Positive Patches and Normal Patches Near Tumor Regions

Neural Network Training with False Positive Patches and Normal Patches Near Tumor Regions

The training will use the same code, but with the folder ("hnm_dir") for false positive patches and the folder ("pnt_dir") for normal patches near tumor regions included.

4 - Prediction and Evaluation

4.1 Make predictions and Heatmap stitching

prediction on HPC

Test images were divided into non-overlapping small patches; each patch will get a predicted image for each pixel assigned by probability.

Heatmap Generation

Heatmap Examples (From GoogleNet Prediction Only)

the heatmap for test_075 (the part with score < 0.5 is not shown here)

test_075

the heatmap for test_073 (the part with score < 0.5 is not shown here)

test_073

Comparison of predicted with ground truth for tumor_005:

alt text

If some tasks on HPC fail

4.2a Slide-based Classification

Global Features Extraction
  1. The ratio between the area of metastatic regions and the tissue area.
  2. The sum of all cancer metastases probailities detected in the metastasis identification task, divided by the tissue area. caculate them at 5 different thresholds (0.5, 0.6, 0.7, 0.8, 0.9), so the total 10 global features
Local Features Extraction

Based on 2 largest metastatic candidate regions (select them based on a threshold of 0.5).

10 features were extracted from the 2 largest regions:

  1. Area: the area of connected region
  2. Eccentricity: The eccentricity of the ellipse that has the same second-moments as the region
  3. Extend: The ratio of region area over the total bounding box area
  4. Bounding box area
  5. Major axis length: the length of the major axis of the ellipse that has the same normalized second central moments as the region
  6. Max/mean/min intensity: The max/mean/minimum probability value in the region
  7. Aspect ratio of the bounding box
  8. Solidity: Ratio of region area over the surrounding convex area

Results

Important Features

4.2b Lesion-based Detection

  • Combine the prediction results from Model-1 and Model-2

Model-1 is the model from step 2 (with hard negative mining) in section 3 - "Training Neural Netowrk";

Model-2 is the model from step 3 (with hard negative mining patches and normal patches near tumor regions) in section 3 - "Training Neural Netowrk".

The x, y coordinates of predicted tumor lesion come from Model-1; The scores of predicted tumor lesion are the average of scores from Model-1 and Model-2.

4.3 ROC and FROC Generation

ROC

  • Results

ROC curves

FROC

FROC

References