Project progress can be observed here: To-Do-List
We perform segmentation of medical images to highlight presence of breast cancer.
This project is a part of a course in machine learning operations in the danish technical university. We will be working on a segmentation of medical images to highlight presence of breast cancer. To accomplish this we use the Breast cancer semantic segmentation dataset provided on kaggle. The framework used to train the model is MONAI with the intent of a UNET architecture which is popular for image segmentation in the medical domain.
Our data used is the Breast cancer semantic segmentation dataset specifically the 224x224 image sizes.
The BCSS dataset, derived from TCGA, includes over 20,000 segmentation annotations of breast cancer tissue regions. Annotations are a collaborative effort of pathologists, residents, and medical students, using the Digital Slide Archive
For remote data version control we use a GCP blob as a data lake since we work with image data we need a file storage instead of a traditional table storage.
The framework used to train the model is MONAI, a PyTorch based framework for medical image analysis, that adds a level of abstraction to PyTorch. Instead of defining each layer of our own models, we can instead use entire networks (that are based on scientific papers from the medical research community) and only need to specifiy hyperparameters such as channel sizes, dimensions and loss functions. One such architecture family is UNet. We intend to use a UNET architecture as it is popular for image segmentation. We plan to use a BasicUNet implementation first (based on CNN modules), and later potentially compare the performance to vision transformer based UNET like UNetr (this however is intended for 3D image data, so yet to be clarified).
The training procedure is containerized with docker utilizing the CUDA specific docker container for the option of GPU accelerated training.
The project structure was initially created using the cookiecutter template for the course machine learning operations course using the template mlops_template.
The directory structure of the project looks like this:
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- Documentation folder
│ │
│ ├── index.md <- Homepage for your documentation
│ │
│ ├── mkdocs.yml <- Configuration file for mkdocs
│ │
│ └── source/ <- Source directory for documentation files
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks.
│
├── pyproject.toml <- Project configuration file
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment
|
├── requirements_dev.txt <- The requirements file for reproducing the analysis environment
│
├── tests <- Test files
│
├── project_name <- Source code for use in this project.
│ │
│ ├── __init__.py <- Makes folder a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ ├── __init__.py
│ │ └── make_dataset.py
│ │
│ ├── models <- model implementations, training script and prediction script
│ │ ├── __init__.py
│ │ ├── model.py
│ │
│ ├── visualization <- Scripts to create exploratory and results oriented visualizations
│ │ ├── __init__.py
│ │ └── visualize.py
│ ├── train_model.py <- script for training the model
│ └── predict_model.py <- script for predicting from a model
│
└── LICENSE <- Open-source license if one is chosen