Merge pull request #14 from Forest-Recovery-Digital-Companion/0.0.3

0.0.3
FR-DC · Oct 20, 2023 · 094b645 · 094b645
2 parents 4777740 + 8cbefd9
commit 094b645
Show file tree

Hide file tree

Showing 51 changed files with 3,512 additions and 987 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -1,13 +1,12 @@
 # This workflow will install Python dependencies, run tests and lint with a variety of Python versions
 # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
 
-name: Python package
+name: Python CI
 
 on:
   push:
     branches: [ "main" ]
   pull_request:
-    branches: [ "main" ]
 
 jobs:
   build:
@@ -35,11 +34,12 @@ jobs:
       - name: 'Set up Cloud SDK'
         uses: 'google-github-actions/setup-gcloud@v1'
 
+#     We don't necessarily need to install the CUDA version of torch, so we'll do with cpu
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          python -m pip install flake8 pytest poetry
-          poetry export --without-hashes -o requirements.txt
+          python -m pip install flake8 pytest poetry          
+          poetry export --with dev --without-hashes -o requirements.txt
           pip install -r requirements.txt
 
       - name: Lint with flake8
@@ -50,7 +50,7 @@ jobs:
           flake8 src/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
 
       - name: Test with pytest
-#        env:
-#          GOOGLE_APPLICATION_CREDENTIALS: ${{ env.GOOGLE_APPLICATION_CREDENTIALS }}
         run: |
           pytest
+
+
diff --git a/.gitignore b/.gitignore
@@ -167,4 +167,6 @@ rsc/**/*.tif
 # Ignore any secrets files
 .secrets/
 # REMOVE ONLY IF THE SECRET FILES ARE IN .secrets
-*.json
+*.json
+
+**/*/lightning_logs
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,12 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+repos:
+  - repo: https://github.com/psf/black
+    rev: 23.10.0
+    hooks:
+      - id: black
+  - repo: https://github.com/PyCQA/flake8
+    rev: 6.1.0
+    hooks:
+      - id: flake8
+        args: [ --max-line-length=79 ]
diff --git a/README.md b/README.md
@@ -1,8 +1,9 @@
 # FRDC-ML
+
 **Forest Recover Digital Companion** Machine Learning Pipeline Repository
 
-This repository contains all code regarding our models used. 
-This is part of the entire E2E pipeline for our product. 
+This repository contains all code regarding our models used.
+This is part of the entire E2E pipeline for our product.
 
 _Data Collection -> **FRDC-ML** -> [FRDC-UI](https://github.com/Forest-Recovery-Digital-Companion/FRDC-UI)_
 
@@ -16,51 +17,70 @@ FRDC-ML/
         frdc/               # Package/Component Level code
             load/           # Image I/O
             preprocess/     # Image Preprocessing
-            train/          # ML Training 
+            train/          # ML Training
             evaluate/       # Model Evaluation
             ...             # ...
         main.py             # Pipeline Entry Point
-        
+
     tests/                  # PyTest Tests
-        integration-tests/  # Tests that run the entire pipeline 
+        integration-tests/  # Tests that run the entire pipeline
         unit-tests/         # Tests for each component
-    
+
     poetry.lock             # Poetry managed environment file
     pyproject.toml          # Project-level information: requirements, settings, name, deployment info
-            
+
     .github/                # GitHub Actions
 ```
 
 ## Our Architecture
 
 This is a classic, simple Python Package architecture, however, we **HEAVILY EMPHASIZE** encapsulation of each stage.
-That means, there should never be data that **IMPLICITLY** persists across stages. We enforce this by our 
+That means, there should never be data that **IMPLICITLY** persists across stages. We enforce this by our
 `src/main.py` entrypoint.
 
 Each function should have a high-level, preferably intuitively english naming convention.
 
 ```python
-from frdc.load import load_image
-from frdc.preprocess import watershed, remove_small_blobs
-...
-from frdc.train import train
 from torch.optim import Adam
 
-ar = load_image("my_img.png")
+from frdc.load.dataset import FRDCDataset
+from frdc.preprocess.morphology import remove_small_objects
+from frdc.preprocess.morphology import watershed
+from frdc.train import train
+
+ar = FRDCDataset("chestnut", "date", ...)
 ar = watershed(ar)
-ar = remove_small_blobs(ar, min_size=50)
-...
+ar = remove_small_objects(ar, min_size=100)
 model = train(ar, lr=0.01, optimizer=Adam, )
 ...
 ```
 
-This architecture allows for 
+This architecture allows for
+
 1) Easily legible high level pipelines
 2) Flexibility
-   1) Conventional Python signatures can be used to input arguments
-   2) If necessary we can leverage everything else Python
+    1) Conventional Python signatures can be used to input arguments
+    2) If necessary we can leverage everything else Python
 3) Easily replicable pipelines
 
-> Initially, we evaluated a few ML E2E solutions, despite them offering great functionality, their flexibility was 
+> Initially, we evaluated a few ML E2E solutions, despite them offering great functionality, their flexibility was
 > limited. From a dev perspective, **Active Learning** was a gray area, and we foresee heavy shoehorning.
 > Ultimately, we decided that the risk was too great, thus we resort to creating our own solution.
+
+## Contributing
+
+### Pre-commit Hooks
+
+We use Black and Flake8 as our pre-commit hooks. To install them, run the following commands:
+
+```bash
+poetry install
+pre-commit install
+```
+
+If you're using `pip` instead of `poetry`, run the following commands:
+
+```bash
+pip install pre-commit
+pre-commit install
+```
diff --git a/pipeline/__init__.py b/pipeline/__init__.py
diff --git a/pipeline/model_tests/README.md b/pipeline/model_tests/README.md
@@ -0,0 +1,14 @@
+# Model Tests
+
+This directory contains full tests model architectures.
+
+## `chestnut_dec_may`
+
+This test is the classic FRDC tests used in research papers.
+It uses December's data to train and May's data to test.
+
+The current baseline is 40% accuracy.
+
+### Confusion Matrix
+
+![chestnut_dec_may](chestnut_dec_may/confusion_matrix.png)
diff --git a/pipeline/model_tests/__init__.py b/pipeline/model_tests/__init__.py
diff --git a/pipeline/model_tests/chestnut_dec_may/__init__.py b/pipeline/model_tests/chestnut_dec_may/__init__.py
diff --git a/pipeline/model_tests/chestnut_dec_may/augmentation.py b/pipeline/model_tests/chestnut_dec_may/augmentation.py
@@ -0,0 +1,9 @@
+import torch
+from torchvision.transforms.v2 import RandomHorizontalFlip, RandomVerticalFlip
+
+
+def augmentation(t: torch.Tensor) -> torch.Tensor:
+    """Runs out augmentation on a tensor."""
+    t = RandomHorizontalFlip()(t)
+    t = RandomVerticalFlip()(t)
+    return t
diff --git a/pipeline/model_tests/chestnut_dec_may/confusion_matrix.png b/pipeline/model_tests/chestnut_dec_may/confusion_matrix.png
diff --git a/pipeline/model_tests/chestnut_dec_may/evaluate.py b/pipeline/model_tests/chestnut_dec_may/evaluate.py
@@ -0,0 +1,53 @@
+import lightning as pl
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+from seaborn import heatmap
+from sklearn.metrics import confusion_matrix
+
+from frdc.train import FRDCDataModule
+from frdc.train import FRDCModule
+from pipeline.model_tests.chestnut_dec_may.preprocess import preprocess
+from pipeline.model_tests.utils import get_dataset
+
+# Get our Test
+# TODO: Ideally, we should have a separate dataset for testing.
+segments, labels = get_dataset(
+    "chestnut_nature_park", "20210510", "90deg43m85pct255deg/map"
+)
+
+# Prepare the datamodule and trainer
+dm = FRDCDataModule(segments=segments, preprocess=preprocess, batch_size=5)
+
+# TODO: Hacky way to load our LabelEncoder
+dm.le.classes_ = np.load("le.npy", allow_pickle=True)
+
+# Load the model
+m = FRDCModule.load_from_checkpoint(
+    "lightning_logs/version_88/checkpoints/epoch=99-step=700.ckpt"
+)
+
+# Make predictions
+trainer = pl.Trainer(logger=False)
+pred = trainer.predict(m, datamodule=dm)
+y_pred = torch.concat(pred, dim=0).argmax(dim=1)
+y_true = dm.le.transform(labels)
+
+# Plot the confusion matrix
+cm = confusion_matrix(y_true, y_pred)
+
+plt.figure(figsize=(10, 10))
+
+heatmap(
+    cm,
+    annot=True,
+    xticklabels=dm.le.classes_,
+    yticklabels=dm.le.classes_,
+    cbar=False,
+)
+
+plt.tight_layout(pad=3)
+plt.title("Confusion Matrix")
+plt.xlabel("Predicted Label")
+plt.ylabel("True Label")
+plt.savefig("confusion_matrix.png")
diff --git a/pipeline/model_tests/chestnut_dec_may/le.npy b/pipeline/model_tests/chestnut_dec_may/le.npy
diff --git a/pipeline/model_tests/chestnut_dec_may/main.py b/pipeline/model_tests/chestnut_dec_may/main.py
@@ -0,0 +1,96 @@
+""" Tests for the FaceNet model.
+
+This test is done by training a model on the 20201218 dataset, then testing on
+the 20210510 dataset.
+"""
+
+import lightning as pl
+import numpy as np
+import torch
+from lightning.pytorch.callbacks import (
+    LearningRateMonitor,
+    ModelCheckpoint,
+    EarlyStopping,
+)
+from torch.utils.data import TensorDataset, Dataset, Subset
+
+from frdc.models import FaceNet
+from frdc.train import FRDCDataModule, FRDCModule
+from pipeline.model_tests.chestnut_dec_may.augmentation import augmentation
+from pipeline.model_tests.chestnut_dec_may.preprocess import preprocess
+from pipeline.model_tests.utils import get_dataset
+
+
+def train_val_test_split(x: TensorDataset) -> list[Dataset, Dataset, Dataset]:
+    # Defines how to split the dataset into train, val, test subsets.
+    # TODO: Quite ugly as it uses the global variables segments_0 and
+    #  segments_1. Will need to refactor this.
+    return [
+        Subset(x, list(range(len(segments_0)))),
+        Subset(
+            x, list(range(len(segments_0), len(segments_0) + len(segments_1)))
+        ),
+        [],
+    ]
+
+
+# Prepare the dataset
+segments_0, labels_0 = get_dataset("chestnut_nature_park", "20201218", None)
+segments_1, labels_1 = get_dataset(
+    "chestnut_nature_park", "20210510", "90deg43m85pct255deg/map"
+)
+
+# Concatenate the datasets
+segments = [*segments_0, *segments_1]
+labels = [*labels_0, *labels_1]
+
+BATCH_SIZE = 5
+EPOCHS = 100
+LR = 1e-3
+
+# Prepare the datamodule and trainer
+dm = FRDCDataModule(
+    # Input to the model
+    segments=segments,
+    # Output of the model
+    labels=labels,
+    # Preprocessing function
+    preprocess=preprocess,
+    # Augmentation function (Only on train)
+    augmentation=augmentation,
+    # Splitting function
+    train_val_test_split=train_val_test_split,
+    # Batch size
+    batch_size=BATCH_SIZE,
+)
+
+trainer = pl.Trainer(
+    max_epochs=EPOCHS,
+    # Set the seed for reproducibility
+    # TODO: Though this is set, the results are still not reproducible.
+    deterministic=True,
+    log_every_n_steps=4,
+    callbacks=[
+        # Stop training if the validation loss doesn't improve for 4 epochs
+        EarlyStopping(monitor="val_loss", patience=4, mode="min"),
+        # Log the learning rate on TensorBoard
+        LearningRateMonitor(logging_interval="epoch"),
+        # Save the best model
+        ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1),
+    ],
+)
+
+m = FRDCModule(
+    # Our model is the "FaceNet" model
+    # TODO: It's not really the FaceNet model, but a modified version of it.
+    model_cls=FaceNet,
+    model_kwargs=dict(n_out_classes=len(set(labels))),
+    # We use the Adam optimizer
+    optim_cls=torch.optim.Adam,
+    # TODO: This is not fine-tuned.
+    optim_kwargs=dict(lr=LR, weight_decay=1e-4, amsgrad=True),
+)
+
+trainer.fit(m, datamodule=dm)
+# TODO: Quite hacky, but we need to save the label encoder for prediction.
+np.save("le.npy", dm.le.classes_)