Skip to content

Commit

Permalink
Merge pull request #14 from Forest-Recovery-Digital-Companion/0.0.3
Browse files Browse the repository at this point in the history
0.0.3
  • Loading branch information
Eve-ning authored Oct 20, 2023
2 parents 4777740 + 8cbefd9 commit 094b645
Show file tree
Hide file tree
Showing 51 changed files with 3,512 additions and 987 deletions.
12 changes: 6 additions & 6 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Python package
name: Python CI

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

jobs:
build:
Expand Down Expand Up @@ -35,11 +34,12 @@ jobs:
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'

# We don't necessarily need to install the CUDA version of torch, so we'll do with cpu
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest poetry
poetry export --without-hashes -o requirements.txt
python -m pip install flake8 pytest poetry
poetry export --with dev --without-hashes -o requirements.txt
pip install -r requirements.txt
- name: Lint with flake8
Expand All @@ -50,7 +50,7 @@ jobs:
flake8 src/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
# env:
# GOOGLE_APPLICATION_CREDENTIALS: ${{ env.GOOGLE_APPLICATION_CREDENTIALS }}
run: |
pytest
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,4 +167,6 @@ rsc/**/*.tif
# Ignore any secrets files
.secrets/
# REMOVE ONLY IF THE SECRET FILES ARE IN .secrets
*.json
*.json

**/*/lightning_logs
12 changes: 12 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/psf/black
rev: 23.10.0
hooks:
- id: black
- repo: https://github.com/PyCQA/flake8
rev: 6.1.0
hooks:
- id: flake8
args: [ --max-line-length=79 ]
58 changes: 39 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# FRDC-ML

**Forest Recover Digital Companion** Machine Learning Pipeline Repository

This repository contains all code regarding our models used.
This is part of the entire E2E pipeline for our product.
This repository contains all code regarding our models used.
This is part of the entire E2E pipeline for our product.

_Data Collection -> **FRDC-ML** -> [FRDC-UI](https://github.com/Forest-Recovery-Digital-Companion/FRDC-UI)_

Expand All @@ -16,51 +17,70 @@ FRDC-ML/
frdc/ # Package/Component Level code
load/ # Image I/O
preprocess/ # Image Preprocessing
train/ # ML Training
train/ # ML Training
evaluate/ # Model Evaluation
... # ...
main.py # Pipeline Entry Point
tests/ # PyTest Tests
integration-tests/ # Tests that run the entire pipeline
integration-tests/ # Tests that run the entire pipeline
unit-tests/ # Tests for each component
poetry.lock # Poetry managed environment file
pyproject.toml # Project-level information: requirements, settings, name, deployment info
.github/ # GitHub Actions
```

## Our Architecture

This is a classic, simple Python Package architecture, however, we **HEAVILY EMPHASIZE** encapsulation of each stage.
That means, there should never be data that **IMPLICITLY** persists across stages. We enforce this by our
That means, there should never be data that **IMPLICITLY** persists across stages. We enforce this by our
`src/main.py` entrypoint.

Each function should have a high-level, preferably intuitively english naming convention.

```python
from frdc.load import load_image
from frdc.preprocess import watershed, remove_small_blobs
...
from frdc.train import train
from torch.optim import Adam

ar = load_image("my_img.png")
from frdc.load.dataset import FRDCDataset
from frdc.preprocess.morphology import remove_small_objects
from frdc.preprocess.morphology import watershed
from frdc.train import train

ar = FRDCDataset("chestnut", "date", ...)
ar = watershed(ar)
ar = remove_small_blobs(ar, min_size=50)
...
ar = remove_small_objects(ar, min_size=100)
model = train(ar, lr=0.01, optimizer=Adam, )
...
```

This architecture allows for
This architecture allows for

1) Easily legible high level pipelines
2) Flexibility
1) Conventional Python signatures can be used to input arguments
2) If necessary we can leverage everything else Python
1) Conventional Python signatures can be used to input arguments
2) If necessary we can leverage everything else Python
3) Easily replicable pipelines

> Initially, we evaluated a few ML E2E solutions, despite them offering great functionality, their flexibility was
> Initially, we evaluated a few ML E2E solutions, despite them offering great functionality, their flexibility was
> limited. From a dev perspective, **Active Learning** was a gray area, and we foresee heavy shoehorning.
> Ultimately, we decided that the risk was too great, thus we resort to creating our own solution.
## Contributing

### Pre-commit Hooks

We use Black and Flake8 as our pre-commit hooks. To install them, run the following commands:

```bash
poetry install
pre-commit install
```

If you're using `pip` instead of `poetry`, run the following commands:

```bash
pip install pre-commit
pre-commit install
```
Empty file added pipeline/__init__.py
Empty file.
14 changes: 14 additions & 0 deletions pipeline/model_tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Model Tests

This directory contains full tests model architectures.

## `chestnut_dec_may`

This test is the classic FRDC tests used in research papers.
It uses December's data to train and May's data to test.

The current baseline is 40% accuracy.

### Confusion Matrix

![chestnut_dec_may](chestnut_dec_may/confusion_matrix.png)
Empty file.
Empty file.
9 changes: 9 additions & 0 deletions pipeline/model_tests/chestnut_dec_may/augmentation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import torch
from torchvision.transforms.v2 import RandomHorizontalFlip, RandomVerticalFlip


def augmentation(t: torch.Tensor) -> torch.Tensor:
"""Runs out augmentation on a tensor."""
t = RandomHorizontalFlip()(t)
t = RandomVerticalFlip()(t)
return t
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 53 additions & 0 deletions pipeline/model_tests/chestnut_dec_may/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import lightning as pl
import matplotlib.pyplot as plt
import numpy as np
import torch
from seaborn import heatmap
from sklearn.metrics import confusion_matrix

from frdc.train import FRDCDataModule
from frdc.train import FRDCModule
from pipeline.model_tests.chestnut_dec_may.preprocess import preprocess
from pipeline.model_tests.utils import get_dataset

# Get our Test
# TODO: Ideally, we should have a separate dataset for testing.
segments, labels = get_dataset(
"chestnut_nature_park", "20210510", "90deg43m85pct255deg/map"
)

# Prepare the datamodule and trainer
dm = FRDCDataModule(segments=segments, preprocess=preprocess, batch_size=5)

# TODO: Hacky way to load our LabelEncoder
dm.le.classes_ = np.load("le.npy", allow_pickle=True)

# Load the model
m = FRDCModule.load_from_checkpoint(
"lightning_logs/version_88/checkpoints/epoch=99-step=700.ckpt"
)

# Make predictions
trainer = pl.Trainer(logger=False)
pred = trainer.predict(m, datamodule=dm)
y_pred = torch.concat(pred, dim=0).argmax(dim=1)
y_true = dm.le.transform(labels)

# Plot the confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(10, 10))

heatmap(
cm,
annot=True,
xticklabels=dm.le.classes_,
yticklabels=dm.le.classes_,
cbar=False,
)

plt.tight_layout(pad=3)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.savefig("confusion_matrix.png")
Binary file added pipeline/model_tests/chestnut_dec_may/le.npy
Binary file not shown.
96 changes: 96 additions & 0 deletions pipeline/model_tests/chestnut_dec_may/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
""" Tests for the FaceNet model.
This test is done by training a model on the 20201218 dataset, then testing on
the 20210510 dataset.
"""

import lightning as pl
import numpy as np
import torch
from lightning.pytorch.callbacks import (
LearningRateMonitor,
ModelCheckpoint,
EarlyStopping,
)
from torch.utils.data import TensorDataset, Dataset, Subset

from frdc.models import FaceNet
from frdc.train import FRDCDataModule, FRDCModule
from pipeline.model_tests.chestnut_dec_may.augmentation import augmentation
from pipeline.model_tests.chestnut_dec_may.preprocess import preprocess
from pipeline.model_tests.utils import get_dataset


def train_val_test_split(x: TensorDataset) -> list[Dataset, Dataset, Dataset]:
# Defines how to split the dataset into train, val, test subsets.
# TODO: Quite ugly as it uses the global variables segments_0 and
# segments_1. Will need to refactor this.
return [
Subset(x, list(range(len(segments_0)))),
Subset(
x, list(range(len(segments_0), len(segments_0) + len(segments_1)))
),
[],
]


# Prepare the dataset
segments_0, labels_0 = get_dataset("chestnut_nature_park", "20201218", None)
segments_1, labels_1 = get_dataset(
"chestnut_nature_park", "20210510", "90deg43m85pct255deg/map"
)

# Concatenate the datasets
segments = [*segments_0, *segments_1]
labels = [*labels_0, *labels_1]

BATCH_SIZE = 5
EPOCHS = 100
LR = 1e-3

# Prepare the datamodule and trainer
dm = FRDCDataModule(
# Input to the model
segments=segments,
# Output of the model
labels=labels,
# Preprocessing function
preprocess=preprocess,
# Augmentation function (Only on train)
augmentation=augmentation,
# Splitting function
train_val_test_split=train_val_test_split,
# Batch size
batch_size=BATCH_SIZE,
)

trainer = pl.Trainer(
max_epochs=EPOCHS,
# Set the seed for reproducibility
# TODO: Though this is set, the results are still not reproducible.
deterministic=True,
log_every_n_steps=4,
callbacks=[
# Stop training if the validation loss doesn't improve for 4 epochs
EarlyStopping(monitor="val_loss", patience=4, mode="min"),
# Log the learning rate on TensorBoard
LearningRateMonitor(logging_interval="epoch"),
# Save the best model
ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1),
],
)

m = FRDCModule(
# Our model is the "FaceNet" model
# TODO: It's not really the FaceNet model, but a modified version of it.
model_cls=FaceNet,
model_kwargs=dict(n_out_classes=len(set(labels))),
# We use the Adam optimizer
optim_cls=torch.optim.Adam,
# TODO: This is not fine-tuned.
optim_kwargs=dict(lr=LR, weight_decay=1e-4, amsgrad=True),
)

trainer.fit(m, datamodule=dm)
# TODO: Quite hacky, but we need to save the label encoder for prediction.
np.save("le.npy", dm.le.classes_)
Loading

0 comments on commit 094b645

Please sign in to comment.