Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.0.8 #41

Merged
merged 62 commits into from
Jan 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
bf92c2c
Remove unused evaluate script
Eve-ning Dec 26, 2023
77ba78a
Make GCS error clearer
Eve-ning Dec 26, 2023
5c4a36c
Fix missing default on exception
Eve-ning Dec 26, 2023
d46f4e3
Add dev container spec
Eve-ning Dec 26, 2023
c2ba141
Delete rsc.dvc
Eve-ning Dec 26, 2023
60e5c2a
Merge branch '0.0.8' into FRML-93
Eve-ning Dec 26, 2023
276fa17
Get api key from host
Eve-ning Dec 26, 2023
70b275e
Add missing lightning dep
Eve-ning Dec 26, 2023
a1d79c1
Add uncommentable local W&B setup
Eve-ning Dec 26, 2023
5d457ab
Update getting started docs for dev container
Eve-ning Dec 26, 2023
3ad231b
Update README.md
Eve-ning Dec 26, 2023
3eb0b40
Update devcontainer.json
Eve-ning Dec 26, 2023
d021af7
Attempt to fix codespace problem
Eve-ning Dec 26, 2023
2636cf1
Update Dockerfile
Eve-ning Dec 26, 2023
bac614a
Force Dockerfile to LF
Eve-ning Dec 27, 2023
399fc54
Force Dockerfile to LF
Eve-ning Dec 27, 2023
f275997
Merge pull request #39 from FR-DC/FRML-93
Eve-ning Dec 27, 2023
e2cab32
Move Dev Container setup to other page
Eve-ning Dec 27, 2023
c8e3d88
Merge branch 'FRML-93' into docs
Eve-ning Dec 27, 2023
64eb808
Add missing page
Eve-ning Dec 27, 2023
2cc3ac9
Merge pull request #38 from FR-DC/FRML-93
Eve-ning Dec 27, 2023
ddc7e1c
Implement Preset Class
Eve-ning Dec 28, 2023
76b4dff
Update debug dataset loading
Eve-ning Dec 28, 2023
2fbd4d4
Update preset loading for chestnut training
Eve-ning Dec 28, 2023
b5a465a
Move import to top
Eve-ning Dec 28, 2023
334daa7
Implement interface to use add op to concat
Eve-ning Dec 28, 2023
ef57cf4
Remove unused import
Eve-ning Dec 28, 2023
f04b9a3
Move warning to func def
Eve-ning Dec 28, 2023
419fd9e
Improve syntax of creating unlabelled datasets
Eve-ning Dec 28, 2023
7d3183e
Implement auto casting of labelled to unlabelled
Eve-ning Dec 28, 2023
a34d367
Refactor unlabelled to use the preset
Eve-ning Dec 28, 2023
58c977d
Refactor the preprocessing step
Eve-ning Dec 28, 2023
2928147
Move common scripts to utils
Eve-ning Dec 28, 2023
60b835d
Migrate references for train
Eve-ning Dec 28, 2023
a78446a
Fix error in documentation signature
Eve-ning Dec 28, 2023
a0583d6
Make wandb online by default
Eve-ning Dec 28, 2023
a64e59a
Migrate Preset classes to preset.py
Eve-ning Dec 29, 2023
314774c
Update docs to prefer preset
Eve-ning Dec 29, 2023
1673227
update html docs
Eve-ning Dec 29, 2023
a46e072
Merge pull request #40 from FR-DC/refactor-dataset
Eve-ning Dec 29, 2023
61093f7
Merge branch 'main' into 0.0.8
Eve-ning Dec 29, 2023
3289d49
Implement Stratified Sampling
Eve-ning Dec 29, 2023
fdfa17a
Add test for Stratified Sampling
Eve-ning Dec 29, 2023
349e7cd
Implement Stratified Sampling on DM
Eve-ning Jan 2, 2024
dc05b35
Allow Stratified Sampling for arbitrary seq types
Eve-ning Jan 2, 2024
a8dcafc
Fix missing imports for pred and plot
Eve-ning Jan 2, 2024
e6f6a9c
Change test to use str list
Eve-ning Jan 2, 2024
86d11df
Implement W&B vis of label spread
Eve-ning Jan 2, 2024
a355c39
Clean up train.py
Eve-ning Jan 2, 2024
dff8378
Make W&B Watch model
Eve-ning Jan 2, 2024
3bb5b6f
Merge pull request #42 from FR-DC/frml-102
Eve-ning Jan 2, 2024
c825f8c
Merge pull request #43 from FR-DC/FRML-81
Eve-ning Jan 2, 2024
53d38e5
Add warning on Label Studio connection issue
Eve-ning Jan 8, 2024
74b15c0
Add dumping script
Eve-ning Jan 8, 2024
13a6118
Update .gitignore
Eve-ning Jan 8, 2024
aeca556
Mount backups to host
Eve-ning Jan 8, 2024
54e214c
Merge pull request #44 from FR-DC/frml-97
Eve-ning Jan 8, 2024
fb93890
Fix issue with Label Studio being None exception
Eve-ning Jan 8, 2024
d96030d
Minor Black formatting
Eve-ning Jan 8, 2024
3794501
Fix issue with WandB hist logger too many bins
Eve-ning Jan 8, 2024
c8b050a
Fix issue with redundant initializing wandb
Eve-ning Jan 8, 2024
c2c48b3
Merge pull request #45 from FR-DC/frml-107
Eve-ning Jan 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"name": "frdc",
"build": {
"dockerfile": "../Dockerfile",
},
"containerEnv": {
"LABEL_STUDIO_HOST": "host.docker.internal",
"LABEL_STUDIO_API_KEY": "${localEnv:LABEL_STUDIO_API_KEY}",
},
"runArgs": [
"--gpus=all",
],
"hostRequirements": {
"gpu": true,
}
}
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Dockerfile text=auto eol=lf
20 changes: 20 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime as torch
WORKDIR /devcontainer

COPY ./pyproject.toml /devcontainer/pyproject.toml

RUN apt update -y && apt upgrade -y
RUN apt install git -y

RUN pip3 install --upgrade pip && \
pip3 install poetry && \
pip3 install lightning

RUN conda init bash \
&& . ~/.bashrc \
&& conda activate base \
&& poetry config virtualenvs.create false \
&& poetry install --with dev --no-interaction --no-ansi

RUN apt install curl -y && curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:/root/google-cloud-sdk/bin
10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,6 @@ To illustrate this, take a look at how
`tests/model_tests/chestnut_dec_may/train.py` is written. It pulls in relevant
modules from each stage and constructs a pipeline.


> Initially, we evaluated a few ML E2E solutions, despite them offering great
> functionality, their flexibility was
> limited. From a dev perspective, **Active Learning** was a gray area, and we
> foresee heavy shoehorning.
> Ultimately, we decided that the risk was too great, thus we resort to
> creating our own solution.

## Contributing

### Pre-commit Hooks
Expand All @@ -80,3 +72,5 @@ If you're using `pip` instead of `poetry`, run the following commands:
pip install pre-commit
pre-commit install
```

Alternatively, you can use Black configured with your own IDE.
4 changes: 3 additions & 1 deletion Writerside/d.tree
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@
start-page="Overview.md">

<toc-element topic="Overview.md"/>
<toc-element topic="Getting-Started.md"/>
<toc-element topic="Getting-Started.md">
<toc-element topic="Get-Started-with-Dev-Containers.md"/>
</toc-element>
<toc-element toc-title="Tutorials">
<toc-element topic="Retrieve-our-Datasets.md"/>
</toc-element>
Expand Down
49 changes: 49 additions & 0 deletions Writerside/topics/Get-Started-with-Dev-Containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Get Started with Dev Containers

Dev. Containers are a great way to get started with a project. They define all
necessary dependencies and environments, so you can just start coding within
the container.

In this article, we'll only go over **additional steps** to set up with our
project. For more information on how to use Dev Containers, please refer to
the official documentation for each IDE. Once you've set up the Dev Container,
come back here to finish the setup:

- [VSCode](https://code.visualstudio.com/docs/remote/containers).
- [IntelliJ](https://www.jetbrains.com/help/idea/connect-to-devcontainer.html)

> If you see the error `Error response from daemon: ... <!DOCTYPE`, you forgot
> the `.git` at the end of the repo URL.
{style='warning'}

## Python Environment

> Do not create a new environment
{style='warning'}

The dev environment is already created and is managed by Anaconda
`/opt/conda/bin/conda`.
To activate the environment, run the following command:

```bash
conda activate base
```

> Refer to your respective IDE's documentation on how to activate the
> environment.

## Mark as Sources Root (Add to PYTHONPATH)

For `import` statements to work, you need to mark the `src` folder as the
sources root. Optionally, also mark the `tests` folder as the tests root.

> Refer to your respective IDE's documentation on how to mark folders as
> sources root. (Also known as adding to the `PYTHONPATH`)

## Additional Setup

Refer to the [Getting Started](Getting-Started.md) guide for additional setup
steps such as:
- Google Cloud Application Default Credentials
- Weight & Bias API Key
- Label Studio API Key
107 changes: 81 additions & 26 deletions Writerside/topics/Getting-Started.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Getting Started

> Want to use a Dev Container? See [Get Started with Dev Containers](Get-Started-with-Dev-Containers.md)

<procedure title="Installing the Dev. Environment" id="install">
<step>Ensure that you have the right version of Python.
The required Python version can be seen in <code>pyproject.toml</code>
Expand All @@ -10,7 +12,7 @@
</step>
<step>Start by cloning our repository.
<code-block lang="shell">
git clone https://github.com/Forest-Recovery-Digital-Companion/FRDC-ML.git
git clone https://github.com/FR-DC/FRDC-ML.git
</code-block>
</step>
<step>Then, create a Python Virtual Env <code>pyvenv</code>
Expand Down Expand Up @@ -60,6 +62,7 @@
</step>
</procedure>


<procedure title="Setting Up Google Cloud" id="gcloud">
<step>
We use Google Cloud to store our datasets. To set up Google Cloud,
Expand All @@ -86,6 +89,48 @@
</step>
</procedure>

<procedure title="Setting Up Label Studio" id="ls">
<tip>This is only necessary if any task requires Label Studio annotations</tip>
<step>
We use Label Studio to annotate our datasets.
We won't go through how to install Label Studio, for contributors, it
should be up on <code>localhost:8080</code>.
</step>
<step>
Then, retrieve your own API key from Label Studio.
<a href="http://localhost:8080/user/account"> Go to your account page </a>
and copy the API key. <br/></step>
<step> Set your API key as an environment variable.
<tabs>
<tab title="Windows">
In Windows, go to "Edit environment variables for
your account" and add this as a new environment variable with name
<code>LABEL_STUDIO_API_KEY</code>.
</tab>
<tab title="Linux">
Export it as an environment variable.
<code-block lang="shell">export LABEL_STUDIO_API_KEY=...</code-block>
</tab>
</tabs>
</step>
</procedure>

<procedure title="Setting Up Weight and Biases" id="wandb">
<step>
We use W&B to track our experiments. To set up W&B,
<a href="https://docs.wandb.ai/quickstart">
install the W&B CLI
</a>
</step>
<step>
Then,
<a href="https://docs.wandb.ai/quickstart">
authenticate your account
</a>.
<code-block lang="shell">wandb login</code-block>
</step>
</procedure>

<procedure title="Pre-commit Hooks" collapsible="true">
<note>This is optional but recommended.
Pre-commit hooks are a way to ensure that your code is formatted correctly.
Expand All @@ -98,30 +143,45 @@
</step>
</procedure>

<procedure title="Running the Tests" collapsible="true" id="tests">
<procedure title="Running the Tests" id="tests">
<step>
Run the tests to make sure everything is working
<code-block lang="shell">
pytest
</code-block>
</step>
<step>
In case of errors:
<deflist>
<def title="google.auth.exceptions.DefaultCredentialsError">
If you get this error, it means that you haven't authenticated your
Google Cloud account.
See <a anchor="gcloud">Setting Up Google Cloud</a>
</def>
<def title="ModuleNotFoundError" collapsible="true">
If you get this error, it means that you haven't installed the
dependencies.
See <a anchor="install">Installing the Dev. Environment</a>
</def>
</deflist>
</step>
</procedure>

## Troubleshooting

### ModuleNotFoundError

It's likely that your `src` and `tests` directories are not in `PYTHONPATH`.
To fix this, run the following command:

```shell
export PYTHONPATH=$PYTHONPATH:./src:./tests
```

Or, set it in your IDE, for example, IntelliJ allows setting directories as
**Source Roots**.

### google.auth.exceptions.DefaultCredentialsError

It's likely that you haven't authenticated your Google Cloud account.
See [Setting Up Google Cloud](#gcloud)

### Couldn't connect to Label Studio

Label Studio must be running locally, exposed on `localhost:8080`. Furthermore,
you need to specify the `LABEL_STUDIO_API_KEY` environment variable. See
[Setting Up Label Studio](#ls)

### Cannot login to W&B

You need to authenticate your W&B account. See [Setting Up Weight and Biases](#wandb)
If you're facing difficulties, set the `WANDB_MODE` environment variable to `offline`
to disable W&B.

## Our Repository Structure

Expand All @@ -132,15 +192,13 @@ help you understand where to put your code.
graph LR
FRDC -- " Core Dependencies " --> src/frdc/
FRDC -- " Resources " --> rsc/
FRDC -- " Pipeline " --> pipeline/
FRDC -- " Tests " --> tests/
FRDC -- " Repo Dependencies " --> pyproject.toml,poetry.lock
src/frdc/ -- " Dataset Loaders " --> ./load/
src/frdc/ -- " Preprocessing Fn. " --> ./preprocess/
src/frdc/ -- " Train Deps " --> ./train/
src/frdc/ -- " Model Architectures " --> ./models/
rsc/ -- " Datasets ... " --> ./dataset_name/
pipeline/ -- " Model Training Pipeline " --> ./model_tests/
```

src/frdc/
Expand All @@ -149,19 +207,16 @@ src/frdc/
rsc/
: Resources. These are usually cached datasets

pipeline/
: Pipeline code. These are the full ML tests of our pipeline.

tests/
: PyTest tests. These are unit tests & integration tests.
: PyTest tests. These are unit, integration, and model tests.

### Unit, Integration, and Pipeline Tests

We have 3 types of tests:

- Unit Tests are usually small, single function tests.
- Integration Tests are larger tests that tests a mock pipeline.
- Pipeline Tests are the true production pipeline tests that will generate a
- Model Tests are the true production pipeline tests that will generate a
model.

### Where Should I contribute?
Expand All @@ -176,9 +231,9 @@ at the <code>src/frdc/</code> directory.
By adding a new component, you'll need to add a new test. Take a look at the
<code>tests/</code> directory.
</def>
<def title="Changing the pipeline">
<def title="Changing the model pipeline">
If you're a ML Researcher, you'll probably be changing the pipeline. Take a
look at the <code>pipeline/</code> directory.
look at the <code>tests/model_tests/</code> directory.
</def>
<def title="Adding a dependency">
If you're adding a new dependency, use <code>poetry add PACKAGE</code> and
Expand Down
15 changes: 8 additions & 7 deletions Writerside/topics/Retrieve-our-Datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,17 @@ Here, we'll download and load our
- `labels`: The labels of the trees (segments)

```python
from frdc.load.dataset import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset

ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
ar, order = ds.get_ar_bands()
bounds, labels = ds.get_bounds_and_labels()
```

### What Datasets are there? {collapsible="true"}

> To know what datasets are available, you can run
> We recommend to use FRDCDatasetPreset. However, if you want
> to know what other datasets are available, you can run
> [load.gcs](load.gcs.md)'s `list_gcs_datasets()`
> method

Expand Down Expand Up @@ -86,10 +87,10 @@ To segment the data, use [Extract Segments](preprocessing.extract_segments.md).
Here, we'll segment the data by the bounds.

```python
from frdc.load.dataset import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset
from frdc.preprocess.extract_segments import extract_segments_from_bounds

ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
ar, order = ds.get_ar_bands()
bounds, labels = ds.get_bounds_and_labels()
segments = extract_segments_from_bounds(ar, bounds)
Expand All @@ -109,11 +110,11 @@ We can then use these data to plot out the first tree segment.
```python
import matplotlib.pyplot as plt

from frdc.load.dataset import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset
from frdc.preprocess.extract_segments import extract_segments_from_bounds
from frdc.preprocess.scale import scale_0_1_per_band

ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
ar, order = ds.get_ar_bands()
bounds, labels = ds.get_bounds_and_labels()
segments = extract_segments_from_bounds(ar, bounds)
Expand Down
6 changes: 2 additions & 4 deletions Writerside/topics/load.dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,9 @@ version.
For example, to load our Chestnut Nature Park dataset.

```python
from frdc.load import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset

ds = FRDCDataset(site='chestnut_nature_park',
date='20201218',
version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
```

Then, we can use the `ds` object to load objects of the dataset:
Expand Down
Loading