Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/FR-DC/FRDC-ML
Browse files Browse the repository at this point in the history
  • Loading branch information
Eve-ning committed Feb 14, 2024
2 parents f43e1eb + ed10c7a commit 1691608
Show file tree
Hide file tree
Showing 52 changed files with 840 additions and 764 deletions.
16 changes: 16 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"name": "frdc",
"build": {
"dockerfile": "../Dockerfile",
},
"containerEnv": {
"LABEL_STUDIO_HOST": "host.docker.internal",
"LABEL_STUDIO_API_KEY": "${localEnv:LABEL_STUDIO_API_KEY}",
},
"runArgs": [
"--gpus=all",
],
"hostRequirements": {
"gpu": true,
}
}
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Dockerfile text=auto eol=lf
20 changes: 20 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime as torch
WORKDIR /devcontainer

COPY ./pyproject.toml /devcontainer/pyproject.toml

RUN apt update -y && apt upgrade -y
RUN apt install git -y

RUN pip3 install --upgrade pip && \
pip3 install poetry && \
pip3 install lightning

RUN conda init bash \
&& . ~/.bashrc \
&& conda activate base \
&& poetry config virtualenvs.create false \
&& poetry install --with dev --no-interaction --no-ansi

RUN apt install curl -y && curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:/root/google-cloud-sdk/bin
10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,6 @@ To illustrate this, take a look at how
`tests/model_tests/chestnut_dec_may/train.py` is written. It pulls in relevant
modules from each stage and constructs a pipeline.


> Initially, we evaluated a few ML E2E solutions, despite them offering great
> functionality, their flexibility was
> limited. From a dev perspective, **Active Learning** was a gray area, and we
> foresee heavy shoehorning.
> Ultimately, we decided that the risk was too great, thus we resort to
> creating our own solution.
## Contributing

### Pre-commit Hooks
Expand All @@ -80,3 +72,5 @@ If you're using `pip` instead of `poetry`, run the following commands:
pip install pre-commit
pre-commit install
```

Alternatively, you can use Black configured with your own IDE.
4 changes: 3 additions & 1 deletion Writerside/d.tree
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@
start-page="Overview.md">

<toc-element topic="Overview.md"/>
<toc-element topic="Getting-Started.md"/>
<toc-element topic="Getting-Started.md">
<toc-element topic="Get-Started-with-Dev-Containers.md"/>
</toc-element>
<toc-element toc-title="Tutorials">
<toc-element topic="Retrieve-our-Datasets.md"/>
</toc-element>
Expand Down
49 changes: 49 additions & 0 deletions Writerside/topics/Get-Started-with-Dev-Containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Get Started with Dev Containers

Dev. Containers are a great way to get started with a project. They define all
necessary dependencies and environments, so you can just start coding within
the container.

In this article, we'll only go over **additional steps** to set up with our
project. For more information on how to use Dev Containers, please refer to
the official documentation for each IDE. Once you've set up the Dev Container,
come back here to finish the setup:

- [VSCode](https://code.visualstudio.com/docs/remote/containers).
- [IntelliJ](https://www.jetbrains.com/help/idea/connect-to-devcontainer.html)

> If you see the error `Error response from daemon: ... <!DOCTYPE`, you forgot
> the `.git` at the end of the repo URL.
{style='warning'}

## Python Environment

> Do not create a new environment
{style='warning'}

The dev environment is already created and is managed by Anaconda
`/opt/conda/bin/conda`.
To activate the environment, run the following command:

```bash
conda activate base
```

> Refer to your respective IDE's documentation on how to activate the
> environment.
## Mark as Sources Root (Add to PYTHONPATH)

For `import` statements to work, you need to mark the `src` folder as the
sources root. Optionally, also mark the `tests` folder as the tests root.

> Refer to your respective IDE's documentation on how to mark folders as
> sources root. (Also known as adding to the `PYTHONPATH`)
## Additional Setup

Refer to the [Getting Started](Getting-Started.md) guide for additional setup
steps such as:
- Google Cloud Application Default Credentials
- Weight & Bias API Key
- Label Studio API Key
107 changes: 81 additions & 26 deletions Writerside/topics/Getting-Started.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Getting Started

> Want to use a Dev Container? See [Get Started with Dev Containers](Get-Started-with-Dev-Containers.md)
<procedure title="Installing the Dev. Environment" id="install">
<step>Ensure that you have the right version of Python.
The required Python version can be seen in <code>pyproject.toml</code>
Expand All @@ -10,7 +12,7 @@
</step>
<step>Start by cloning our repository.
<code-block lang="shell">
git clone https://github.com/Forest-Recovery-Digital-Companion/FRDC-ML.git
git clone https://github.com/FR-DC/FRDC-ML.git
</code-block>
</step>
<step>Then, create a Python Virtual Env <code>pyvenv</code>
Expand Down Expand Up @@ -60,6 +62,7 @@
</step>
</procedure>


<procedure title="Setting Up Google Cloud" id="gcloud">
<step>
We use Google Cloud to store our datasets. To set up Google Cloud,
Expand All @@ -86,6 +89,48 @@
</step>
</procedure>

<procedure title="Setting Up Label Studio" id="ls">
<tip>This is only necessary if any task requires Label Studio annotations</tip>
<step>
We use Label Studio to annotate our datasets.
We won't go through how to install Label Studio, for contributors, it
should be up on <code>localhost:8080</code>.
</step>
<step>
Then, retrieve your own API key from Label Studio.
<a href="http://localhost:8080/user/account"> Go to your account page </a>
and copy the API key. <br/></step>
<step> Set your API key as an environment variable.
<tabs>
<tab title="Windows">
In Windows, go to "Edit environment variables for
your account" and add this as a new environment variable with name
<code>LABEL_STUDIO_API_KEY</code>.
</tab>
<tab title="Linux">
Export it as an environment variable.
<code-block lang="shell">export LABEL_STUDIO_API_KEY=...</code-block>
</tab>
</tabs>
</step>
</procedure>

<procedure title="Setting Up Weight and Biases" id="wandb">
<step>
We use W&B to track our experiments. To set up W&B,
<a href="https://docs.wandb.ai/quickstart">
install the W&B CLI
</a>
</step>
<step>
Then,
<a href="https://docs.wandb.ai/quickstart">
authenticate your account
</a>.
<code-block lang="shell">wandb login</code-block>
</step>
</procedure>

<procedure title="Pre-commit Hooks" collapsible="true">
<note>This is optional but recommended.
Pre-commit hooks are a way to ensure that your code is formatted correctly.
Expand All @@ -98,30 +143,45 @@
</step>
</procedure>

<procedure title="Running the Tests" collapsible="true" id="tests">
<procedure title="Running the Tests" id="tests">
<step>
Run the tests to make sure everything is working
<code-block lang="shell">
pytest
</code-block>
</step>
<step>
In case of errors:
<deflist>
<def title="google.auth.exceptions.DefaultCredentialsError">
If you get this error, it means that you haven't authenticated your
Google Cloud account.
See <a anchor="gcloud">Setting Up Google Cloud</a>
</def>
<def title="ModuleNotFoundError" collapsible="true">
If you get this error, it means that you haven't installed the
dependencies.
See <a anchor="install">Installing the Dev. Environment</a>
</def>
</deflist>
</step>
</procedure>

## Troubleshooting

### ModuleNotFoundError

It's likely that your `src` and `tests` directories are not in `PYTHONPATH`.
To fix this, run the following command:

```shell
export PYTHONPATH=$PYTHONPATH:./src:./tests
```

Or, set it in your IDE, for example, IntelliJ allows setting directories as
**Source Roots**.

### google.auth.exceptions.DefaultCredentialsError

It's likely that you haven't authenticated your Google Cloud account.
See [Setting Up Google Cloud](#gcloud)

### Couldn't connect to Label Studio

Label Studio must be running locally, exposed on `localhost:8080`. Furthermore,
you need to specify the `LABEL_STUDIO_API_KEY` environment variable. See
[Setting Up Label Studio](#ls)

### Cannot login to W&B

You need to authenticate your W&B account. See [Setting Up Weight and Biases](#wandb)
If you're facing difficulties, set the `WANDB_MODE` environment variable to `offline`
to disable W&B.

## Our Repository Structure

Expand All @@ -132,15 +192,13 @@ help you understand where to put your code.
graph LR
FRDC -- " Core Dependencies " --> src/frdc/
FRDC -- " Resources " --> rsc/
FRDC -- " Pipeline " --> pipeline/
FRDC -- " Tests " --> tests/
FRDC -- " Repo Dependencies " --> pyproject.toml,poetry.lock
src/frdc/ -- " Dataset Loaders " --> ./load/
src/frdc/ -- " Preprocessing Fn. " --> ./preprocess/
src/frdc/ -- " Train Deps " --> ./train/
src/frdc/ -- " Model Architectures " --> ./models/
rsc/ -- " Datasets ... " --> ./dataset_name/
pipeline/ -- " Model Training Pipeline " --> ./model_tests/
```

src/frdc/
Expand All @@ -149,19 +207,16 @@ src/frdc/
rsc/
: Resources. These are usually cached datasets

pipeline/
: Pipeline code. These are the full ML tests of our pipeline.

tests/
: PyTest tests. These are unit tests & integration tests.
: PyTest tests. These are unit, integration, and model tests.

### Unit, Integration, and Pipeline Tests

We have 3 types of tests:

- Unit Tests are usually small, single function tests.
- Integration Tests are larger tests that tests a mock pipeline.
- Pipeline Tests are the true production pipeline tests that will generate a
- Model Tests are the true production pipeline tests that will generate a
model.

### Where Should I contribute?
Expand All @@ -176,9 +231,9 @@ at the <code>src/frdc/</code> directory.
By adding a new component, you'll need to add a new test. Take a look at the
<code>tests/</code> directory.
</def>
<def title="Changing the pipeline">
<def title="Changing the model pipeline">
If you're a ML Researcher, you'll probably be changing the pipeline. Take a
look at the <code>pipeline/</code> directory.
look at the <code>tests/model_tests/</code> directory.
</def>
<def title="Adding a dependency">
If you're adding a new dependency, use <code>poetry add PACKAGE</code> and
Expand Down
15 changes: 8 additions & 7 deletions Writerside/topics/Retrieve-our-Datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,17 @@ Here, we'll download and load our
- `labels`: The labels of the trees (segments)

```python
from frdc.load.dataset import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset

ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
ar, order = ds.get_ar_bands()
bounds, labels = ds.get_bounds_and_labels()
```

### What Datasets are there? {collapsible="true"}

> To know what datasets are available, you can run
> We recommend to use FRDCDatasetPreset. However, if you want
> to know what other datasets are available, you can run
> [load.gcs](load.gcs.md)'s `list_gcs_datasets()`
> method
Expand Down Expand Up @@ -86,10 +87,10 @@ To segment the data, use [Extract Segments](preprocessing.extract_segments.md).
Here, we'll segment the data by the bounds.

```python
from frdc.load.dataset import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset
from frdc.preprocess.extract_segments import extract_segments_from_bounds

ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
ar, order = ds.get_ar_bands()
bounds, labels = ds.get_bounds_and_labels()
segments = extract_segments_from_bounds(ar, bounds)
Expand All @@ -109,11 +110,11 @@ We can then use these data to plot out the first tree segment.
```python
import matplotlib.pyplot as plt

from frdc.load.dataset import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset
from frdc.preprocess.extract_segments import extract_segments_from_bounds
from frdc.preprocess.scale import scale_0_1_per_band

ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
ar, order = ds.get_ar_bands()
bounds, labels = ds.get_bounds_and_labels()
segments = extract_segments_from_bounds(ar, bounds)
Expand Down
6 changes: 2 additions & 4 deletions Writerside/topics/load.dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,9 @@ version.
For example, to load our Chestnut Nature Park dataset.

```python
from frdc.load import FRDCDataset
from frdc.load.preset import FRDCDatasetPreset

ds = FRDCDataset(site='chestnut_nature_park',
date='20201218',
version=None)
ds = FRDCDatasetPreset.chestnut_20201218()
```

Then, we can use the `ds` object to load objects of the dataset:
Expand Down
Loading

0 comments on commit 1691608

Please sign in to comment.