Merge branch 'main' of https://github.com/FR-DC/FRDC-ML

FR-DC · Feb 14, 2024 · 1691608 · 1691608
2 parents f43e1eb + ed10c7a
commit 1691608
Show file tree

Hide file tree

Showing 52 changed files with 840 additions and 764 deletions.
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -0,0 +1,16 @@
+{
+  "name": "frdc",
+  "build": {
+    "dockerfile": "../Dockerfile",
+  },
+  "containerEnv": {
+    "LABEL_STUDIO_HOST": "host.docker.internal",
+    "LABEL_STUDIO_API_KEY": "${localEnv:LABEL_STUDIO_API_KEY}",
+  },
+  "runArgs": [
+    "--gpus=all",
+  ],
+  "hostRequirements": {
+    "gpu": true,
+  }
+}
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+Dockerfile text=auto eol=lf
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,20 @@
+FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime as torch
+WORKDIR /devcontainer
+
+COPY ./pyproject.toml /devcontainer/pyproject.toml
+
+RUN apt update -y && apt upgrade -y
+RUN apt install git -y
+
+RUN pip3 install --upgrade pip && \
+    pip3 install poetry && \
+    pip3 install lightning
+
+RUN conda init bash \
+    && . ~/.bashrc \
+    && conda activate base \
+    && poetry config virtualenvs.create false \
+    && poetry install --with dev --no-interaction --no-ansi
+
+RUN apt install curl -y && curl -sSL https://sdk.cloud.google.com | bash
+ENV PATH $PATH:/root/google-cloud-sdk/bin
diff --git a/README.md b/README.md
@@ -54,14 +54,6 @@ To illustrate this, take a look at how
 `tests/model_tests/chestnut_dec_may/train.py` is written. It pulls in relevant
 modules from each stage and constructs a pipeline.
 
-
-> Initially, we evaluated a few ML E2E solutions, despite them offering great
-> functionality, their flexibility was
-> limited. From a dev perspective, **Active Learning** was a gray area, and we
-> foresee heavy shoehorning.
-> Ultimately, we decided that the risk was too great, thus we resort to
-> creating our own solution.
-
 ## Contributing
 
 ### Pre-commit Hooks
@@ -80,3 +72,5 @@ If you're using `pip` instead of `poetry`, run the following commands:
 pip install pre-commit
 pre-commit install
 ```
+
+Alternatively, you can use Black configured with your own IDE.
diff --git a/Writerside/d.tree b/Writerside/d.tree
@@ -8,7 +8,9 @@
                   start-page="Overview.md">
 
     <toc-element topic="Overview.md"/>
-    <toc-element topic="Getting-Started.md"/>
+    <toc-element topic="Getting-Started.md">
+        <toc-element topic="Get-Started-with-Dev-Containers.md"/>
+    </toc-element>
     <toc-element toc-title="Tutorials">
         <toc-element topic="Retrieve-our-Datasets.md"/>
     </toc-element>

diff --git a/Writerside/topics/Get-Started-with-Dev-Containers.md b/Writerside/topics/Get-Started-with-Dev-Containers.md
@@ -0,0 +1,49 @@
+# Get Started with Dev Containers
+
+Dev. Containers are a great way to get started with a project. They define all
+necessary dependencies and environments, so you can just start coding within
+the container.
+
+In this article, we'll only go over **additional steps** to set up with our
+project. For more information on how to use Dev Containers, please refer to
+the official documentation for each IDE. Once you've set up the Dev Container,
+come back here to finish the setup:
+
+- [VSCode](https://code.visualstudio.com/docs/remote/containers).
+- [IntelliJ](https://www.jetbrains.com/help/idea/connect-to-devcontainer.html)
+
+> If you see the error `Error response from daemon: ... <!DOCTYPE`, you forgot
+> the `.git` at the end of the repo URL.
+{style='warning'}
+
+## Python Environment
+
+> Do not create a new environment
+{style='warning'}
+
+The dev environment is already created and is managed by Anaconda 
+`/opt/conda/bin/conda`. 
+To activate the environment, run the following command:
+
+```bash
+conda activate base
+```
+
+> Refer to your respective IDE's documentation on how to activate the
+> environment.
+
+## Mark as Sources Root (Add to PYTHONPATH)
+
+For `import` statements to work, you need to mark the `src` folder as the
+sources root. Optionally, also mark the `tests` folder as the tests root.
+
+> Refer to your respective IDE's documentation on how to mark folders as
+> sources root. (Also known as adding to the `PYTHONPATH`)
+
+## Additional Setup
+
+Refer to the [Getting Started](Getting-Started.md) guide for additional setup
+steps such as:
+- Google Cloud Application Default Credentials
+- Weight & Bias API Key
+- Label Studio API Key
diff --git a/Writerside/topics/Getting-Started.md b/Writerside/topics/Getting-Started.md
@@ -1,5 +1,7 @@
 # Getting Started
 
+> Want to use a Dev Container? See [Get Started with Dev Containers](Get-Started-with-Dev-Containers.md)
+
 <procedure title="Installing the Dev. Environment" id="install">
     <step>Ensure that you have the right version of Python.
         The required Python version can be seen in <code>pyproject.toml</code>
@@ -10,7 +12,7 @@
     </step>
     <step>Start by cloning our repository.
         <code-block lang="shell">
-          git clone https://github.com/Forest-Recovery-Digital-Companion/FRDC-ML.git
+          git clone https://github.com/FR-DC/FRDC-ML.git
         </code-block>
     </step>
     <step>Then, create a Python Virtual Env <code>pyvenv</code>
@@ -60,6 +62,7 @@
     </step>
 </procedure>
 
+
 <procedure title="Setting Up Google Cloud" id="gcloud">
     <step>
         We use Google Cloud to store our datasets. To set up Google Cloud,
@@ -86,6 +89,48 @@
     </step>
 </procedure>
 
+<procedure title="Setting Up Label Studio" id="ls">
+    <tip>This is only necessary if any task requires Label Studio annotations</tip>
+    <step>
+        We use Label Studio to annotate our datasets.
+        We won't go through how to install Label Studio, for contributors, it
+        should be up on <code>localhost:8080</code>.
+    </step>
+    <step>
+        Then, retrieve your own API key from Label Studio.
+        <a href="http://localhost:8080/user/account"> Go to your account page </a>
+        and copy the API key. <br/></step>
+    <step> Set your API key as an environment variable.
+        <tabs>
+<tab title="Windows">
+        In Windows, go to "Edit environment variables for
+        your account" and add this as a new environment variable with name
+        <code>LABEL_STUDIO_API_KEY</code>.
+</tab>
+<tab title="Linux">
+        Export it as an environment variable.
+        <code-block lang="shell">export LABEL_STUDIO_API_KEY=...</code-block>
+</tab>
+</tabs>
+    </step>
+</procedure>
+
+<procedure title="Setting Up Weight and Biases" id="wandb">
+    <step>
+        We use W&B to track our experiments. To set up W&B,
+        <a href="https://docs.wandb.ai/quickstart">
+          install the W&B CLI
+        </a>
+    </step>
+    <step>
+        Then, 
+        <a href="https://docs.wandb.ai/quickstart">
+          authenticate your account
+        </a>.
+        <code-block lang="shell">wandb login</code-block>
+    </step>
+</procedure>
+
 <procedure title="Pre-commit Hooks" collapsible="true">
     <note>This is optional but recommended.
     Pre-commit hooks are a way to ensure that your code is formatted correctly.
@@ -98,30 +143,45 @@
     </step>
 </procedure>
 
-<procedure title="Running the Tests" collapsible="true" id="tests">
+<procedure title="Running the Tests" id="tests">
     <step>
         Run the tests to make sure everything is working
         <code-block lang="shell">
             pytest
         </code-block>
     </step>
-<step>
-    In case of errors:
-    <deflist>
-        <def title="google.auth.exceptions.DefaultCredentialsError">
-            If you get this error, it means that you haven't authenticated your
-            Google Cloud account.
-            See <a anchor="gcloud">Setting Up Google Cloud</a>
-        </def>
-        <def title="ModuleNotFoundError" collapsible="true">
-            If you get this error, it means that you haven't installed the
-            dependencies.
-            See <a anchor="install">Installing the Dev. Environment</a>
-        </def>
-    </deflist>
-</step>
 </procedure>
 
+## Troubleshooting
+
+### ModuleNotFoundError
+
+It's likely that your `src` and `tests` directories are not in `PYTHONPATH`.
+To fix this, run the following command:
+
+```shell
+export PYTHONPATH=$PYTHONPATH:./src:./tests
+```
+
+Or, set it in your IDE, for example, IntelliJ allows setting directories as
+**Source Roots**.
+
+### google.auth.exceptions.DefaultCredentialsError
+
+It's likely that you haven't authenticated your Google Cloud account.
+See [Setting Up Google Cloud](#gcloud)
+
+### Couldn't connect to Label Studio
+
+Label Studio must be running locally, exposed on `localhost:8080`. Furthermore,
+you need to specify the `LABEL_STUDIO_API_KEY` environment variable. See 
+[Setting Up Label Studio](#ls)
+
+### Cannot login to W&B
+
+You need to authenticate your W&B account. See [Setting Up Weight and Biases](#wandb)
+If you're facing difficulties, set the `WANDB_MODE` environment variable to `offline`
+to disable W&B.
 
 ## Our Repository Structure
 
@@ -132,15 +192,13 @@ help you understand where to put your code.
 graph LR
     FRDC -- " Core Dependencies " --> src/frdc/
     FRDC -- " Resources " --> rsc/
-    FRDC -- " Pipeline " --> pipeline/
     FRDC -- " Tests " --> tests/
     FRDC -- " Repo Dependencies " --> pyproject.toml,poetry.lock
     src/frdc/ -- " Dataset Loaders " --> ./load/
     src/frdc/ -- " Preprocessing Fn. " --> ./preprocess/
     src/frdc/ -- " Train Deps " --> ./train/
     src/frdc/ -- " Model Architectures " --> ./models/
     rsc/ -- " Datasets ... " --> ./dataset_name/
-    pipeline/ -- " Model Training Pipeline " --> ./model_tests/
 ```
 
 src/frdc/
@@ -149,19 +207,16 @@ src/frdc/
 rsc/
 : Resources. These are usually cached datasets
 
-pipeline/
-: Pipeline code. These are the full ML tests of our pipeline.
-
 tests/
-: PyTest tests. These are unit tests & integration tests.
+: PyTest tests. These are unit, integration, and model tests.
 
 ### Unit, Integration, and Pipeline Tests
 
 We have 3 types of tests:
 
 - Unit Tests are usually small, single function tests.
 - Integration Tests are larger tests that tests a mock pipeline.
-- Pipeline Tests are the true production pipeline tests that will generate a
+- Model Tests are the true production pipeline tests that will generate a
   model.
 
 ### Where Should I contribute?
@@ -176,9 +231,9 @@ at the <code>src/frdc/</code> directory.
 By adding a new component, you'll need to add a new test. Take a look at the
 <code>tests/</code> directory.
 </def>
-<def title="Changing the pipeline">
+<def title="Changing the model pipeline">
 If you're a ML Researcher, you'll probably be changing the pipeline. Take a
-look at the <code>pipeline/</code> directory.
+look at the <code>tests/model_tests/</code> directory.
 </def>
 <def title="Adding a dependency">
 If you're adding a new dependency, use <code>poetry add PACKAGE</code> and

diff --git a/Writerside/topics/Retrieve-our-Datasets.md b/Writerside/topics/Retrieve-our-Datasets.md
@@ -25,16 +25,17 @@ Here, we'll download and load our
 - `labels`: The labels of the trees (segments)
 
 ```python
-from frdc.load.dataset import FRDCDataset
+from frdc.load.preset import FRDCDatasetPreset
 
-ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
+ds = FRDCDatasetPreset.chestnut_20201218()
 ar, order = ds.get_ar_bands()
 bounds, labels = ds.get_bounds_and_labels()
 ```
 
 ### What Datasets are there? {collapsible="true"}
 
-> To know what datasets are available, you can run
+> We recommend to use FRDCDatasetPreset. However, if you want  
+> to know what other datasets are available, you can run
 > [load.gcs](load.gcs.md)'s `list_gcs_datasets()`
 > method
 
@@ -86,10 +87,10 @@ To segment the data, use [Extract Segments](preprocessing.extract_segments.md).
 Here, we'll segment the data by the bounds.
 
 ```python
-from frdc.load.dataset import FRDCDataset
+from frdc.load.preset import FRDCDatasetPreset
 from frdc.preprocess.extract_segments import extract_segments_from_bounds
 
-ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
+ds = FRDCDatasetPreset.chestnut_20201218()
 ar, order = ds.get_ar_bands()
 bounds, labels = ds.get_bounds_and_labels()
 segments = extract_segments_from_bounds(ar, bounds)
@@ -109,11 +110,11 @@ We can then use these data to plot out the first tree segment.
 ```python
 import matplotlib.pyplot as plt
 
-from frdc.load.dataset import FRDCDataset
+from frdc.load.preset import FRDCDatasetPreset
 from frdc.preprocess.extract_segments import extract_segments_from_bounds
 from frdc.preprocess.scale import scale_0_1_per_band
 
-ds = FRDCDataset(site="chestnut_nature_park", date="20201218", version=None)
+ds = FRDCDatasetPreset.chestnut_20201218()
 ar, order = ds.get_ar_bands()
 bounds, labels = ds.get_bounds_and_labels()
 segments = extract_segments_from_bounds(ar, bounds)

diff --git a/Writerside/topics/load.dataset.md b/Writerside/topics/load.dataset.md
@@ -17,11 +17,9 @@ version.
 For example, to load our Chestnut Nature Park dataset. 
 
 ```python
-from frdc.load import FRDCDataset
+from frdc.load.preset import FRDCDatasetPreset
 
-ds = FRDCDataset(site='chestnut_nature_park',
-                 date='20201218',
-                 version=None)
+ds = FRDCDatasetPreset.chestnut_20201218()
 ```
 
 Then, we can use the `ds` object to load objects of the dataset: