diff --git a/Writerside/topics/Get-Started-with-Dev-Containers.md b/Writerside/topics/Get-Started-with-Dev-Containers.md
index 750bead..342721f 100644
--- a/Writerside/topics/Get-Started-with-Dev-Containers.md
+++ b/Writerside/topics/Get-Started-with-Dev-Containers.md
@@ -47,3 +47,8 @@ steps such as:
- Google Cloud Application Default Credentials
- Weight & Bias API Key
- Label Studio API Key
+
+> You can set the API Keys in the `.env` file in the root of the project.
+> Be careful not to commit the `.env` file to the repository, which should
+> have been ignored by default.
+{style='note'}
\ No newline at end of file
diff --git a/Writerside/topics/Getting-Started.md b/Writerside/topics/Getting-Started.md
index c62ee26..746c93f 100644
--- a/Writerside/topics/Getting-Started.md
+++ b/Writerside/topics/Getting-Started.md
@@ -1,155 +1,161 @@
# Getting Started
-> Want to use a Dev Container? See [Get Started with Dev Containers](Get-Started-with-Dev-Containers.md)
+> Want to use a Dev Container?
+> See [Get Started with Dev Containers](Get-Started-with-Dev-Containers.md)
- Ensure that you have the right version of Python.
- The required Python version can be seen in pyproject.toml
-
- [tool.poetry.dependencies]
- python = "..."
-
-
- Start by cloning our repository.
-
- git clone https://github.com/FR-DC/FRDC-ML.git
-
-
- Then, create a Python Virtual Env pyvenv
-
-
- python -m venv venv/
-
-
- python3 -m venv venv/
-
-
-
-
- Install Poetry
- Then check if it's installed with
- poetry --version
-
- If poetry is not found, it's likely not in the user PATH.
-
-
- Activate the virtual environment
-
-
+ Ensure that you have the right version of Python.
+ The required Python version can be seen in pyproject.toml
+
+ [tool.poetry.dependencies]
+ python = "..."
+
+
+ Start by cloning our repository.
+
+ git clone https://github.com/FR-DC/FRDC-ML.git
+
+
+ Then, create a Python Virtual Env pyvenv
+
+
+ python -m venv venv/
+
+
+ python3 -m venv venv/
+
+
+
+
+ Install Poetry
+ Then check if it's installed with
+ poetry --version
+
+ If poetry is not found, it's likely not in the user PATH.
+
+
+ Activate the virtual environment
+
+
- cd venv/Scripts
- activate
- cd ../..
+ cd venv/Scripts
+ activate
+ cd ../..
-
-
+
+
- source venv/bin/activate
+ source venv/bin/activate
-
-
-
- Install the dependencies. You should be in the same directory as
- pyproject.toml
-
- poetry install --with dev
-
-
- Install Pre-Commit Hooks
-
- pre-commit install
-
-
+
+
+
+ Install the dependencies. You should be in the same directory as
+ pyproject.toml
+
+ poetry install --with dev
+
+
+ Install Pre-Commit Hooks
+
+ pre-commit install
+
+
-
- We use Google Cloud to store our datasets. To set up Google Cloud,
-
- install the Google Cloud CLI
-
-
-
- Then,
-
- authenticate your account
- .
- gcloud auth login
-
-
- Finally,
-
- set up Application Default Credentials (ADC)
- .
- gcloud auth application-default login
-
-
- To make sure everything is working, run the tests.
-
+
+ We use Google Cloud to store our datasets. To set up Google Cloud,
+
+ install the Google Cloud CLI
+
+
+
+ Then,
+
+ authenticate your account
+ .
+ gcloud auth login
+
+
+ Finally,
+
+ set up Application Default Credentials (ADC)
+ .
+ gcloud auth application-default login
+
+
+ To make sure everything is working, run the tests.
+
- This is only necessary if any task requires Label Studio annotations
-
- We use Label Studio to annotate our datasets.
- We won't go through how to install Label Studio, for contributors, it
- should be up on localhost:8080.
-
-
- Then, retrieve your own API key from Label Studio.
- Go to your account page
- and copy the API key.
- Set your API key as an environment variable.
-
-
+ This is only necessary if any task requires Label Studio annotations
+
+ We use Label Studio to annotate our datasets.
+ We won't go through how to install Label Studio, for contributors, it
+ should be up on localhost:8080.
+
+
+ Then, retrieve your own API key from Label Studio.
+ Go to your account page
+ and copy the API key.
+ Set your API key as an environment variable.
+
+
In Windows, go to "Edit environment variables for
your account" and add this as a new environment variable with name
LABEL_STUDIO_API_KEY.
-
-
+
+
Export it as an environment variable.
export LABEL_STUDIO_API_KEY=...
-
-
-
+
+
+ In all cases, you can create a .env file in the root of
+ the project and add the following line:
+ LABEL_STUDIO_API_KEY=...
+
+
+
-
- We use W&B to track our experiments. To set up W&B,
-
- install the W&B CLI
-
-
-
- Then,
-
- authenticate your account
- .
- wandb login
-
+
+ We use W&B to track our experiments. To set up W&B,
+
+ install the W&B CLI
+
+
+
+ Then,
+
+ authenticate your account
+ .
+ wandb login
+
- This is optional but recommended.
- Pre-commit hooks are a way to ensure that your code is formatted correctly.
- This is done by running a series of checks before you commit your code.
-
-
-
- pre-commit install
-
-
+ This is optional but recommended.
+ Pre-commit hooks are a way to ensure that your code is formatted correctly.
+ This is done by running a series of checks before you commit your code.
+
+
+
+ pre-commit install
+
+
-
- Run the tests to make sure everything is working
-
- pytest
-
-
+
+ Run the tests to make sure everything is working
+
+ pytest
+
+
## Troubleshooting
@@ -174,13 +180,15 @@ See [Setting Up Google Cloud](#gcloud)
### Couldn't connect to Label Studio
Label Studio must be running locally, exposed on `localhost:8080`. Furthermore,
-you need to specify the `LABEL_STUDIO_API_KEY` environment variable. See
+you need to specify the `LABEL_STUDIO_API_KEY` environment variable. See
[Setting Up Label Studio](#ls)
### Cannot login to W&B
-You need to authenticate your W&B account. See [Setting Up Weight and Biases](#wandb)
-If you're facing difficulties, set the `WANDB_MODE` environment variable to `offline`
+You need to authenticate your W&B account.
+See [Setting Up Weight and Biases](#wandb)
+If you're facing difficulties, set the `WANDB_MODE` environment variable
+to `offline`
to disable W&B.
## Our Repository Structure
diff --git a/docs/HelpTOC.json b/docs/HelpTOC.json
index 59ab580..98bda53 100644
--- a/docs/HelpTOC.json
+++ b/docs/HelpTOC.json
@@ -1 +1 @@
-{"entities":{"pages":{"Overview":{"id":"Overview","title":"Overview","url":"overview.html","level":0,"tabIndex":0},"ML-Architecture":{"id":"ML-Architecture","title":"ML Architecture","url":"ml-architecture.html","level":0,"tabIndex":1},"Getting-Started":{"id":"Getting-Started","title":"Getting Started","url":"getting-started.html","level":0,"pages":["Get-Started-with-Dev-Containers"],"tabIndex":2},"Get-Started-with-Dev-Containers":{"id":"Get-Started-with-Dev-Containers","title":"Get Started with Dev Containers","url":"get-started-with-dev-containers.html","level":1,"parentId":"Getting-Started","tabIndex":0},"-6vddrq_5799":{"id":"-6vddrq_5799","title":"Tutorials","level":0,"pages":["Retrieve-our-Datasets"],"tabIndex":3},"Retrieve-our-Datasets":{"id":"Retrieve-our-Datasets","title":"Retrieve our Datasets","url":"retrieve-our-datasets.html","level":1,"parentId":"-6vddrq_5799","tabIndex":0},"mix-match":{"id":"mix-match","title":"MixMatch","url":"mix-match.html","level":0,"pages":["mix-match-module","custom-k-aug-dataloaders"],"tabIndex":4},"mix-match-module":{"id":"mix-match-module","title":"MixMatch Module","url":"mix-match-module.html","level":1,"parentId":"mix-match","tabIndex":0},"custom-k-aug-dataloaders":{"id":"custom-k-aug-dataloaders","title":"Custom K-Aug Dataloaders","url":"custom-k-aug-dataloaders.html","level":1,"parentId":"mix-match","tabIndex":1},"-6vddrq_5804":{"id":"-6vddrq_5804","title":"Model Tests","level":0,"pages":["Model-Test-Chestnut-May-Dec"],"tabIndex":5},"Model-Test-Chestnut-May-Dec":{"id":"Model-Test-Chestnut-May-Dec","title":"Model Test Chestnut May-Dec","url":"model-test-chestnut-may-dec.html","level":1,"parentId":"-6vddrq_5804","tabIndex":0},"-6vddrq_5806":{"id":"-6vddrq_5806","title":"API","level":0,"pages":["load.dataset","load.gcs","preprocessing.scale","preprocessing.extract_segments","preprocessing.morphology","preprocessing.glcm_padded"],"tabIndex":6},"load.dataset":{"id":"load.dataset","title":"load.dataset","url":"load-dataset.html","level":1,"parentId":"-6vddrq_5806","tabIndex":0},"load.gcs":{"id":"load.gcs","title":"load.gcs","url":"load-gcs.html","level":1,"parentId":"-6vddrq_5806","tabIndex":1},"preprocessing.scale":{"id":"preprocessing.scale","title":"preprocessing.scale","url":"preprocessing-scale.html","level":1,"parentId":"-6vddrq_5806","tabIndex":2},"preprocessing.extract_segments":{"id":"preprocessing.extract_segments","title":"preprocessing.extract_segments","url":"preprocessing-extract-segments.html","level":1,"parentId":"-6vddrq_5806","tabIndex":3},"preprocessing.morphology":{"id":"preprocessing.morphology","title":"preprocessing.morphology","url":"preprocessing-morphology.html","level":1,"parentId":"-6vddrq_5806","tabIndex":4},"preprocessing.glcm_padded":{"id":"preprocessing.glcm_padded","title":"preprocessing.glcm_padded","url":"preprocessing-glcm-padded.html","level":1,"parentId":"-6vddrq_5806","tabIndex":5}}},"topLevelIds":["Overview","ML-Architecture","Getting-Started","-6vddrq_5799","mix-match","-6vddrq_5804","-6vddrq_5806"]}
\ No newline at end of file
+{"entities":{"pages":{"Overview":{"id":"Overview","title":"Overview","url":"overview.html","level":0,"tabIndex":0},"ML-Architecture":{"id":"ML-Architecture","title":"ML Architecture","url":"ml-architecture.html","level":0,"tabIndex":1},"Getting-Started":{"id":"Getting-Started","title":"Getting Started","url":"getting-started.html","level":0,"pages":["Get-Started-with-Dev-Containers"],"tabIndex":2},"Get-Started-with-Dev-Containers":{"id":"Get-Started-with-Dev-Containers","title":"Get Started with Dev Containers","url":"get-started-with-dev-containers.html","level":1,"parentId":"Getting-Started","tabIndex":0},"-6vddrq_6549":{"id":"-6vddrq_6549","title":"Tutorials","level":0,"pages":["Retrieve-our-Datasets"],"tabIndex":3},"Retrieve-our-Datasets":{"id":"Retrieve-our-Datasets","title":"Retrieve our Datasets","url":"retrieve-our-datasets.html","level":1,"parentId":"-6vddrq_6549","tabIndex":0},"mix-match":{"id":"mix-match","title":"MixMatch","url":"mix-match.html","level":0,"pages":["mix-match-module","custom-k-aug-dataloaders"],"tabIndex":4},"mix-match-module":{"id":"mix-match-module","title":"MixMatch Module","url":"mix-match-module.html","level":1,"parentId":"mix-match","tabIndex":0},"custom-k-aug-dataloaders":{"id":"custom-k-aug-dataloaders","title":"Custom K-Aug Dataloaders","url":"custom-k-aug-dataloaders.html","level":1,"parentId":"mix-match","tabIndex":1},"-6vddrq_6554":{"id":"-6vddrq_6554","title":"Model Tests","level":0,"pages":["Model-Test-Chestnut-May-Dec"],"tabIndex":5},"Model-Test-Chestnut-May-Dec":{"id":"Model-Test-Chestnut-May-Dec","title":"Model Test Chestnut May-Dec","url":"model-test-chestnut-may-dec.html","level":1,"parentId":"-6vddrq_6554","tabIndex":0},"-6vddrq_6556":{"id":"-6vddrq_6556","title":"API","level":0,"pages":["load.dataset","load.gcs","preprocessing.scale","preprocessing.extract_segments","preprocessing.morphology","preprocessing.glcm_padded"],"tabIndex":6},"load.dataset":{"id":"load.dataset","title":"load.dataset","url":"load-dataset.html","level":1,"parentId":"-6vddrq_6556","tabIndex":0},"load.gcs":{"id":"load.gcs","title":"load.gcs","url":"load-gcs.html","level":1,"parentId":"-6vddrq_6556","tabIndex":1},"preprocessing.scale":{"id":"preprocessing.scale","title":"preprocessing.scale","url":"preprocessing-scale.html","level":1,"parentId":"-6vddrq_6556","tabIndex":2},"preprocessing.extract_segments":{"id":"preprocessing.extract_segments","title":"preprocessing.extract_segments","url":"preprocessing-extract-segments.html","level":1,"parentId":"-6vddrq_6556","tabIndex":3},"preprocessing.morphology":{"id":"preprocessing.morphology","title":"preprocessing.morphology","url":"preprocessing-morphology.html","level":1,"parentId":"-6vddrq_6556","tabIndex":4},"preprocessing.glcm_padded":{"id":"preprocessing.glcm_padded","title":"preprocessing.glcm_padded","url":"preprocessing-glcm-padded.html","level":1,"parentId":"-6vddrq_6556","tabIndex":5}}},"topLevelIds":["Overview","ML-Architecture","Getting-Started","-6vddrq_6549","mix-match","-6vddrq_6554","-6vddrq_6556"]}
\ No newline at end of file
diff --git a/docs/custom-k-aug-dataloaders.html b/docs/custom-k-aug-dataloaders.html
index 0d084f1..98a565a 100644
--- a/docs/custom-k-aug-dataloaders.html
+++ b/docs/custom-k-aug-dataloaders.html
@@ -1,5 +1,5 @@
-
Custom K-Aug Dataloaders | Documentation
Documentation 0.1.2 Help
Custom K-Aug Dataloaders
In MixMatch, implementing the data loading methods is quite unconventional.
We need to load multiple augmented versions of the same image into the same batch.
The labelled set is usually too small, causing a premature end to the epoch as it runs out of samples to draw from faster than the unlabelled set.
This can be rather tricky to implement in PyTorch. This tutorial will illustrate how we did it.
Loading Multiple Augmented Versions of the Same Image
See: frdc/load/dataset.pyFRDCDataset.__getitem__
In MixMatch, a single train batch must consist of:
A batch of labeled images
K batches of unlabeled images
Keep in mind that the unlabelled batch, is a single batch of images, not separate draws of batches. It is then "duplicated" K times, and each copy is augmented differently.
Solution 1: Custom Dataset
To solve this, we need to understand the role of both a Dataset and a DataLoader.
A Dataset represents a collection of data, responsible for loading and returning something.
A DataLoader draws samples from a Dataset and returns batched samples.
The key here is that a Dataset is not limited to returning 1 sample at a time, we can make it return the K augmented versions of the same image.
In code, this is done by subclassing the Dataset class and overriding the __getitem__ method.
+}
Documentation 0.1.2 Help
Custom K-Aug Dataloaders
In MixMatch, implementing the data loading methods is quite unconventional.
We need to load multiple augmented versions of the same image into the same batch.
The labelled set is usually too small, causing a premature end to the epoch as it runs out of samples to draw from faster than the unlabelled set.
This can be rather tricky to implement in PyTorch. This tutorial will illustrate how we did it.
Loading Multiple Augmented Versions of the Same Image
See: frdc/load/dataset.pyFRDCDataset.__getitem__
In MixMatch, a single train batch must consist of:
A batch of labeled images
K batches of unlabeled images
Keep in mind that the unlabelled batch, is a single batch of images, not separate draws of batches. It is then "duplicated" K times, and each copy is augmented differently.
Solution 1: Custom Dataset
To solve this, we need to understand the role of both a Dataset and a DataLoader.
A Dataset represents a collection of data, responsible for loading and returning something.
A DataLoader draws samples from a Dataset and returns batched samples.
The key here is that a Dataset is not limited to returning 1 sample at a time, we can make it return the K augmented versions of the same image.
In code, this is done by subclassing the Dataset class and overriding the __getitem__ method.
In the above example, we have a Dataset that returns 3 duplicate versions of the same image. By leveraging this technique, we can create a Dataset that returns K augmented versions of the same image as a tuple
Premature End of Epoch due to Small Labelled Set
See: frdc/train/frdc_datamodule.py
In MixMatch, the definition of an "epoch" is a bit different. Instead of implying that we have seen all the data once, it implies that we've drawn N batches. The N is referred to as the number of iterations per epoch.
Take for example, a labelled set of numbers [1, 2, 3] and an unlabelled set [4, 5, 6, 7, 8, 9, 10]. With batch size of 2, we'll run out of labelled samples after 2 iterations, but we'll still have 3 more iterations for the unlabelled set.
Draw 1: [1, 2], [4, 5]
Draw 2: [3], [6, 7].
Epoch ends.
Solution 2: Random Sampling
To fix this, instead of sequentially sampling the labelled set (and the unlabelled set), we can sample them randomly. This way, we can ensure that it never runs out.
Draw 1: [1, 3], [7, 5]
Draw 2: [2, 1], [4, 9]
Draw 3: [3, 2], [8, 6]
... and so on.
Luckily, PyTorch's DataLoader supports random sampling. We just need to use RandomSampler instead of SequentialSampler (which is the default).
+
In the above example, we have a Dataset that returns 3 duplicate versions of the same image. By leveraging this technique, we can create a Dataset that returns K augmented versions of the same image as a tuple
Premature End of Epoch due to Small Labelled Set
See: frdc/train/frdc_datamodule.py
In MixMatch, the definition of an "epoch" is a bit different. Instead of implying that we have seen all the data once, it implies that we've drawn N batches. The N is referred to as the number of iterations per epoch.
Take for example, a labelled set of numbers [1, 2, 3] and an unlabelled set [4, 5, 6, 7, 8, 9, 10]. With batch size of 2, we'll run out of labelled samples after 2 iterations, but we'll still have 3 more iterations for the unlabelled set.
Draw 1: [1, 2], [4, 5]
Draw 2: [3], [6, 7].
Epoch ends.
Solution 2: Random Sampling
To fix this, instead of sequentially sampling the labelled set (and the unlabelled set), we can sample them randomly. This way, we can ensure that it never runs out.
Draw 1: [1, 3], [7, 5]
Draw 2: [2, 1], [4, 9]
Draw 3: [3, 2], [8, 6]
... and so on.
Luckily, PyTorch's DataLoader supports random sampling. We just need to use RandomSampler instead of SequentialSampler (which is the default).