diff --git a/Writerside/topics/Pipeline.md b/Writerside/topics/Pipeline.md
index ee5bc0f..f5bb38d 100644
--- a/Writerside/topics/Pipeline.md
+++ b/Writerside/topics/Pipeline.md
@@ -69,11 +69,11 @@ flowchart TD
YLO --> YC([Concat])
YUPRK --> YC([Concat])
- XC --> X
- YC --> Y
+ XC --> X[X Concat.]
+ YC --> Y[Y Concat.]
- X[X] --> S
- Y[Y] --> S([Shuffler])
+ X[Y] --> S([Shuffler])
+ Y[X] --> S
S --> XS[X Shuffled]
S --> YS[Y Shuffled]
X & XS --> MU([Mix Up])
diff --git a/docs/HelpTOC.json b/docs/HelpTOC.json
index 79ae1df..f0e50b1 100644
--- a/docs/HelpTOC.json
+++ b/docs/HelpTOC.json
@@ -1 +1 @@
-{"entities":{"pages":{"home":{"id":"home","title":"Mix Match CIFAR10 on PyTorch","url":"home.html","level":0,"tabIndex":0},"7b32ac5c_135":{"id":"7b32ac5c_135","title":"Pipeline","level":0,"pages":["Pipeline","Data-Preparation","Interleaving"],"tabIndex":1},"Pipeline":{"id":"Pipeline","title":"Pipeline","url":"pipeline.html","level":1,"parentId":"7b32ac5c_135","tabIndex":0},"Data-Preparation":{"id":"Data-Preparation","title":"Data Preparation","url":"data-preparation.html","level":1,"parentId":"7b32ac5c_135","tabIndex":1},"Interleaving":{"id":"Interleaving","title":"Interleaving","url":"interleaving.html","level":1,"parentId":"7b32ac5c_135","tabIndex":2}}},"topLevelIds":["home","7b32ac5c_135"]}
\ No newline at end of file
+{"entities":{"pages":{"home":{"id":"home","title":"Mix Match CIFAR10 on PyTorch","url":"home.html","level":0,"tabIndex":0},"39bb0149_38560":{"id":"39bb0149_38560","title":"Pipeline","level":0,"pages":["Pipeline","Data-Preparation","Interleaving"],"tabIndex":1},"Pipeline":{"id":"Pipeline","title":"Pipeline","url":"pipeline.html","level":1,"parentId":"39bb0149_38560","tabIndex":0},"Data-Preparation":{"id":"Data-Preparation","title":"Data Preparation","url":"data-preparation.html","level":1,"parentId":"39bb0149_38560","tabIndex":1},"Interleaving":{"id":"Interleaving","title":"Interleaving","url":"interleaving.html","level":1,"parentId":"39bb0149_38560","tabIndex":2}}},"topLevelIds":["home","39bb0149_38560"]}
\ No newline at end of file
diff --git a/docs/data-preparation.html b/docs/data-preparation.html
index ccfcfa8..26fa217 100644
--- a/docs/data-preparation.html
+++ b/docs/data-preparation.html
@@ -1,4 +1,4 @@
-
Data Preparation | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Data Preparation
The data is split into 3 + K sets, where K is the number of augmentations.
As our labelled data is tiny, it's important to stratify the data on the labels.
It's up to the user to decide how to augment, how much labelled data to use, how much validation, etc.
Data Augmentation
The data augmentation is simple
Implementation
We found some difficulty working with torchvision's CIFAR10 as the dataset is on-demand. We implement this approach, however, it's not the only way to do this.
In order to stratify and shuffle the data without loading the entire dataset, we pass indices into the Dataset initialization, and pre-emptively shuffle and filter out the indices.
+ Data Preparation | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Data Preparation
The data is split into 3 + K sets, where K is the number of augmentations.
As our labelled data is tiny, it's important to stratify the data on the labels.
It's up to the user to decide how to augment, how much labelled data to use, how much validation, etc.
Data Augmentation
The data augmentation is simple
Implementation
We found some difficulty working with torchvision's CIFAR10 as the dataset is on-demand. We implement this approach, however, it's not the only way to do this.
In order to stratify and shuffle the data without loading the entire dataset, we pass indices into the Dataset initialization, and pre-emptively shuffle and filter out the indices.
from typing import Sequence
import numpy as np
@@ -10,7 +10,7 @@
super().__init__(**kwargs)
self.data = self.data[idxs]
self.targets = np.array(self.targets)[idxs].tolist()
-
Notably, torchvision.datasets.CIFAR10 has
50,000 training images
10,000 test images
So, the process is to figure out how to split range(50000) indices into the indices for the labelled and unlabelled data, furthermore, use the targets to stratify the data.
Splitting Unlabelled to K Augmentations
Instead of trying to create K datasets, we create a single dataset, and make it return K augmented versions of the same image.
+
Notably, torchvision.datasets.CIFAR10 has
50,000 training images
10,000 test images
So, the process is to figure out how to split range(50000) indices into the indices for the labelled and unlabelled data, furthermore, use the targets to stratify the data.
Splitting Unlabelled to K Augmentations
Instead of trying to create K datasets, we create a single dataset, and make it return K augmented versions of the same image.
from typing import Sequence, Callable
@@ -24,4 +24,4 @@
def __getitem__(self, item):
img, target = super().__getitem__(item)
return tuple(self.aug(img) for _ in range(self.k_augs)), target
-
This works by overriding the __getitem__ method, and returning a tuple of augmented images. Take note to update all downstream code to handle this.
\ No newline at end of file
diff --git a/docs/home.html b/docs/home.html
index b29f1ad..60c4de3 100644
--- a/docs/home.html
+++ b/docs/home.html
@@ -1 +1 @@
- Mix Match CIFAR10 on PyTorch | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Mix Match CIFAR10 on PyTorch
This repository showcases using PyTorch to implement MixMatch on the CIFAR10 dataset. Much of the original implementation is based on YU1ut/MixMatch-pytorch however, we've refactored the code to follow more modern PyTorch practices.
\ No newline at end of file
+ Mix Match CIFAR10 on PyTorch | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Mix Match CIFAR10 on PyTorch
This repository showcases using PyTorch to implement MixMatch on the CIFAR10 dataset. Much of the original implementation is based on YU1ut/MixMatch-pytorch however, we've refactored the code to follow more modern PyTorch practices.
\ No newline at end of file
diff --git a/docs/interleaving.html b/docs/interleaving.html
index 0eec62e..569db37 100644
--- a/docs/interleaving.html
+++ b/docs/interleaving.html
@@ -1,11 +1,11 @@
- Interleaving | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Interleaving
The interleaving step in MixMatch is a little bit of a mystery. The paper doesn't go into detail about it at all, however, the TensorFlow implementation does use it.
This issue thread google-research/mixmatch/5 attempted to explain and justify the use of Interleave. However, it's unclear if it's a limitation of TensorFlow or a necessary step for good performance.
The Justification
Following the issue thread, Interleave creates a batch that is representative of the input data, as they "only update batch norm for the first batch, ..." Specifically, this batch should be of shape (B, C, H, W), where B is the batch size. This differs from the input of shape (B, K + 1, C, H, W), where K is the number of augmentations.
Interleavedeterministically mixes the labelled and unlabelled data to create a new batch of shape (B, C, H, W).
Interleaving Process
This may sound a bit confusing, so let's review how we arrived at this step.
Recall that we only interleave the X Mix Up before prediction then reverse the interleaving after prediction.
Remember the shapes of the data.
Mix Up is just a proportional mix of the original data and a shuffled version of the data. Assuming a 32x32 RGB image, the shape of this X Mix:
It's confusing with words, so let's visualize what interleave does. Here, each cell is C x H x W with K = 2, B = 5
B=0
B=1
B=2
B=3
B=4
K=0 Lab.
CxHxW
...
K=1 Unl. Aug 1
...
...
K=2 Unl. Aug 2
...
CxHxW
Given the constraint mentioned above, we can't use the first B x C x H x W for prediction, as B x K=0 x C x H x W would just yield the first K, which are all labelled data. This is not representative of X Mix.
Find all elements on the first augmentation
Item 1
Item 2
Item 3
Item 4
Item 5
Lab.
O
O
O
O
O
Unl. Aug 1
Unl. Aug 2
Find all elements on the "diagonal", this is usually not a true diagonal as it's not perfectly square, we'll mark this as X.
Item 1
Item 2
Item 3
Item 4
Item 5
Lab.
X
X
Unl. Aug 1
X
X
Unl. Aug 2
X
Swap these elements
Item 1
Item 2
Item 3
Item 4
Item 5
Lab.
OX
OX
X
X
X
Unl. Aug 1
O
O
Unl. Aug 2
O
Notice that this interleaving process proportionally mixes the labelled and unlabelled data. This first row of shape (B, C, H, W) is then used to predict the labels of X Mix.
Reverse the interleaving process
Item 1
Item 2
Item 3
Item 4
Item 5
Aug 1
OX
OX
O
O
O
Aug 2
X
X
Aug 3
X
This process is perfectly reversible.
Our Preliminary Results
We briefly tried this under 3 conditions:
With interleaving (the default implementation)
Without interleaving, across all augmentations
Batched across all augmentations
Our 2nd experiment is not subjected to the limitation mentioned above. We can just pass the splits into the model sequentially.
Our 3rd experiment concatenates the splits into a single batch (in our case we just didn't split it). This is the most representative of the data, as the model is trained on the entire batch. Due to our larger batch, we also multiplied the learning rate by (K + 1).
We can see that our 2nd experiment failed to converge, while our 1st and 3rd are comparable. We tease out possible issues in the next section.
BatchNorm is not order invariant
Let's dig a little deeper into BatchNorm, specifically, how the running mean and variance are calculated.
+ Interleaving | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Interleaving
The interleaving step in MixMatch is a little bit of a mystery. The paper doesn't go into detail about it at all, however, the TensorFlow implementation does use it.
This issue thread google-research/mixmatch/5 attempted to explain and justify the use of Interleave. However, it's unclear if it's a limitation of TensorFlow or a necessary step for good performance.
The Justification
Following the issue thread, Interleave creates a batch that is representative of the input data, as they "only update batch norm for the first batch, ..." Specifically, this batch should be of shape (B, C, H, W), where B is the batch size. This differs from the input of shape (B, K + 1, C, H, W), where K is the number of augmentations.
Interleavedeterministically mixes the labelled and unlabelled data to create a new batch of shape (B, C, H, W).
Interleaving Process
This may sound a bit confusing, so let's review how we arrived at this step.
Recall that we only interleave the X Mix Up before prediction then reverse the interleaving after prediction.
Remember the shapes of the data.
Mix Up is just a proportional mix of the original data and a shuffled version of the data. Assuming a 32x32 RGB image, the shape of this X Mix:
It's confusing with words, so let's visualize what interleave does. Here, each cell is C x H x W with K = 2, B = 5
B=0
B=1
B=2
B=3
B=4
K=0 Lab.
CxHxW
...
K=1 Unl. Aug 1
...
...
K=2 Unl. Aug 2
...
CxHxW
Given the constraint mentioned above, we can't use the first B x C x H x W for prediction, as B x K=0 x C x H x W would just yield the first K, which are all labelled data. This is not representative of X Mix.
Find all elements on the first augmentation
Item 1
Item 2
Item 3
Item 4
Item 5
Lab.
O
O
O
O
O
Unl. Aug 1
Unl. Aug 2
Find all elements on the "diagonal", this is usually not a true diagonal as it's not perfectly square, we'll mark this as X.
Item 1
Item 2
Item 3
Item 4
Item 5
Lab.
X
X
Unl. Aug 1
X
X
Unl. Aug 2
X
Swap these elements
Item 1
Item 2
Item 3
Item 4
Item 5
Lab.
OX
OX
X
X
X
Unl. Aug 1
O
O
Unl. Aug 2
O
Notice that this interleaving process proportionally mixes the labelled and unlabelled data. This first row of shape (B, C, H, W) is then used to predict the labels of X Mix.
Reverse the interleaving process
Item 1
Item 2
Item 3
Item 4
Item 5
Aug 1
OX
OX
O
O
O
Aug 2
X
X
Aug 3
X
This process is perfectly reversible.
Our Preliminary Results
We briefly tried this under 3 conditions:
With interleaving (the default implementation)
Without interleaving, across all augmentations
Batched across all augmentations
Our 2nd experiment is not subjected to the limitation mentioned above. We can just pass the splits into the model sequentially.
Our 3rd experiment concatenates the splits into a single batch (in our case we just didn't split it). This is the most representative of the data, as the model is trained on the entire batch. Due to our larger batch, we also multiplied the learning rate by (K + 1).
We can see that our 2nd experiment failed to converge, while our 1st and 3rd are comparable. We tease out possible issues in the next section.
BatchNorm is not order invariant
Let's dig a little deeper into BatchNorm, specifically, how the running mean and variance are calculated.
This outputs 0.5, 0.25 respectively, this is because
One can think of momentum as the weight of the new value, so a lower weight means a slower change in the running mean and variance. This algorithm only depends on the immediate previous value, this means that this can be a problem
+
This outputs 0.5, 0.25 respectively, this is because
One can think of momentum as the weight of the new value, so a lower weight means a slower change in the running mean and variance. This algorithm only depends on the immediate previous value, this means that this can be a problem
Despite the same input, the running mean is different. This is reflective of our previous experiment, where we split the data into multiple batches to pass into the model sequentially, it's heavily biased against the first batch, which is our labelled data.
To fix this, instead of passing them in sequentially, we can pass them in a single batch. This is what we did in our 3rd experiment, and it has the same effect with the 1st experiment, where we interleaved the data.
To recap, these are how our 3 experiments passed the x_mix into the model.
+
Despite the same input, the running mean is different. This is reflective of our previous experiment, where we split the data into multiple batches to pass into the model sequentially, it's heavily biased against the first batch, which is our labelled data.
To fix this, instead of passing them in sequentially, we can pass them in a single batch. This is what we did in our 3rd experiment, and it has the same effect with the 1st experiment, where we interleaved the data.
To recap, these are how our 3 experiments passed the x_mix into the model.
It's inconclusive if interleaving is necessary, as passing the whole batch into the model has the same effect, and theoretically, more sound. However, it's clear that passing the data sequentially is not a good idea.
It's inconclusive if interleaving is necessary, as passing the whole batch into the model has the same effect, and theoretically, more sound. However, it's clear that passing the data sequentially is not a good idea.
\ No newline at end of file
diff --git a/docs/pipeline.html b/docs/pipeline.html
index ef2919a..4e3e8a5 100644
--- a/docs/pipeline.html
+++ b/docs/pipeline.html
@@ -1 +1 @@
- Pipeline | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Pipeline
There are several crucial details about the pipeline of MixMatch that aren't mentioned in the paper that makes or breaks the performance of the model. This document aims to explain the pipeline in detail.
Shortforms
X: input data, in this case, images
Y: labels of the input data
K: number of augmentations, or to refer to the kth augmentation
Lab.: labeled data
Unl.: unlabeled data
Data Preparation
The data is split into 3 + K sets, where K is the number of augmentations.
Training is rather complex. The key steps are illustrated below.
To highlight certain steps, we use the following notation:
This is the pipeline of the training process.
We have both Data and Data List, as the augmentations create a new axis in the data.
A few things to note:
Concat is on the Batch axis, the 1st axis.
Predict uses the model's forward pass.
The Label Guessing Prediction, Predict(X Unl. K), doesn't use gradient.
The Mix Up Shuffling is on the Batch axis, which includes the augmentations. If the data is of shape (B, K, C, H, W), the shuffling happens on both B and K.
CIFAR10 (and most datasets) are not even, use drop_last on the DataLoader to avoid errors.
Sharpening
This is a step to make the Unlabelled Predictions more confident. This is done by raising the predictions to a power, then normalizing the predictions.
A higher tau value will make the predictions more confident.
Mix Up
Mix Up mixes the original data with a shuffled version of the data. This ratio of this mix is determined by a random modified sample from a Beta distribution. The modified sample is the maximum of the sample and its complement.
Notably, when we modify the sample, we're effectively always taking the larger value, making the original sample more prevalent during the mix.
Unlabelled Loss Scaler
The unlabelled loss scaler is a scalar that scales the unlabelled loss. This linearly increases from 0 to 100 over the course of training.
The implementation is simple, just take the current epoch and divide by the total number of epochs.
Interleaving
Interleaving is not a well-documented step in the paper. See our Interleaving document for more details.
Evaluation
The evaluation is simple. We just take the accuracy of the model on the test set.
\ No newline at end of file
+ Pipeline | MixMatch PyTorch Documentation
MixMatch PyTorch Documentation 1.0 Help
Pipeline
There are several crucial details about the pipeline of MixMatch that aren't mentioned in the paper that makes or breaks the performance of the model. This document aims to explain the pipeline in detail.
Shortforms
X: input data, in this case, images
Y: labels of the input data
K: number of augmentations, or to refer to the kth augmentation
Lab.: labeled data
Unl.: unlabeled data
Data Preparation
The data is split into 3 + K sets, where K is the number of augmentations.
Training is rather complex. The key steps are illustrated below.
To highlight certain steps, we use the following notation:
This is the pipeline of the training process.
We have both Data and Data List, as the augmentations create a new axis in the data.
A few things to note:
Concat is on the Batch axis, the 1st axis.
Predict uses the model's forward pass.
The Label Guessing Prediction, Predict(X Unl. K), doesn't use gradient.
The Mix Up Shuffling is on the Batch axis, which includes the augmentations. If the data is of shape (B, K, C, H, W), the shuffling happens on both B and K.
CIFAR10 (and most datasets) are not even, use drop_last on the DataLoader to avoid errors.
Sharpening
This is a step to make the Unlabelled Predictions more confident. This is done by raising the predictions to a power, then normalizing the predictions.
A higher tau value will make the predictions more confident.
Mix Up
Mix Up mixes the original data with a shuffled version of the data. This ratio of this mix is determined by a random modified sample from a Beta distribution. The modified sample is the maximum of the sample and its complement.
Notably, when we modify the sample, we're effectively always taking the larger value, making the original sample more prevalent during the mix.
Unlabelled Loss Scaler
The unlabelled loss scaler is a scalar that scales the unlabelled loss. This linearly increases from 0 to 100 over the course of training.
The implementation is simple, just take the current epoch and divide by the total number of epochs.
Interleaving
Interleaving is not a well-documented step in the paper. See our Interleaving document for more details.
Evaluation
The evaluation is simple. We just take the accuracy of the model on the test set.