mosaicml · ethantang-db · Nov 22, 2024 · Nov 19, 2024 · Nov 21, 2024
@@ -97,6 +97,19 @@ To run the tests in the provided docker containers:
     * `pip install -e .`
     * `pytest <args>` or `make <args>` to run the desired tests
 
+### Checking documentation
+
+If your changes affects the documentation, please get a chance to build the docs locally and view it to verify if the changes
+are what you wanted.
+
+<!--pytest.mark.skip-->
+```bash
+cd docs
+pip install -e '.[docs]'
+make clean && make html
+make host   # open the output link in a browser.
+```
+
 
 ## Code Style & Typing
 

@@ -6,7 +6,7 @@
 
 ALiBi (Attention with Linear Biases) dispenses with position embeddings for tokens in transformer-based NLP models, instead encoding position information by biasing the query-key attention scores proportionally to each token pair’s distance. ALiBi yields excellent extrapolation to unseen sequence lengths compared to other position embedding schemes. We leverage this extrapolation capability by training with shorter sequence lengths, which reduces the memory and computation load.
 
-| ![Alibi](https://storage.googleapis.com/docs.mosaicml.com/images/methods/alibi.png) |
+| ![Alibi](../_images/alibi.png) |
 |:--:
 |*The matrix on the left depicts the attention score for each key-query token pair. The matrix on the right depicts the distance between each query-key token pair. m is a head-specific scalar that is fixed during training. Figure from [Press et al., 2021](https://openreview.net/forum?id=R8sQPpGCv0).*|
 

@@ -10,7 +10,7 @@ BlurPool increases the accuracy of convolutional neural networks for computer vi
 nearly the same speed, by applying a spatial low-pass filter before pooling operations and strided convolutions.
 Doing so reduces [aliasing](https://en.wikipedia.org/wiki/Aliasing) when performing these operations.
 
-| ![BlurPool](https://storage.googleapis.com/docs.mosaicml.com/images/methods/blurpool-antialiasing.png) |
+| ![BlurPool](../_images/blurpool-antialiasing.png) |
 |:--:
 |*A diagram of the BlurPool replacements (bottom row) for typical pooling and downsampling operations (top row) in convolutional neural networks. In each case, BlurPool applies a low-pass filter before the spatial downsampling to avoid aliasing. This image is Figure 2 in [Zhang (2019)](https://proceedings.mlr.press/v97/zhang19a.html).*|
 

@@ -8,7 +8,7 @@ Channels Last improves the throughput of convolution operations in networks for
 NVIDIA GPUs natively perform convolution operations in NHWC format, so storing the tensors this way eliminates transpositions that would otherwise need to take place, increasing throughput.
 This is a systems-level method that does not change the math or outcome of training in any way.
 
-| ![ChannelsLast](https://storage.googleapis.com/docs.mosaicml.com/images/methods/channels_last.png) |
+| ![ChannelsLast](../_images/channels_last.png) |
 |:--:
 |*A diagram of a convolutional layer using the standard NCHW tensor memory layout (left) and the NHWC tensor memory layout (right). Fewer operations take place in NHWC format because the convolution operation is natively performed in NHWC format (right); in contrast, the NCHW tensor must be transposed to NHWC before the convolution and transposed back to NCHW after (right). This digram is from [NVIDIA](https://developer.nvidia.com/blog/tensor-core-ai-performance-milestones/).*|
 

@@ -8,7 +8,7 @@ ColOut is a data augmentation technique that drops a fraction of the rows or col
 If the fraction of rows/columns dropped isn't too large, the image content is not significantly altered but the image size is reduced, speeding up training.
 This modification modestly reduces accuracy, but it is a worthwhile tradeoff for the increased speed.
 
-| ![ColOut](https://storage.googleapis.com/docs.mosaicml.com/images/methods/col_out.png) |
+| ![ColOut](../_images/col_out.png) |
 |:--:
 |*Several instances of an image of an apple from the CIFAR-100 dataset with ColOut applied. ColOut randomly removes different rows and columns each time it is applied.*|
 

@@ -8,7 +8,7 @@ CutMix is a data augmentation technique that modifies images by cutting out a sm
 It is a regularization technique that can improve the generalization accuracy of computer
 vision models.
 
-| ![CutMix](https://storage.googleapis.com/docs.mosaicml.com/images/methods/cutmix.png) |
+| ![CutMix](../_images/cutmix.png) |
 |:--:
 |*An image with CutMix applied. A picture of a cat has been placed over the top left corner of a picture of a dog. This image is taken from [Figure 1 from Yun et al. (2019)](https://arxiv.org/abs/1905.04899).*|
 

@@ -7,7 +7,7 @@
 Cutout is a data augmentation technique that masks one or more square regions of an input image, replacing them with gray boxes.
 It is a regularization technique that improves the accuracy of models for computer vision.
 
-| ![CutOut](https://storage.googleapis.com/docs.mosaicml.com/images/methods/cutout.png) |
+| ![CutOut](../_images/cutout.png) |
 |:--:
 |*Several images from the CIFAR-10 dataset with Cutout applied. Cutout adds a gray box that occludes a portion of each image. This is [Figure 1 from DeVries & Taylor (2017)](https://arxiv.org/abs/1708.04552).*|
 

@@ -8,7 +8,7 @@
 Factorize splits a large linear or convolutional layer into two smaller ones that compute a similar function.
 This can be applied to models for both computer vision and natural language processing.
 
-| ![Factorize](https://storage.googleapis.com/docs.mosaicml.com/images/methods/factorize-no-caption.png) |
+| ![Factorize](../_images/factorize-no-caption.png) |
 |:--:
 |*Figure 1 of [Zhang et al. (2015)](https://ieeexplore.ieee.org/abstract/document/7332968). (a) The weights `W` of a 2D convolutional layer with `k x k` filters, `c` input channels, and `d` output channels are factorized into two smaller convolutions (b) with weights `W'` and `P` with `d'` intermediate channels. The first convolution uses the original filter size but produces only `d'` channels. The second convolution has `1 x 1` filters and produces the original `d` output channels but has only `d'` input channels. This changes the complexity per spatial position from $O(k^2cd)$ to $O(k^2cd') + O(d'd)$.*|
 

@@ -8,7 +8,7 @@ During training, BatchNorm normalizes each batch of inputs to have a mean of 0 a
 Ghost BatchNorm instead splits the batch into multiple "ghost" batches, each containing `ghost_batch_size` samples, and normalizes each one to have a mean of 0 and variance of 1.
 This causes training with a large batch size to behave similarly to training with a small batch size.
 
-| ![Ghost BatchNorm](https://storage.googleapis.com/docs.mosaicml.com/images/methods/ghost-batch-normalization.png) |
+| ![Ghost BatchNorm](../_images/ghost-batch-normalization.png) |
 |:--:
 |*A visualization of different normalization methods on an activation tensor in a neural network with multiple channels. M represents the batch dimension, C represents the channel dimension, and F represents the spatial dimensions (such as height and width). Ghost BatchNorm (upper right) is a modified version of BatchNorm that normalizes the mean and variance for disjoint sub-batches of the full batch. This image is Figure 1 in [Dimitriou & Arandjelovic, 2020](https://arxiv.org/abs/2007.08554).*|
 

@@ -8,7 +8,7 @@
 Layer Freezing gradually makes early modules untrainable ("freezing" them), saving the cost of backpropagating to and updating frozen modules.
 The hypothesis behind Layer Freezing is that early layers may learn their features sooner than later layers, meaning they do not need to be updated later in training.
 
-<!--| ![LayerFreezing](https://storage.googleapis.com/docs.mosaicml.com/images/methods/layer-freezing.png) |
+<!--| ![LayerFreezing](../_images/layer-freezing.png) |
 |:--:
 |*Need a picture.*|-->
 

@@ -10,7 +10,7 @@ For any pair of examples, it trains the network on a random convex combination o
 To create the corresponding targets, it uses the same random convex combination of the targets of the individual examples.
 Training in this fashion improves generalization.
 
-| ![MixUp](https://storage.googleapis.com/docs.mosaicml.com/images/methods/mix_up.png) |
+| ![MixUp](../_images/mix_up.png) |
 |:--:
 |*Two different training examples (a picture of a bird and a picture of a frog) that have been combined by MixUp into a single example. The corresponding targets are a convex combination of the targets for the bird class and the frog class.*|
 

@@ -7,7 +7,7 @@
 
 Progressive Resizing works by initially training on images that have been downsampled to a smaller size. It slowly grows the images back to their full size by a set point in training and uses full-size images for the remainder of training. Progressive resizing reduces costs during the early phase of training when the network may learn coarse-grained features that do not require details lost by reducing image resolution.
 
-| ![ProgressiveResizing](https://storage.googleapis.com/docs.mosaicml.com/images/methods/progressive_resizing_vision.png) |
+| ![ProgressiveResizing](../_images/progressive_resizing_vision.png) |
 |:--|
 |*An example image as it would appear to the network at different stages of training with progressive resizing. At the beginning of training, each training example is at its smallest size. Throughout the pre-training phase, example size increases linearly. At the end of the pre-training phase, example size has reached its full value and remains at that value for the remainder of training (the fine-tuning phase).*|
 

@@ -8,7 +8,7 @@ For each data sample, RandAugment randomly samples `depth` image augmentations f
 Each augmentation is applied with a context-specific `severity` sampled uniformly from 0 to 10.
 Training in this fashion regularizes the network and can improve generalization performance.
 
-| ![RandAugment](https://storage.googleapis.com/docs.mosaicml.com/images/methods/rand_augment.jpg) |
+| ![RandAugment](../_images/rand_augment.jpg) |
 |:--:|
 |*An image of a dog that undergoes three different augmentation chains. Each of these chains is a possible augmentation that might be applied by RandAugment and gets combined with the original image.*|
 

@@ -7,7 +7,7 @@
 Selective Backprop prioritizes examples with high loss at each iteration, skipping backpropagation on examples with low loss.
 This speeds up training with limited impact on generalization.
 
-| ![SelectiveBackprop](https://storage.googleapis.com/docs.mosaicml.com/images/methods/selective-backprop.png) |
+| ![SelectiveBackprop](../_images/selective-backprop.png) |
 |:--|
 |*Four examples are forward propagated through the network. Selective backprop only backpropagates the two examples that have the highest loss.*|
 

@@ -7,7 +7,7 @@
 
 Sequence Length Warmup linearly increases the sequence length (number of tokens per sentence) used to train a language model from a `min_seq_length` to a `max_seq_length` over some duration at the beginning of training. The underlying motivation is that sequence length is a proxy for the difficulty of an example, and this method assumes a simple curriculum where the model is trained on easy examples (by this definition) first. Sequence Length Warmup is able to reduce the training time of GPT-style models by ~1.5x while still achieving the same loss as baselines.
 
-| ![SequenceLengthWarmup](https://storage.googleapis.com/docs.mosaicml.com/images/methods/seq_len_warmup.svg)|
+| ![SequenceLengthWarmup](../_images/seq_len_warmup.svg)|
 |:--|
 |*The sequence length used to train a model over the course of training. It increases linearly over the first 30% of training before reaching its full value for the remainder of training.*|
 

@@ -6,7 +6,7 @@
 
 Adds a channel-wise attention operator in CNNs. Attention coefficients are produced by a small, trainable MLP that uses the channels' globally pooled activations as input. It requires more work on each forward pass, slowing down training and inference, but leads to higher quality models.
 
-| ![Squeeze-Excite](https://storage.googleapis.com/docs.mosaicml.com/images/methods/squeeze-and-excitation.png) |
+| ![Squeeze-Excite](../_images/squeeze-and-excitation.png) |
 |:--|
 | *After an activation tensor **X** is passed through Conv2d **F**<sub>tr</sub> to yield a new tensor **U**, a Squeeze-and-Excitation (SE) module scales the channels in a data-dependent manner. The scales are produced by a single-hidden-layer, fully-connected network whose input is the global-averaged-pooled **U**. This can be seen as a channel-wise attention mechanism.* |
 

@@ -6,7 +6,7 @@
 
 Weight Standardization is a reparametrization of convolutional weights such that the input channel and kernel dimensions have zero mean and unit variance. The authors suggested using this method when the per-device batch size is too small to work well with batch normalization models. Additionally, the authors suggest this method enables using other normalization layers instead of batch normalizaiton while maintaining similar performance. We have been unable to verify either of these claims on Composer benchmarks. Instead, we have found weight standardization to improve performance with a small throughput degradation when training ResNet architectures on semantic segmentation tasks. There are a few papers that have found weight standardization useful as well.
 
-| ![WeightStandardization](https://storage.googleapis.com/docs.mosaicml.com/images/methods/weight_standardization.png) |
+| ![WeightStandardization](../_images/weight_standardization.png) |
 |:--|
 | *Comparing various normalization layers applied to activations (blue) and weight standardization applied to convolutional weights (orange). This figure is Figure 2 in [Qiao et al., 2019](https://arxiv.org/abs/1903.10520).* |
 

diff --git a/docs/source/_images/alibi.png b/docs/source/_images/alibi.png
diff --git a/docs/source/_images/aug_mix.png b/docs/source/_images/aug_mix.png
diff --git a/docs/source/_images/block_wise_stochastic_depth.png b/docs/source/_images/block_wise_stochastic_depth.png
diff --git a/docs/source/_images/blurpool-antialiasing.png b/docs/source/_images/blurpool-antialiasing.png
diff --git a/docs/source/_images/channels_last.png b/docs/source/_images/channels_last.png
diff --git a/docs/source/_images/col_out.png b/docs/source/_images/col_out.png
diff --git a/docs/source/_images/cutmix.png b/docs/source/_images/cutmix.png
diff --git a/docs/source/_images/cutout.png b/docs/source/_images/cutout.png
diff --git a/docs/source/_images/factorize-no-caption.png b/docs/source/_images/factorize-no-caption.png
diff --git a/docs/source/_images/ghost-batch-normalization.png b/docs/source/_images/ghost-batch-normalization.png
diff --git a/docs/source/_images/logo-dark-bg.png b/docs/source/_images/logo-dark-bg.png
diff --git a/docs/source/_images/mix_up.png b/docs/source/_images/mix_up.png
diff --git a/docs/source/_images/profiler_trace_example.png b/docs/source/_images/profiler_trace_example.png
diff --git a/docs/source/_images/progressive_resizing_vision.png b/docs/source/_images/progressive_resizing_vision.png
diff --git a/docs/source/_images/r50_aws_explorer.png b/docs/source/_images/r50_aws_explorer.png
diff --git a/docs/source/_images/r50_aws_explorer_recipe.png b/docs/source/_images/r50_aws_explorer_recipe.png
diff --git a/docs/source/_images/rand_augment.jpg b/docs/source/_images/rand_augment.jpg
diff --git a/docs/source/_images/scale_schedule.png b/docs/source/_images/scale_schedule.png
diff --git a/docs/source/_images/selective-backprop.png b/docs/source/_images/selective-backprop.png
diff --git a/docs/source/_images/seq_len_warmup.svg b/docs/source/_images/seq_len_warmup.svg
diff --git a/docs/source/_images/squeeze-and-excitation.png b/docs/source/_images/squeeze-and-excitation.png
diff --git a/docs/source/_images/weight_standardization.png b/docs/source/_images/weight_standardization.png
diff --git a/docs/source/method_cards/scale_schedule.md b/docs/source/method_cards/scale_schedule.md
@@ -6,7 +6,7 @@ Scale Schedule changes the number of training steps by a dilation factor and dil
 accordingly. Doing so varies the training budget, making it possible to explore tradeoffs between cost (measured in
 time or money) and the quality of the final model.
 
-| ![scale_schedule.png](https://storage.googleapis.com/docs.mosaicml.com/images/methods/scale_schedule.png) |
+| ![scale_schedule.png](../_images/scale_schedule.png) |
 |:--|
 |*Scale schedule scales the learning rate decay schedule.*|
 

diff --git a/docs/source/method_cards/stochastic_depth.md b/docs/source/method_cards/stochastic_depth.md
@@ -4,7 +4,7 @@
 
 Block-wise stochastic depth assigns every residual block a probability of dropping the transformation function, leaving only the skip connection. This regularizes and reduces the amount of computation.
 
-![block_wise_stochastic_depth.png](https://storage.googleapis.com/docs.mosaicml.com/images/methods/block_wise_stochastic_depth.png)
+![block_wise_stochastic_depth.png](../_images/block_wise_stochastic_depth.png)
 
 ## How to Use