mammoth sharing schemes

Helsinki-NLP · Feb 23, 2024 · 9057c65 · 9057c65
1 parent 338f990
commit 9057c65
Showing 1 changed file with 146 additions and 0 deletions.
diff --git a/docs/source/examples/sharing_schemes.md b/docs/source/examples/sharing_schemes.md
@@ -0,0 +1,146 @@
+# MAMMOTH Sharing Schemes
+MAMMOTH is designed as a flexible modular system, allowing users to configure, train, and test various sharing schemes. This tutorial will guide you through the process of setting up and experimenting with different sharing schemes, including:
+
+- fully shared
+- fully unshared
+- encoder shared
+- decoder shared
+
+The configuration for each scheme is managed through YAML files, ensuring a seamless and customizable experience.
+
+
+## Dataset
+For this tutorial, we will be utilizing the [UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) dataset, which consists of manually translated UN documents spanning the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.
+
+
+Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command:
+```bash
+wget 
+```
+
+Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command:
+```bash 
+wget https://mammoth-share.a3s.fi/vocab.tar.gz
+```
+
+
+Now, let's explore an overview of the sharing schemes to better understand their functionalities.
+
+
+## Sharing Schemes Overview
+
+Let's delve into an overview of the MAMMOTH Sharing Schemes, each offering unique configurations for a flexible modular system.
+
+### 1. **Fully Unshared:**
+   - Each language maintains a distinct set of parameters for both encoder and decoder.
+   - No parameter sharing occurs between languages.
+```yaml
+  train_ar-ar:
+    dec_sharing_group:
+    - ar
+    enc_sharing_group:
+    - ar
+```
+- `train_ar-ar`: This denotes the training configuration for the Arabic-to-Arabic language pair.
+- `dec_sharing_group`: Specifies the decoder sharing group, indicating which languages share decoder parameters. In this case, only Arabic (ar) is included, meaning no sharing with other languages for decoding.
+- `enc_sharing_group`: Denotes the encoder sharing group, signifying which languages share encoder parameters. Here, it's also set to only Arabic (ar), indicating no encoder parameter sharing with other languages.
+
+### 2. **Shared Encoder, Separate Decoder:**
+   - Encoder parameters are shared across all languages.
+   - Each language has a separate set of parameters for the decoder.
+```yaml
+  train_ar-ar:
+    dec_sharing_group:
+    - ar
+    enc_sharing_group:
+    - all
+```
+
+### 3. **Separate Encoder, Shared Decoder:**
+   - Each language has a separate set of parameters for the encoder.
+   - Decoder parameters are shared across all languages.
+
+```yaml
+  train_ar-en:
+    dec_sharing_group:
+    - all
+    enc_sharing_group:
+    - ar
+```
+
+### 4. **Fully Shared:**
+   - Both encoder and decoder parameters are shared across all languages.
+   - The entire model is shared among all language pairs.
+```yaml
+  train_ar-ar:
+    dec_sharing_group:
+    - all
+    enc_sharing_group:
+    - all
+```
+
+You can conveniently download the complete configurations using the following command:
+```bash
+wget https://mammoth-share.a3s.fi/configs.tar.gz
+```
+
+These configurations provide a solid foundation for configuring, training, and testing various sharing schemes in the MAMMOTH framework. Ensure to modify the file paths according to your specific compute device configurations. Feel free to experiment and tailor these settings to suit your specific needs.
+
+## Training Modular Systems
+
+
+### 1. **Setup:**
+To initiate the training process for MAMMOTH's modular systems, start by setting up the necessary environment variables:
+
+```bash
+export MAMMOTH=/path/to/mammoth
+export CONFIG=/path/to/configs/config.yaml
+```
+
+#### 2. **Training Command:**
+
+Execute the following command to commence training:
+
+```bash
+srun /path/to/wrapper.sh $MAMMOTH/train.py \
+    -config $CONFIG \
+    -master_ip $SLURMD_NODENAME \
+    -master_port 9969
+```
+
+For the wrapper script, use an example like the one below:
+```bash
+python -u "$@" --node_rank $SLURM_NODEID
+```
+
+This tutorial utilizes SLURM for job scheduling and parallel computing.
+You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups.
+Ensure that the `config.yaml` file specifies the desired sharing scheme.
+
+#### 3. **Testing Command:**
+
+After training, use the following command to test the model:
+```bash
+python3 -u $MAMMOTH/translate.py \
+    --config $CONFIG \
+    --model "$checkpoint" \
+    --task_id train_$sl-$tl \
+    --src $processed_data/$lp/$lp.$sl.sp \
+    --output $out_path/$sl-$tl.${base}hyp.sp \
+    --gpu 0 --shard_size 0 \
+    --batch_size 512
+```
+
+Remember to replace `$checkpoint`, `$sl` (source language), `$tl` (target language), `$lp` (language pair), `$processed_data`, and `$out_path` with appropriate values.
+
+We provide the model checkpoint trained using the aforementioned encoder shared scheme.
+```bash
+wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz
+```
+
+#### Notes:
+- Make sure to adapt the paths and variables to your specific directory structure.
+- Adjust the `--gpu` flag in the testing command based on your GPU availability.
+- Ensure that the configuration file (`config.yaml`) contains the correct sharing scheme based on your experiment.
+
+This tutorial serves as a general guide, and it is recommended to refer to the specific configuration file for additional details and customization options. Feel free to explore and adapt the commands to suit your specific training and testing requirements, regardless of the job scheduling system you choose to employ.