Skip to content

Commit

Permalink
mammoth sharing schemes
Browse files Browse the repository at this point in the history
  • Loading branch information
shaoxiongji committed Feb 23, 2024
1 parent 338f990 commit 9057c65
Showing 1 changed file with 146 additions and 0 deletions.
146 changes: 146 additions & 0 deletions docs/source/examples/sharing_schemes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# MAMMOTH Sharing Schemes
MAMMOTH is designed as a flexible modular system, allowing users to configure, train, and test various sharing schemes. This tutorial will guide you through the process of setting up and experimenting with different sharing schemes, including:

- fully shared
- fully unshared
- encoder shared
- decoder shared

The configuration for each scheme is managed through YAML files, ensuring a seamless and customizable experience.


## Dataset
For this tutorial, we will be utilizing the [UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) dataset, which consists of manually translated UN documents spanning the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.


Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command:
```bash
wget
```

Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command:
```bash
wget https://mammoth-share.a3s.fi/vocab.tar.gz
```


Now, let's explore an overview of the sharing schemes to better understand their functionalities.


## Sharing Schemes Overview

Let's delve into an overview of the MAMMOTH Sharing Schemes, each offering unique configurations for a flexible modular system.

### 1. **Fully Unshared:**
- Each language maintains a distinct set of parameters for both encoder and decoder.
- No parameter sharing occurs between languages.
```yaml
train_ar-ar:
dec_sharing_group:
- ar
enc_sharing_group:
- ar
```
- `train_ar-ar`: This denotes the training configuration for the Arabic-to-Arabic language pair.
- `dec_sharing_group`: Specifies the decoder sharing group, indicating which languages share decoder parameters. In this case, only Arabic (ar) is included, meaning no sharing with other languages for decoding.
- `enc_sharing_group`: Denotes the encoder sharing group, signifying which languages share encoder parameters. Here, it's also set to only Arabic (ar), indicating no encoder parameter sharing with other languages.

### 2. **Shared Encoder, Separate Decoder:**
- Encoder parameters are shared across all languages.
- Each language has a separate set of parameters for the decoder.
```yaml
train_ar-ar:
dec_sharing_group:
- ar
enc_sharing_group:
- all
```

### 3. **Separate Encoder, Shared Decoder:**
- Each language has a separate set of parameters for the encoder.
- Decoder parameters are shared across all languages.

```yaml
train_ar-en:
dec_sharing_group:
- all
enc_sharing_group:
- ar
```

### 4. **Fully Shared:**
- Both encoder and decoder parameters are shared across all languages.
- The entire model is shared among all language pairs.
```yaml
train_ar-ar:
dec_sharing_group:
- all
enc_sharing_group:
- all
```

You can conveniently download the complete configurations using the following command:
```bash
wget https://mammoth-share.a3s.fi/configs.tar.gz
```

These configurations provide a solid foundation for configuring, training, and testing various sharing schemes in the MAMMOTH framework. Ensure to modify the file paths according to your specific compute device configurations. Feel free to experiment and tailor these settings to suit your specific needs.

## Training Modular Systems


### 1. **Setup:**
To initiate the training process for MAMMOTH's modular systems, start by setting up the necessary environment variables:

```bash
export MAMMOTH=/path/to/mammoth
export CONFIG=/path/to/configs/config.yaml
```

#### 2. **Training Command:**

Execute the following command to commence training:

```bash
srun /path/to/wrapper.sh $MAMMOTH/train.py \
-config $CONFIG \
-master_ip $SLURMD_NODENAME \
-master_port 9969
```

For the wrapper script, use an example like the one below:
```bash
python -u "$@" --node_rank $SLURM_NODEID
```

This tutorial utilizes SLURM for job scheduling and parallel computing.
You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups.
Ensure that the `config.yaml` file specifies the desired sharing scheme.

#### 3. **Testing Command:**

After training, use the following command to test the model:
```bash
python3 -u $MAMMOTH/translate.py \
--config $CONFIG \
--model "$checkpoint" \
--task_id train_$sl-$tl \
--src $processed_data/$lp/$lp.$sl.sp \
--output $out_path/$sl-$tl.${base}hyp.sp \
--gpu 0 --shard_size 0 \
--batch_size 512
```

Remember to replace `$checkpoint`, `$sl` (source language), `$tl` (target language), `$lp` (language pair), `$processed_data`, and `$out_path` with appropriate values.

We provide the model checkpoint trained using the aforementioned encoder shared scheme.
```bash
wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz
```

#### Notes:
- Make sure to adapt the paths and variables to your specific directory structure.
- Adjust the `--gpu` flag in the testing command based on your GPU availability.
- Ensure that the configuration file (`config.yaml`) contains the correct sharing scheme based on your experiment.

This tutorial serves as a general guide, and it is recommended to refer to the specific configuration file for additional details and customization options. Feel free to explore and adapt the commands to suit your specific training and testing requirements, regardless of the job scheduling system you choose to employ.

0 comments on commit 9057c65

Please sign in to comment.