Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs #59

Merged
merged 24 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions docs/source/examples/sharing_schemes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# MAMMOTH Sharing Schemes
MAMMOTH is designed as a flexible modular system, allowing users to configure, train, and test various sharing schemes. This tutorial will guide you through the process of setting up and experimenting with different sharing schemes, including:

- fully shared
- fully unshared
- encoder shared
- decoder shared

The configuration for each scheme is managed through YAML files, ensuring a seamless and customizable experience.


## Dataset
For this tutorial, we will be utilizing the [UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) dataset, which consists of manually translated UN documents spanning the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.


Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command:
```bash
wget https://mammoth-share.a3s.fi/unpc.tar
```

Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command:
```bash
wget https://mammoth-share.a3s.fi/vocab.tar.gz
```


Now, let's explore an overview of the sharing schemes to better understand their functionalities.


## Sharing Schemes Overview

Let's delve into an overview of the MAMMOTH Sharing Schemes, each offering unique configurations for a flexible modular system.

### 1. **Fully Unshared:**
- Each language maintains a distinct set of parameters for both encoder and decoder.
- No parameter sharing occurs between languages.
```yaml
train_ar-ar:
dec_sharing_group:
- ar
enc_sharing_group:
- ar
```
- `train_ar-ar`: This denotes the training configuration for the Arabic-to-Arabic language pair.
- `dec_sharing_group`: Specifies the decoder sharing group, indicating which languages share decoder parameters. In this case, only Arabic (ar) is included, meaning no sharing with other languages for decoding.
- `enc_sharing_group`: Denotes the encoder sharing group, signifying which languages share encoder parameters. Here, it's also set to only Arabic (ar), indicating no encoder parameter sharing with other languages.

### 2. **Shared Encoder, Separate Decoder:**
- Encoder parameters are shared across all languages.
- Each language has a separate set of parameters for the decoder.
```yaml
train_ar-ar:
dec_sharing_group:
- ar
enc_sharing_group:
- all
```

### 3. **Separate Encoder, Shared Decoder:**
- Each language has a separate set of parameters for the encoder.
- Decoder parameters are shared across all languages.

```yaml
train_ar-en:
dec_sharing_group:
- all
enc_sharing_group:
- ar
```

### 4. **Fully Shared:**
- Both encoder and decoder parameters are shared across all languages.
- The entire model is shared among all language pairs.
```yaml
train_ar-ar:
dec_sharing_group:
- all
enc_sharing_group:
- all
```

You can conveniently download the complete configurations using the following command:
```bash
wget https://mammoth-share.a3s.fi/configs.tar.gz
```

These configurations provide a solid foundation for configuring, training, and testing various sharing schemes in the MAMMOTH framework. Ensure to modify the file paths according to your specific compute device configurations. Feel free to experiment and tailor these settings to suit your specific needs.

## Training Modular Systems


### 1. **Setup:**
To initiate the training process for MAMMOTH's modular systems, start by setting up the necessary environment variables:

```bash
export MAMMOTH=/path/to/mammoth
export CONFIG=/path/to/configs/config.yaml
```

#### 2. **Training Command:**

Execute the following command to commence training:

```bash
shaoxiongji marked this conversation as resolved.
Show resolved Hide resolved
srun /path/to/wrapper.sh $MAMMOTH/train.py \
-config $CONFIG \
-master_ip $SLURMD_NODENAME \
-master_port 9969
```

For the wrapper script, use an example like the one below:
```bash
python -u "$@" --node_rank $SLURM_NODEID
```

This tutorial utilizes SLURM for job scheduling and parallel computing.
shaoxiongji marked this conversation as resolved.
Show resolved Hide resolved
You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups.
Ensure that the `config.yaml` file specifies the desired sharing scheme.

The training can be run on a single GPU in which case the wrapper wouldn't be necessary. In this case, you can train with the following command.
```bash
python -u $MAMMOTH/train.py -config $CONFIG -node_rank 0
shaoxiongji marked this conversation as resolved.
Show resolved Hide resolved
```

#### 3. **Inference Command:**

After training, use the following command to test the model:
```bash
python3 -u $MAMMOTH/translate.py \
--config $CONFIG \
--model "$checkpoint" \
--task_id train_$sl-$tl \
--src $processed_data/$lp/$lp.$sl.sp \
--output $out_path/$sl-$tl.${base}hyp.sp \
--gpu 0 --shard_size 0 \
--batch_size 512
```

Remember to replace `$checkpoint`, `$sl` (source language), `$tl` (target language), `$lp` (language pair), `$processed_data`, and `$out_path` with appropriate values.

We provide the model checkpoint trained using the aforementioned encoder shared scheme.
```bash
wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz
```

#### Notes:
- Make sure to adapt the paths and variables to your specific directory structure.
- Adjust the `--gpu` flag in the testing command based on your GPU availability.
- Ensure that the configuration file (`config.yaml`) contains the correct sharing scheme based on your experiment.

This tutorial serves as a general guide, and it is recommended to refer to the specific configuration file for additional details and customization options. Feel free to explore and adapt the commands to suit your specific training and testing requirements, regardless of the job scheduling system you choose to employ.
37 changes: 1 addition & 36 deletions docs/source/examples/train_mammoth_101.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,42 +31,7 @@ Here's a high-level summary of the main processing steps. For each language in '

We use a positional argument 'lang' that can accept one or more values, for specifying the languages (e.g., `bg` and `cs` as used in Europarl) to process.

You're free to skip this step if you directly download the processed data.

```python
import argparse
import random

import tqdm
import sentencepiece as sp

parser = argparse.ArgumentParser()
parser.add_argument('lang', nargs='+')
langs = parser.parse_args().lang

sp_path = 'vocab/opusTC.mul.64k.spm'
spm = sp.SentencePieceProcessor(model_file=sp_path)

for lang in tqdm.tqdm(langs):
en_side_in = f'{lang}-en/europarl-v7.{lang}-en.en'
xx_side_in = f'{lang}-en/europarl-v7.{lang}-en.{lang}'
with open(xx_side_in) as xx_stream, open(en_side_in) as en_stream:
data = zip(map(str.strip, xx_stream), map(str.strip, en_stream))
data = [(xx, en) for xx, en in tqdm.tqdm(data, leave=False, desc=f'read {lang}') if xx and en] # drop empty lines
random.shuffle(data)
en_side_out = f'{lang}-en/valid.{lang}-en.en.sp'
xx_side_out = f'{lang}-en/valid.{lang}-en.{lang}.sp'
with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream:
for xx, en in tqdm.tqdm(data[:1000], leave=False, desc=f'valid {lang}'):
print(*spm.encode(xx, out_type=str), file=xx_stream)
print(*spm.encode(en, out_type=str), file=en_stream)
en_side_out = f'{lang}-en/train.{lang}-en.en.sp'
xx_side_out = f'{lang}-en/train.{lang}-en.{lang}.sp'
with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream:
for xx, en in tqdm.tqdm(data[1000:], leave=False, desc=f'train {lang}'):
print(*spm.encode(xx, out_type=str), file=xx_stream)
print(*spm.encode(en, out_type=str), file=en_stream)
```
You're free to skip this step if you directly download the processed data. For details, see [this page](../prepare_data.md#europarl).

## Step 3: Configuration

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Contents
install.md
prepare_data.md
examples/train_mammoth_101.md
examples/sharing_schemes.md

.. toctree::
:caption: Scripts
Expand Down
7 changes: 2 additions & 5 deletions docs/source/main.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,7 @@ This portal provides a detailed documentation of the **MAMMOTH**: Modular Adapta
## Installation

```bash
git clone https://github.com/Helsinki-NLP/mammoth.git
cd mammoth
pip3 install -e .
pip3 install sentencepiece==0.1.97 sacrebleu==2.3.1
pip install mammoth-nlp
```

Check out the [installation guide](install) to install in specific clusters.
Expand Down Expand Up @@ -58,4 +55,4 @@ We published [FoTraNMT](https://github.com/Helsinki-NLP/FoTraNMT) the ancestor o
url = "https://aclanthology.org/2023.nodalida-1.24",
pages = "238--247"
}
```
```
3 changes: 3 additions & 0 deletions docs/source/modular_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ We break down mNMT training into a series of smaller “tasks”
- A task is done on a specific device
- A task corresponds to a specific (parallel) corpus

In short, a task corresponds to a specific model behavior.
In translation settings, a task will therefore correspond to a specific translation direction (say translating from Swahili to Catalan):
All training datapoints for this direction (i) must involve the same modules (pertaining to Swahili encoding and Catalan decoding); (ii) must be preprocessed with the same tokenizers; and (iii) can be grouped into a single bitext.
A centralized manager handles tasks synchronization.
This manager oversees the parallel execution of tasks, coordinating the flow of information between different modules, devices, and corpora to ensure a cohesive and synchronized training process.

Expand Down
74 changes: 54 additions & 20 deletions docs/source/prepare_data.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,108 @@

# Prepare Data

## Europarl
Before running these scripts, make sure that you have [installed](quickstart) Mamooth, which includes the dependencies required below.


## Quickstart: Europarl

In the [Quickstart tutorial](quickstart), we assume that you will download and preprocess the Europarl data by following the steps below.

### Step 1: Download the data
[Europarl parallel corpus](https://www.statmt.org/europarl/) is a multilingual resource extracted from European Parliament proceedings and contains texts in 21 European languages. Download the Release v7 - a further expanded and improved version of the Europarl corpus on 15 May 2012 - from the original website or
download the processed data by us:
```bash
wget https://mammoth101.a3s.fi/europarl.tar.gz
mkdir europarl_data
tar –xvzf europarl.tar.gz -C europarl_data
```
Note that the extracted dataset will require around 30GB of memory. Alternatively, you can only download the data for the three example languages (666M).
```bash
wget https://mammoth101.a3s.fi/europarl-3langs.tar.gz
mkdir europarl_data
tar –xvzf europarl-3langs.tar.gz -C europarl_data
```

We use a SentencePiece model trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary:
We use a SentencePiece tokenizer trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary:
```bash
# Download the SentencePiece model
wget https://mammoth101.a3s.fi/opusTC.mul.64k.spm
# Download the vocabulary
wget https://mammoth101.a3s.fi/opusTC.mul.vocab.onmt
```

mkdir vocab
mv opusTC.mul.64k.spm vocab/.
mv opusTC.mul.vocab.onmt vocab/.
```
If you would like to create and use a custom sentencepiece tokenizer, take a look at the OPUS tutorial below.

### Step 2: Tokenization
Then, read parallel text data, processes it, and generate output files for training and validation sets.
Then, read parallel text data, processes it, and generates output files for training and validation sets.
Here's a high-level summary of the main processing steps. For each language in 'langs,'
- read parallel data files.
- clean the data by removing empty lines.
- shuffle the data randomly.
- tokenizes the text using SentencePiece and writes the tokenized data to separate output files for training and validation sets.

We use a positional argument 'lang' that can accept one or more values, for specifying the languages (e.g., `bg` and `cs` as used in Europarl) to process.

You're free to skip this step if you directly download the processed data.

```python
import argparse
import random
import pathlib

import tqdm
import sentencepiece as sp

parser = argparse.ArgumentParser()
parser.add_argument('lang', nargs='+')
langs = parser.parse_args().lang
langs = ["bg", "cs"]

sp_path = 'vocab/opusTC.mul.64k.spm'
spm = sp.SentencePieceProcessor(model_file=sp_path)

input_dir = 'europarl_data/europarl'
output_dir = 'europarl_data/encoded'

for lang in tqdm.tqdm(langs):
en_side_in = f'{lang}-en/europarl-v7.{lang}-en.en'
xx_side_in = f'{lang}-en/europarl-v7.{lang}-en.{lang}'
en_side_in = f'{input_dir}/{lang}-en/europarl-v7.{lang}-en.en'
xx_side_in = f'{input_dir}/{lang}-en/europarl-v7.{lang}-en.{lang}'
with open(xx_side_in) as xx_stream, open(en_side_in) as en_stream:
data = zip(map(str.strip, xx_stream), map(str.strip, en_stream))
data = [(xx, en) for xx, en in tqdm.tqdm(data, leave=False, desc=f'read {lang}') if xx and en] # drop empty lines
random.shuffle(data)
en_side_out = f'{lang}-en/valid.{lang}-en.en.sp'
xx_side_out = f'{lang}-en/valid.{lang}-en.{lang}.sp'
pathlib.Path(output_dir).mkdir(exist_ok=True)
en_side_out = f'{output_dir}/valid.{lang}-en.en.sp'
xx_side_out = f'{output_dir}/valid.{lang}-en.{lang}.sp'
with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream:
for xx, en in tqdm.tqdm(data[:1000], leave=False, desc=f'valid {lang}'):
print(*spm.encode(xx, out_type=str), file=xx_stream)
print(*spm.encode(en, out_type=str), file=en_stream)
en_side_out = f'{lang}-en/train.{lang}-en.en.sp'
xx_side_out = f'{lang}-en/train.{lang}-en.{lang}.sp'
en_side_out = f'{output_dir}/train.{lang}-en.en.sp'
xx_side_out = f'{output_dir}/train.{lang}-en.{lang}.sp'
with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream:
for xx, en in tqdm.tqdm(data[1000:], leave=False, desc=f'train {lang}'):
print(*spm.encode(xx, out_type=str), file=xx_stream)
print(*spm.encode(en, out_type=str), file=en_stream)
```

The script will produce encoded datasets in `europarl_data/encoded` that you can further use for the training.

## OPUS 100
To get started, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/opus-100.php)
## UNPC
[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.
We preprocess the data. You can download the processed data by:
```bash
wget https://mammoth-share.a3s.fi/unpc.tar
```
Or you can use the scripts provided by the tarball to process the data yourself.

For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.


## OPUS 100

In this guideline, we will also create our custom sentencepiece tokenizer.

To do that, you will also need to compile a sentencepiece installation in your environment (not just pip install).
Follow the instructions on [sentencepiece github](https://github.com/google/sentencepiece?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source).

After that, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/legacy/opus-100.php)

### Step 1: Set relevant paths, variables and download

Expand Down Expand Up @@ -217,4 +251,4 @@ cd $SP_PATH
> $DATA_PATH/zero-shot/$lp/opus.$lp-test.$tl.sp
cd $CUR_DIR
done
```
```
Loading
Loading