From 14c56498e5e65433fb466c096b83319a749a8c6f Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 20 Feb 2024 13:29:44 +0200 Subject: [PATCH 01/22] remove data proc --- docs/source/examples/train_mammoth_101.md | 37 +---------------------- 1 file changed, 1 insertion(+), 36 deletions(-) diff --git a/docs/source/examples/train_mammoth_101.md b/docs/source/examples/train_mammoth_101.md index 5618a6c8..db99fbd3 100644 --- a/docs/source/examples/train_mammoth_101.md +++ b/docs/source/examples/train_mammoth_101.md @@ -31,42 +31,7 @@ Here's a high-level summary of the main processing steps. For each language in ' We use a positional argument 'lang' that can accept one or more values, for specifying the languages (e.g., `bg` and `cs` as used in Europarl) to process. -You're free to skip this step if you directly download the processed data. - -```python -import argparse -import random - -import tqdm -import sentencepiece as sp - -parser = argparse.ArgumentParser() -parser.add_argument('lang', nargs='+') -langs = parser.parse_args().lang - -sp_path = 'vocab/opusTC.mul.64k.spm' -spm = sp.SentencePieceProcessor(model_file=sp_path) - -for lang in tqdm.tqdm(langs): - en_side_in = f'{lang}-en/europarl-v7.{lang}-en.en' - xx_side_in = f'{lang}-en/europarl-v7.{lang}-en.{lang}' - with open(xx_side_in) as xx_stream, open(en_side_in) as en_stream: - data = zip(map(str.strip, xx_stream), map(str.strip, en_stream)) - data = [(xx, en) for xx, en in tqdm.tqdm(data, leave=False, desc=f'read {lang}') if xx and en] # drop empty lines - random.shuffle(data) - en_side_out = f'{lang}-en/valid.{lang}-en.en.sp' - xx_side_out = f'{lang}-en/valid.{lang}-en.{lang}.sp' - with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream: - for xx, en in tqdm.tqdm(data[:1000], leave=False, desc=f'valid {lang}'): - print(*spm.encode(xx, out_type=str), file=xx_stream) - print(*spm.encode(en, out_type=str), file=en_stream) - en_side_out = f'{lang}-en/train.{lang}-en.en.sp' - xx_side_out = f'{lang}-en/train.{lang}-en.{lang}.sp' - with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream: - for xx, en in tqdm.tqdm(data[1000:], leave=False, desc=f'train {lang}'): - print(*spm.encode(xx, out_type=str), file=xx_stream) - print(*spm.encode(en, out_type=str), file=en_stream) -``` +You're free to skip this step if you directly download the processed data. For details, see [this page](../prepare_data.md#europarl). ## Step 3: Configuration From b11b9b742743ca4bf60c21c52a7b8c18e349fae9 Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 20 Feb 2024 13:55:21 +0200 Subject: [PATCH 02/22] clarify task --- docs/source/modular_model.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/source/modular_model.md b/docs/source/modular_model.md index 02d4d094..dc77d157 100644 --- a/docs/source/modular_model.md +++ b/docs/source/modular_model.md @@ -24,6 +24,9 @@ We break down mNMT training into a series of smaller “tasks” - A task is done on a specific device - A task corresponds to a specific (parallel) corpus +In short, a task corresponds to a specific model behavior. +In translation settings, a task will therefore correspond to a specific translation direction (say translating from Swahili to Catalan): +All training datapoints for this direction (i) must involve the same modules (pertaining to Swahili encoding and Catalan decoding); (ii) must be preprocessed with the same tokenizers; and (iii) can be grouped into a single bitext. A centralized manager handles tasks synchronization. This manager oversees the parallel execution of tasks, coordinating the flow of information between different modules, devices, and corpora to ensure a cohesive and synchronized training process. From beac3fdf1ddb505915ce517d77c575e963c73392 Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 20 Feb 2024 14:54:31 +0200 Subject: [PATCH 03/22] add translate for mammoth101 --- docs/source/quickstart.md | 84 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 2 deletions(-) diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index f47fefe6..bebbf44e 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -125,7 +125,7 @@ We recommend our [automatic configuration generation tool](config_config) for ge ### Step 3: Start training -Finally, launch the training script, for example, through the Slurm manager, via: +Now that you've prepared your data and configured the settings, it's time to initiate the training of your multilingual machine translation model using Mammoth. Follow these steps to launch the training script, for example, through the Slurm manager: ```bash python -u "$@" --node_rank $SLURM_NODEID -u ${PATH_TO_MAMMOTH}/train.py \ @@ -135,4 +135,84 @@ python -u "$@" --node_rank $SLURM_NODEID -u ${PATH_TO_MAMMOTH}/train.py \ -tensorboard -tensorboard_log_dir ${LOG_DIR}/${EXP_ID} ``` -A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md). \ No newline at end of file +Explanation of Command: + - `python -u "$@"`: Initiates the training script using Python. + - `--node_rank $SLURM_NODEID`: Specifies the node rank using the environment variable provided by Slurm. + - `-u ${PATH_TO_MAMMOTH}/train.py`: Specifies the path to the Mammoth training script. + - `-config ${CONFIG_DIR}/your_config.yml`: Specifies the path to your configuration file. + - `-save_model ${SAVE_DIR}/models/${EXP_ID}`: Defines the directory to save the trained models, incorporating an experiment identifier (`${EXP_ID}`). + - `-master_port 9974 -master_ip $SLURMD_NODENAME`: Sets the master port and IP for communication. + - `-tensorboard -tensorboard_log_dir ${LOG_DIR}/${EXP_ID}`: Enables TensorBoard logging, specifying the directory for TensorBoard logs. + +Your training process has been initiated through the Slurm manager, leveraging the specified configuration settings. Monitor the progress through the provided logging and visualization tools. Adjust parameters as needed for your specific training requirements. You can also run the command on other workstations by modifying the parameters accordingly. + + + +### Step 4: Translate + +Now that you have successfully trained your multilingual machine translation model using Mammoth, it's time to put it to use for translation. + +```bash +python3 -u $MAMMOTH/translate.py \ + --config "${CONFIG_DIR}/your_config.yml" \ + --model "$model_checkpoint" \ + --task_id "train_$src_lang-$tgt_lang" \ + --src "$path_to_src_language/$lang_pair.$src_lang.sp" \ + --output "$out_path/$src_lang-$tgt_lang.hyp.sp" \ + --gpu 0 --shard_size 0 \ + --batch_size 512 +``` + +Follow these steps to translate text with your trained model. +1. **Initiate Translation Script:** + Open your terminal and execute the translation script: + + ```bash + python3 -u $MAMMOTH/translate.py + ``` + +2. **Specify Configuration, Model, and Task:** + Provide necessary details using the following options: + + - Configuration File: + ```bash + --config "${CONFIG_DIR}/your_config.yml" + ``` + + - Model Checkpoint: + ```bash + --model "$model_checkpoint" + ``` + + - Translation Task: + ```bash + --task_id "train_$src_lang-$tgt_lang" + ``` + +3. **Specify Source Language File and Output Path:** + + Point to the source language file for translation: + + ```bash + --src "$path_to_src_language/$lang_pair.$src_lang.sp" + ``` + + Define the path for saving the translated output: + + ```bash + --output "$out_path/$src_lang-$tgt_lang.hyp.sp" + ``` + +4. **Specify GPU and Batch Size:** + Adjust GPU and batch size settings based on your requirements: + + ```bash + --gpu 0 --shard_size 0 --batch_size 512 + ``` + +Feel free to adjust these commands as needed for your specific use case. + +Congratulations! You've successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks. + +### Further reading +A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md) \ No newline at end of file From 2d18b92eb46b8b203c457cc0953970a97578b46b Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 20 Feb 2024 15:09:01 +0200 Subject: [PATCH 04/22] shorten quickstart --- docs/source/quickstart.md | 55 +++++++-------------------------------- 1 file changed, 9 insertions(+), 46 deletions(-) diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index bebbf44e..c9db37fb 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -163,54 +163,17 @@ python3 -u $MAMMOTH/translate.py \ --batch_size 512 ``` -Follow these steps to translate text with your trained model. -1. **Initiate Translation Script:** - Open your terminal and execute the translation script: +Follow these configs to translate text with your trained model. - ```bash - python3 -u $MAMMOTH/translate.py - ``` +- Provide necessary details using the following options: + - Configuration File: `--config "${CONFIG_DIR}/your_config.yml"` + - Model Checkpoint: `--model "$model_checkpoint"` + - Translation Task: `--task_id "train_$src_lang-$tgt_lang"` -2. **Specify Configuration, Model, and Task:** - Provide necessary details using the following options: - - - Configuration File: - ```bash - --config "${CONFIG_DIR}/your_config.yml" - ``` - - - Model Checkpoint: - ```bash - --model "$model_checkpoint" - ``` - - - Translation Task: - ```bash - --task_id "train_$src_lang-$tgt_lang" - ``` - -3. **Specify Source Language File and Output Path:** - - Point to the source language file for translation: - - ```bash - --src "$path_to_src_language/$lang_pair.$src_lang.sp" - ``` - - Define the path for saving the translated output: - - ```bash - --output "$out_path/$src_lang-$tgt_lang.hyp.sp" - ``` - -4. **Specify GPU and Batch Size:** - Adjust GPU and batch size settings based on your requirements: - - ```bash - --gpu 0 --shard_size 0 --batch_size 512 - ``` - -Feel free to adjust these commands as needed for your specific use case. +- Point to the source language file for translation: + `--src "$path_to_src_language/$lang_pair.$src_lang.sp"` +- Define the path for saving the translated output: `--output "$out_path/$src_lang-$tgt_lang.hyp.sp"` +- Adjust GPU and batch size settings based on your requirements: `--gpu 0 --shard_size 0 --batch_size 512` Congratulations! You've successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks. From 3df66bc77849d497203ffdfae52f037e586a0480 Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 20 Feb 2024 15:09:33 +0200 Subject: [PATCH 05/22] add toc tree --- docs/source/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/index.rst b/docs/source/index.rst index 6cb98339..f37b5afc 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -31,6 +31,7 @@ Contents install.md prepare_data.md examples/train_mammoth_101.md + examples/sharing_schemes.md .. toctree:: :caption: Scripts From 338f990152e5671004a84ea5ced9e4961f2ff6d9 Mon Sep 17 00:00:00 2001 From: JS Date: Fri, 23 Feb 2024 10:53:59 +0200 Subject: [PATCH 06/22] add unpc --- docs/source/prepare_data.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index 2041dc08..5d6ba40b 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -1,6 +1,16 @@ # Prepare Data +## UNPC +[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. +We preprocess the data. You can download the processed data by: +``` +wget https://mammoth-share.a3s.fi/unpc.tar +``` +Or you can use the scripts provided by the tarball to process the data yourself. + +For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016. + ## Europarl ### Step 1: Download the data From 9057c65fd306de4a5b65cb5abe59298171a4f8d2 Mon Sep 17 00:00:00 2001 From: JS Date: Fri, 23 Feb 2024 10:55:20 +0200 Subject: [PATCH 07/22] mammoth sharing schemes --- docs/source/examples/sharing_schemes.md | 146 ++++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 docs/source/examples/sharing_schemes.md diff --git a/docs/source/examples/sharing_schemes.md b/docs/source/examples/sharing_schemes.md new file mode 100644 index 00000000..70d3a361 --- /dev/null +++ b/docs/source/examples/sharing_schemes.md @@ -0,0 +1,146 @@ +# MAMMOTH Sharing Schemes +MAMMOTH is designed as a flexible modular system, allowing users to configure, train, and test various sharing schemes. This tutorial will guide you through the process of setting up and experimenting with different sharing schemes, including: + +- fully shared +- fully unshared +- encoder shared +- decoder shared + +The configuration for each scheme is managed through YAML files, ensuring a seamless and customizable experience. + + +## Dataset +For this tutorial, we will be utilizing the [UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) dataset, which consists of manually translated UN documents spanning the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. + + +Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command: +```bash +wget +``` + +Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command: +```bash +wget https://mammoth-share.a3s.fi/vocab.tar.gz +``` + + +Now, let's explore an overview of the sharing schemes to better understand their functionalities. + + +## Sharing Schemes Overview + +Let's delve into an overview of the MAMMOTH Sharing Schemes, each offering unique configurations for a flexible modular system. + +### 1. **Fully Unshared:** + - Each language maintains a distinct set of parameters for both encoder and decoder. + - No parameter sharing occurs between languages. +```yaml + train_ar-ar: + dec_sharing_group: + - ar + enc_sharing_group: + - ar +``` +- `train_ar-ar`: This denotes the training configuration for the Arabic-to-Arabic language pair. +- `dec_sharing_group`: Specifies the decoder sharing group, indicating which languages share decoder parameters. In this case, only Arabic (ar) is included, meaning no sharing with other languages for decoding. +- `enc_sharing_group`: Denotes the encoder sharing group, signifying which languages share encoder parameters. Here, it's also set to only Arabic (ar), indicating no encoder parameter sharing with other languages. + +### 2. **Shared Encoder, Separate Decoder:** + - Encoder parameters are shared across all languages. + - Each language has a separate set of parameters for the decoder. +```yaml + train_ar-ar: + dec_sharing_group: + - ar + enc_sharing_group: + - all +``` + +### 3. **Separate Encoder, Shared Decoder:** + - Each language has a separate set of parameters for the encoder. + - Decoder parameters are shared across all languages. + +```yaml + train_ar-en: + dec_sharing_group: + - all + enc_sharing_group: + - ar +``` + +### 4. **Fully Shared:** + - Both encoder and decoder parameters are shared across all languages. + - The entire model is shared among all language pairs. +```yaml + train_ar-ar: + dec_sharing_group: + - all + enc_sharing_group: + - all +``` + +You can conveniently download the complete configurations using the following command: +```bash +wget https://mammoth-share.a3s.fi/configs.tar.gz +``` + +These configurations provide a solid foundation for configuring, training, and testing various sharing schemes in the MAMMOTH framework. Ensure to modify the file paths according to your specific compute device configurations. Feel free to experiment and tailor these settings to suit your specific needs. + +## Training Modular Systems + + +### 1. **Setup:** +To initiate the training process for MAMMOTH's modular systems, start by setting up the necessary environment variables: + +```bash +export MAMMOTH=/path/to/mammoth +export CONFIG=/path/to/configs/config.yaml +``` + +#### 2. **Training Command:** + +Execute the following command to commence training: + +```bash +srun /path/to/wrapper.sh $MAMMOTH/train.py \ + -config $CONFIG \ + -master_ip $SLURMD_NODENAME \ + -master_port 9969 +``` + +For the wrapper script, use an example like the one below: +```bash +python -u "$@" --node_rank $SLURM_NODEID +``` + +This tutorial utilizes SLURM for job scheduling and parallel computing. +You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups. +Ensure that the `config.yaml` file specifies the desired sharing scheme. + +#### 3. **Testing Command:** + +After training, use the following command to test the model: +```bash +python3 -u $MAMMOTH/translate.py \ + --config $CONFIG \ + --model "$checkpoint" \ + --task_id train_$sl-$tl \ + --src $processed_data/$lp/$lp.$sl.sp \ + --output $out_path/$sl-$tl.${base}hyp.sp \ + --gpu 0 --shard_size 0 \ + --batch_size 512 +``` + +Remember to replace `$checkpoint`, `$sl` (source language), `$tl` (target language), `$lp` (language pair), `$processed_data`, and `$out_path` with appropriate values. + +We provide the model checkpoint trained using the aforementioned encoder shared scheme. +```bash +wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz +``` + +#### Notes: +- Make sure to adapt the paths and variables to your specific directory structure. +- Adjust the `--gpu` flag in the testing command based on your GPU availability. +- Ensure that the configuration file (`config.yaml`) contains the correct sharing scheme based on your experiment. + +This tutorial serves as a general guide, and it is recommended to refer to the specific configuration file for additional details and customization options. Feel free to explore and adapt the commands to suit your specific training and testing requirements, regardless of the job scheduling system you choose to employ. \ No newline at end of file From 9e6b79f04f4347574f0b5857f9956197dacd81f8 Mon Sep 17 00:00:00 2001 From: JS Date: Fri, 23 Feb 2024 11:01:46 +0200 Subject: [PATCH 08/22] minor fixes --- docs/source/examples/sharing_schemes.md | 2 +- docs/source/prepare_data.md | 2 +- docs/source/quickstart.md | 6 +++++- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/source/examples/sharing_schemes.md b/docs/source/examples/sharing_schemes.md index 70d3a361..9734e79c 100644 --- a/docs/source/examples/sharing_schemes.md +++ b/docs/source/examples/sharing_schemes.md @@ -15,7 +15,7 @@ For this tutorial, we will be utilizing the [UNPC](https://opus.nlpl.eu/UNPC/cor Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command: ```bash -wget +wget https://mammoth-share.a3s.fi/unpc.tar ``` Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command: diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index 5d6ba40b..8eaa1dfc 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -4,7 +4,7 @@ ## UNPC [UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. We preprocess the data. You can download the processed data by: -``` +```bash wget https://mammoth-share.a3s.fi/unpc.tar ``` Or you can use the scripts provided by the tarball to process the data yourself. diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index c9db37fb..8d5d6d85 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -174,8 +174,12 @@ Follow these configs to translate text with your trained model. `--src "$path_to_src_language/$lang_pair.$src_lang.sp"` - Define the path for saving the translated output: `--output "$out_path/$src_lang-$tgt_lang.hyp.sp"` - Adjust GPU and batch size settings based on your requirements: `--gpu 0 --shard_size 0 --batch_size 512` +- We provide the model checkpoint trained using the encoder shared scheme described in [this tutorial](examples/sharing_schemes.md). + ```bash + wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz + ``` Congratulations! You've successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks. ### Further reading -A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md) \ No newline at end of file +A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md). \ No newline at end of file From e69d0aefe503f50a79be52295e6d0e66ccd01a11 Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Tue, 5 Mar 2024 10:55:39 +0200 Subject: [PATCH 09/22] Update prepare_data.md to make Europarl work --- docs/source/prepare_data.md | 38 ++++++++++++++++++++++++------------- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index 2041dc08..de0a022d 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -1,6 +1,9 @@ # Prepare Data +Before running these scripts, make sure that you have [installed](quickstart) Mamooth, which includes the dependencies required below. + + ## Europarl ### Step 1: Download the data @@ -8,7 +11,10 @@ download the processed data by us: ```bash wget https://mammoth101.a3s.fi/europarl.tar.gz +mkdir europarl_data +tar –xvzf europarl.tar.gz.1 -C europarl_data ``` +Note that the extracted dataset will require around 30GB of memory. We use a SentencePiece model trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary: ```bash @@ -16,6 +22,10 @@ We use a SentencePiece model trained on OPUS Tatoeba Challenge data with 64k voc wget https://mammoth101.a3s.fi/opusTC.mul.64k.spm # Download the vocabulary wget https://mammoth101.a3s.fi/opusTC.mul.vocab.onmt + +mkdir vocab +mv opusTC.mul.64k.spm vocab/. +mv opusTC.mul.vocab.onmt vocab/. ``` @@ -27,45 +37,47 @@ Here's a high-level summary of the main processing steps. For each language in ' - shuffle the data randomly. - tokenizes the text using SentencePiece and writes the tokenized data to separate output files for training and validation sets. -We use a positional argument 'lang' that can accept one or more values, for specifying the languages (e.g., `bg` and `cs` as used in Europarl) to process. - You're free to skip this step if you directly download the processed data. ```python -import argparse import random +import pathlib import tqdm import sentencepiece as sp -parser = argparse.ArgumentParser() -parser.add_argument('lang', nargs='+') -langs = parser.parse_args().lang +langs = ["bg", "cs"] sp_path = 'vocab/opusTC.mul.64k.spm' spm = sp.SentencePieceProcessor(model_file=sp_path) +input_dir = 'europarl_data/europarl' +output_dir = 'europarl_data/encoded' + for lang in tqdm.tqdm(langs): - en_side_in = f'{lang}-en/europarl-v7.{lang}-en.en' - xx_side_in = f'{lang}-en/europarl-v7.{lang}-en.{lang}' + en_side_in = f'{input_dir}/{lang}-en/europarl-v7.{lang}-en.en' + xx_side_in = f'{input_dir}/{lang}-en/europarl-v7.{lang}-en.{lang}' with open(xx_side_in) as xx_stream, open(en_side_in) as en_stream: data = zip(map(str.strip, xx_stream), map(str.strip, en_stream)) data = [(xx, en) for xx, en in tqdm.tqdm(data, leave=False, desc=f'read {lang}') if xx and en] # drop empty lines random.shuffle(data) - en_side_out = f'{lang}-en/valid.{lang}-en.en.sp' - xx_side_out = f'{lang}-en/valid.{lang}-en.{lang}.sp' + pathlib.Path(output_dir).mkdir(exist_ok=True) + en_side_out = f'{output_dir}/valid.{lang}-en.en.sp' + xx_side_out = f'{output_dir}/valid.{lang}-en.{lang}.sp' with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream: for xx, en in tqdm.tqdm(data[:1000], leave=False, desc=f'valid {lang}'): print(*spm.encode(xx, out_type=str), file=xx_stream) print(*spm.encode(en, out_type=str), file=en_stream) - en_side_out = f'{lang}-en/train.{lang}-en.en.sp' - xx_side_out = f'{lang}-en/train.{lang}-en.{lang}.sp' + en_side_out = f'{output_dir}/train.{lang}-en.en.sp' + xx_side_out = f'{output_dir}/train.{lang}-en.{lang}.sp' with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream: for xx, en in tqdm.tqdm(data[1000:], leave=False, desc=f'train {lang}'): print(*spm.encode(xx, out_type=str), file=xx_stream) print(*spm.encode(en, out_type=str), file=en_stream) ``` +The script will produce encoded datasets in `europarl_data/encoded` that you can further use for the training. + ## OPUS 100 To get started, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/opus-100.php) @@ -217,4 +229,4 @@ cd $SP_PATH > $DATA_PATH/zero-shot/$lp/opus.$lp-test.$tl.sp cd $CUR_DIR done -``` \ No newline at end of file +``` From f7ffb9faa834729d6388171074d9228e91fb3a74 Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Tue, 5 Mar 2024 15:11:40 +0200 Subject: [PATCH 10/22] Update quickstart.md to work out-of-box --- docs/source/quickstart.md | 167 +++++++++++++++++++++++++++++--------- 1 file changed, 129 insertions(+), 38 deletions(-) diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index f47fefe6..811c4de2 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -2,52 +2,81 @@ # Quickstart +Mammoth is a library for training modular language models supporting multi-node multi-GPU training. + +In the example below, we will show you how to configure Mamooth to train a machine translation model with language-specific encoders and decoders. + ### Step 0: Install mammoth ```bash -git clone https://github.com/Helsinki-NLP/mammoth.git -cd mammoth -pip3 install -e . -pip3 install sentencepiece==0.1.97 sacrebleu==2.3.1 +pip install git+https://github.com/Helsinki-NLP/mammoth.git ``` Check out the [installation guide](install) to install in specific clusters. ### Step 1: Prepare the data -Prepare the data for training. You can refer to the data preparation [tutorial](prepare_data) for more details. +Before running the training, we will download data for chosen pairs of languages and create a sentencepiece tokenizer for the model. + +**Refer to the data preparation [tutorial](prepare_data) for more details.** +In the following steps, we assume that you already have an encoded dataset containing `*.sp` file for `europarl` dataset, and languages `cs` and `bg`. Thus, your data directory `europarl_data/encoded` should contain 8 files in a format `{train/valid}.{cs/bg}-en.{cs/bg}.sp`. If you use other datasets, please update the paths in the configurations below. ### Step 2: Configurations -You will need to configure your training settings. -Below is a list of configuration examples: + +Mamooth uses configurations to build a new transformer model and configure your training settings, such as which modules are trained with the data from which languages. + +Below are a few examples of training configurations that will work for you out-of-box in a one-node, two-GPU environment.
Task-specific encoders and decoders +In this example, we create a model with encoders and decoders **shared** for the specified languages. This is defined by `enc_sharing_group` and `enc_sharing_group`. + ```yaml +# TRAINING CONFIG +world_size: 2 +gpu_ranks: [0, 1] + +batch_type: tokens +batch_size: 4096 + +# INPUT/OUTPUT VOCABULARY CONFIG + +src_vocab: + bg: vocab/opusTC.mul.vocab.onmt + cs: vocab/opusTC.mul.vocab.onmt + en: vocab/opusTC.mul.vocab.onmt +tgt_vocab: + cs: vocab/opusTC.mul.vocab.onmt + en: vocab/opusTC.mul.vocab.onmt + +# MODEL CONFIG + +model_dim: 512 + tasks: train_bg-en: src_tgt: bg-en enc_sharing_group: [bg] dec_sharing_group: [en] node_gpu: "0:0" - path_src: /path/to/train.bg-en.bg - path_tgt: /path/to/train.bg-en.en + path_src: europarl_data/encoded/train.bg-en.bg.sp + path_tgt: europarl_data/encoded/train.bg-en.en.sp train_cs-en: src_tgt: cs-en enc_sharing_group: [cs] dec_sharing_group: [en] node_gpu: "0:1" - path_src: /path/to/train.cs-en.cs - path_tgt: /path/to/train.cs-en.en + path_src: europarl_data/encoded/train.cs-en.cs.sp + path_tgt: europarl_data/encoded/train.cs-en.en.sp train_en-cs: src_tgt: en-cs enc_sharing_group: [en] dec_sharing_group: [cs] node_gpu: "0:1" - path_src: /path/to/train.cs-en.en - path_tgt: /path/to/train.cs-en.cs + path_src: europarl_data/encoded/train.cs-en.en.sp + path_tgt: europarl_data/encoded/train.cs-en.cs.sp enc_layers: [6] dec_layers: [6] @@ -58,29 +87,52 @@ dec_layers: [6]
Arbitrarily shared layers in encoders and task-specific decoders +The training and vocab config is the same as in the previous example. + ```yaml +# TRAINING CONFIG +world_size: 2 +gpu_ranks: [0, 1] + +batch_type: tokens +batch_size: 4096 + +# INPUT/OUTPUT VOCABULARY CONFIG + +src_vocab: + bg: vocab/opusTC.mul.vocab.onmt + cs: vocab/opusTC.mul.vocab.onmt + en: vocab/opusTC.mul.vocab.onmt +tgt_vocab: + cs: vocab/opusTC.mul.vocab.onmt + en: vocab/opusTC.mul.vocab.onmt + +# MODEL CONFIG + +model_dim: 512 + tasks: train_bg-en: src_tgt: bg-en enc_sharing_group: [bg, all] dec_sharing_group: [en] node_gpu: "0:0" - path_src: /path/to/train.bg-en.bg - path_tgt: /path/to/train.bg-en.en + path_src: europarl_data/encoded/train.bg-en.bg.sp + path_tgt: europarl_data/encoded/train.bg-en.en.sp train_cs-en: src_tgt: cs-en enc_sharing_group: [cs, all] dec_sharing_group: [en] node_gpu: "0:1" - path_src: /path/to/train.cs-en.cs - path_tgt: /path/to/train.cs-en.en + path_src: europarl_data/encoded/train.cs-en.cs.sp + path_tgt: europarl_data/encoded/train.cs-en.en.sp train_en-cs: src_tgt: en-cs enc_sharing_group: [en, all] dec_sharing_group: [cs] node_gpu: "0:1" - path_src: /path/to/train.cs-en.en - path_tgt: /path/to/train.cs-en.cs + path_src: europarl_data/encoded/train.cs-en.en.sp + path_tgt: europarl_data/encoded/train.cs-en.cs.sp enc_layers: [4, 4] dec_layers: [4] @@ -90,49 +142,88 @@ dec_layers: [4]
Non-modular multilingual system +In this example, we share the input/output vocabulary over all languages. Hence, we define a vocabulary for an `all` language, that we use in the definition of the model. + ```yaml +# TRAINING CONFIG +world_size: 2 +gpu_ranks: [0, 1] + +batch_type: tokens +batch_size: 4096 + +# INPUT/OUTPUT VOCABULARY CONFIG + +src_vocab: + all: vocab/opusTC.mul.vocab.onmt +tgt_vocab: + all: vocab/opusTC.mul.vocab.onmt + +# MODEL CONFIG + +model_dim: 512 + tasks: train_bg-en: src_tgt: all-all - enc_sharing_group: [all] - dec_sharing_group: [all] + enc_sharing_group: [shared_enc] + dec_sharing_group: [shared_dec] node_gpu: "0:0" - path_src: /path/to/train.bg-en.bg - path_tgt: /path/to/train.bg-en.en + path_src: europarl_data/encoded/train.bg-en.bg.sp + path_tgt: europarl_data/encoded/train.bg-en.en.sp train_cs-en: src_tgt: all-all - enc_sharing_group: [all] - dec_sharing_group: [all] + enc_sharing_group: [shared_enc] + dec_sharing_group: [shared_dec] node_gpu: "0:1" - path_src: /path/to/train.cs-en.cs - path_tgt: /path/to/train.cs-en.en + path_src: europarl_data/encoded/train.cs-en.cs.sp + path_tgt: europarl_data/encoded/train.cs-en.en.sp train_en-cs: src_tgt: all-all - enc_sharing_group: [all] - dec_sharing_group: [all] + enc_sharing_group: [shared_enc] + dec_sharing_group: [shared_dec] node_gpu: "0:1" - path_src: /path/to/train.cs-en.en - path_tgt: /path/to/train.cs-en.cs + path_src: europarl_data/encoded/train.cs-en.en.sp + path_tgt: europarl_data/encoded/train.cs-en.cs.sp enc_layers: [6] dec_layers: [6] ```
+**To proceed, copy-paste one of these configurations into a new file named `my_config.yaml`.** + +For further information, check out the documentation of all parameters in **[train.py](options/train)**. + +For more complex scenarios, we recommend our [automatic configuration generation tool](config_config) for generating your configurations. + +## Step 3: Start training -We recommend our [automatic configuration generation tool](config_config) for generating your configurations. +The running script will slightly differ depending on whether you want to run the training in a single-node (i.e. a single-machine) or multi-node setting: + +### Single-node training + +If you want to run your training on a single machine, simply run a python script `train.py`, possibly with a definition of your desired GPUs. + +```shell +CUDA_VISIBLE_DEVICES=0,1 python train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir -node_rank 0 +``` +Note that when running `train.py`, you can use all the parameters from [train.py](options/train) as cmd arguments. In the case of duplicate arguments, the cmd parameters override the ones found in your config.yaml. -### Step 3: Start training +### Multi-node training -Finally, launch the training script, for example, through the Slurm manager, via: +For the multi-node training, launch the training script, for example, through the Slurm manager, via: ```bash -python -u "$@" --node_rank $SLURM_NODEID -u ${PATH_TO_MAMMOTH}/train.py \ - -config ${CONFIG_DIR}/your_config.yml \ - -save_model ${SAVE_DIR}/models/${EXP_ID} \ +python -u "$@" --node_rank $SLURM_NODEID -u train.py \ + -config my_config.yaml \ + -save_model output_dir \ -master_port 9974 -master_ip $SLURMD_NODENAME \ -tensorboard -tensorboard_log_dir ${LOG_DIR}/${EXP_ID} ``` + From 0e9209b1fe584803efe5806cd987a45f4b3653bf Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Tue, 5 Mar 2024 15:14:32 +0200 Subject: [PATCH 11/22] Update setup.py --- setup.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/setup.py b/setup.py index 1dc4e9b0..7888ca63 100644 --- a/setup.py +++ b/setup.py @@ -32,6 +32,8 @@ "pytest==7.0.1", "pyyaml", "timeout_decorator", + "sentencepiece==0.1.97", # TODO: do we need these fixed? + "sacrebleu==2.3.1" ], entry_points={ "console_scripts": [ From 210e9670a7b5e671e717ee891b064fad7630d95a Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Fri, 8 Mar 2024 14:26:54 +0200 Subject: [PATCH 12/22] Update prepare_data.md --- docs/source/prepare_data.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index de0a022d..93b149ab 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -1,10 +1,11 @@ - # Prepare Data Before running these scripts, make sure that you have [installed](quickstart) Mamooth, which includes the dependencies required below. -## Europarl +## Quickstart: Europarl + +In the quickstart tutorial, we have ### Step 1: Download the data [Europarl parallel corpus](https://www.statmt.org/europarl/) is a multilingual resource extracted from European Parliament proceedings and contains texts in 21 European languages. Download the Release v7 - a further expanded and improved version of the Europarl corpus on 15 May 2012 - from the original website or @@ -16,7 +17,7 @@ tar –xvzf europarl.tar.gz.1 -C europarl_data ``` Note that the extracted dataset will require around 30GB of memory. -We use a SentencePiece model trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary: +We use a SentencePiece tokenizer trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary: ```bash # Download the SentencePiece model wget https://mammoth101.a3s.fi/opusTC.mul.64k.spm @@ -27,10 +28,10 @@ mkdir vocab mv opusTC.mul.64k.spm vocab/. mv opusTC.mul.vocab.onmt vocab/. ``` - +If you would like to create and use a custom sentencepiece tokenizer, take a look at the OPUS tutorial below. ### Step 2: Tokenization -Then, read parallel text data, processes it, and generate output files for training and validation sets. +Then, read parallel text data, processes it, and generates output files for training and validation sets. Here's a high-level summary of the main processing steps. For each language in 'langs,' - read parallel data files. - clean the data by removing empty lines. @@ -79,8 +80,14 @@ for lang in tqdm.tqdm(langs): The script will produce encoded datasets in `europarl_data/encoded` that you can further use for the training. -## OPUS 100 -To get started, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/opus-100.php) +## OPUS 100 + +In this guideline, we will also create our custom sentencepiece tokenizer. + +To do that, you will also need to compile a sentencepiece installation in your environment (not just pip install). +Follow the instructions on [sentencepiece github](https://github.com/google/sentencepiece?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source). + +After that, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/opus-100.php) ### Step 1: Set relevant paths, variables and download From d42b89f77c11a591918b742b1b81df68d71b39d7 Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Fri, 8 Mar 2024 14:29:05 +0200 Subject: [PATCH 13/22] Update prepare_data.md --- docs/source/prepare_data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index 93b149ab..3abac3be 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -5,7 +5,7 @@ Before running these scripts, make sure that you have [installed](quickstart) Ma ## Quickstart: Europarl -In the quickstart tutorial, we have +In the [Quickstart tutorial](quickstart), we assume that you will download and preprocess the Europarl data by following the steps below. ### Step 1: Download the data [Europarl parallel corpus](https://www.statmt.org/europarl/) is a multilingual resource extracted from European Parliament proceedings and contains texts in 21 European languages. Download the Release v7 - a further expanded and improved version of the Europarl corpus on 15 May 2012 - from the original website or From ed571596de3b63ef48cf468edea8123f6f200f16 Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Fri, 8 Mar 2024 14:46:21 +0200 Subject: [PATCH 14/22] main.md: single-line installation --- docs/source/main.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/docs/source/main.md b/docs/source/main.md index 702f790c..f8499f99 100644 --- a/docs/source/main.md +++ b/docs/source/main.md @@ -8,10 +8,7 @@ This portal provides a detailed documentation of the **MAMMOTH**: Modular Adapta ## Installation ```bash -git clone https://github.com/Helsinki-NLP/mammoth.git -cd mammoth -pip3 install -e . -pip3 install sentencepiece==0.1.97 sacrebleu==2.3.1 +pip install git+https://github.com/Helsinki-NLP/mammoth.git ``` Check out the [installation guide](install) to install in specific clusters. @@ -58,4 +55,4 @@ We published [FoTraNMT](https://github.com/Helsinki-NLP/FoTraNMT) the ancestor o url = "https://aclanthology.org/2023.nodalida-1.24", pages = "238--247" } -``` \ No newline at end of file +``` From 81dfaebc6f619352d3bc487563e51379517f9c9c Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Fri, 8 Mar 2024 15:04:47 +0200 Subject: [PATCH 15/22] prepare_data.md: legacy OPUS 100 link --- docs/source/prepare_data.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index 9cf76ca0..efb9a441 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -1,4 +1,4 @@ -# Prepare Data +![image](https://github.com/stefanik12/mammoth/assets/8227868/4e65b62b-980d-459f-a9bc-ea50e398924f)# Prepare Data Before running these scripts, make sure that you have [installed](quickstart) Mamooth, which includes the dependencies required below. @@ -97,7 +97,7 @@ In this guideline, we will also create our custom sentencepiece tokenizer. To do that, you will also need to compile a sentencepiece installation in your environment (not just pip install). Follow the instructions on [sentencepiece github](https://github.com/google/sentencepiece?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source). -After that, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/opus-100.php) +After that, download the opus 100 dataset from [OPUS 100](https://opus.nlpl.eu/legacy/opus-100.php) ### Step 1: Set relevant paths, variables and download From afe53e0da94a39b49d78963671d3dfba88a4d6fb Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Fri, 8 Mar 2024 15:15:25 +0200 Subject: [PATCH 16/22] Update prepare_data.md --- docs/source/prepare_data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index efb9a441..a10d3805 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -1,4 +1,4 @@ -![image](https://github.com/stefanik12/mammoth/assets/8227868/4e65b62b-980d-459f-a9bc-ea50e398924f)# Prepare Data +# Prepare Data Before running these scripts, make sure that you have [installed](quickstart) Mamooth, which includes the dependencies required below. From 83c500b30f3c096dffd49b5fd50bcba6db07df71 Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Mon, 11 Mar 2024 16:18:49 +0200 Subject: [PATCH 17/22] pip install mammoth-nlp in quickstart.md --- docs/source/quickstart.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index e978a949..aba25101 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -9,7 +9,7 @@ In the example below, we will show you how to configure Mamooth to train a machi ### Step 0: Install mammoth ```bash -pip install git+https://github.com/Helsinki-NLP/mammoth.git +pip install mammoth-nlp ``` Check out the [installation guide](install) to install in specific clusters. @@ -270,4 +270,4 @@ Follow these configs to translate text with your trained model. Congratulations! You've successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks. ### Further reading -A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md). \ No newline at end of file +A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md). From 2d5c6c95713d763bf5f4213502033279f5169c72 Mon Sep 17 00:00:00 2001 From: Michal Stefanik Date: Mon, 11 Mar 2024 16:19:07 +0200 Subject: [PATCH 18/22] pip install mammoth-nlp in main.md --- docs/source/main.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/main.md b/docs/source/main.md index f8499f99..217bad85 100644 --- a/docs/source/main.md +++ b/docs/source/main.md @@ -8,7 +8,7 @@ This portal provides a detailed documentation of the **MAMMOTH**: Modular Adapta ## Installation ```bash -pip install git+https://github.com/Helsinki-NLP/mammoth.git +pip install mammoth-nlp ``` Check out the [installation guide](install) to install in specific clusters. From aa770c97a97b5dfc58aba017ee2397707d01306d Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 12 Mar 2024 11:11:20 +0200 Subject: [PATCH 19/22] fix docs: quickstart and sharing_schemes. --- docs/source/examples/sharing_schemes.md | 7 ++++++- docs/source/prepare_data.md | 9 +++++++-- docs/source/quickstart.md | 13 +++++++------ 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/docs/source/examples/sharing_schemes.md b/docs/source/examples/sharing_schemes.md index 9734e79c..cf3caa66 100644 --- a/docs/source/examples/sharing_schemes.md +++ b/docs/source/examples/sharing_schemes.md @@ -117,7 +117,12 @@ This tutorial utilizes SLURM for job scheduling and parallel computing. You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups. Ensure that the `config.yaml` file specifies the desired sharing scheme. -#### 3. **Testing Command:** +The training can be run on a single GPU in which case the wrapper wouldn't be necessary. In this case, you can train with the following command. +```bash +python -u $MAMMOTH/train.py -config $CONFIG -node_rank 0 +``` + +#### 3. **Inference Command:** After training, use the following command to test the model: ```bash diff --git a/docs/source/prepare_data.md b/docs/source/prepare_data.md index a10d3805..6d467869 100644 --- a/docs/source/prepare_data.md +++ b/docs/source/prepare_data.md @@ -13,9 +13,14 @@ download the processed data by us: ```bash wget https://mammoth101.a3s.fi/europarl.tar.gz mkdir europarl_data -tar –xvzf europarl.tar.gz.1 -C europarl_data +tar –xvzf europarl.tar.gz -C europarl_data +``` +Note that the extracted dataset will require around 30GB of memory. Alternatively, you can only download the data for the three example languages (666M). +```bash +wget https://mammoth101.a3s.fi/europarl-3langs.tar.gz +mkdir europarl_data +tar –xvzf europarl-3langs.tar.gz -C europarl_data ``` -Note that the extracted dataset will require around 30GB of memory. We use a SentencePiece tokenizer trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary: ```bash diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index aba25101..283e067c 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -2,7 +2,7 @@ # Quickstart -Mammoth is a library for training modular language models supporting multi-node multi-GPU training. +MAMMOTH is specifically designed for distributed training of modular systems in multi-GPUs SLURM environments. In the example below, we will show you how to configure Mamooth to train a machine translation model with language-specific encoders and decoders. @@ -32,6 +32,7 @@ Below are a few examples of training configurations that will work for you out-o Task-specific encoders and decoders In this example, we create a model with encoders and decoders **shared** for the specified languages. This is defined by `enc_sharing_group` and `enc_sharing_group`. +Note that the configs expect you have access to 2 GPUs. ```yaml # TRAINING CONFIG @@ -207,7 +208,7 @@ If you want to run your training on a single machine, simply run a python script Note that the example config above assumes two GPUs available on one machine. ```shell -CUDA_VISIBLE_DEVICES=0,1 python train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir -node_rank 0 +CUDA_VISIBLE_DEVICES=0,1 python3 train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir -node_rank 0 ``` Note that when running `train.py`, you can use all the parameters from [train.py](options/train) as cmd arguments. In the case of duplicate arguments, the cmd parameters override the ones found in your config.yaml. @@ -217,7 +218,7 @@ Note that when running `train.py`, you can use all the parameters from [train.py Now that you've prepared your data and configured the settings, it's time to initiate the training of your multilingual machine translation model using Mammoth. Follow these steps to launch the training script, for example, through the Slurm manager: ```bash -python -u "$@" --node_rank $SLURM_NODEID -u train.py \ +python3 --node_rank $SLURM_NODEID -u train.py \ -config my_config.yaml \ -save_model output_dir \ -master_port 9974 -master_ip $SLURMD_NODENAME \ @@ -226,9 +227,9 @@ python -u "$@" --node_rank $SLURM_NODEID -u train.py \ Explanation of Command: - `python -u "$@"`: Initiates the training script using Python. - `--node_rank $SLURM_NODEID`: Specifies the node rank using the environment variable provided by Slurm. - - `-u ${PATH_TO_MAMMOTH}/train.py`: Specifies the path to the Mammoth training script. - - `-config ${CONFIG_DIR}/your_config.yml`: Specifies the path to your configuration file. - - `-save_model ${SAVE_DIR}/models/${EXP_ID}`: Defines the directory to save the trained models, incorporating an experiment identifier (`${EXP_ID}`). + - `${PATH_TO_MAMMOTH}/train.py`: Specifies the path to the Mammoth training script. + - `-config my_config.yml`: Specifies the configuration file. + - `-save_model output_dir`: Defines the directory to save the trained models. - `-master_port 9974 -master_ip $SLURMD_NODENAME`: Sets the master port and IP for communication. - `-tensorboard -tensorboard_log_dir ${LOG_DIR}/${EXP_ID}`: Enables TensorBoard logging, specifying the directory for TensorBoard logs. From fa10e023075fa67349e09cd8d3fa5d920d59ac3a Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 12 Mar 2024 11:16:35 +0200 Subject: [PATCH 20/22] project urls --- setup.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/setup.py b/setup.py index 7888ca63..1ae9ae28 100644 --- a/setup.py +++ b/setup.py @@ -14,10 +14,10 @@ version='0.1', packages=find_packages(), project_urls={ - "Documentation": "http://opennmt.net/OpenNMT-py/", - "Forum": "http://forum.opennmt.net/", - "Gitter": "https://gitter.im/OpenNMT/OpenNMT-py", - "Source": "https://github.com/OpenNMT/OpenNMT-py/", + "Documentation": "https://helsinki-nlp.github.io/mammoth/", + # "Forum": "http://forum.opennmt.net/", + # "Gitter": "https://gitter.im/OpenNMT/OpenNMT-py", + "Source": "https://github.com/Helsinki-NLP/mammoth", }, python_requires=">=3.5", install_requires=[ From a4e03cfa527d1e6744309ad26a8172c3a60192df Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 12 Mar 2024 11:34:35 +0200 Subject: [PATCH 21/22] remove multinode command; minor fixes --- docs/source/quickstart.md | 38 ++++++-------------------------------- 1 file changed, 6 insertions(+), 32 deletions(-) diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md index 283e067c..416c3e82 100644 --- a/docs/source/quickstart.md +++ b/docs/source/quickstart.md @@ -4,7 +4,7 @@ MAMMOTH is specifically designed for distributed training of modular systems in multi-GPUs SLURM environments. -In the example below, we will show you how to configure Mamooth to train a machine translation model with language-specific encoders and decoders. +In the example below, we will show you how to configure Mammoth to train a machine translation model with language-specific encoders and decoders. ### Step 0: Install mammoth @@ -31,7 +31,7 @@ Below are a few examples of training configurations that will work for you out-o
Task-specific encoders and decoders -In this example, we create a model with encoders and decoders **shared** for the specified languages. This is defined by `enc_sharing_group` and `enc_sharing_group`. +In this example, we create a model with encoders and decoders **unshared** for the specified languages. This is defined by `enc_sharing_group` and `enc_sharing_group`. Note that the configs expect you have access to 2 GPUs. ```yaml @@ -200,41 +200,15 @@ For more complex scenarios, we recommend our [automatic configuration generation ## Step 3: Start training -The running script will slightly differ depending on whether you want to run the training in a single-node (i.e. a single-machine) or multi-node setting: - -### Single-node training - -If you want to run your training on a single machine, simply run a python script `train.py`, possibly with a definition of your desired GPUs. +You can start your training on a single machine, by simply running a python script `train.py`, possibly with a definition of your desired GPUs. Note that the example config above assumes two GPUs available on one machine. ```shell -CUDA_VISIBLE_DEVICES=0,1 python3 train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir -node_rank 0 +CUDA_VISIBLE_DEVICES=0,1 python3 train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir ``` Note that when running `train.py`, you can use all the parameters from [train.py](options/train) as cmd arguments. In the case of duplicate arguments, the cmd parameters override the ones found in your config.yaml. -### Multi-node training - -Now that you've prepared your data and configured the settings, it's time to initiate the training of your multilingual machine translation model using Mammoth. Follow these steps to launch the training script, for example, through the Slurm manager: - -```bash -python3 --node_rank $SLURM_NODEID -u train.py \ - -config my_config.yaml \ - -save_model output_dir \ - -master_port 9974 -master_ip $SLURMD_NODENAME \ - -tensorboard -tensorboard_log_dir ${LOG_DIR}/${EXP_ID} -``` -Explanation of Command: - - `python -u "$@"`: Initiates the training script using Python. - - `--node_rank $SLURM_NODEID`: Specifies the node rank using the environment variable provided by Slurm. - - `${PATH_TO_MAMMOTH}/train.py`: Specifies the path to the Mammoth training script. - - `-config my_config.yml`: Specifies the configuration file. - - `-save_model output_dir`: Defines the directory to save the trained models. - - `-master_port 9974 -master_ip $SLURMD_NODENAME`: Sets the master port and IP for communication. - - `-tensorboard -tensorboard_log_dir ${LOG_DIR}/${EXP_ID}`: Enables TensorBoard logging, specifying the directory for TensorBoard logs. - -Your training process has been initiated through the Slurm manager, leveraging the specified configuration settings. Monitor the progress through the provided logging and visualization tools. Adjust parameters as needed for your specific training requirements. You can also run the command on other workstations by modifying the parameters accordingly. - ### Step 4: Translate @@ -243,7 +217,7 @@ Now that you have successfully trained your multilingual machine translation mod ```bash python3 -u $MAMMOTH/translate.py \ - --config "${CONFIG_DIR}/your_config.yml" \ + --config "my_config.yml" \ --model "$model_checkpoint" \ --task_id "train_$src_lang-$tgt_lang" \ --src "$path_to_src_language/$lang_pair.$src_lang.sp" \ @@ -255,7 +229,7 @@ python3 -u $MAMMOTH/translate.py \ Follow these configs to translate text with your trained model. - Provide necessary details using the following options: - - Configuration File: `--config "${CONFIG_DIR}/your_config.yml"` + - Configuration File: `--config "my_config.yml"` - Model Checkpoint: `--model "$model_checkpoint"` - Translation Task: `--task_id "train_$src_lang-$tgt_lang"` From bfd8a9905dda2506789f803703a67c0617cfc971 Mon Sep 17 00:00:00 2001 From: JS Date: Tue, 12 Mar 2024 11:34:54 +0200 Subject: [PATCH 22/22] minor fixes --- docs/source/examples/sharing_schemes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/examples/sharing_schemes.md b/docs/source/examples/sharing_schemes.md index cf3caa66..1c46373c 100644 --- a/docs/source/examples/sharing_schemes.md +++ b/docs/source/examples/sharing_schemes.md @@ -119,7 +119,7 @@ Ensure that the `config.yaml` file specifies the desired sharing scheme. The training can be run on a single GPU in which case the wrapper wouldn't be necessary. In this case, you can train with the following command. ```bash -python -u $MAMMOTH/train.py -config $CONFIG -node_rank 0 +python -u $MAMMOTH/train.py -config $CONFIG ``` #### 3. **Inference Command:**