Skip to content

Commit

Permalink
Installable tools
Browse files Browse the repository at this point in the history
Three scripts formerly in the tools directory have been promoted to
mammoth.bin and made into installable scripts.
- mammoth_config_config: Your trusted config swiss army knife.
- mammoth_iterate_tasks: Helper for iterating over all tasks in a
  config.
- mammoth_generate_synth_data: Toy data used in quickstart.
  • Loading branch information
Waino committed Dec 9, 2024
1 parent d32140e commit 6271534
Show file tree
Hide file tree
Showing 19 changed files with 38 additions and 1,468 deletions.
2 changes: 1 addition & 1 deletion docs/source/config_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ To ease the creation of configs, the config-config tool reads in a human-writabl

## Command
```bash
python3 mammoth/tools/config_config.py config_all --in_config input.yaml --out_config output.yaml
mammoth_config_config config_all --in_config input.yaml --out_config output.yaml
```

## Inputs
Expand Down
64 changes: 24 additions & 40 deletions docs/source/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ MAMMOTH is specifically designed for distributed training of modular systems in
In the example below, we will show you how to configure Mammoth.
We will use two small experiments as examples

1. A simple set of toy tasks with synthetic data. Easy and fast, requiring no resources except for the Mammoth git repo.
1. A simple set of toy tasks with synthetic data. Easy and fast, requiring no resources except for the Mammoth package.
2. A machine translation model with language-specific encoders and decoders.

### Step 0: Install mammoth
Expand All @@ -25,61 +25,53 @@ Easy and fast, requiring no resources except for the Mammoth git repo.
This example uses a very small vocabulary, so we can use a "word level" model without sentencepiece.
The opts `--n_nodes`, `--n_gpus_per_node`, `--node_rank`, and `--gpu_rank` are set to use a single GPU.

### Step 1: Set the locations

Set the locations to the directory in which you want to work on this project, and the path to the mammoth git repo.

```bash
export PROJECT_DIR="/path/to/work/dir"
export MAMMOTH_DIR="/path/to/mammoth"
```

### Step 2: Activate your virtual env
### Step 1: Activate your virtual env

```bash
source ~/venvs/mammoth/bin/activate
```

### Step 3: Copy the config template from the Mammoth repo
### Step 2: Copy the config template from the Mammoth repo

```bash
cd $PROJECT_DIR
mkdir config
cp -i ${MAMMOTH_DIR}/examples/synthdata.template.yaml config/synthdata.template.yaml
pushd config
wget "https://raw.githubusercontent.com/Helsinki-NLP/mammoth/refs/heads/main/examples/synthdata.template.yaml"
popd
```

### Step 4: Generate synthetic data
### Step 3: Generate synthetic data

(this might take about 5 min)

```bash
python ${MAMMOTH_DIR}/tools/generate_synth_data.py \
mammoth_generate_synth_data \
--config_path config/synthdata.template.yaml \
--shared_vocab data/synthdata/shared_vocab
```

### Step 5: Generate the actual config from the config template
### Step 4: Generate the actual config from the config template

(this should only take a few seconds)

```bash
python ${MAMMOTH_DIR}/tools/config_config.py \
mammoth_config_config \
config_all \
--in_config config/synthdata.template.yaml \
--out_config config/synthdata.yaml \
--n_nodes 1 \
--n_gpus_per_node 1
```

### Step 6: Train the model
### Step 5: Train the model

(This might take about 1h. To speed things up, train for a shorter time, e.g. `--train_steps 5000 --warmup_steps 600`)

```bash
mammoth_train --config config/synthdata.yaml --node_rank 0 --gpu_rank 0
```

### Step 7: Translate
### Step 6: Translate

(this might take a few minutes)

Expand All @@ -101,25 +93,15 @@ mammoth_translate \

## Experiment 2: Machine translation with multi30k

### Step 1: Set the locations

Set the locations to the directory in which you want to work on this project, and the path to the mammoth git repo.

```bash
export PROJECT_DIR="/path/to/work/dir"
export MAMMOTH_DIR="/path/to/mammoth"
```

### Step 2: Activate your virtual env
### Step 1: Activate your virtual env

```bash
source ~/venvs/mammoth/bin/activate
```

### Step 3: Download data
### Step 2: Download data

```bash
cd $PROJECT_DIR
mkdir data/multi30k
pushd data/multi30k

Expand All @@ -131,7 +113,7 @@ done
popd
```

### Step 4: Train sentencepiece models
### Step 3: Train sentencepiece models

```bash
mkdir -p models/spm
Expand All @@ -142,35 +124,37 @@ for language in cs en de fr; do
done
```

### Step 5: Copy the config template from the Mammoth repo
### Step 4: Copy the config template from the Mammoth repo

```bash
mkdir config
cp -i ${MAMMOTH_DIR}/examples/multi30k.template.yaml config/multi30k.template.yaml
pushd config
wget "https://raw.githubusercontent.com/Helsinki-NLP/mammoth/refs/heads/main/examples/multi30k.template.yaml"
popd
```

### Step 6: Generate the actual config from the config template
### Step 5: Generate the actual config from the config template

(this should only take a few seconds)

```bash
python ${MAMMOTH_DIR}/tools/config_config.py \
mammoth_config_config.py \
config_all \
--in_config config/multi30k.template.yaml \
--out_config config/multi30k.yaml \
--n_nodes 1 \
--n_gpus_per_node 1
```

### Step 7: Train the model
### Step 6: Train the model

(this might take a while)

```bash
mammoth_train --config config/multi30k.yaml --node_rank 0 --gpu_rank 0
```

### Step 8: Translate
### Step 7: Translate

(this might take a while)

Expand All @@ -189,7 +173,7 @@ MODEL="models/${EXP_NAME}_step_${STEP}"

# Translate all language pairs
mkdir -p "translations/${EXP_NAME}/"
python ${MAMMOTH_DIR}/tools/iterate_tasks.py --config ${CONFIG} \
mammoth_iterate_tasks --config ${CONFIG} \
--src "data/${EXP_NAME}/test_2016_flickr.{src_lang}.gz" \
--output "translations/${EXP_NAME}/test_2016_flickr.{task_id}.greedy.trans" \
| while read task_flags; do \
Expand Down
12 changes: 8 additions & 4 deletions tools/config_config.py → mammoth/bin/config_config.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from itertools import compress
from sklearn.cluster import AgglomerativeClustering

from gpu_assignment import optimize_gpu_assignment
from mammoth.utils.gpu_assignment import optimize_gpu_assignment

logger = logging.getLogger('config_config')

Expand Down Expand Up @@ -918,12 +918,12 @@ def extra_copy_gpu_assignment(opts):
opts.in_config[0]['gpu_ranks'] = opts.copy_from[0]['gpu_ranks']


if __name__ == '__main__':
def main():
init_logging()
opts = get_opts()
# if not opts.out_config:
# opts.out_config = opts.in_config[1]
main = {
command = {
func.__name__: func
for func in (
complete_language_pairs,
Expand All @@ -941,5 +941,9 @@ def extra_copy_gpu_assignment(opts):
extra_copy_gpu_assignment,
)
}[opts.command]
main(opts)
command(opts)
save_yaml(opts)


if __name__ == '__main__':
main()
File renamed without changes.
2 changes: 1 addition & 1 deletion tools/iterate_tasks.py → mammoth/bin/iterate_tasks.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
'--output',
type=str,
default=None,
help='Template for source file paths. Use varibles src_lang, tgt_lang, and task_id.',
help='Template for translation output file paths. Use varibles src_lang, tgt_lang, and task_id.',
)
@click.option(
'--flag',
Expand Down
File renamed without changes.
3 changes: 3 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@
# "onmt_server=mammoth.bin.server:main",
"mammoth_train=mammoth.bin.train:main",
"mammoth_translate=mammoth.bin.translate:main",
"mammoth_config_config=mammoth.bin.config_config:main",
"mammoth_iterate_tasks=mammoth.bin.iterate_tasks:main",
"mammoth_generate_synth_data=mammoth.bin.generate_synth_data:main",
# "onmt_release_model=mammoth.bin.release_model:main",
# "onmt_average_models=mammoth.bin.average_models:main",
# "onmt_build_vocab=mammoth.bin.build_vocab:main",
Expand Down
4 changes: 1 addition & 3 deletions tools/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1 @@
This directly contains scripts and tools adopted from other open source projects such as Apache Joshua and Moses Decoder.

TODO: credit the authors and resolve license issues (if any)
This directory contains some legacy scripts not deemed worthy to install as part of the mammoth package.
Loading

0 comments on commit 6271534

Please sign in to comment.