Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs #59

Merged
merged 24 commits into from
Mar 12, 2024
Merged

update docs #59

merged 24 commits into from
Mar 12, 2024

Conversation

shaoxiongji
Copy link
Collaborator

  • add scripts for translation
  • enhance tutorial

Copy link
Collaborator

@TimotheeMickus TimotheeMickus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes suggested, but it's only a suggestion.

docs/source/examples/sharing_schemes.md Show resolved Hide resolved
docs/source/examples/sharing_schemes.md Outdated Show resolved Hide resolved
docs/source/quickstart.md Outdated Show resolved Hide resolved
docs/source/quickstart.md Show resolved Hide resolved
Copy link
Collaborator

@TimotheeMickus TimotheeMickus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we should have a tutorial page about distributed training.

This page should

  1. mention that mammoth was written with distributed modular training on SLURM clusters in mind
  2. explain world_size (total number of devices) / gpu_ranks (devices visible on the node) in the configs and the node_gpu (device on which a task is ran) in the task configs
  3. provide a wrapper, and explain
    • why it's necessary (i.e., so that we can set variables specific to a SLURM node)
    • and how it works (arguments declared inside wrapper.sh are not evaluated until the wrapper is ran on a node, so they are node specific; arguments declared outside of the config are evaluated globally and first, so they are shared by all nodes).
  4. provide examples that work in multi-gpu and multi-node setting.

We might also want a page about how to define a task, and how this links to the vocab, encoder_layers, decoder_layers and so on:

  1. mention the config has a dict tasks, where keys are unique task identifiers and values are structured task definitions that will define what your model does (we have the (i) (ii) (iii) criteria from the demo paper for swahili to catalan one could include here)
  2. explain the sharing groups and how their shapes are defined globally by the encoder_layers and decoder_layers
  3. explain how the src_tgt task key links a task to the source and target vocabs defined globally, and mention how to do explicit vocab sharing. We probably want to stress that embeddings are defined per vocab at this stage
  4. explain how the physicial compute device is decided with node_gpu and link back to the distributed training page
  5. provide other keys, in particular path_valid_{src,tgt} to define validation loops and transforms

We probably also want to change the 101, quickstart, and sharing scheme pages such that

  • the example training only expects 1 GPU (it shouldn't fail if you have only 1 GPU but will silently ignore some of the tasks)
  • the wrapper not presented in the sharing schemes page
  • there are some pointers to the task definition page

docs/source/quickstart.md Show resolved Hide resolved
docs/source/quickstart.md Outdated Show resolved Hide resolved
docs/source/quickstart.md Outdated Show resolved Hide resolved
docs/source/quickstart.md Outdated Show resolved Hide resolved

```bash
python -u "$@" --node_rank $SLURM_NODEID -u ${PATH_TO_MAMMOTH}/train.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided command seems to be a mix between the wrapper's internal and a default wrapper-free call. You probably want one if the two following:

either run with a wrapper so it can handle multiple nodes; with the wrapper provided in sharing_schemes.md

srun wrapper.sh \
    ${PATH_TO_MAMMOTH}/train.py \
    -config my_config.yaml \
    -master_port 9974 \
    -master_ip ${SLURM_NODENAME} \
    # and maybe -tensorboard -tensorboard_log_dir -save_model

or the default python call:

python3 -u \
    ${PATH_TO_MAMMOTH}/train.py \
    -config my_config.yaml \
    # and maybe -tensorboard -tensorboard_log_dir -save_model    

In the latter case -master_port 9974 and -master_ip ${SLURM_NODENAME} should no longer be required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly fixed. Quickstart removes wrapper.sh (i guess for simplicity). I will leave the distributed training later maybe a tutorial in a new page.

@TimotheeMickus
Copy link
Collaborator

TimotheeMickus commented Mar 12, 2024

@stefanik12 does the call to train.py still work if people install it from pip?

@TimotheeMickus TimotheeMickus merged commit 669e2f4 into main Mar 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants