-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update docs #59
update docs #59
Conversation
shaoxiongji
commented
Mar 5, 2024
- add scripts for translation
- enhance tutorial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes suggested, but it's only a suggestion.
Runnable quickstart docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like we should have a tutorial page about distributed training.
This page should
- mention that mammoth was written with distributed modular training on SLURM clusters in mind
- explain
world_size
(total number of devices) /gpu_ranks
(devices visible on the node) in the configs and thenode_gpu
(device on which a task is ran) in the task configs - provide a wrapper, and explain
- why it's necessary (i.e., so that we can set variables specific to a SLURM node)
- and how it works (arguments declared inside wrapper.sh are not evaluated until the wrapper is ran on a node, so they are node specific; arguments declared outside of the config are evaluated globally and first, so they are shared by all nodes).
- provide examples that work in multi-gpu and multi-node setting.
We might also want a page about how to define a task, and how this links to the vocab, encoder_layers, decoder_layers and so on:
- mention the config has a dict
tasks
, where keys are unique task identifiers and values are structured task definitions that will define what your model does (we have the (i) (ii) (iii) criteria from the demo paper for swahili to catalan one could include here) - explain the sharing groups and how their shapes are defined globally by the
encoder_layers
anddecoder_layers
- explain how the
src_tgt
task key links a task to the source and target vocabs defined globally, and mention how to do explicit vocab sharing. We probably want to stress that embeddings are defined per vocab at this stage - explain how the physicial compute device is decided with
node_gpu
and link back to the distributed training page - provide other keys, in particular
path_valid_{src,tgt}
to define validation loops andtransforms
We probably also want to change the 101, quickstart, and sharing scheme pages such that
- the example training only expects 1 GPU (it shouldn't fail if you have only 1 GPU but will silently ignore some of the tasks)
- the wrapper not presented in the sharing schemes page
- there are some pointers to the task definition page
|
||
```bash | ||
python -u "$@" --node_rank $SLURM_NODEID -u ${PATH_TO_MAMMOTH}/train.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided command seems to be a mix between the wrapper's internal and a default wrapper-free call. You probably want one if the two following:
either run with a wrapper so it can handle multiple nodes; with the wrapper provided in sharing_schemes.md
srun wrapper.sh \
${PATH_TO_MAMMOTH}/train.py \
-config my_config.yaml \
-master_port 9974 \
-master_ip ${SLURM_NODENAME} \
# and maybe -tensorboard -tensorboard_log_dir -save_model
or the default python call:
python3 -u \
${PATH_TO_MAMMOTH}/train.py \
-config my_config.yaml \
# and maybe -tensorboard -tensorboard_log_dir -save_model
In the latter case -master_port 9974
and -master_ip ${SLURM_NODENAME}
should no longer be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly fixed. Quickstart removes wrapper.sh (i guess for simplicity). I will leave the distributed training later maybe a tutorial in a new page.
@stefanik12 does the call to |