-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reorg main function; add 'hpu' option w/o implementation #329
base: main
Are you sure you want to change the base?
Conversation
The `main` training function needed to be broken down into smaller functions for readability/testability. WORLD_SIZE, LOCAL_RANK, and RANK have also been extracted and made global constants since they are set by the administrating multiprocessing launcher (torchrun, in our case). HPU configuration options and checks are also added. Signed-off-by: James Kunstle <[email protected]>
disable_flash_attn=args.disable_flash_attn, use_dolomite=args.use_dolomite | ||
) | ||
|
||
tokenizer = configure_tokenizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have named this setup_tokenizer
but that name would shadow the imported function.
train_loader = setup_dataloader( | ||
dataset, | ||
tokenizer.pad_token_id, | ||
this happens sometimes when we have more GPUs than data to process. In this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix this docstring
mock_len=args.mock_len, | ||
) | ||
# will try to make multipack work if possible. | ||
sampler_type: str = "multipack" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: str
here is redundant - it's inferred by the assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a high level these changes look good, we'll just need to test them to verify.
The
main
training function needed to be broken down into smallerfunctions for readability/testability.
WORLD_SIZE, LOCAL_RANK, and RANK have also been extracted and made
global constants since they are set by the administrating
multiprocessing launcher (torchrun, in our case).
HPU configuration options and checks are also added.
Signed-off-by: James Kunstle [email protected]