PR | Type | Ref. Issue(s) | Breaking Changes | PR Description |
---|---|---|---|---|
#141 | Bug Fix | #129 | Yes | Towards stable modalities version |
#154 | Bug Fix | #14 | Yes | Towards stable modalities version |
This PR further stabilise the codebase and makes training more robust also w.r.t. loss spikes, which we fixed via scaled weight initialisation and an increased batch size in our experiments. The PR also fixes all failing tests and adds a simple entrypoint for running cpu, single-gpu and multi-gpu tests. The PR contains multiple sub PRs.
General changes:
- Bug fix: the model evaluation mode is now properly deactivated after evaluation (see PR #131)
- Bug fix: Fixed the implementation of Pre-LN for GPT2 model (see PR #136)
- Enhancement: Further mixed precision strategies; also added one matching MegatronLM's.
- Enhancement: Single, unified entrypoint for running cpu, single-gpu and multi-gpu tests. All tests fixed. (PR #155)
- Enhancement: Previously, we would chunk the dataset into
block_size
long chunks. Each chunk would then be used for training individually. As a result, the last token of a block would be only used as a target but never as an input. We changed this, such that we reuse the last token of a batch as the first one of the subsequent batch. (PR #158) - Bug: Indexing of the original samples of the dataset pbin files had multiple bugs. The index tuples are now always in bytes and the start of the first sample in the data section starts at byte 0 (before the was a wrong offset) (PR #164)
- Enhancement: Improvements on the current pull request template and addition of several issue templates (bug report, documentation, feature request, blank) (PR #172)
- Components and factories for plain, scaled and scaled_embed initialisation. (PR #161)
- in GPT2 model training configs, the standard deviation
std
can now be set to the stringauto
(in which case it will equalsqrt(2/(5*hidden_dim))
, see e.g. https://arxiv.org/abs/2312.16903) (PR #161) - The CoCa model, which previously used a hardcoded, (probably not entirely correct) scaled initialization (see #165), can now only use plain initialization (PR #161)
Breaking changes:
- Enhancement: Logging is now always based on #training steps and #consumed tokens (PR #137) This change is a breaking change and the experiment configs need to adapated as shown here.
- Enhancement: The model parameters are now grouped within the respective model. The optimizer can leverage these groups to e.g., only apply weight decay to non-layer-norm weights. See here for the necessary config changes. (PR #139)
- Enhancement: We support now different attention implementations (manual, pytorch flash, DAO flash) See here for the respective config changes. (PR #138)
- Enhancement: replaced
block_size
inDataset
,Model
andNumberConversion
withsequence_length
(PR #158) - Enhancement:
block_size
is nowsequence_length +1
and we should always specifysequence_length
as a value of power of 2. (PR #158) - Enhancement: Restricted the codebase to the officially supported python versions 3.10 and 3.11 ((PR #174))
- All training configs require an additional component for initialization of the raw model (i.e. the model with random weights), as shown here. (PR #161)
- My PR is minimal and addresses one issue / enhancement in isolation
- I have merged main into this feature branch
- I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
- I have run a sample config for model training
- I have fixed all failing tests (
python tests/tests.py
)
This PR adds a manual SwiGLU implementation. The original one from xops was imcompatible with activation checkpointing (see issue #14)
General changes:
- replaces xops swiglu imlementation with custom reimplementation
Breaking changes:
- renaming of
fused_swiglu
toswiglu
inActivationType
(see here for the respective config changes)
This PR removes all code related to Mamba. The latest state of main with Mamba can be found in the branch main_with_mamba.
General changes:
- Removes Mamba-related code
Breaking changes:
- None
This PR mainly addresses the warmstart of model training, e.g., after GPU crashes.
General Changes
- Fixes issue #242
- Warmstarts with changing infrastructure (e.g.,. different number of GPUs) are now supported.
- Restructures the settings part of the configs to
- Adds various checks for consistency of model training (e.g., target tokens and number of dataset tokens mismatch)
- Refactors all configs to be runnable again
- Adds an interactive jupyter notebook-based Tutorial on how to use Modalities. (merged from PR #239 )
- Adds a warmstart tutorial
- TrainingReportGenerator that creates a report on the training setup and prints out warnings in case of inconsistencies.
- Activation Checkpointing is now a component
- Added further NumberConversion routines
Breaking Changes
- the settings part of the configs have been completely refactored
This PR addresses issue #258 (inefficiencies in the dataloader) and additionally introduces a combined dataset, where a dataset can now comprise a list of datasets and iterate over them.
As part of fixing the dataloader inefficiencies, we now implement the sample skipping functionality not on the dataloader level anymore but in an adapted version of the PyTorch DistributedSampler
. I reran a warm start and the learning is equivalent to a full, non-warmstarted run.
General Changes
- Introduced
ResumableDistributedSampler
which is a copy of the PyTorchDistributedSampler
added with the feature to skip samples. This is from now on used for warmstarts instead of theskip_num_samples
in the Dataloader. In case of skipping samples, the dataloader had to instantiate aResumableBatchSampler
which was internally iterating over all the dataset indices. For small datasets this was fine, but for larger datasets (in the trillion token range) this became a bottleneck at instantiation time:modalities/src/modalities/dataloader/samplers.py
Lines 25 to 28 in b79d04d
ResumableDistributedSampler
is skipping in O(1) now. TheResumableBatchSampler
was removed from the codebase. - Replaced the packed index generation routine (inefficient due to for loop)
modalities/src/modalities/dataloader/dataset.py
Lines 331 to 334 in b79d04d
- added new
NumberConversion
routinenum_samples_from_num_tokens
Breaking Changes
- Removed RepeatingDataloader, as a feature that was never actively used for running multiple epochs and had complex maintenance when refactoring the sampling. If needed we could reimpliment it.
- In the settings, the
training_progress
section has nownum_seen_samples
instead oflocal_num_seen_batches
, as skipping is now done on the Sampler level and not on the dataloader level anymore batch_size
andfast_forward_batch_id
fields in theLLMDataLoader
are not neede anymore and were removed.