v0.0.2
This release mainly improves
- More unit tests.
- Add
.fuse
and related primitives. - Improve overall training efficiency of GPT models by adding sequence parallelism, tie weight supports, etc.
- Documentation and tutorials.
- Bug fixing.
What's Changed
- [Release] Setup wheel and release scripts by @comaniac in #18
- [Pipeline] Drop last batch in DeepSpeed scripts by @comaniac in #19
- [Examples] Add disable_flash_attn by @chhzh123 in #22
- [Bugfix] Fix sequence parallelism by @szhengac in #20
- [Schedule][replace] Transfer hooks when replacing modules by @comaniac in #27
- [Bugfix] Fix GPT script by @szhengac in #26
- [Bugfix] Transfer hooks in pipeline modules by @comaniac in #28
- [Tracer] Add
flatten
argument to .trace() by @chhzh123 in #29 - [Benchmark] Fix ZeRO-3 step log by @comaniac in #31
- [Bugfix] Fix for sharding TP only by @zarzen in #32
- [Primitive][shard] Use autograd function for all sync ops by @comaniac in #33
- [Bugfix] Using None for mpu when PP > 1 by @zarzen in #34
- [Bugfix] Fix GPT script by @szhengac in #36
- [Schedule] Refactor subgraph matching by @chhzh123 in #35
- [Schedule] Add .fuse() primitive by @chhzh123 in #25
- [Setup] Fix dependency by @chhzh123 in #39
- [Random] Random state management by @comaniac in #38
- [GPT] Use flash-attention and enable dropout by @comaniac in #40
- [Op] Add attention and bias_gelu ops by @comaniac in #41
- [Tracer] Remove SelfAttention renaming by @chhzh123 in #44
- [Model] Add HuggingFace GPT-2 by @comaniac in #45
- [Op] Refactor qkv processing by @comaniac in #46
- Add num_workers to GPT dataloader by @szhengac in #48
- [Op] Add flash-attention CUDA kernel by @comaniac in #49
- [Bugfix] Fix tensor device by @szhengac in #50
- [Example] Use .fuse() primitive when possible by @chhzh123 in #42
- [Refactor] model_dialect -> framework_dialect by @comaniac in #51
- [Test] Add default initialization test by @chhzh123 in #54
- [Schedule] Create subschedule for subgraph replacement by @chhzh123 in #52
- [Schedule] Support partial checkpointing by @chhzh123 in #55
- [DeepSpeed] Support TP=nGPU and PP=DP=1 by @comaniac in #56
- [Examples] Move examples to slapo.model_schedule by @chhzh123 in #53
- [Bugfix] Support tree-like subgraph matching by @chhzh123 in #58
- [Bugfix] Consolidate params with orig size by @comaniac in #59
- [Bugfix] Fix a small device bug by @szhengac in #57
- [README] Temporary remove paper info by @comaniac in #60
- Add param_name to shard infer type and fix consolidate by @comaniac in #62
- [Feature] Layernorm Tag by @szhengac in #61
- [Docs] Add initial documentations by @chhzh123 in #63
- Enable launch training with torchrun by @zarzen in #64
- [Examples] Enable launch with torchrun by @comaniac in #65
New Contributors
Full Changelog: v0.0.1...v0.0.2