Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.15.4
What's Changed
- Update version.txt after 0.15.3 release by @loadams in #6652
- Fix expert grad scaling problem with ZeRO optimizer by @wyooyw in #6546
- Add attribute check for language_model when replace last linear module by @Yejing-Lai in #6650
- fix init_device_mesh for torch 2.4 by @Lzhang-hub in #6614
- Fix dynamo issue by @oraluben in #6527
- sequence parallel for uneven heads by @inkcherry in #6392
- Add fallback for is_compiling by @tohtana in #6663
- Update profiler registration check by @loadams in #6668
- Add support for H100/sm_90 arch compilation by @loadams in #6669
- Update Gaudi2 docker image by @loadams in #6677
- Update gaudi2 docker version to latest release (1.18) by @raza-sikander in #6648
- Update base docker image for A6000 GPU tests by @loadams in #6681
- Remove packages that no longer need to be updated in the latest container by @loadams in #6682
- Fix training of pipeline based peft's lora model by @xuanhua in #5477
- Update checkout action to latest version by @loadams in #5021
- Add attribute check to support git-base autotp by @Yejing-Lai in #6688
- fix memcpy issue on backward for zero-infinity by @xylian86 in #6670
- Free memory in universal checkpointing tests by @tohtana in #6693
- Explictly set device when reusing dist env by @tohtana in #6696
- Update URL in README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6706
- Pin transformers to 4.45.2 in nv-ds-chat workflow by @loadams in #6710
- [Bug Fix] Support threads_per_head < 64 for wavefront size of 64 by @jagadish-amd in #6622
- Use one param coordinator for both train/inference scenarios by @tohtana in #6662
- Update yapf version by @loadams in #6721
- Update flake8 version by @loadams in #6722
- Switch what versions of python are supported by @loadams in #5676
New Contributors
Full Changelog: v0.15.3...v0.15.4
v0.15.3
What's Changed
- Update version.txt after 0.15.2 release by @loadams in #6615
- Clean up prefetched parameters by @tohtana in #6557
- AIO CPU Locked Tensor by @jomayeri in #6592
- reduce setting global variables to reduce torch compile graph breaks by @NirSonnenschein in #6541
- Add API to get devices of offload states by @tohtana in #6586
- Ignore reuse_dist_env by @tohtana in #6623
- Add API for updating ZeRO gradients by @tjruwase in #6590
- [compile] Show breakdown of graph break by @delock in #6601
- Accept btl_tcp_if_include option through launcher_args by @diskkid in #6613
- Add first Step in LR Schedulers by @jomayeri in #6597
- Support safetensors export by @xu-song in #6579
- add option to disable logger while compiling to avoid graph breaks by @ShellyNR in #6496
- Lock cache file of HF model list by @tohtana in #6628
- Add README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6588
- Update torch version in workflows by @tohtana in #6631
- Use file store for tests by @tohtana in #6632
- Fix Memory Leak In AIO by @jomayeri in #6630
- [XPU] upgrade xpu max1100 CI workflow to pytorch2.3 by @Liangliang-Ma in #6646
- [XPU] host timer check version from Torch 2.5 to Torch 2.6 by @YizhouZ in #6633
- [XPU] [DeepNVMe] use same cpu_op_desc_t with cuda by @Liangliang-Ma in #6645
New Contributors
Full Changelog: v0.15.2...v0.15.3
v0.15.2 Patch Release
What's Changed
- Update version.txt after 0.15.1 release by @loadams in #6493
- HPU: add required ENV vars to acccelerator init by @nelyahu in #6495
- Op_builder->is_compatible quite warning by @terry-for-github in #6093
- fix pipeline eval_batch micro_batches argument for schedule by @nelyahu in #6484
- Fix the broken url link by @rogerxfeng8 in #6500
- fix environment variable export bug for MultiNodeRunner by @TideDra in #5878
- Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook" by @nelyahu in #6508
- wrap include cuda_bf16.h with ifdef BF16_AVAILABLE by @oelayan7 in #6520
- Avoid security issues of subprocess shell by @tjruwase in #6498
- Add conditional on torch version for scaled_dot_product_attention by @loadams in #6517
- Added Intel Gaudi to Accelerator Setup Guide by @ShifaAbu in #6543
- Skip failing newly added tests in accelerate by @loadams in #6574
- Use msgpack for p2p comm by @tohtana in #6547
- DeepNVMe perf tuning by @tjruwase in #6560
- [Accelerator] Cambricon MLU support by @Andy666G in #6472
- Fix gradient accumulation for Z2+offload by @tohtana in #6550
- fix errors when setting zero3 leaf modules with torch.compile by @NirSonnenschein in #6564
- [XPU] Support DeepNVMe new code structure by @Liangliang-Ma in #6532
- Add APIs to offload states of model, optimizer, and engine by @tohtana in #6011
- add bfloat16 to inference support dtypes by @nelyahu in #6528
- [COMPILE] workflow for deepspeed + torch.compile by @YizhouZ in #6570
- Fixes on the accelerate side mean we do not need to skip this test by @loadams in #6583
- Fix torch include in
op_builder/mlu/fused_adam.py
and update no-torch workflow triggers by @loadams in #6584 - [ROCm] Fix subprocess error by @jagadish-amd in #6587
- Cleanup CODEOWNERS file to be valid by @loadams in #6603
- Add SSF Best practices badge by @loadams in #6604
- Move V100 workflows from cuda 11.1/11.7 to 12.1 by @loadams in #6607
- Fix SD workflow by @loadams in #6609
- Pin accelerate to fix CI failures/issues by @loadams in #6610
- Add llama3.2 vision autotp by @Yejing-Lai in #6577
- Improve DS logging control by @tjruwase in #6602
- Fix device selection using CUDA_VISIBLE_DEVICES by @tohtana in #6530
- Handle when
backend
is also in compile_kwargs by @oraluben in #6502 - Rearrange inference OPS and stop using builder.load by @oelayan7 in #5490
- Unpin accelerate tests, update lightning with node16 removal. by @loadams in #6611
- Enabled Qwen2-MoE Tensor Parallelism (TP) inference by @gyou2021 in #6551
New Contributors
- @TideDra made their first contribution in #5878
- @ShifaAbu made their first contribution in #6543
- @jagadish-amd made their first contribution in #6587
- @gyou2021 made their first contribution in #6551
Full Changelog: v0.15.1...v0.15.2
v0.15.1 Patch release
What's Changed
- Update version.txt after 0.15.0 release by @loadams in #6403
- Fix Type Mismatch by @jomayeri in #6410
- Fix redundant seq data parallel grp argument in Z3/MiCS by @samadejacobs in #5352
- add Huawei Ascend NPU setup guide by @xuedinge233 in #6445
- Add documentation for launcher without SSH by @dogacancolak-kensho in #6455
- Dtype support check for accelerator in UTs by @raza-sikander in #6360
- Store/Load CIFAR from local/offline by @raza-sikander in #6390
- Add the accelerator setup guide link in Getting Started page by @rogerxfeng8 in #6452
- Allow triton==3.0.x for fp_quantizer by @siddartha-RE in #6447
- Change GDS to 1 AIO thread by @jomayeri in #6459
- [CCL] fix condition issue in ccl.py by @YizhouZ in #6443
- Avoid gds build errors on ROCm by @rraminen in #6456
- TestLowCpuMemUsage UT get device by device_name by @raza-sikander in #6397
- Add workflow to build DS without torch to better test before releases by @loadams in #6450
- Fix patch for parameter partitioning in zero.Init() by @tohtana in #6388
- Add default value to "checkpoint_folder" in "load_state_dict" of bf16_optimizer by @ljcc0930 in #6446
- DeepNVMe tutorial by @tjruwase in #6449
- bf16_optimizer: fixes to different grad acc dtype by @nelyahu in #6485
- print warning if actual triton cache dir is on NFS, not just for default by @jrandall in #6487
- DS_BUILD_OPS should build only compatible ops by @tjruwase in #6489
- Safe usage of popen by @tjruwase in #6490
- Handle an edge case where
CUDA_HOME
is not defined on ROCm systems by @amorehead in #6488
New Contributors
- @xuedinge233 made their first contribution in #6445
- @siddartha-RE made their first contribution in #6447
- @ljcc0930 made their first contribution in #6446
- @jrandall made their first contribution in #6487
- @amorehead made their first contribution in #6488
Full Changelog: v0.15.0...v0.15.1
DeepSpeed v0.15.0
What's Changed
- Update version.txt after 0.14.5 release by @loadams in #5982
- move pynvml install to setup.py by @Rohan138 in #5840
- add moe topk(k>2) gate support by @inkcherry in #5881
- Move inf_or_nan_tracker to cpu for cpu offload by @BacharL in #5826
- Enable dynamic shapes for pipeline parallel engine inputs by @tohtana in #5481
- Add and Remove ZeRO 3 Hooks by @jomayeri in #5658
- DeepNVMe GDS by @jomayeri in #5852
- Pin transformers version on nv-nightly by @loadams in #6002
- DeepSpeed on Window blog by @tjruwase in #6364
- Bug Fix 5880 by @jomayeri in #6378
- Update linear.py compatible with torch 2.4.0 by @terry-for-github in #5811
- GDS Swapping Fix by @jomayeri in #6386
- Long sequence parallelism (Ulysses) integration with HuggingFace by @samadejacobs in #5774
- reduce cpu host overhead when using moe by @ranzhejiang in #5578
- fix fp16 Qwen2 series model to DeepSpeed-FastGen by @ZonePG in #6028
- Add Japanese translation of Windows support blog by @tohtana in #6394
- Correct op_builder path to xpu files for trigger XPU tests by @loadams in #6398
- add pip install cutlass version check by @GuanhuaWang in #6393
- [XPU] API align with new intel pytorch extension release by @YizhouZ in #6395
- Pydantic v2 migration by @mrwyattii in #5167
- Fix torch check by @loadams in #6402
New Contributors
- @Rohan138 made their first contribution in #5840
- @terry-for-github made their first contribution in #5811
- @ranzhejiang made their first contribution in #5578
Full Changelog: v0.14.5...v0.15.0
v0.14.5 Patch release
What's Changed
- Update version.txt after 0.14.4 release by @mrwyattii in #5694
- Fixed Windows inference build. by @costin-eseanu in #5609
- Fix memory leak from _hp_mapping by @chiragjn in #5643
- Bug fix for the "Link bit16 and fp32 parameters in partition" by @U-rara in #5681
- [CPU] add fp16 support to shm inference_all_reduce by @delock in #5669
- Universal checkpoint for zero stage 3 by @xylian86 in #5475
- inference unit test injectionPolicy split world_size to multiple tests by @oelayan7 in #5687
- ENV var added for recaching in INF Unit tests by @raza-sikander in #5688
- Disable nvtx decorator to avoid graph break by @tohtana in #5697
- Add an argument to enable the injection of missing state during the conversion of universal checkpoints by @xylian86 in #5608
- Change source of CPUAdam for xpu accelerator by @Liangliang-Ma in #5703
- Add additional paths to trigger xpu tests by @loadams in #5707
- Update XPU docker version by @loadams in #5712
- update xpu fusedadam opbuilder for pytorch 2.3 by @baodii in #5702
- DeepSpeed Universal Checkpointing: Blog and Tutorial by @samadejacobs in #5711
- UCP Chinese Blog by @HeyangQin in #5713
- Fix tutorial links by @samadejacobs in #5714
- Update node16 check on self-hosted runners and remove python 3.6 by @loadams in #5756
- fix the missing argument in test and typo by @xylian86 in #5730
- [INF] Enable torch compile for inference by @oelayan7 in #5612
- Update checkout action for nv-human-eval workflow by @loadams in #5757
- Add Windows scripts (deepspeed, ds_report). by @costin-eseanu in #5699
- Unit Test: Add error handling for rate limit exceeded in model list by @HeyangQin in #5715
- Fix memory leak for pipelined optimizer swapper by @mauryaavinash95 in #5700
- Remove duplicated variable by @xu-song in #5727
- Fix phi3 mini 128k load error by @Yejing-Lai in #5765
- [CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph by @delock in #5604
- Added wrappers for hpu tensors based on dtype by @deepcharm in #5771
- [bugfix] promote state in bf16_optimizer by @billishyahao in #5767
- Launcher mode with SSH bypass by @dogacancolak-kensho in #5728
- Update the list of supported models in the Chinese README of fastgen by @beep-bebop in #5773
- Add support for Microsoft Phi-3 model to DeepSpeed-FastGen by @adk9 in #5559
- Misplaced global variable
warned
by @anferico in #5725 - Fixes for latest Huggingface_hub changes on modelId -> id by @loadams in #5789
- reduce all-to-all communication volume when both expert and non-expert are tensor-parallel by @taozhiwei in #5626
- Update Ubuntu version for running python tests by @loadams in #5783
- fix: quantization with DeepSpeed HE by @Atry in #5624
- [INF] Add Qwen2RMSNorm to loaded layers in auto_tp by @oelayan7 in #5786
- Add chatglm2 & chatglm3 autotp by @Yejing-Lai in #5540
- Add new autotp supported model in doc by @Yejing-Lai in #5785
- Fix accuracy error of NPUFusedAdam by @penn513 in #5777
- Update torch version in cpu-torch-latest and nv-torch-latest-v100 tests to 2.4 by @loadams in #5797
- move is_checkpointable call reducing torch.compile Graph breaks by @NirSonnenschein in #5759
- Unpin transformers version by @loadams in #5650
- Update other workflows to run on Ubuntu 22.04 by @loadams in #5798
- [XPU]Use host time to replace xpu time when IPEX version slower than 2.5. by @ys950902 in #5796
- Update MII tests to pull correct torchvision by @loadams in #5800
- Add fp8-fused gemm kernel by @sfc-gh-reyazda in #5764
- Add doc of compressed backend in Onebit optimizers by @Liangliang-Ma in #5782
- fix: handle exception when loading cache file in test_inference.py by @HeyangQin in #5802
- Pin transformers version for MII tests by @loadams in #5807
- Fix op_builder for CUDA 12.5 by @keshavkowshik in #5806
- Find ROCm on Fedora by @trixirt in #5705
- Fix CPU Adam JIT compilation by @lekurile in #5780
- GDS AIO Blog by @jomayeri in #5817
- [ROCm] Get rocm version from /opt/rocm/.info/version by @rraminen in #5815
- sequence parallel with communication overlap by @inkcherry in #5691
- Update to ROCm6 by @loadams in #5491
- Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen by @ZonePG in #5403
- Use accelerator to replace cuda in setup and runner by @Andy666G in #5769
- Link GDS blog to site by @tjruwase in #5820
- Non-reentrant checkpointing hook fix by @ic-synth in #5781
- Fix NV references by @tjruwase in #5821
- Fix docs building guide by @tjruwase in #5825
- Update clang-format version from 16 to 18. by @loadams in #5839
- Add Japanese translation of DeepNVMe blog by @tohtana in #5845
- Fix the bug of deepspeed sequence parallel working with batch size larger than 1 by @YJHMITWEB in #5823
- Upgrade HPU image to v1.16.2. by @vshekhawat-hlab in #5610
- OptimizedLinear updates by @jeffra in #5791
- Log operator warnings only in verbose mode by @tjruwase in #5917
- Use
torch.nan_to_num
replace numpy wrapper one by @jinyouzhi in #5877 - [Zero2] Reduce the unnecessary all-reduce when tensor size is 0. by @ys950902 in #5868
- Update container version for Gaudi2 CI by @raza-sikander in #5937
- Fix missing ds_id bug by @tjruwase in #5824
- Update LR scheduler configuration by @xiyang-aads-lilly in #5846
- HPUAccelerator: remove support in set_visible_devices_envs by @nelyahu in #5929
- Z3: optimizations for grad norm calculation and gradient clipping by @nelyahu in #5504
- Update xpu-max1100.yml with new config and add some tests by @Liangliang-Ma in #5668
- Add accelerator setup guides by @delock in #5827
- Allow accelerator to instantiate the device by @nelyahu in #5255
New Contributors
- @U-rara made their first contribution in #5681
- @xylian86 made their first contribution in #5475
- @mauryaavinash95 made their first contribution in #5700
- @billishyahao made their first contribution in #5767
- @dogacancolak-kensho made their first contribution in #5728
- @beep-bebop made their first contribution in #5773
- @anferico made their first contribution in #5725
- @Atry made their first contribution in #5624
- @sfc-gh-reyazda made their first contribution in https://github.com/...
v0.14.4 Patch release
What's Changed
- Update version.txt after 0.14.3 release by @mrwyattii in #5651
- [CPU] SHM based allreduce improvement for small message size by @delock in #5571
- _exec_forward_pass: place zeros(1) on the same device as the param by @nelyahu in #5576
- [XPU] adapt lazy_call func to different versions by @YizhouZ in #5670
- fix IDEX dependence in xpu accelerator by @Liangliang-Ma in #5666
- Remove compile wrapper to simplify access to model attributes by @tohtana in #5581
- Fix hpZ with zero element by @samadejacobs in #5652
- Fixing the reshape bug in sequence parallel alltoall, which corrupted all QKV data by @YJHMITWEB in #5664
- enable yuan autotp & add conv tp by @Yejing-Lai in #5428
- Fix latest pytorch '_get_socket_with_port' import error by @Yejing-Lai in #5654
- Fix numpy upgrade to 2.0.0 BUFSIZE import error by @Yejing-Lai in #5680
- Update BUFSIZE to come from autotuner's constants.py, not numpy by @loadams in #5686
- [XPU] support op builder from intel_extension_for_pytorch kernel path by @YizhouZ in #5425
New Contributors
- @YJHMITWEB made their first contribution in #5664
Full Changelog: v0.14.3...v0.14.4
v0.14.3 Patch release
What's Changed
- Update version.txt after 0.14.2 release by @mrwyattii in #5458
- Add getter and setter methods for compile_backend across accelerators. by @vshekhawat-hlab in #5299
- Fix torch.compile error for PyTorch v2.3 by @tohtana in #5463
- Revert "stage3: efficient compute of scaled_global_grad_norm (#5256)" by @lekurile in #5461
- Update ds-chat CI workflow paths to include zero stage 1-3 files by @lekurile in #5462
- Update with ops not supported on Windows by @loadams in #5468
- fix: swapping order of parameters in create_dir_symlink method. by @alvieirajr in #5465
- Un-pin torch version in nv-torch-latest back to latest and skip test_compile_zero tests on v100 by @loadams in #5459
- re-introduce: stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5493
- Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() by @harygo2 in #5464
- Fix compile wrapper by @BacharL in #5455
- enable phi3_mini autotp by @Yejing-Lai in #5501
- Fused adam for HPU by @BacharL in #5500
- [manifest] update mainfest to add hpp file in csrc. by @ys950902 in #5522
- enable phi2 autotp by @Yejing-Lai in #5436
- Switch pynvml to nvidia-ml-py by @loadams in #5529
- Switch from double quotes to match single quotes by @loadams in #5530
- [manifest] update mainfest to add hpp file in deepspeed. by @ys950902 in #5533
- New integration - CometMonitor by @alexkuzmik in #5466
- Improve _configure_optimizer() final optimizer log by @nelyahu in #5528
- Enhance testing: Skip fused_optimizer tests if not supported. by @vshekhawat-hlab in #5159
- Skip the UT cases that use unimplemented op builders. by @foin6 in #5372
- rocblas -> hipblas changes for ROCm by @rraminen in #5401
- Rocm warp size fix by @rraminen in #5402
- CPUAdam fp16 and bf16 support by @BacharL in #5409
- Optimize zero3 fetch params using all_reduce by @deepcharm in #5420
- Fix the TypeError for XPU Accelerator by @shiyang-weng in #5531
- Fix RuntimeError for moe on XPU: tensors found at least two devices by @shiyang-weng in #5519
- Remove synchronize calls from allgather params by @BacharL in #5516
- Avoid overwrite of compiled module wrapper attributes by @deepcharm in #5549
- Small typos in functions set_none_gradients_to_zero by @TravelLeraLone in #5557
- Adapt doc for #4405 by @oraluben in #5552
- Update to HF_HOME from TRANSFORMERS_CACHE by @loadams in #4816
- [INF] DSAttention allow input_mask to have false as value by @oelayan7 in #5546
- Add throughput timer configuration by @deepcharm in #5363
- Add Ulysses DistributedAttention compatibility by @Kwen-Chen in #5525
- Add hybrid_engine.py as path to trigger the DS-Chat GH workflow by @lekurile in #5562
- Update HPU docker version by @loadams in #5566
- Rename files in fp_quantize op from quantize.* to fp_quantize.* by @loadams in #5577
- [MiCS] Remove the handle print on DeepSpeed side by @ys950902 in #5574
- Update to fix sidebar over text by @loadams in #5567
- DeepSpeedCheckpoint: support custom final ln idx by @nelyahu in #5506
- Update minor CUDA version compatibility by @adk9 in #5591
- Add slide deck for meetup in Japan by @tohtana in #5598
- Fixed the Windows build. by @costin-eseanu in #5596
- estimate_zero2_model_states_mem_needs: fixing memory estiamtion by @nelyahu in #5099
- Fix cuda hardcode for inference woq by @Liangliang-Ma in #5565
- fix sequence parallel(Ulysses) grad scale for zero0 by @inkcherry in #5555
- Add Compressedbackend for Onebit optimizers by @Liangliang-Ma in #5473
- Updated hpu-gaudi2 tests content. by @vshekhawat-hlab in #5622
- Pin transformers version for MII tests by @loadams in #5629
- WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo by @NirSonnenschein in #5590
- stage_1_and_2: optimize clip calculation to use clamp by @nelyahu in #5632
- Fix overlap communication of ZeRO stage 1 and 2 by @penn513 in #5606
- fixes in _partition_param_sec function by @mmhab in #5613
- assumption of torch.initial_seed function accepting seed arg in DeepSpeedAccelerator abstract class is incorrect by @polisettyvarma in #5569
- pipe/_exec_backward_pass: fix immediate grad update by @nelyahu in #5605
- Monitor was always enabled causing performance degradation by @deepcharm in #5633
New Contributors
- @alvieirajr made their first contribution in #5465
- @harygo2 made their first contribution in #5464
- @alexkuzmik made their first contribution in #5466
- @foin6 made their first contribution in #5372
- @shiyang-weng made their first contribution in #5531
- @TravelLeraLone made their first contribution in #5557
- @oraluben made their first contribution in #5552
- @Kwen-Chen made their first contribution in #5525
- @adk9 made their first contribution in #5591
- @costin-eseanu made their first contribution in #5596
- @NirSonnenschein made their first contribution in #5590
- @penn513 made their first contribution in #5606
Full Changelog: v0.14.2...v0.14.3
v0.14.2 Patch release
What's Changed
- Update version.txt after 0.14.1 release by @mrwyattii in #5413
- Remove dtype(fp16) condition check for residual_add unit test by @raza-sikander in #5329
- [XPU] Use non_daemonic_proc by default on XPU device by @ys950902 in #5412
- Fix a convergence issues in TP topology caused by incorrect grad_norm. by @inkcherry in #5411
- Update 'create-pr' action in release workflow to latest by @loadams in #5415
- Update engine.py to avoid torch warning by @etiennebonnafoux in #5408
- Update _sidebar.scss by @fasterinnerlooper in #5293
- Add more tests into XPU CI by @Liangliang-Ma in #5427
- [CPU] Support SHM based inference_all_reduce in TorchBackend by @delock in #5391
- Add required paths to trigger AMD tests on PRs by @loadams in #5406
- Bug fix in
split_index
method by @bm-synth in #5292 - Parallel map step for
DistributedDataAnalyzer
map-reduce by @bm-synth in #5291 - Selective dequantization by @RezaYazdaniAminabadi in #5375
- Fix sorting of shard optimizer states files for universal checkpoint by @tohtana in #5395
- add device config env for the accelerator by @shiyuan680 in #5396
- 64bit indexing fused adam by @garrett4wade in #5187
- Improve parallel process of universal checkpoint conversion by @tohtana in #5343
- set the default to use set_to_none for clearing gradients in BF16 optimizer. by @inkcherry in #5434
- OptimizedLinear implementation by @jeffra in #5355
- Update README.md by @Jhonso7393 in #5453
- Update PyTest torch version to match PyTorch latest official (2.3.0) by @loadams in #5454
New Contributors
- @etiennebonnafoux made their first contribution in #5408
- @fasterinnerlooper made their first contribution in #5293
- @shiyuan680 made their first contribution in #5396
- @garrett4wade made their first contribution in #5187
- @Jhonso7393 made their first contribution in #5453
Full Changelog: v0.14.1...v0.14.2
v0.14.1 Patch release
What's Changed
- Update version.txt after 0.14.0 release by @mrwyattii in #5238
- Fp6 blog chinese by @xiaoxiawu-microsoft in #5239
- Add contributed HW support into README by @delock in #5240
- Set tp world size to 1 in ckpt load, if MPU is not provided by @samadejacobs in #5243
- Make op builder detection adapt to accelerator change by @delock in #5206
- Replace HIP_PLATFORM_HCC with HIP_PLATFORM_AMD by @rraminen in #5264
- Add CI for Habana Labs HPU/Gaudi2 by @loadams in #5244
- Fix attention mask handling in the Hybrid Engine Bloom flow by @deepcharm in #5101
- Skip 1Bit Compression and sparsegrad tests for HPU. by @vshekhawat-hlab in #5270
- Enabled LMCorrectness inference tests on HPU. by @vshekhawat-hlab in #5271
- Added HPU backend support for torch.compile tests. by @vshekhawat-hlab in #5269
- Average only valid part of the ipg buffer. by @BacharL in #5268
- Add HPU accelerator support in unit tests. by @vshekhawat-hlab in #5162
- Fix loading a universal checkpoint by @tohtana in #5263
- Add Habana Gaudi2 CI badge to the README by @loadams in #5286
- Add intel gaudi to contributed HW in README by @BacharL in #5300
- Fixed Accelerate Link by @wkaisertexas in #5314
- Enable mixtral 8x7b autotp by @Yejing-Lai in #5257
- support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix by @inkcherry in #5259
- fix comms dtype by @mayank31398 in #5297
- Modified regular expression by @igeni in #5306
- Docs typos fix and grammar suggestions by @Gr0g0 in #5322
- Added Gaudi2 CI tests. by @vshekhawat-hlab in #5275
- Improve universal checkpoint by @tohtana in #5289
- Increase coverage for HPU by @loadams in #5324
- Add NFS path check for default deepspeed triton cache directory by @HeyangQin in #5323
- Correct typo in checking on bf16 unit test support by @loadams in #5317
- Make NFS warning print only once by @HeyangQin in #5345
- resolve KeyError: 'PDSH_SSH_ARGS_APPEND' by @Lzhang-hub in #5318
- BF16 optimizer: Clear lp grads after updating hp grads in hook by @YangQun1 in #5328
- Fix sort of zero checkpoint files by @tohtana in #5342
- Add
distributed_port
fordeepspeed.initialize
by @LZHgrla in #5260 - [fix] fix typo s/simultanenously /simultaneously by @digger-yu in #5359
- Update container version for Gaudi2 CI by @raza-sikander in #5360
- compute global norm on device by @BacharL in #5125
- logger update with torch master changes by @rogerxfeng8 in #5346
- Ensure capacity does not exceed number of tokens by @jeffra in #5353
- Update workflows that use cu116 to cu117 by @loadams in #5361
- FP [6,8,12] quantizer op by @jeffra in #5336
- CPU SHM based inference_all_reduce improve by @delock in #5320
- Auto convert moe param groups by @jeffra in #5354
- Support MoE for pipeline models by @mosheisland in #5338
- Update pytest and transformers with fixes for pytest>= 8.0.0 by @loadams in #5164
- Increase CI coverage for Gaudi2 accelerator. by @vshekhawat-hlab in #5358
- Add CI for Intel XPU/Max1100 by @Liangliang-Ma in #5376
- Update path name on xpu-max1100.yml, add badge in README by @loadams in #5386
- Update checkout action on workflows on ubuntu 20.04 by @loadams in #5387
- Cleanup required_torch_version code and references. by @loadams in #5370
- Update README.md for intel XPU support by @Liangliang-Ma in #5389
- Optimize the fp-dequantizer to get high memory-BW utilization by @RezaYazdaniAminabadi in #5373
- Removal of cuda hardcoded string with get_device function by @raza-sikander in #5351
- Add custom reshaping for universal checkpoint by @tohtana in #5390
- fix pagable h2d memcpy by @GuanhuaWang in #5301
- stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5256
- Fix the FP6 kernels compilation problem on non-Ampere GPUs. by @JamesTheZ in #5333
New Contributors
- @vshekhawat-hlab made their first contribution in #5270
- @wkaisertexas made their first contribution in #5314
- @igeni made their first contribution in #5306
- @Gr0g0 made their first contribution in #5322
- @Lzhang-hub made their first contribution in #5318
- @YangQun1 made their first contribution in #5328
- @raza-sikander made their first contribution in #5360
- @rogerxfeng8 made their first contribution in #5346
- @JamesTheZ made their first contribution in #5333
Full Changelog: v0.14.0...v0.14.1