Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.16.2 Patch Release
What's Changed
- Update pre-commit version by @loadams in #6821
- Update version.txt after 0.16.1 release by @loadams in #6826
- Pin HPU tests by @loadams in #6831
- Flops profiler support einops.einsum by @lvhoaa in #6755
- Pin pytest-subtests version for accelerate tests by @loadams in #6842
- Inference UTs check for trition support from accelerator by @raza-sikander in #6782
- Unpin pytest-subtests now that 0.14.1 is released by @loadams in #6844
- Merge LoCo with Zero++ by @XingyuXie in #6730
- Fix type error in
ZeROOrderedDict
by @oraluben in #6794 - Fix uneven head sequence parallelism bug (#6774) by @Eugene29 in #6797
- Fix nv-torch-nightly test by pinning transformers by @loadams in #6849
- Remove broken links to non-active site by @kaiksi-bb in #6854
- Avoid poisoning process with CUDA calls as soon as importing by @HollowMan6 in #6810
- Fix xpu tests workflow failure by changing pip index url by @Liangliang-Ma in #6864
- Domino updates by @GuanhuaWang in #6861
- add domino navigation by @GuanhuaWang in #6866
- Update TSC by @tjruwase in #6867
- Remove warnings from autodoc and sphinx by @loadams in #6788
- Update real_accelerator.py by @keiwoo in #6845
- Fix assertion for offloading states by @tohtana in #6855
- Remove pin from transformers version and fix Processing/Threading issues in tests by @loadams in #6822
- Add MLP/lm_head tp grain size setting. by @Yejing-Lai in #6828
- Fix --enable_each_rank_log when used with PDSH multi-node runner by @akeshet in #6863
- Update transformers ops unit tests to use
requried_torch_version
by @loadams in #6884 - Don't error out when cpu accelerator doesn't have torch (as default for whl building) by @loadams in #6886
- Add arctic model support by adding w2 to all_reduce by @pi314ever in #6856
- Update code owners by @tjruwase in #6890
New Contributors
- @lvhoaa made their first contribution in #6755
- @XingyuXie made their first contribution in #6730
- @Eugene29 made their first contribution in #6797
- @kaiksi-bb made their first contribution in #6854
- @HollowMan6 made their first contribution in #6810
- @keiwoo made their first contribution in #6845
- @akeshet made their first contribution in #6863
- @pi314ever made their first contribution in #6856
Full Changelog: v0.16.1...v0.16.2
v0.16.1 Patch Release
What's Changed
- Update version.txt after 0.16.0 release by @loadams in #6786
- Domino news update on readme.md by @GuanhuaWang in #6815
- Fix zero checkpoint by @xu-song in #6792
- Update python version but now we need to include setuptools on our own by @loadams in #6787
- Adding the new feature of FPDT by @YJHMITWEB in #6462
- Pin transformers to avoid errors with latest version by @loadams in #6820
- Ulyssess offload blog by @samadejacobs in #6814
- add FPDT tutorial by @samadejacobs in #6813
- Update README.md by @samadejacobs in #6824
- Update README.md by @samadejacobs in #6825
- Pin transformers version in cpu-torch-latest due to multiprocessing error. by @loadams in #6823
Full Changelog: v0.16.0...v0.16.1
DeepSpeed v0.16.0
What's Changed
- Update version.txt after 0.15.4 release by @loadams in #6731
- Update GH hosted workflows to 24.04 by @loadams in #6717
- Add COMMITTER file by @tjruwase in #6741
- Update AMD apex version by @loadams in #6739
- Fix Type Name Inconsistency & Typo in cpu_adam by @xylian86 in #6732
- Add Domino code by @zhangsmallshark in #6733
- Add data type check for bf16 by @hwchen2017 in #6742
- add zero3
module_granularity_threshold
to zero optimization. by @inkcherry in #6649 - AIO File Offsets by @jomayeri in #6641
- Update path for BingBertSquad from DeepSpeedExamples by @loadams in #6746
- Sanitize inputs to eval() by @loadams in #6745
- Adding the governance doc by @minjiazhang in #6748
- Add no_sync context manager by @tjruwase in #6675
- Gaudi2 Nightly job for daily check by @raza-sikander in #6753
- Disable failing python tests by @loadams in #6758
- A faster and more memory-efficient implementation of
zero_to_fp32
by @xu-song in #6658 - Pin transformers version to work around latest torch requirements by @loadams in #6759
- make xpu ops compatible with oneapi 2025.0 by @baodii in #6760
- Add explicit parameters for torch.load by @loadams in #6751
- Fix setup.py bash cmd generation to correctly extract git info by @nelyahu in #6762
- Use
json_schema_extra
instead of extra keyword inField
by @qgallouedec in #6764 - Fix potential memory issues when use deepspeed Z3 by @wenbinc-Bin in #6726
- Removes unnecessary cloning by @swigls in #6761
- Enable torch compile on _allgather_params by @deepcharm in #6769
- Unpin with latest transformers fixes by @loadams in #6763
- docs: fix HF links by @imba-tjd in #6780
- Fix Doc Error: ZeRO Stage 2 gradient partitioning by @yewentao256 in #6775
- Cleanup code docs warnings by @loadams in #6783
- Domino Blog by @GuanhuaWang in #6776
- Update version.txt before release by @loadams in #6784
- Revert release workflow by @loadams in #6785
New Contributors
- @zhangsmallshark made their first contribution in #6733
- @hwchen2017 made their first contribution in #6742
- @minjiazhang made their first contribution in #6748
- @qgallouedec made their first contribution in #6764
- @wenbinc-Bin made their first contribution in #6726
- @swigls made their first contribution in #6761
- @imba-tjd made their first contribution in #6780
- @yewentao256 made their first contribution in #6775
Full Changelog: v0.15.4...v0.16.0
v0.15.4 Patch Release
What's Changed
- Update version.txt after 0.15.3 release by @loadams in #6652
- Fix expert grad scaling problem with ZeRO optimizer by @wyooyw in #6546
- Add attribute check for language_model when replace last linear module by @Yejing-Lai in #6650
- fix init_device_mesh for torch 2.4 by @Lzhang-hub in #6614
- Fix dynamo issue by @oraluben in #6527
- sequence parallel for uneven heads by @inkcherry in #6392
- Add fallback for is_compiling by @tohtana in #6663
- Update profiler registration check by @loadams in #6668
- Add support for H100/sm_90 arch compilation by @loadams in #6669
- Update Gaudi2 docker image by @loadams in #6677
- Update gaudi2 docker version to latest release (1.18) by @raza-sikander in #6648
- Update base docker image for A6000 GPU tests by @loadams in #6681
- Remove packages that no longer need to be updated in the latest container by @loadams in #6682
- Fix training of pipeline based peft's lora model by @xuanhua in #5477
- Update checkout action to latest version by @loadams in #5021
- Add attribute check to support git-base autotp by @Yejing-Lai in #6688
- fix memcpy issue on backward for zero-infinity by @xylian86 in #6670
- Free memory in universal checkpointing tests by @tohtana in #6693
- Explictly set device when reusing dist env by @tohtana in #6696
- Update URL in README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6706
- Pin transformers to 4.45.2 in nv-ds-chat workflow by @loadams in #6710
- [Bug Fix] Support threads_per_head < 64 for wavefront size of 64 by @jagadish-amd in #6622
- Use one param coordinator for both train/inference scenarios by @tohtana in #6662
- Update yapf version by @loadams in #6721
- Update flake8 version by @loadams in #6722
- Switch what versions of python are supported by @loadams in #5676
New Contributors
Full Changelog: v0.15.3...v0.15.4
v0.15.3 Patch Release
What's Changed
- Update version.txt after 0.15.2 release by @loadams in #6615
- Clean up prefetched parameters by @tohtana in #6557
- AIO CPU Locked Tensor by @jomayeri in #6592
- reduce setting global variables to reduce torch compile graph breaks by @NirSonnenschein in #6541
- Add API to get devices of offload states by @tohtana in #6586
- Ignore reuse_dist_env by @tohtana in #6623
- Add API for updating ZeRO gradients by @tjruwase in #6590
- [compile] Show breakdown of graph break by @delock in #6601
- Accept btl_tcp_if_include option through launcher_args by @diskkid in #6613
- Add first Step in LR Schedulers by @jomayeri in #6597
- Support safetensors export by @xu-song in #6579
- add option to disable logger while compiling to avoid graph breaks by @ShellyNR in #6496
- Lock cache file of HF model list by @tohtana in #6628
- Add README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6588
- Update torch version in workflows by @tohtana in #6631
- Use file store for tests by @tohtana in #6632
- Fix Memory Leak In AIO by @jomayeri in #6630
- [XPU] upgrade xpu max1100 CI workflow to pytorch2.3 by @Liangliang-Ma in #6646
- [XPU] host timer check version from Torch 2.5 to Torch 2.6 by @YizhouZ in #6633
- [XPU] [DeepNVMe] use same cpu_op_desc_t with cuda by @Liangliang-Ma in #6645
New Contributors
Full Changelog: v0.15.2...v0.15.3
v0.15.2 Patch Release
What's Changed
- Update version.txt after 0.15.1 release by @loadams in #6493
- HPU: add required ENV vars to acccelerator init by @nelyahu in #6495
- Op_builder->is_compatible quite warning by @terry-for-github in #6093
- fix pipeline eval_batch micro_batches argument for schedule by @nelyahu in #6484
- Fix the broken url link by @rogerxfeng8 in #6500
- fix environment variable export bug for MultiNodeRunner by @TideDra in #5878
- Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook" by @nelyahu in #6508
- wrap include cuda_bf16.h with ifdef BF16_AVAILABLE by @oelayan7 in #6520
- Avoid security issues of subprocess shell by @tjruwase in #6498
- Add conditional on torch version for scaled_dot_product_attention by @loadams in #6517
- Added Intel Gaudi to Accelerator Setup Guide by @ShifaAbu in #6543
- Skip failing newly added tests in accelerate by @loadams in #6574
- Use msgpack for p2p comm by @tohtana in #6547
- DeepNVMe perf tuning by @tjruwase in #6560
- [Accelerator] Cambricon MLU support by @Andy666G in #6472
- Fix gradient accumulation for Z2+offload by @tohtana in #6550
- fix errors when setting zero3 leaf modules with torch.compile by @NirSonnenschein in #6564
- [XPU] Support DeepNVMe new code structure by @Liangliang-Ma in #6532
- Add APIs to offload states of model, optimizer, and engine by @tohtana in #6011
- add bfloat16 to inference support dtypes by @nelyahu in #6528
- [COMPILE] workflow for deepspeed + torch.compile by @YizhouZ in #6570
- Fixes on the accelerate side mean we do not need to skip this test by @loadams in #6583
- Fix torch include in
op_builder/mlu/fused_adam.py
and update no-torch workflow triggers by @loadams in #6584 - [ROCm] Fix subprocess error by @jagadish-amd in #6587
- Cleanup CODEOWNERS file to be valid by @loadams in #6603
- Add SSF Best practices badge by @loadams in #6604
- Move V100 workflows from cuda 11.1/11.7 to 12.1 by @loadams in #6607
- Fix SD workflow by @loadams in #6609
- Pin accelerate to fix CI failures/issues by @loadams in #6610
- Add llama3.2 vision autotp by @Yejing-Lai in #6577
- Improve DS logging control by @tjruwase in #6602
- Fix device selection using CUDA_VISIBLE_DEVICES by @tohtana in #6530
- Handle when
backend
is also in compile_kwargs by @oraluben in #6502 - Rearrange inference OPS and stop using builder.load by @oelayan7 in #5490
- Unpin accelerate tests, update lightning with node16 removal. by @loadams in #6611
- Enabled Qwen2-MoE Tensor Parallelism (TP) inference by @gyou2021 in #6551
New Contributors
- @TideDra made their first contribution in #5878
- @ShifaAbu made their first contribution in #6543
- @jagadish-amd made their first contribution in #6587
- @gyou2021 made their first contribution in #6551
Full Changelog: v0.15.1...v0.15.2
v0.15.1 Patch release
What's Changed
- Update version.txt after 0.15.0 release by @loadams in #6403
- Fix Type Mismatch by @jomayeri in #6410
- Fix redundant seq data parallel grp argument in Z3/MiCS by @samadejacobs in #5352
- add Huawei Ascend NPU setup guide by @xuedinge233 in #6445
- Add documentation for launcher without SSH by @dogacancolak-kensho in #6455
- Dtype support check for accelerator in UTs by @raza-sikander in #6360
- Store/Load CIFAR from local/offline by @raza-sikander in #6390
- Add the accelerator setup guide link in Getting Started page by @rogerxfeng8 in #6452
- Allow triton==3.0.x for fp_quantizer by @siddartha-RE in #6447
- Change GDS to 1 AIO thread by @jomayeri in #6459
- [CCL] fix condition issue in ccl.py by @YizhouZ in #6443
- Avoid gds build errors on ROCm by @rraminen in #6456
- TestLowCpuMemUsage UT get device by device_name by @raza-sikander in #6397
- Add workflow to build DS without torch to better test before releases by @loadams in #6450
- Fix patch for parameter partitioning in zero.Init() by @tohtana in #6388
- Add default value to "checkpoint_folder" in "load_state_dict" of bf16_optimizer by @ljcc0930 in #6446
- DeepNVMe tutorial by @tjruwase in #6449
- bf16_optimizer: fixes to different grad acc dtype by @nelyahu in #6485
- print warning if actual triton cache dir is on NFS, not just for default by @jrandall in #6487
- DS_BUILD_OPS should build only compatible ops by @tjruwase in #6489
- Safe usage of popen by @tjruwase in #6490
- Handle an edge case where
CUDA_HOME
is not defined on ROCm systems by @amorehead in #6488
New Contributors
- @xuedinge233 made their first contribution in #6445
- @siddartha-RE made their first contribution in #6447
- @ljcc0930 made their first contribution in #6446
- @jrandall made their first contribution in #6487
- @amorehead made their first contribution in #6488
Full Changelog: v0.15.0...v0.15.1
DeepSpeed v0.15.0
What's Changed
- Update version.txt after 0.14.5 release by @loadams in #5982
- move pynvml install to setup.py by @Rohan138 in #5840
- add moe topk(k>2) gate support by @inkcherry in #5881
- Move inf_or_nan_tracker to cpu for cpu offload by @BacharL in #5826
- Enable dynamic shapes for pipeline parallel engine inputs by @tohtana in #5481
- Add and Remove ZeRO 3 Hooks by @jomayeri in #5658
- DeepNVMe GDS by @jomayeri in #5852
- Pin transformers version on nv-nightly by @loadams in #6002
- DeepSpeed on Window blog by @tjruwase in #6364
- Bug Fix 5880 by @jomayeri in #6378
- Update linear.py compatible with torch 2.4.0 by @terry-for-github in #5811
- GDS Swapping Fix by @jomayeri in #6386
- Long sequence parallelism (Ulysses) integration with HuggingFace by @samadejacobs in #5774
- reduce cpu host overhead when using moe by @ranzhejiang in #5578
- fix fp16 Qwen2 series model to DeepSpeed-FastGen by @ZonePG in #6028
- Add Japanese translation of Windows support blog by @tohtana in #6394
- Correct op_builder path to xpu files for trigger XPU tests by @loadams in #6398
- add pip install cutlass version check by @GuanhuaWang in #6393
- [XPU] API align with new intel pytorch extension release by @YizhouZ in #6395
- Pydantic v2 migration by @mrwyattii in #5167
- Fix torch check by @loadams in #6402
New Contributors
- @Rohan138 made their first contribution in #5840
- @terry-for-github made their first contribution in #5811
- @ranzhejiang made their first contribution in #5578
Full Changelog: v0.14.5...v0.15.0
v0.14.5 Patch release
What's Changed
- Update version.txt after 0.14.4 release by @mrwyattii in #5694
- Fixed Windows inference build. by @costin-eseanu in #5609
- Fix memory leak from _hp_mapping by @chiragjn in #5643
- Bug fix for the "Link bit16 and fp32 parameters in partition" by @U-rara in #5681
- [CPU] add fp16 support to shm inference_all_reduce by @delock in #5669
- Universal checkpoint for zero stage 3 by @xylian86 in #5475
- inference unit test injectionPolicy split world_size to multiple tests by @oelayan7 in #5687
- ENV var added for recaching in INF Unit tests by @raza-sikander in #5688
- Disable nvtx decorator to avoid graph break by @tohtana in #5697
- Add an argument to enable the injection of missing state during the conversion of universal checkpoints by @xylian86 in #5608
- Change source of CPUAdam for xpu accelerator by @Liangliang-Ma in #5703
- Add additional paths to trigger xpu tests by @loadams in #5707
- Update XPU docker version by @loadams in #5712
- update xpu fusedadam opbuilder for pytorch 2.3 by @baodii in #5702
- DeepSpeed Universal Checkpointing: Blog and Tutorial by @samadejacobs in #5711
- UCP Chinese Blog by @HeyangQin in #5713
- Fix tutorial links by @samadejacobs in #5714
- Update node16 check on self-hosted runners and remove python 3.6 by @loadams in #5756
- fix the missing argument in test and typo by @xylian86 in #5730
- [INF] Enable torch compile for inference by @oelayan7 in #5612
- Update checkout action for nv-human-eval workflow by @loadams in #5757
- Add Windows scripts (deepspeed, ds_report). by @costin-eseanu in #5699
- Unit Test: Add error handling for rate limit exceeded in model list by @HeyangQin in #5715
- Fix memory leak for pipelined optimizer swapper by @mauryaavinash95 in #5700
- Remove duplicated variable by @xu-song in #5727
- Fix phi3 mini 128k load error by @Yejing-Lai in #5765
- [CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph by @delock in #5604
- Added wrappers for hpu tensors based on dtype by @deepcharm in #5771
- [bugfix] promote state in bf16_optimizer by @billishyahao in #5767
- Launcher mode with SSH bypass by @dogacancolak-kensho in #5728
- Update the list of supported models in the Chinese README of fastgen by @beep-bebop in #5773
- Add support for Microsoft Phi-3 model to DeepSpeed-FastGen by @adk9 in #5559
- Misplaced global variable
warned
by @anferico in #5725 - Fixes for latest Huggingface_hub changes on modelId -> id by @loadams in #5789
- reduce all-to-all communication volume when both expert and non-expert are tensor-parallel by @taozhiwei in #5626
- Update Ubuntu version for running python tests by @loadams in #5783
- fix: quantization with DeepSpeed HE by @Atry in #5624
- [INF] Add Qwen2RMSNorm to loaded layers in auto_tp by @oelayan7 in #5786
- Add chatglm2 & chatglm3 autotp by @Yejing-Lai in #5540
- Add new autotp supported model in doc by @Yejing-Lai in #5785
- Fix accuracy error of NPUFusedAdam by @penn513 in #5777
- Update torch version in cpu-torch-latest and nv-torch-latest-v100 tests to 2.4 by @loadams in #5797
- move is_checkpointable call reducing torch.compile Graph breaks by @NirSonnenschein in #5759
- Unpin transformers version by @loadams in #5650
- Update other workflows to run on Ubuntu 22.04 by @loadams in #5798
- [XPU]Use host time to replace xpu time when IPEX version slower than 2.5. by @ys950902 in #5796
- Update MII tests to pull correct torchvision by @loadams in #5800
- Add fp8-fused gemm kernel by @sfc-gh-reyazda in #5764
- Add doc of compressed backend in Onebit optimizers by @Liangliang-Ma in #5782
- fix: handle exception when loading cache file in test_inference.py by @HeyangQin in #5802
- Pin transformers version for MII tests by @loadams in #5807
- Fix op_builder for CUDA 12.5 by @keshavkowshik in #5806
- Find ROCm on Fedora by @trixirt in #5705
- Fix CPU Adam JIT compilation by @lekurile in #5780
- GDS AIO Blog by @jomayeri in #5817
- [ROCm] Get rocm version from /opt/rocm/.info/version by @rraminen in #5815
- sequence parallel with communication overlap by @inkcherry in #5691
- Update to ROCm6 by @loadams in #5491
- Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen by @ZonePG in #5403
- Use accelerator to replace cuda in setup and runner by @Andy666G in #5769
- Link GDS blog to site by @tjruwase in #5820
- Non-reentrant checkpointing hook fix by @ic-synth in #5781
- Fix NV references by @tjruwase in #5821
- Fix docs building guide by @tjruwase in #5825
- Update clang-format version from 16 to 18. by @loadams in #5839
- Add Japanese translation of DeepNVMe blog by @tohtana in #5845
- Fix the bug of deepspeed sequence parallel working with batch size larger than 1 by @YJHMITWEB in #5823
- Upgrade HPU image to v1.16.2. by @vshekhawat-hlab in #5610
- OptimizedLinear updates by @jeffra in #5791
- Log operator warnings only in verbose mode by @tjruwase in #5917
- Use
torch.nan_to_num
replace numpy wrapper one by @jinyouzhi in #5877 - [Zero2] Reduce the unnecessary all-reduce when tensor size is 0. by @ys950902 in #5868
- Update container version for Gaudi2 CI by @raza-sikander in #5937
- Fix missing ds_id bug by @tjruwase in #5824
- Update LR scheduler configuration by @xiyang-aads-lilly in #5846
- HPUAccelerator: remove support in set_visible_devices_envs by @nelyahu in #5929
- Z3: optimizations for grad norm calculation and gradient clipping by @nelyahu in #5504
- Update xpu-max1100.yml with new config and add some tests by @Liangliang-Ma in #5668
- Add accelerator setup guides by @delock in #5827
- Allow accelerator to instantiate the device by @nelyahu in #5255
New Contributors
- @U-rara made their first contribution in #5681
- @xylian86 made their first contribution in #5475
- @mauryaavinash95 made their first contribution in #5700
- @billishyahao made their first contribution in #5767
- @dogacancolak-kensho made their first contribution in #5728
- @beep-bebop made their first contribution in #5773
- @anferico made their first contribution in #5725
- @Atry made their first contribution in #5624
- @sfc-gh-reyazda made their first contribution in https://github.com/...
v0.14.4 Patch release
What's Changed
- Update version.txt after 0.14.3 release by @mrwyattii in #5651
- [CPU] SHM based allreduce improvement for small message size by @delock in #5571
- _exec_forward_pass: place zeros(1) on the same device as the param by @nelyahu in #5576
- [XPU] adapt lazy_call func to different versions by @YizhouZ in #5670
- fix IDEX dependence in xpu accelerator by @Liangliang-Ma in #5666
- Remove compile wrapper to simplify access to model attributes by @tohtana in #5581
- Fix hpZ with zero element by @samadejacobs in #5652
- Fixing the reshape bug in sequence parallel alltoall, which corrupted all QKV data by @YJHMITWEB in #5664
- enable yuan autotp & add conv tp by @Yejing-Lai in #5428
- Fix latest pytorch '_get_socket_with_port' import error by @Yejing-Lai in #5654
- Fix numpy upgrade to 2.0.0 BUFSIZE import error by @Yejing-Lai in #5680
- Update BUFSIZE to come from autotuner's constants.py, not numpy by @loadams in #5686
- [XPU] support op builder from intel_extension_for_pytorch kernel path by @YizhouZ in #5425
New Contributors
- @YJHMITWEB made their first contribution in #5664
Full Changelog: v0.14.3...v0.14.4