-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed with ZeRO3 strategy cannot build 'fused_adam' #6892
Comments
@LeonardoZini, can you please share the log showing the
With zero stage 3 model sharding, special handling is required to access parameters. See following links for more details.
|
The log are this one
and this
Thank you for the references! |
Describe the bug
I am using Deepspeed with the huggingface trainer to fine-tune an llm. While with ZeRO2 strategy I don't have any problem I need to shard also the parameters since i'm working on long-context sequences.
When using ZeRO3 the trainer at the beginning of the training, raise me an excpetion
RuntimeError: Error building extension 'fused_adam'
I installed DeepSpeed with the command
TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0" DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext"
(i tried also the 0.15.4 version)I tried also
TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0" DS_BUILD_FUSED_ADAM=1 pip install deepspeed --global-option="build_ext"
.What changes is that if i specify in the deepspeed json config file the
"offload_optimizer"
and"offload_param"
it doesn't throws me any error, but i lose any reference to the model parameters (the weights are void tensors).I am using a SLURM scheduler, and one thing i noticed is that the ds_report output are different. Outisde the SLURM
fused_adam
seems installed, while inside SLURM no.pip env
ds_report output
output
ds_report output in SLURM
output
output log
Ass enlighten in this logs, the number of parameter goes from 4568002560 before the training loop, to 266240 after the training loop (the voice Parameter Offload makes me thinking..).
System info :
Launcher context
I am launching with torchrun,
srun torchrun --nnodes=1 --nproc-per-node=2 --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT --rdzv-id=$SLURM_JOB_NAME --rdzv-backend="c10d" --max_restarts=$MAX_RESTARTS trainer.py
The text was updated successfully, but these errors were encountered: