[QUESTIONS]：Some questions about running Domino #6851

yingtongxiong · 2024-12-11T03:26:08Z

Hello, I have read the paper of Domino recently. And when I run the Domino according to this blog https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-domino/README.md and this one https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/README.md. I just run the shell https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/pretrain_llama_7b.sh. However, I met the following error:

I think the cdb in deepspeed is from client such as Megatron? Therefore, I am very confused about the relationship between deepspeed and megatron. And if I want to run domino, should I use the https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/ or Megatron-DeepSpeed. At the same time, according to the paper, domino is tensor-parallel only? I want to know whether domino supports zero3 or not since I found that the optimizer is from megatron not from deepspeed?

hwchen2017 · 2024-12-14T00:03:01Z

Hi @yingtongxiong, thanks for reporting this error, and we will fix it soon.

To run the code now, replace deepspee.comm with torch.distributed in the deepspeed/runtime/domino/transformer.py, and use https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/

Domino is tensor parallel only now. But we will support zero3 in the future.

GuanhuaWang · 2024-12-16T22:53:12Z

Thanks for the question.

Short answer is for now you should use https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/ for now, which is a minimum dependency we maintained relevant to Megatron.

Right now we have not incorporated with zero3 but it is on our roadmap. Thanks @hwchen2017 and please follow up with help here. thx

yingtongxiong · 2024-12-19T05:58:48Z

yingtongxiong added the enhancement New feature or request label Dec 11, 2024

GuanhuaWang self-assigned this Dec 13, 2024

hwchen2017 self-assigned this Dec 14, 2024

hwchen2017 linked a pull request Dec 16, 2024 that will close this issue

Fix error caused by all_reduce call in domino #6880

Open

Provide feedback