Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTIONS]:Some questions about running Domino #6851

Open
yingtongxiong opened this issue Dec 11, 2024 · 3 comments · May be fixed by #6880
Open

[QUESTIONS]:Some questions about running Domino #6851

yingtongxiong opened this issue Dec 11, 2024 · 3 comments · May be fixed by #6880
Assignees
Labels
enhancement New feature or request

Comments

@yingtongxiong
Copy link

Hello, I have read the paper of Domino recently. And when I run the Domino according to this blog https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-domino/README.md and this one https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/README.md. I just run the shell https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/pretrain_llama_7b.sh. However, I met the following error:
Image
I think the cdb in deepspeed is from client such as Megatron? Therefore, I am very confused about the relationship between deepspeed and megatron. And if I want to run domino, should I use the https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/ or Megatron-DeepSpeed. At the same time, according to the paper, domino is tensor-parallel only? I want to know whether domino supports zero3 or not since I found that the optimizer is from megatron not from deepspeed?

@yingtongxiong yingtongxiong added the enhancement New feature or request label Dec 11, 2024
@GuanhuaWang GuanhuaWang self-assigned this Dec 13, 2024
@hwchen2017
Copy link
Contributor

Hi @yingtongxiong, thanks for reporting this error, and we will fix it soon.

To run the code now, replace deepspee.comm with torch.distributed in the deepspeed/runtime/domino/transformer.py, and use https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/

Domino is tensor parallel only now. But we will support zero3 in the future.

@hwchen2017 hwchen2017 self-assigned this Dec 14, 2024
@GuanhuaWang
Copy link
Member

Hi, @yingtongxiong

Thanks for the question.

Short answer is for now you should use https://github.com/microsoft/DeepSpeedExamples/blob/master/training/DeepSpeed-Domino/ for now, which is a minimum dependency we maintained relevant to Megatron.

Right now we have not incorporated with zero3 but it is on our roadmap. Thanks @hwchen2017 and please follow up with help here. thx

@hwchen2017 hwchen2017 linked a pull request Dec 16, 2024 that will close this issue
@yingtongxiong
Copy link
Author

@hwchen2017 @GuanhuaWang Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants