Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] MoE auxiliary loss #6816

Open
osayamenja opened this issue Dec 4, 2024 · 0 comments
Open

[QST] MoE auxiliary loss #6816

osayamenja opened this issue Dec 4, 2024 · 0 comments

Comments

@osayamenja
Copy link

osayamenja commented Dec 4, 2024

Hello!

My understanding is that the gate layer implements Algorithm 1 of GShard; however, our auxiliary loss computation seems to deviate from the algorithm; please help me understand.

That is, we compute l_aux here by sum(me * ce) * num_experts, while line 13 of the algorithm specifies mean(me * ce) or, equivalently, sum(me * ce) / num_experts.

Why do we deviate there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant