You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My understanding is that the gate layer implements Algorithm 1 of GShard; however, our auxiliary loss computation seems to deviate from the algorithm; please help me understand.
That is, we compute l_auxhere by sum(me * ce) * num_experts, while line 13 of the algorithm specifies mean(me * ce) or, equivalently, sum(me * ce) / num_experts.
Why do we deviate there?
The text was updated successfully, but these errors were encountered:
Hello!
My understanding is that the gate layer implements Algorithm 1 of GShard; however, our auxiliary loss computation seems to deviate from the algorithm; please help me understand.
That is, we compute
l_aux
here bysum(me * ce) * num_experts
, while line 13 of the algorithm specifiesmean(me * ce)
or, equivalently,sum(me * ce) / num_experts
.Why do we deviate there?
The text was updated successfully, but these errors were encountered: