You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I'm training Llama-3.1-70B-SFT with DPO using lora, equipped with Zero3. And the training log consitently output and stucks in this line "Invalidate trace cache @ step 10: expected module 11, but got module 19".
Yet the same training configuration work fine with 7B models, completely bug-free.
Describe the bug
I'm training Llama-3.1-70B-SFT with DPO using lora, equipped with Zero3. And the training log consitently output and stucks in this line "Invalidate trace cache @ step 10: expected module 11, but got module 19".
Yet the same training configuration work fine with 7B models, completely bug-free.
Hardware
8 *A100 (80G)
Deepspeed Config
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_gather_16bit_weights_on_model_save": true,
"stage3_prefetch_bucket_size": 0,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0
}
}
The text was updated successfully, but these errors were encountered: