[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 #6870

yafuly · 2024-12-14T04:05:40Z

Describe the bug
I'm training Llama-3.1-70B-SFT with DPO using lora, equipped with Zero3. And the training log consitently output and stucks in this line "Invalidate trace cache @ step 10: expected module 11, but got module 19".

Yet the same training configuration work fine with 7B models, completely bug-free.

Hardware
8 *A100 (80G)

Deepspeed Config
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_gather_16bit_weights_on_model_save": true,
"stage3_prefetch_bucket_size": 0,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0
}
}

JinXins · 2024-12-15T16:49:02Z

same issue.

tjruwase · 2024-12-18T15:04:32Z

@yafuly, @JinXins can you provide full repro steps, including scripts and command line? Thanks!

yafuly added bug Something isn't working training labels Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 #6870

[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 #6870

yafuly commented Dec 14, 2024

JinXins commented Dec 15, 2024

tjruwase commented Dec 18, 2024

[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 #6870

[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 #6870

Comments

yafuly commented Dec 14, 2024

JinXins commented Dec 15, 2024

tjruwase commented Dec 18, 2024