How do I know if stage-3 is a success by using deepspeed？ #6877

hwhyyds · 2024-12-16T14:18:07Z

I used accelerate to packag the model with GPUs, but the model are copied in GPUs, not shard in GPUs. Why?

deepspeed_plugin = DeepSpeedPlugin(hf_ds_config="./configs/rl_ds_zero_3.json", zero_stage=3)

base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin, mixed_precision="bf16")
base_model= accelerator.prepare(base_model)

rl_ds_zero_3.json

{
  "train_micro_batch_size_per_gpu": 1,
  "model_parallel_size": 3,
  "zero_allow_untested_optimizer": true,
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e7,
    "stage3_param_persistence_threshold": 5e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
  }
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "number_checkpoints": 4,
    "synchronize_checkpoint_boundary": false,
    "contiguous_memory_optimization": true
  }
}

tjruwase · 2024-12-18T11:21:17Z

@hwhyyds, can you please share more steps to reproduce, including logs, scripts, and command line?

tjruwase added the training label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I know if stage-3 is a success by using deepspeed？ #6877

How do I know if stage-3 is a success by using deepspeed？ #6877

hwhyyds commented Dec 16, 2024

tjruwase commented Dec 18, 2024

How do I know if stage-3 is a success by using deepspeed？ #6877

How do I know if stage-3 is a success by using deepspeed？ #6877

Comments

hwhyyds commented Dec 16, 2024

tjruwase commented Dec 18, 2024