执行命令:CUDA_VISIBLE_DEVICES=1,2,5,6 python3 main.py --task=grpo_train --model_name_or_path=outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802 --bf16 --use_vllm --checkpoint_dir=outputs/Qwen2.5-7B-Instruct-GRPO-SecondHalf --per_device_train_batch_size=2 --save_strategy=epoch
报错信息:
CUDA_VISIBLE_DEVICES=1,2,5,6 python3 main.py --task=grpo_train --model_name_or_path=outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802 --bf16 --use_vllm --checkpoint_dir=outputs/Qwen2.5-7B-Instruct-GRPO-SecondHalf --per_device_train_batch_size=8 --save_strategy=epoch
Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.04it/s]
2025-03-18 19:07:23,351 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:23,712 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:23,712 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:23,712 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:26,240 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
INFO 03-18 19:07:30 init.py:207] Automatically detected platform cuda.
INFO 03-18 19:07:35 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 03-18 19:07:35 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802', speculative_config=None, tokenizer='outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-18 19:07:35 cuda.py:229] Using Flash Attention backend.
INFO 03-18 19:07:36 model_runner.py:1110] Starting to load model outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.44it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.34it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.68it/s]
INFO 03-18 19:07:38 model_runner.py:1115] Loading model weights took 14.2487 GB
INFO 03-18 19:07:40 worker.py:267] Memory profiling takes 1.22 seconds
INFO 03-18 19:07:40 worker.py:267] the current vLLM instance can use total_gpu_memory (79.11GiB) x gpu_memory_utilization (0.40) = 31.64GiB
INFO 03-18 19:07:40 worker.py:267] model weights take 14.25GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 4.38GiB; the rest of the memory reserved for KV Cache is 12.87GiB.
INFO 03-18 19:07:40 executor_base.py:111] # cuda blocks: 15056, # CPU blocks: 4681
INFO 03-18 19:07:40 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 7.35x
INFO 03-18 19:07:42 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:08<00:00, 4.03it/s]
INFO 03-18 19:07:51 model_runner.py:1562] Graph capturing finished in 9 secs, took 0.85 GiB
INFO 03-18 19:07:51 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 12.75 seconds
0%| | 0/467 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/nfs/largemodel/wangjuan/deepseek-train/main.py", line 215, in
[rank0]: main()
[rank0]: File "/nfs/largemodel/wangjuan/deepseek-train/main.py", line 207, in main
[rank0]: grpo_train(args)
[rank0]: File "/nfs/largemodel/wangjuan/deepseek-train/grpo_train.py", line 58, in train
[rank0]: trainer.train()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2241, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 3692, in training_step
[rank0]: inputs = self._prepare_inputs(inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/trl/trainer/grpo_trainer.py", line 560, in _prepare_inputs
[rank0]: prompt_completion_ids = torch.cat([prompt_ids, completion_ids], dim=1)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
0%| | 0/467 [00:02<?, ?it/s]
[rank0]:[W318 19:08:03.686005855 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
执行命令:CUDA_VISIBLE_DEVICES=1,2,5,6 python3 main.py --task=grpo_train --model_name_or_path=outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802 --bf16 --use_vllm --checkpoint_dir=outputs/Qwen2.5-7B-Instruct-GRPO-SecondHalf --per_device_train_batch_size=2 --save_strategy=epoch
报错信息:
CUDA_VISIBLE_DEVICES=1,2,5,6 python3 main.py --task=grpo_train --model_name_or_path=outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802 --bf16 --use_vllm --checkpoint_dir=outputs/Qwen2.5-7B-Instruct-GRPO-SecondHalf --per_device_train_batch_size=8 --save_strategy=epoch
Sliding Window Attention is enabled but not implemented for
sdpa; unexpected results may be encountered.Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.04it/s]
2025-03-18 19:07:23,351 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:23,712 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:23,712 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:23,712 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
2025-03-18 19:07:26,240 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes.
INFO 03-18 19:07:30 init.py:207] Automatically detected platform cuda.
INFO 03-18 19:07:35 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 03-18 19:07:35 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802', speculative_config=None, tokenizer='outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-18 19:07:35 cuda.py:229] Using Flash Attention backend.
INFO 03-18 19:07:36 model_runner.py:1110] Starting to load model outputs/Qwen-7B-SFT-FirstHalf/checkpoint-2802...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.44it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.34it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.68it/s]
INFO 03-18 19:07:38 model_runner.py:1115] Loading model weights took 14.2487 GB
INFO 03-18 19:07:40 worker.py:267] Memory profiling takes 1.22 seconds
INFO 03-18 19:07:40 worker.py:267] the current vLLM instance can use total_gpu_memory (79.11GiB) x gpu_memory_utilization (0.40) = 31.64GiB
INFO 03-18 19:07:40 worker.py:267] model weights take 14.25GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 4.38GiB; the rest of the memory reserved for KV Cache is 12.87GiB.
INFO 03-18 19:07:40 executor_base.py:111] # cuda blocks: 15056, # CPU blocks: 4681
INFO 03-18 19:07:40 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 7.35x
INFO 03-18 19:07:42 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing
gpu_memory_utilizationor switching to eager mode. You can also reduce themax_num_seqsas needed to decrease memory usage.Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:08<00:00, 4.03it/s]
INFO 03-18 19:07:51 model_runner.py:1562] Graph capturing finished in 9 secs, took 0.85 GiB
INFO 03-18 19:07:51 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 12.75 seconds
0%| | 0/467 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/nfs/largemodel/wangjuan/deepseek-train/main.py", line 215, in
[rank0]: main()
[rank0]: File "/nfs/largemodel/wangjuan/deepseek-train/main.py", line 207, in main
[rank0]: grpo_train(args)
[rank0]: File "/nfs/largemodel/wangjuan/deepseek-train/grpo_train.py", line 58, in train
[rank0]: trainer.train()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2241, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 3692, in training_step
[rank0]: inputs = self._prepare_inputs(inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/trl/trainer/grpo_trainer.py", line 560, in _prepare_inputs
[rank0]: prompt_completion_ids = torch.cat([prompt_ids, completion_ids], dim=1)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
0%| | 0/467 [00:02<?, ?it/s]
[rank0]:[W318 19:08:03.686005855 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())