name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
pangu-moe-alltoall-pipeline/pangu-moe-alltoall 910A 32p训练失败
网络脚本路径:https://e.gitee.com/mind_spore/repos/mindspore/models/tree/master/official/nlp/Pangu_alpha
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend910A/
失败版本:
run包:Milan_C17/20240315
MindSpore 版本:2.3.0/B080 r2.3.q1_20240320000457_1b2cb8cd14
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
用例仓地址:MindFormers_Test/cases/pangu/train
test_mf_pangu_1_3b_train_moe_alltoall_check_loss_910_pangudata_32p_0001
test_mf_pangu_1_3b_train_moe_alltoall_pipeline_check_loss_910_pangudata_32p_0001
1.get code from models
2.cd models/official/nlp/Pangu_alpha
3.node1:
bash scripts/run_distributed_train_moe.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 1.3B 2 2 16 0 8 4 1 1
node2:
bash scripts/run_distributed_train_moe.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 1.3B 2 2 16 8 8 4 1 1
node3:
bash scripts/run_distributed_train_moe.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 1.3B 2 2 16 16 8 4 1 1
node4:
bash scripts/run_distributed_train_moe.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 1.3B 2 2 16 24 8 4 1 1
4. 验证网络训练是否成功
网络pangu-moe-pipeline-alltotal训练正常,32p训练性能达到73/fps
Traceback (most recent call last):
File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/pangu/train/test_mf_pangu_1_3b_train_moe_alltoall_pipeline_check_loss_910_pangudata_32p_0001/train.py", line 558, in <module>
run_train_pipeline(opt)
File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/pangu/train/test_mf_pangu_1_3b_train_moe_alltoall_pipeline_check_loss_910_pangudata_32p_0001/train.py", line 543, in run_train_pipeline
sink_size=callback_size, dataset_sink_mode=True)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1074, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 662, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 980, in compile_and_run
self.compile(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 964, in compile
jit_config_dict=self._jit_config_dict, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1583, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Compile graph kernel_graph1 failed.
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E20007: Failed to run graph fusion pass [BatchMatMulFusionPass]. The pass type is [built-in-ai-core-graph-pass]
Solution: 1. If the pass code is custom, check the error log and the verification logic. 2. If the pass code is not custom, perform a complete or partial dump by using npucollect.sh and then send the dump to Huawei technical support for fault locating.
TraceBack (most recent call last):
RemoveEdge failed because src is nullptr or run Unlink failed.[FUNC:RemoveEdge][FILE:graph_utils.cc][LINE:165]
RemoveEdge transpose-->matmul failed.[FUNC:LinkEdge][FILE:batch_matmul_fusion_pass.cc][LINE:314]
LinkEdge failed.[FUNC:DoTransposeFusion][FILE:batch_matmul_fusion_pass.cc][LINE:285]
DoTransposeFusion right failed.[FUNC:CheckAndDoTransposeFusion][FILE:batch_matmul_fusion_pass.cc][LINE:160]
op[BatchMatMul:Gradients/recompute_Default/network-PanguAlphaTrainPipelineWithLossScaleCell/network-_VirtualDatasetCell/_backbone-PipelineCell/network-MicroBatchInterleaved/network-PanGUAlphaWithLoss/network-PanguAlphaModel/backbone-PanguAlpha_Model/blocks-CellList/8-TransformerEncoderLayer/output-MoE/gradBatchMatMul-expand/BatchMatMul-op101], failed to execute TransposeFusion.[FUNC:Fusion][FILE:batch_matmul_fusion_pass.cc][LINE:66]
Failed to run graph fusion pass [BatchMatMulFusionPass]. The pass type is [built-in-ai-core-graph-pass]
[GraphOpt][FirstRoundFusion] Fail to run graph fusion pass[BatchMatMulFusionPass, built-in-ai-core-graph-pass]. Return value is 4294967295.[FUNC:RunOnePassFusion][FILE:graph_fusion.cc][LINE:1173]
[GraphOpt][FirstRoundFusion] MainGraph[kernel_graph1]: RunGraphFusion unsuccessfully.[FUNC:Fusion][FILE:graph_fusion.cc][LINE:100]
[GraphOpt][AfterFusion]Failed to do graph fusion for graph kernel_graph1. ErrNo is 4294967295.[FUNC:OptimizeOriginalGraph][FILE:fe_graph_optimizer.cc][LINE:346]
Call OptimizeOriginalGraph failed, ret:-1, engine_name:AIcoreEngine, graph_name:kernel_graph1[FUNC:OptimizeOriginalGraph][FILE:graph_optimize.cc][LINE:174]
[Call][PreRun] Failed, graph_id:2, session_id:0.[FUNC:CompileGraph][FILE:graph_manager.cc][LINE:4408]
[Compile][Graph]Compile graph failed, error code:1343225857, session_id:0, graph_id:2.[FUNC:CompileGraph][FILE:ge_api.cc][LINE:1159]
(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:972 CompileGraph
走给张银霞
Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
BatchMatMul算子报错,找对应责任人吧
登录 后才可以发表评论