name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
bert_large_boost 训练时算子报错(AllReduce-op397(HcomAllReduce) load task fail)
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
HiAI/Milan_C17/20240414
master_20240426230938_a8763
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
test_ms_bert_large_boost_en_wiki_train_infer_epoch_40_8p_0003
执行通过,性能及精度正常
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.825 [p2p_mgmt.cc:285][4060972][Wait][P2PConnected]connected p2p timeout, timeout:600 s. local logicDevid:5, remote physic id:2.
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.875 [p2p_mgmt.cc:247][4060972]call trace: hcclRet -> 16
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.888 [p2p_mgmt.cc:182][4060972]call trace: hcclRet -> 16
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.902 [comm_factory.cc:2189][4060972][Get][ExchangerNetwork]Enable P2P Failed, ret[16]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.913 [comm_factory.cc:417][4060972][Create][CommOuter]exchangerNetwork create failed
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.933 [hccl_impl.cc:1452][4060972][Create][OuterComm]errNo[0x0000000005000006] tag[HcomAllReduce_6629421139219749105_0], created commOuter fail. commOuter[0] is null
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.055 [hccl_impl.cc:1227][3854446][Create][CommByAlg]CreateComm failed, commType[0], result: Level0[0] Level1[6] Level2[0].
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.087 [hccl_impl.cc:1007][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.101 [hccl_impl.cc:1029][3854446][hcclImpl][CreateComm]create comminfo by tag[HcomAllReduce_6629421139219749105_0] failed. return[4]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.151 [hccl_impl.cc:932][3854446][HcclImpl][PrepareCommRes]errNo[0x0000000005000004] tag[HcomAllReduce_6629421139219749105_0], create comm failed
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.174 [hccl_impl.cc:971][3854446][HcclImpl][PrepareCommRes] failed, tag[HcomAllReduce_6629421139219749105_0], inputMem ptr[0x12427f8fda00] size[75724288], outputMem ptr[0x1244dd0b0400] size[75724288], algType[520], streamId[21], root[4294967295], isP2p[0], isHaveCpuRank[0], return[0x0000000005000004]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.312 [all_reduce_operator.cc:228][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.355 [hccl_impl_base.cc:4228][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.366 [hccl_comm.cc:281][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.380 [hcom.cc:739][3854446][AllReduce][Result]errNo[0x0000000005010004] hcclComm all reduce error, tag[HcomAllReduce_6629421139219749105_0], input_ptr[0x12427f8fda00], output_ptr[0x1244dd0b0400], count[18931072], data_type[4], op[0]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.397 [hcom_ops_kernel_info_store.cc:1161][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.444 [hcom_ops_kernel_info_store.cc:410][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.454 [hcom_ops_kernel_info_store.cc:3132][3854446][Load][Task]errNo[0x0000000005010004] load task failed. (load op[HcomAllReduce] fail)
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.563 [hccl_task_info.cc:327]3854446 Distribute: ErrorNo: 1343225860(Internal errors) [LOAD][LOAD]call hccl op:Default/network/Switch-op5_kernel_graph2/Default/network/network/grad_reducer/AllReduce-op397(HcomAllReduce) load task fail
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.600 [davinci_model.cc:4431]3854446 DistributeTask: ErrorNo: 4294967295(failed) [LOAD][LOAD][Call][Distribute] for Task[2278] fail
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.618 [davinci_model.cc:607]3854446 DoTaskSink: ErrorNo: 4294967295(failed) [LOAD][LOAD][Distribute][Task] failed, model_id: 3.
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.633 [davinci_model.cc:787]3854446 Init: ErrorNo: 4294967295(failed) [LOAD][LOAD][Call][DoTaskSink] failed, model_id: 3.
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.653 [model_manager.cc:450]3854446 LoadModelOnline: ErrorNo: 1343225860(Internal errors) [LOAD][LOAD]DavinciModel Init failed.
当前唐慧康定界中
定界完成再转给对应责任人
Please assign maintainer to check this issue.
请为此issue分配处理人。
@wanlinhui_A
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
用例test_ms_bert_large_boost_en_wiki_train_msrun_infer_8p_0001也有该问题
bert_large_boost为昇腾服务算力验收用例,不能遗留
B220版本验证,无Allreduce超时报错,equal算子报错。张朱谷承反馈为已知问题,走单给郑左贺
用例test_ms_resnet50_imagenet_pynative_train_check_loss_910_8p_0001也有该现象,概率出现。
用例test_ms_openpose_coco2017_train_train_infer_8p_0003在Ascend910_Arm+EulerOS上也有该问题。
equal算子问题2号已解决。
https://e.gitee.com/mind_spore/issues/list?issue=I9K1EM
跟建娥和林辉确认还是有allreduce问题,转回慧康
0505:910B,B230执行通过
root cause & fix solution
B230执行通过, 问题不复现
是否需要加UT/ST:不需要
0505:910B,B230执行通过 FPS:1111 ; 172.81ms/step,acc=72.04
登录 后才可以发表评论