2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[MT][910B][8p]bert_large_boost 训练时算子报错(AllReduce-op397(HcomAllReduce) load task fail)

DONE
Bug-Report
创建于  
2024-04-29 11:43
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

bert_large_boost 训练时算子报错(AllReduce-op397(HcomAllReduce) load task fail)

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    HiAI/Milan_C17/20240414
    master_20240426230938_a8763

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_bert_large_boost_en_wiki_train_infer_epoch_40_8p_0003

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. cd solution_test/cases/02network/02nlp/bert/train/
  2. pytest -s test_ms_bert_large_boost_en_wiki_train_infer_epoch_40_8p_0003.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

执行通过,性能及精度正常

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.825 [p2p_mgmt.cc:285][4060972][Wait][P2PConnected]connected p2p timeout, timeout:600 s. local logicDevid:5, remote physic id:2.
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.875 [p2p_mgmt.cc:247][4060972]call trace: hcclRet -> 16
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.888 [p2p_mgmt.cc:182][4060972]call trace: hcclRet -> 16
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.902 [comm_factory.cc:2189][4060972][Get][ExchangerNetwork]Enable P2P Failed, ret[16]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.913 [comm_factory.cc:417][4060972][Create][CommOuter]exchangerNetwork create failed
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.746.933 [hccl_impl.cc:1452][4060972][Create][OuterComm]errNo[0x0000000005000006] tag[HcomAllReduce_6629421139219749105_0], created commOuter fail. commOuter[0] is null
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.055 [hccl_impl.cc:1227][3854446][Create][CommByAlg]CreateComm failed, commType[0], result: Level0[0] Level1[6] Level2[0].
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.087 [hccl_impl.cc:1007][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.101 [hccl_impl.cc:1029][3854446][hcclImpl][CreateComm]create comminfo by tag[HcomAllReduce_6629421139219749105_0] failed. return[4]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.151 [hccl_impl.cc:932][3854446][HcclImpl][PrepareCommRes]errNo[0x0000000005000004] tag[HcomAllReduce_6629421139219749105_0], create comm failed
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.174 [hccl_impl.cc:971][3854446][HcclImpl][PrepareCommRes] failed, tag[HcomAllReduce_6629421139219749105_0], inputMem ptr[0x12427f8fda00] size[75724288], outputMem ptr[0x1244dd0b0400] size[75724288], algType[520], streamId[21], root[4294967295], isP2p[0], isHaveCpuRank[0], return[0x0000000005000004]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.312 [all_reduce_operator.cc:228][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.355 [hccl_impl_base.cc:4228][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.366 [hccl_comm.cc:281][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.380 [hcom.cc:739][3854446][AllReduce][Result]errNo[0x0000000005010004] hcclComm all reduce error, tag[HcomAllReduce_6629421139219749105_0], input_ptr[0x12427f8fda00], output_ptr[0x1244dd0b0400], count[18931072], data_type[4], op[0]
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.397 [hcom_ops_kernel_info_store.cc:1161][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.444 [hcom_ops_kernel_info_store.cc:410][3854446]call trace: hcclRet -> 4
[ERROR] HCCL(3831802,python):2024-04-29-10:43:06.747.454 [hcom_ops_kernel_info_store.cc:3132][3854446][Load][Task]errNo[0x0000000005010004] load task failed. (load op[HcomAllReduce] fail)
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.563 [hccl_task_info.cc:327]3854446 Distribute: ErrorNo: 1343225860(Internal errors) [LOAD][LOAD]call hccl op:Default/network/Switch-op5_kernel_graph2/Default/network/network/grad_reducer/AllReduce-op397(HcomAllReduce) load task fail
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.600 [davinci_model.cc:4431]3854446 DistributeTask: ErrorNo: 4294967295(failed) [LOAD][LOAD][Call][Distribute] for Task[2278] fail
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.618 [davinci_model.cc:607]3854446 DoTaskSink: ErrorNo: 4294967295(failed) [LOAD][LOAD][Distribute][Task] failed, model_id: 3.
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.633 [davinci_model.cc:787]3854446 Init: ErrorNo: 4294967295(failed) [LOAD][LOAD][Call][DoTaskSink] failed, model_id: 3.
[ERROR] GE(3831802,python):2024-04-29-10:43:06.747.653 [model_manager.cc:450]3854446 LoadModelOnline: ErrorNo: 1343225860(Internal errors) [LOAD][LOAD]DavinciModel Init failed.

Special notes for this issue/备注 (Optional / 选填)

当前唐慧康定界中
定界完成再转给对应责任人

评论 (11)

wanlinhui_A 创建了Bug-Report
wanlinhui_A 添加了
 
kind/bug
标签
wanlinhui_A 添加了
 
attr/function
标签
wanlinhui_A 添加了
 
v2.3.0.rc2
标签
wanlinhui_A 添加了
 
device/ascend
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@wanlinhui_A

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
wanlinhui_A 修改了描述
wanlinhui_A 里程碑B-SIG-TBD 修改为B-SIG-Parallel
wanlinhui_A 负责人wangbixing 修改为tanghuikang
wanlinhui_A 添加协作者wangbixing

用例test_ms_bert_large_boost_en_wiki_train_msrun_infer_8p_0001也有该问题

bert_large_boost为昇腾服务算力验收用例,不能遗留

tanghuikang 添加协作者tanghuikang
tanghuikang 负责人tanghuikang 修改为zhengzuohe
tanghuikang 里程碑B-SIG-Parallel 修改为B-SIG-AKG

B220版本验证,无Allreduce超时报错,equal算子报错。张朱谷承反馈为已知问题,走单给郑左贺

用例test_ms_resnet50_imagenet_pynative_train_check_loss_910_8p_0001也有该现象,概率出现。

用例test_ms_openpose_coco2017_train_train_infer_8p_0003在Ascend910_Arm+EulerOS上也有该问题。

fangwenyi 负责人zhengzuohe 修改为huoxinyou
fangwenyi 添加协作者zhengzuohe
huoxinyou 里程碑B-SIG-AKG 修改为B-SIG-Parallel
huoxinyou 添加协作者huoxinyou
huoxinyou 负责人huoxinyou 修改为tanghuikang
huoxinyou 取消协作者tanghuikang

equal算子问题2号已解决。
https://e.gitee.com/mind_spore/issues/list?issue=I9K1EM
跟建娥和林辉确认还是有allreduce问题,转回慧康

xiaoyao 负责人tanghuikang 修改为xiaoyao
xiaoyao 添加协作者tanghuikang

0505:910B,B230执行通过

root cause & fix solution
B230执行通过, 问题不复现

是否需要加UT/ST:不需要

xiaoyao 里程碑B-SIG-Parallel 修改为B-SolutionTest
xiaoyao 添加了
 
rct/bugfix
标签
xiaoyao 添加了
 
rca/others
标签
xiaoyao 里程碑B-SolutionTest 修改为B-MDTest
xiaoyao 任务状态TODO 修改为VALIDATION
wangbixing 移除了
 
rct/bugfix
标签
wangbixing 移除了
 
rca/others
标签
xiaoyao 添加协作者xiaoyao
xiaoyao 负责人xiaoyao 修改为wanlinhui_A
xiaoyao 添加了
 
ctl/solutiontest
标签

0505:910B,B230执行通过 FPS:1111 ; 172.81ms/step,acc=72.04

wangbixing 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(9)
6574048 hulktang 1584443870
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助