2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[CI][MS]联合CI-910B 模型导出失败,padv3算子切分策略报错

DONE
Bug-Report 成员
创建于  
2024-03-14 10:56
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[CI][MS]联合CI-910B 模型导出失败,padv3算子切分策略报错

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. 使用盘古sigma网络脚本,执行模型导出流程(在187上执行导出流程,运行lite_infer_71b_main.sh,或者CI/noah/910B/export2mindir_main.sh CI/noah/910B/lite_infer_38b_4k_bs1_quant_main.sh可复现

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] PARALLEL(459695,ffffa32580b0,python):2024-03-14-01:14:09.398.959 [mindspore/ccsrc/frontend/parallel/ops_info/pad_info.cc:101] CheckStrategy] PadV3Info6363: the padding dimension of input can not be split, the strategy of input is [const vector]{1, 8, 1, 1}, and the paddings flag is [const vector]{0, 1, 0, 0}
[ERROR] PARALLEL(459695,ffffa32580b0,python):2024-03-14-01:14:09.399.048 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:1126] InitForCostModelWithAutoRepeatCalc] PadV3Info6363: CheckStrategy failed.
[ERROR] PARALLEL(459695,ffffa32580b0,python):2024-03-14-01:14:09.399.066 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:1068] Init] PadV3Info6363 : Init failed.
Traceback (most recent call last):
File "/data2/wxr/xiaoyi_test/script/pangu_am_deploy/workspace/CI/noah/910B/pangu_am_deploy-release-v0.9.10/pangu_sigma/evaluate.py", line 250, in load_model
predict_layout = model_predict.infer_predict_layout(inputs_np, experts_np, attention_mask, position_ids, init_true, batch_valid_length, lora_ids_np, skip_backend_compile=skip_backend_compile)
File "/home/miniconda3/envs/ci39/lib/python3.9/site-packages/mindspore/train/model.py", line 1879, in infer_predict_layout
predict_net.compile(*predict_data)
File "/home/miniconda3/envs/ci39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 963, in compile
_cell_graph_executor.compile(self, *self._compile_args, phase=self.phase,
File "/home/miniconda3/envs/ci39/lib/python3.9/site-packages/mindspore/common/api.py", line 1584, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Failure:operator PadV3 init failed

Special notes for this issue/备注 (Optional / 选填)

评论 (5)

188******92 创建了Bug-Report
188******92 添加了
 
kind/bug
标签
188******92 添加了
 
sig/mslite
标签
188******92 添加了
 
sig/parallel
标签
188******92 添加了
 
v2.3.0
标签
188******92 添加协作者moran
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@188******92

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review

根因分析:
In MindSpore, padding order starts from the last dimension and goes backward (same as PyTorch), but GE padding order
starts from the first dimension and goes forward. So the purpose of this pass is to adapt MindSpore PadV3 op to Ascend
GE PadV3 op. Namely, reverse the padding order.

Main steps:

  1. Slice according to padding length.
  2. Create a concat vector in reverse order.
  3. Set new concat op as the new padding input for PadV3.

已补充st测试用例,见关联pr。

youshu 任务状态TODO 修改为VALIDATION
youshu 添加了
 
rca/codelogic
标签
youshu 添加了
 
rct/bugfix
标签
youshu 添加了
 
ctl/co-testing
标签
youshu 添加协作者youshu
youshu 负责人youshu 修改为188******92
youshu 里程碑B-SIG-MSLite 修改为B-ComponentTest

3.16验证通过,导出、静态转动态、推理均成功,日志见http://10.90.67.50:8080/jenkins/job/Combined_Pipeline_910B_PanguSigma_Inference/276/console
使用的软件包,
MindSpore 2.3.0.B010-20240315231353
Ascend HDK 24.1.RC1.B031-20240307094942-32
Milan-ASL V100R001C17B214

i-robot 添加了
 
10
标签
188******92 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助