2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][CV][efficientnet-B3][GPU][graph/pynative][1/8p]将batch_size从128修改为256,报内存不足

TODO
RFC
创建于  
2023-01-10 15:52
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

efficientnet-b3 imagenet2012数据集在GPU V100环境,将batch_size改为256(与竞品对齐),graph、pynative两个模式都报内存不足
模型地址:https://gitee.com/mindspore/models/tree/master/official/cv/Efficientnet/efficientnet-b3

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device /GPU/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    mindspore:2.0.0.20221220
    commit_id:470b760e

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:solution_test/cases/02network/00cv/efficientnetB3/train
用例:test_ms_efficientnetb3_imagenet2012_gpu_check_fps_1p_0001.py
test_ms_efficientnetb3_imagenet2012_gpu_check_loss_8p_0002.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from models
  2. cd models/official/cv/Efficientnet/efficientnet-b3/scripts
  3. 将src/config.py中的bacth_size修改为256
  4. bash run_standalone_train_gpu.sh device_id 数据集路径

Describe the expected behavior / 预期结果 (Mandatory / 必填)

修改完batch_size,可以正常训练

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] DEVICE(92536,7f7485fff700,python):2023-01-10-16:40:36.975.732 [mindspore/ccsrc/runtime/pynative/op_executor.cc:174] WorkerLoop] Run lazy task failed, error message:Malloc for kernel input failed, Memory isn't enough, node:Default/Conv2D-op4

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/pynative/run_op_helper.cc:464 LaunchKernels

[WARNING] MD(92536,7f76bb7d8740,python):2023-01-10-16:40:37.029.710 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:93] ~DataQueueOp] preprocess_batch: 4; batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 16; push_start_time: 2023-01-10-16:40:31.580.877, 2023-01-10-16:40:32.067.214, 2023-01-10-16:40:32.185.923, 2023-01-10-16:40:32.354.475; push_end_time: 2023-01-10-16:40:31.581.664, 2023-01-10-16:40:32.067.924, 2023-01-10-16:40:32.185.948, 2023-01-10-16:40:36.992.537.
Traceback (most recent call last):
 File "./train.py", line 155, in <module>
   model.train(config.epoch_size, dataset, callbacks=cb, dataset_sink_mode=True, sink_size=100)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1051, in train
   initial_epoch=initial_epoch)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper
   func(self, *args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 625, in _train
   cb_params, sink_size, initial_epoch, valid_infos)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 703, in _train_dataset_sink_process
   outputs = train_network(*inputs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 107, in construct
   return self.network(*outputs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/wrap/loss_scale.py", line 336, in construct
   loss = self.network(*inputs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/amp.py", line 244, in construct
   out = self._backbone(data)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/jenkins0/solution_test/cases/02network/00cv/efficientnetB3/train/test_ms_efficientnetb3_imagenet2012_gpu_check_fps_1p_0001/scripts/train_standalone/src/models/effnet.py", line 125, in construct
   x = self.blocks(stem)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/jenkins0/solution_test/cases/02network/00cv/efficientnetB3/train/test_ms_efficientnetb3_imagenet2012_gpu_check_fps_1p_0001/scripts/train_standalone/src/models/effnet.py", line 77, in construct
   return self.layers(x)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/layer/container.py", line 279, in construct
   input_data = cell(input_data)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/jenkins0/solution_test/cases/02network/00cv/efficientnetB3/train/test_ms_efficientnetb3_imagenet2012_gpu_check_fps_1p_0001/scripts/train_standalone/src/models/effnet.py", line 61, in construct
   x = self.project_conv(x)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/layer/container.py", line 279, in construct
   input_data = cell(input_data)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in __call__
   raise err
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 640, in __call__
   output = self._run_construct(cast_inputs, kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 425, in _run_construct
   output = self.construct(*cast_inputs, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/layer/normalization.py", line 191, in construct
   self.moving_variance)[0]
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 296, in __call__
   return _run_op(self, self.name, args)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper
   results = fn(*arg, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 733, in _run_op
   output = real_run_op(obj, op_name, args)
RuntimeError: Malloc for kernel input failed, Memory isn't enough, node:Default/Conv2D-op4

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/pynative/run_op_helper.cc:464 LaunchKernels

Special notes for this issue/备注 (Optional / 选填)

走给安正气

评论 (5)

zhangjie18 创建了Bug-Report
zhangjie18 添加了
 
kind/bug
标签
zhangjie18 添加了
 
v2.0.0.rc1
标签
zhangjie18 添加了
 
attr/function
标签
zhangjie18 添加了
 
stage/func-debug
标签
zhangjie18 添加了
 
sig/modelzoo
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhangjie18

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

xiangjiawei007 修改了描述
xiangjiawei007 负责人xiangjiawei007 修改为anzhengqi
zhanlijun 负责人anzhengqi 修改为zhongjicheng
zhanlijun 添加协作者anzhengqi
zhanlijun 里程碑B-SIG-ModelZoo 修改为B-SolutionTest
zhanlijun 任务状态TODO 修改为REJECTED
zhanlijun 任务状态REJECTED 修改为TODO
zhanlijun 里程碑B-SolutionTest 修改为B-SIG-ModelZoo

张杰负责的网络,走给张杰

zhongjicheng 负责人zhongjicheng 修改为zhangjie18
zhanlijun 里程碑B-SIG-ModelZoo 修改为B-SIG-Kit
zhanlijun 负责人zhangjie18 修改为wangcong
zhanlijun 添加协作者zhangjie18

网络维护和算法套件已分工,涉及竞品相关的问题单由算法团队承接 --- 套件组 & 维护组 & 版本PM已沟通

zhaoting 添加了
 
ccb/rfc
标签

CCB结论:与竞品测试问题,先加rfc标签

qiuluyu 任务类型Bug-Report 修改为RFC

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(7)
6584633 zhao ting v 1585658628 6575381 anzhengqi 1585657544
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助