1.6K Star 6K Fork 2.3K

GVPMindSpore / mindspore

 / 详情

optimizer.parameters需要进行一次额外操作,才能使得在GPU上训练的最终输出不为nan

DONE
Bug-Report
Opened this issue  
2021-10-22 12:58
name about labels
Bug Report Use this template for reporting a bug kind/bug

Environment

  • Hardware Environment(Ascend/GPU/CPU):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device gpu

  • Software Environment:
    -- MindSpore version (source or binary):1.3.0
    -- Python version (e.g., Python 3.7.5):3.7.5
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):18.04
    -- GCC/Compiler version (if compiled from source):

Related testcase

Steps to reproduce the issue

  1. 在复现DIEN网络时,在GPU上训练静态图,出现最后输出为NAN的错误。
    解决方法:trainonestep中取出当前optimizer的parameters,然后进行任意操作,如使用print(parameters),再训练时输出便不为nan了。
    需求:不进行这个神奇的操作,在GPU上训练输出正常

  2. 代码操作如下:
    class xxxTrainOneStepCell(nn.TrainOneStepCell):
    def init(self, network, optimizer, sens=1.0):
    super(xxxTrainOneStepCell, self).init(network, optimizer, sens)
    self.fill = P.Fill()
    self.Dtype = P.DType()
    self.shape = P.Shape()

    def construct(self, *all_inputs):
    weights = self.weights
    loss = self.network(*all_inputs)
    sens = self.fill(self.Dtype(loss), self.shape(loss), self.sens)
    grads = self.grad(self.network, weights)(*all_inputs, sens)
    succ = self.optimizer(grads)
    print("打印参数的过程")
    print(weights)
    print("=========参数打印结束===========")

     return F.depend(loss, succ)
    

Describe the current behavior

Describe the expected behavior

Related log / screenshot

[ir文件链接](链接:https://pan.baidu.com/s/1gOd_3yVTHXdE2K57x_N6TQ
提取码:mdy3)

Special notes for this issue

Comments (5)

yvlee createdBug-Report

Please assign maintainer to check this issue.
请为这个issue分配处理人, @fangwenyi @chengxiaoli

Please add labels (comp or sig),also you can visit "https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md" to find more.
为了让问题更快得到响应,请您为该issue打上 组件(comp)或兴趣组(sig) 标签,打上标签的问题可以直接推送给责任人进行处理。更多的标签可以查看
https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件问题为例,如果你发现问题是data组件造成的,你可以这样评论:
//comp/data
当然你也可以向data SIG组求助,可以这样写:
//comp/data
//sig/data
如果是一个简单的问题,你可以留给刚进入社区的小伙伴来回答,这时候你可以这样写:
//good-first-issue
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

i-robot added
 
kind/bug
label

欢迎提出使用反馈,我们会尽快分析答复,谢谢~

chengxiaoli added
 
mindspore-assistant
label
chengxiaoli changed issue state from TODO to ACCEPTED
chengxiaoli set priority to Main
chengxiaoli set assignee to chengxiaoli
fangwenyi changed assignee from chengxiaoli to limingqi107
fangwenyi assigned collaborator chengxiaoli
fangwenyi added
 
DFX/start-analysis
label
fangwenyi removed
 
DFX/start-analysis
label

1.IR文件获取不到了,提取码错误
2.方便提供完整的或在简化的可复现的脚本吗?
3.如果不方便提供可执行的脚本,设置下export GLOG_v=1,保存下执行日志,跟IR文件一起提供下了
4.从目前的分析来看,这些print应该是不影响执行结果了

fangwenyi added
 
DFX/start-analysis
label
fangwenyi changed issue state from ACCEPTED to WIP

请使用最新版本验证,此ISSUE先关闭,如有需要请重新提单,或者自行修改ISSUE状态,谢谢

fangwenyi changed issue state from WIP to VALIDATION
fangwenyi changed issue state from VALIDATION to DONE
ms_yan added
 
usability
label
ms_yan added
 
sig/executor
label
ms_yan added
 
user/individual
label

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
Confirm
参与者(5)
Python
1
https://git.oschina.net/mindspore/mindspore.git
git@git.oschina.net:mindspore/mindspore.git
mindspore
mindspore
mindspore

Search