当前仓库属于关闭状态,部分功能使用受限,详情请查阅 仓库状态说明
161 Star 1K Fork 258

MindSpore / course
关闭

 / 详情

ValueError: In epoch: 1 step: 91, loss is NAN or INF, training process cannot continue, terminating training.

DONE
Bug-Report
创建于  
2022-11-14 22:50

非常感谢作者的开源贡献!

当我在本地使用 GPU 运行 depplabv3,总是不成功!报以下错误,

[WARNING] ME(444209:139802082456768,MainProcess):2022-11-14-22:34:45.630.991 [mindspore/context.py:920] For 'context.set_context', 'enable_auto_mixed_precision' parameter is deprecated. For details, please see the interface parameter API comments
the gray file is already exists!
mindrecord file is already exists!
[WARNING] ME(444209:139802082456768,MainProcess):2022-11-14-22:34:45.652.051 [mindspore/common/_decorator.py:38] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
Traceback (most recent call last):
  File "train_script.py", line 157, in <module>
    train(cfg)
  File "train_script.py", line 115, in train
    model.train(args.train_epochs, dataset, callbacks=cbs, dataset_sink_mode=True)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/model.py", line 1049, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/model.py", line 623, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/model.py", line 706, in _train_dataset_sink_process
    list_callback.on_train_step_end(run_context)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 381, in on_train_step_end
    cb.on_train_step_end(run_context)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 223, in on_train_step_end
    self.step_end(run_context)
  File "/home/miniconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/train/callback/_loss_monitor.py", line 77, in step_end
    "terminating training.".format(cur_epoch_num, cur_step_in_epoch))
ValueError: In epoch: 1 step: 91, loss is NAN or INF, training process cannot continue, terminating training.
[WARNING] MD(444209,7f263576b4c0,python):2022-11-14-22:37:37.443.531 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:75] ~DeviceQueueOp] preprocess_batch: 94; batch_queue: 15, 16, 15, 16, 15, 16, 15, 16, 15, 16; push_start_time: 2022-11-14-22:37:20.983.697, 2022-11-14-22:37:22.644.780, 2022-11-14-22:37:24.286.797, 2022-11-14-22:37:25.905.246, 2022-11-14-22:37:27.516.902, 2022-11-14-22:37:29.148.713, 2022-11-14-22:37:30.795.690, 2022-11-14-22:37:32.441.732, 2022-11-14-22:37:34.089.076, 2022-11-14-22:37:35.736.965; push_end_time: 2022-11-14-22:37:22.644.756, 2022-11-14-22:37:24.286.764, 2022-11-14-22:37:25.905.190, 2022-11-14-22:37:27.516.867, 2022-11-14-22:37:29.148.689, 2022-11-14-22:37:30.795.662, 2022-11-14-22:37:32.441.689, 2022-11-14-22:37:34.089.039, 2022-11-14-22:37:35.736.930, 2022-11-14-22:37:37.383.640.

环境,

  • ubuntu 20.04 LTS
  • python3.7
  • mindspore1.8.1
  • Nvidia RTX A6000(48GiB)

希望能得到作者的帮助,非常感谢!

评论 (5)

blxie 创建了Question

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

blxie 修改了描述
blxie 任务类型Question 修改为Bug-Report
fangwenyi 任务状态TODO 修改为ACCEPTED
fangwenyi 添加了
 
mindspore-assistant
标签

你好,问题收到,我们已安排人员分析

fangwenyi 负责人设置为zhangfanghe
fangwenyi 关联项目设置为MindSpore Issue Assistant

是否按着官网步骤执行的,可以试试其他版本的mindspore,如1.9等
如还有问题请继续交流

长时间未反馈,自动关单

zhangfanghe 任务状态ACCEPTED 修改为VALIDATION
Shawny 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
Python
1
https://gitee.com/mindspore/course.git
git@gitee.com:mindspore/course.git
mindspore
course
course

搜索帮助