208 Star 856 Fork 632

GVPMindSpore / mindscience

 / 详情

[MDT][MT][MindEarth][Ascend910B]fuxi 网络单卡训练报错ValueError: In epoch: 1 step: 2, loss is NAN or INF

TODO
Bug-Report
创建于  
2024-05-21 11:36
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

fuxi网络单卡训练报错

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend910B

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :2.3.0rc1
    -- Python version (e.g., Python 3.7.5) :3.8.0
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):Linux 10-90-66-192 4.15.0-45-generic #48-Ubuntu
    -- GCC/Compiler version (if compiled from source): 7.5.0

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

暂无

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.git clone https://gitee.com/mindspore/mindscience.git
2.cd MindEarth/applications/medium-range/fuxi
3.bash {fuxi_path}scripts/run_standalone_train.sh 1 Ascend ./configs/FuXi.yaml

Describe the expected behavior / 预期结果 (Mandatory / 必填)

fuxi网络训练正常

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

epoch: 1 step: 1, loss is 1.40234375
Traceback (most recent call last):
File "main.py", line 95, in
train(config, fuxi_model, logger_obj)
File "main.py", line 61, in train
trainer.train()
File "/home/ma-user/modelarts/outputs/train_url_0/MindSporeLab_Test/source_code/mindscience/MindEarth/mindearth/module/pretrain.py", line 283, in train
self.solver.train(epoch=self.optimizer_params.get("epochs"),
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 1082, in train
self._train(epoch,
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 630, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 939, in _train_process
list_callback.on_train_step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_callback.py", line 437, in on_train_step_end
cb.on_train_step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_callback.py", line 279, in on_train_step_end
self.step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_loss_monitor.py", line 80, in step_end
raise ValueError("In epoch: {} step: {}, loss is NAN or INF, training process cannot continue, "
ValueError: In epoch: 1 step: 2, loss is NAN or INF, training process cannot continue, terminating training.

Special notes for this issue/备注 (Optional / 选填)

评论 (0)

rhxry 创建了Bug-Report
rhxry 添加了
 
kind/bug
标签
rhxry 添加了
 
sig/mindscience
标签
rhxry 添加了
 
v2.3.0
标签
rhxry 添加了
 
attr/function
标签
rhxry 添加了
 
stage/func-debug
标签
rhxry 添加协作者rhxry
rhxry 负责人设置为fengxun
rhxry 取消协作者rhxry
rhxry 添加协作者rhxry
rhxry 修改了标题
rhxry 修改了标题
rhxry 负责人fengxun 修改为Zhou Chuansai
rhxry 修改了描述
rhxry 修改了标题
rhxry 修改了描述
展开全部操作日志

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(1)
1
https://gitee.com/mindspore/mindscience.git
git@gitee.com:mindspore/mindscience.git
mindspore
mindscience
mindscience

搜索帮助