[MDT][MT][MindEarth][Ascend910B]fuxi 网络单卡训练报错ValueError: In epoch: 1 step: 2, loss is NAN or INF

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

fuxi网络单卡训练报错

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend910B

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :2.3.0rc1
-- Python version (e.g., Python 3.7.5) :3.8.0
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):Linux 10-90-66-192 4.15.0-45-generic #48-Ubuntu
-- GCC/Compiler version (if compiled from source): 7.5.0
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

暂无

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.git clone https://gitee.com/mindspore/mindscience.git
2.cd MindEarth/applications/medium-range/fuxi
3.bash {fuxi_path}scripts/run_standalone_train.sh 1 Ascend ./configs/FuXi.yaml

Describe the expected behavior / 预期结果 (Mandatory / 必填)

fuxi网络训练正常

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

epoch: 1 step: 1, loss is 1.40234375
Traceback (most recent call last):
File "main.py", line 95, in
train(config, fuxi_model, logger_obj)
File "main.py", line 61, in train
trainer.train()
File "/home/ma-user/modelarts/outputs/train_url_0/MindSporeLab_Test/source_code/mindscience/MindEarth/mindearth/module/pretrain.py", line 283, in train
self.solver.train(epoch=self.optimizer_params.get("epochs"),
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 1082, in train
self._train(epoch,
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 630, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 939, in _train_process
list_callback.on_train_step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_callback.py", line 437, in on_train_step_end
cb.on_train_step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_callback.py", line 279, in on_train_step_end
self.step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_loss_monitor.py", line 80, in step_end
raise ValueError("In epoch: {} step: {}, loss is NAN or INF, training process cannot continue, "
ValueError: In epoch: 1 step: 2, loss is NAN or INF, training process cannot continue, terminating training.

GVP MindSpore / mindscience

内容风险标识

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (0)

GVPMindSpore / mindscience

内容风险标识