name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
fuxi网络单卡训练报错
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend910B
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :2.3.0rc1
-- Python version (e.g., Python 3.7.5) :3.8.0
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):Linux 10-90-66-192 4.15.0-45-generic #48-Ubuntu
-- GCC/Compiler version (if compiled from source): 7.5.0
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
暂无
1.git clone https://gitee.com/mindspore/mindscience.git
2.cd MindEarth/applications/medium-range/fuxi
3.bash {fuxi_path}scripts/run_standalone_train.sh 1 Ascend ./configs/FuXi.yaml
fuxi网络训练正常
epoch: 1 step: 1, loss is 1.40234375
Traceback (most recent call last):
File "main.py", line 95, in
train(config, fuxi_model, logger_obj)
File "main.py", line 61, in train
trainer.train()
File "/home/ma-user/modelarts/outputs/train_url_0/MindSporeLab_Test/source_code/mindscience/MindEarth/mindearth/module/pretrain.py", line 283, in train
self.solver.train(epoch=self.optimizer_params.get("epochs"),
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 1082, in train
self._train(epoch,
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 630, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 939, in _train_process
list_callback.on_train_step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_callback.py", line 437, in on_train_step_end
cb.on_train_step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_callback.py", line 279, in on_train_step_end
self.step_end(run_context)
File "/home/ma-user/anaconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/callback/_loss_monitor.py", line 80, in step_end
raise ValueError("In epoch: {} step: {}, loss is NAN or INF, training process cannot continue, "
ValueError: In epoch: 1 step: 2, loss is NAN or INF, training process cannot continue, terminating training.