name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
pipeline并行,开启优化器切分,切分策略为2x1,推理报错
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2
test_semi_auto_parallel_single_stage_transform_checkpoints_8p_4x2x1_1x4
正常执行,推理成功
[1m============================= test session starts ==============================[0m
platform linux -- Python 3.7.5, pytest-5.3.5, py-1.11.0, pluggy-0.13.1
rootdir: /home/dfy/MindSporeTest/parallel
plugins: repeat-0.9.1, typeguard-2.13.3, timeout-1.4.2
[1mcollecting ... [0m[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:33.717.479 [mindspore/run_check/_check_version.py:382] MindSpore version 2.2.14.20240410 and "hccl" wheel package version 7.2 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:33.717.581 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 3
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:34.718.644 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 2
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:35.719.744 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 1
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:39.587.591 [mindspore/common/api.py:808] 'mindspore.ms_function' will be deprecated and removed in a future version. Please use 'mindspore.jit' instead.
[1m
collecting 1 item [0m[1m
collected 1 item [0m
../pipeline_split/test_pipeline_checkpoint_change_2.py [WARNING] HCCL_ADPT(2295439,7fba5c991740,python):2024-04-12-11:17:40.089.776 [mindspore/ccsrc/plugin/device/ascend/hal/hccl_adapter/hccl_adapter.cc:63] GenHcclOptions] The environment variable DEPLOY_MODE is not set. Now set to default value 0
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:40.093.803 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:40.096.044 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: hccl_world_group
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:43.338.937 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: hccl_world_group
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:43.343.683 [mindspore/common/parameter.py:786] This interface may be deleted in the future.
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:54.384.353 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter momentum's parameter_shape in param_info is empty
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:54.384.398 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter learning_rate's parameter_shape in param_info is empty
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:54.618.716 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 2-5004544844489628105 [const vector]{0, 1}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.013.907 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 2-5004544844489628105
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.262.942 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 2-5004544844489628105
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.270.208 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 4-6301172352641561019 [const vector]{0, 1, 2, 3}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.414.971 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 4-6301172352641561019
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.689.845 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 4-6301172352641561019
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.693.430 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 1-2297668033614959926 [const vector]{0}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.695.039 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 1-2297668033614959926
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.149.977 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 1-2297668033614959926
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:57.153.123 [mindspore/ccsrc/frontend/parallel/step_parallel.cc:2340] GetSensLossPairs] Can not find the loss cnode
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.153.627 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 2-5208665662337742843 [const vector]{0, 2}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.155.210 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 2-5208665662337742843
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.393.822 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 2-5208665662337742843
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:57.628.153 [mindspore/ccsrc/frontend/parallel/pass/pass_utils.cc:119] ExtractBackwardMatMul] backward_matmul_dx_dw_map size:0
[WARNING] PRE_ACT(2295439,7fba5c991740,python):2024-04-12-11:17:58.233.782 [mindspore/ccsrc/backend/common/pass/adjust_depend_for_parallel_optimizer_recompute_all_gather.cc:84] IncreaseAllgatherFusionId] Increase the duplicated allgather fusion id
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:59.181.962 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter momentum's parameter_shape in param_info is empty
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:59.181.980 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter learning_rate's parameter_shape in param_info is empty
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:59.421.854 [mindspore/ccsrc/frontend/parallel/pass/pass_utils.cc:119] ExtractBackwardMatMul] backward_matmul_dx_dw_map size:0
dfy=============================================================
{'cell.weight', 'moments.cell1.weight', 'moments.cell.weight', 'cell1.weight'}
{'moments.cell3.weight', 'moments.cell.weight', 'moments.cell1.weight', 'cell3.weight', 'cell.weight', 'moments.cell2.weight', 'cell1.weight', 'cell2.weight'}
[31mF[0m
=================================== FAILURES ===================================
[31m[1m___ test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2 ____[0m
[1m @Author('d30029696')[0m
[1m @Level1[0m
[1m @Env_Cards('1x8')[0m
[1m @Manual[0m
[1m @AR('SR.c54211e5')[0m
[1m @SKIP_ENV_CPU()[0m
[1m @SKIP_MODE_PYNATIVE(reason='pynative only support data parallel')[0m
[1m def test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2():[0m
[1m rank_id = contextbase.get_parallel_variable_from_env("RANK_ID")[0m
[1m # pipeline_stage=2,半自动并行,策略是2x1,保存ckpt和strategy文件[0m
[1m contextbase.set_parallel_context(device_num=8, pipeline_stages=2,[0m
[1m parallel_mode="semi_auto_parallel",[0m
[1m enable_parallel_optimizer=True,[0m
[1m parallel_optimizer_config={'optimizer_weight_shard_size': 2,[0m
[1m 'parallel_optimizer_threshold': 0,[0m
[1m 'gradient_accumulation_shard': True},[0m
[1m dataset_strategy='full_batch')[0m
[1m device_num = context.get_auto_parallel_context("device_num")[0m
[1m if rank_id == 0:[0m
[1m context.set_auto_parallel_context([0m
[1m strategy_ckpt_config={"save_file": "./pipeline/strategy1/strategy_2x2x2_1.ckpt"})[0m
[1m elif rank_id == 4:[0m
[1m context.set_auto_parallel_context([0m
[1m strategy_ckpt_config={"save_file": "./pipeline/strategy1/strategy_2x2x2_2.ckpt"})[0m
[1m net_p1 = PipelineNet(size=(32, 32), strategy=((2, 1), (2, 1)))[0m
[1m fake_dataset = GeneratorFakeData(size=128, batch_size=64,[0m
[1m image_size=(32,), num_classes=32)[0m
[1m dataset_p1 = ds.GeneratorDataset(fake_dataset, ["data", "label"])[0m
[1m # sink_mode=True 报错[0m
[1m create_strategy_file(net_p1, dataset_p1, pipe_flag=True, micro_size=2, save_ckpt_flag=True,[0m
[1m ckpt_path=f"./pipeline/rank_{rank_id}", prefix="pipeline", sink_mode=True)[0m
[1m [0m
[1m # #pipeline_stage=2,半自动并行,策略是2x2,只保存strategy文件[0m
[1m context.reset_auto_parallel_context()[0m
[1m contextbase.set_parallel_context(device_num=8, parallel_mode="semi_auto_parallel",[0m
[1m strategy_ckpt_config={[0m
[1m "save_file": "./strategy_2x2.ckpt"})[0m
[1m net_p1 = PipelineNet([0m
[1m size=(32, 32), strategy=((2, 2), (2, 2)), epoch=0)[0m
[1m fake_dataset = GeneratorFakeData(size=128, batch_size=4,[0m
[1m image_size=(32,), num_classes=32)[0m
[1m dataset_p1 = ds.GeneratorDataset(fake_dataset, ["data", "label"])[0m
[1m create_strategy_file([0m
[1m net_p1, dataset_p1, ckpt_path=f"./change/rank_{rank_id}", prefix="pipeline", sink_mode=True)[0m
[1m [0m
[1m # 转换CKPT文件后对比两次推理结果[0m
[1m if rank_id == 0:[0m
[1m transform_checkpoints(src_checkpoints_dir="./pipeline",[0m
[1m dst_checkpoints_dir="./change",[0m
[1m ckpt_prefix="pipeline_changed",[0m
[1m src_strategy_file="./pipeline/strategy1/strategy_2x2x2_1.ckpt",[0m
[1m> dst_strategy_file="./strategy_2x2.ckpt")[0m
[1m[31m../pipeline_split/test_pipeline_checkpoint_change_2.py[0m:244:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/checkpoint_transform.py[0m:421: in transform_checkpoints
[1m src_strategy_file, dst_strategy_file)[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/checkpoint_transform.py[0m:269: in _transform_checkpoint_by_stage
[1m param_type_dict)[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/_parallel_serialization.py[0m:405: in _transform_parallel_checkpoint
[1m from_dev_matrix, from_tensor_map, from_opt_shard_step, from_opt_shard_size, origin_tensor_shape)[0m
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dev_matrix = [2, 2], tensor_map = [1, -1], opt_shard_step = 1
opt_shard_size = 2, origin_full_tensor_shape = (32, 32)
[1m def _construct_tensor_layout_for_opt_shard(dev_matrix, tensor_map, opt_shard_step, opt_shard_size,[0m
[1m origin_full_tensor_shape):[0m
[1m """[0m
[1m dev_mat = [4, 2, 2][0m
[1m tensor_map = [2, 1, 0][0m
[1m opt_size = 2[0m
[1m =>[0m
[1m dev_mat = [opt_size, 4, 2, 2] = [2, 4, 2, 2][0m
[1m tensor_map = [2, 3, 1, 0][0m
[1m thus new_strategy = [4, 2, 2, 2][0m
[1m the tensor_shape should reshape to (model_parallel_size, -1, xx, xx)[0m
[1m first 4 means the model parallel sharding of data_dim[0m
[1m second 2 means the opt sharding of data_dim[0m
[1m And the model parallel sharding dim is the right of opt sharding dim, so it would be 0-1-2-3 model parallel sharding[0m
[1m then 0-4 optimizer sharding.[0m
[1m """[0m
[1m [0m
[1m if opt_shard_step == 0 or opt_shard_size == 0:[0m
[1m return dev_matrix, tensor_map, list(origin_full_tensor_shape)[0m
[1m tensor_strategy = _get_tensor_strategy(dev_matrix, tensor_map)[0m
[1m model_parallel_shard_size = np.prod(tensor_strategy)[0m
[1m if model_parallel_shard_size != opt_shard_step:[0m
[1m raise ValueError("The optimizer sharding step {} is not equal to the model parallel sharding size {}.".[0m
[1m> format(opt_shard_step, model_parallel_shard_size))[0m
[1m[31mE ValueError: The optimizer sharding step 1 is not equal to the model parallel sharding size 2.[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/_tensor.py[0m:405: ValueError
[31m============================== [31m[1m1 failed[0m[31m in 37.74s[0m[31m ==============================[0m
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
测试用例问题
用例适配
1.切分策略切不满场景下,需配置add_prim_attr("repeated_num_in_dev_matrix_right_", False)
2.特性限制,原策略为pipeline,目标策略不能与cell共享特性交互
非问题,需适配用例,此问题单关闭
登录 后才可以发表评论