Appearance & Root Cause

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

pipeline并行，开启优化器切分，切分策略为2x1，推理报错

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2
test_semi_auto_parallel_single_stage_transform_checkpoints_8p_4x2x1_1x4

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

cd parallel
bash test_feature_parallel.sh parallel/pipeline_split/test_pipeline_checkpoint_change_2.py test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2

Describe the expected behavior / 预期结果 (Mandatory / 必填)

正常执行，推理成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[1m============================= test session starts ==============================[0m
platform linux -- Python 3.7.5, pytest-5.3.5, py-1.11.0, pluggy-0.13.1
rootdir: /home/dfy/MindSporeTest/parallel
plugins: repeat-0.9.1, typeguard-2.13.3, timeout-1.4.2
[1mcollecting ... [0m[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:33.717.479 [mindspore/run_check/_check_version.py:382] MindSpore version 2.2.14.20240410 and "hccl" wheel package version 7.2 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:33.717.581 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 3
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:34.718.644 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 2
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:35.719.744 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 1
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:39.587.591 [mindspore/common/api.py:808] 'mindspore.ms_function' will be deprecated and removed in a future version. Please use 'mindspore.jit' instead.
[1m
collecting 1 item                                                              [0m[1m
collected 1 item                                                               [0m

../pipeline_split/test_pipeline_checkpoint_change_2.py [WARNING] HCCL_ADPT(2295439,7fba5c991740,python):2024-04-12-11:17:40.089.776 [mindspore/ccsrc/plugin/device/ascend/hal/hccl_adapter/hccl_adapter.cc:63] GenHcclOptions] The environment variable DEPLOY_MODE is not set. Now set to default value 0
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:40.093.803 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:40.096.044 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: hccl_world_group
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:43.338.937 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: hccl_world_group
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:43.343.683 [mindspore/common/parameter.py:786] This interface may be deleted in the future.
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:54.384.353 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter momentum's parameter_shape in param_info is empty
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:54.384.398 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter learning_rate's parameter_shape in param_info is empty
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:54.618.716 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 2-5004544844489628105 [const vector]{0, 1}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.013.907 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 2-5004544844489628105
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.262.942 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 2-5004544844489628105
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.270.208 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 4-6301172352641561019 [const vector]{0, 1, 2, 3}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.414.971 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 4-6301172352641561019
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.689.845 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 4-6301172352641561019
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.693.430 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 1-2297668033614959926 [const vector]{0}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.695.039 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 1-2297668033614959926
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.149.977 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 1-2297668033614959926
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:57.153.123 [mindspore/ccsrc/frontend/parallel/step_parallel.cc:2340] GetSensLossPairs] Can not find the loss cnode
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.153.627 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 2-5208665662337742843 [const vector]{0, 2}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.155.210 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 2-5208665662337742843
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.393.822 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 2-5208665662337742843
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:57.628.153 [mindspore/ccsrc/frontend/parallel/pass/pass_utils.cc:119] ExtractBackwardMatMul] backward_matmul_dx_dw_map size:0
[WARNING] PRE_ACT(2295439,7fba5c991740,python):2024-04-12-11:17:58.233.782 [mindspore/ccsrc/backend/common/pass/adjust_depend_for_parallel_optimizer_recompute_all_gather.cc:84] IncreaseAllgatherFusionId] Increase the duplicated allgather fusion id
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:59.181.962 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter momentum's parameter_shape in param_info is empty
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:59.181.980 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter learning_rate's parameter_shape in param_info is empty
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:59.421.854 [mindspore/ccsrc/frontend/parallel/pass/pass_utils.cc:119] ExtractBackwardMatMul] backward_matmul_dx_dw_map size:0
dfy=============================================================
{'cell.weight', 'moments.cell1.weight', 'moments.cell.weight', 'cell1.weight'}
{'moments.cell3.weight', 'moments.cell.weight', 'moments.cell1.weight', 'cell3.weight', 'cell.weight', 'moments.cell2.weight', 'cell1.weight', 'cell2.weight'}
[31mF[0m

=================================== FAILURES ===================================
[31m[1m___ test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2 ____[0m

[1m    @Author('d30029696')[0m
[1m    @Level1[0m
[1m    @Env_Cards('1x8')[0m
[1m    @Manual[0m
[1m    @AR('SR.c54211e5')[0m
[1m    @SKIP_ENV_CPU()[0m
[1m    @SKIP_MODE_PYNATIVE(reason='pynative only support data parallel')[0m
[1m    def test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2():[0m
[1m        rank_id = contextbase.get_parallel_variable_from_env("RANK_ID")[0m
[1m        # pipeline_stage=2,半自动并行，策略是2x1，保存ckpt和strategy文件[0m
[1m        contextbase.set_parallel_context(device_num=8, pipeline_stages=2,[0m
[1m                                         parallel_mode="semi_auto_parallel",[0m
[1m                                         enable_parallel_optimizer=True,[0m
[1m                                         parallel_optimizer_config={'optimizer_weight_shard_size': 2,[0m
[1m                                                                    'parallel_optimizer_threshold': 0,[0m
[1m                                                                    'gradient_accumulation_shard': True},[0m
[1m                                         dataset_strategy='full_batch')[0m
[1m        device_num = context.get_auto_parallel_context("device_num")[0m
[1m        if rank_id == 0:[0m
[1m            context.set_auto_parallel_context([0m
[1m                strategy_ckpt_config={"save_file": "./pipeline/strategy1/strategy_2x2x2_1.ckpt"})[0m
[1m        elif rank_id == 4:[0m
[1m            context.set_auto_parallel_context([0m
[1m                strategy_ckpt_config={"save_file": "./pipeline/strategy1/strategy_2x2x2_2.ckpt"})[0m
[1m        net_p1 = PipelineNet(size=(32, 32), strategy=((2, 1), (2, 1)))[0m
[1m        fake_dataset = GeneratorFakeData(size=128, batch_size=64,[0m
[1m                                       image_size=(32,), num_classes=32)[0m
[1m        dataset_p1 = ds.GeneratorDataset(fake_dataset, ["data", "label"])[0m
[1m        # sink_mode=True 报错[0m
[1m        create_strategy_file(net_p1, dataset_p1, pipe_flag=True, micro_size=2, save_ckpt_flag=True,[0m
[1m                             ckpt_path=f"./pipeline/rank_{rank_id}", prefix="pipeline", sink_mode=True)[0m
[1m    [0m
[1m        # #pipeline_stage=2,半自动并行，策略是2x2，只保存strategy文件[0m
[1m        context.reset_auto_parallel_context()[0m
[1m        contextbase.set_parallel_context(device_num=8, parallel_mode="semi_auto_parallel",[0m
[1m                                         strategy_ckpt_config={[0m
[1m                                             "save_file": "./strategy_2x2.ckpt"})[0m
[1m        net_p1 = PipelineNet([0m
[1m            size=(32, 32), strategy=((2, 2), (2, 2)), epoch=0)[0m
[1m        fake_dataset = GeneratorFakeData(size=128, batch_size=4,[0m
[1m                                       image_size=(32,), num_classes=32)[0m
[1m        dataset_p1 = ds.GeneratorDataset(fake_dataset, ["data", "label"])[0m
[1m        create_strategy_file([0m
[1m            net_p1, dataset_p1, ckpt_path=f"./change/rank_{rank_id}", prefix="pipeline", sink_mode=True)[0m
[1m    [0m
[1m        # 转换CKPT文件后对比两次推理结果[0m
[1m        if rank_id == 0:[0m
[1m            transform_checkpoints(src_checkpoints_dir="./pipeline",[0m
[1m                                  dst_checkpoints_dir="./change",[0m
[1m                                  ckpt_prefix="pipeline_changed",[0m
[1m                                  src_strategy_file="./pipeline/strategy1/strategy_2x2x2_1.ckpt",[0m
[1m>                                 dst_strategy_file="./strategy_2x2.ckpt")[0m

[1m[31m../pipeline_split/test_pipeline_checkpoint_change_2.py[0m:244: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/checkpoint_transform.py[0m:421: in transform_checkpoints
[1m    src_strategy_file, dst_strategy_file)[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/checkpoint_transform.py[0m:269: in _transform_checkpoint_by_stage
[1m    param_type_dict)[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/_parallel_serialization.py[0m:405: in _transform_parallel_checkpoint
[1m    from_dev_matrix, from_tensor_map, from_opt_shard_step, from_opt_shard_size, origin_tensor_shape)[0m
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dev_matrix = [2, 2], tensor_map = [1, -1], opt_shard_step = 1
opt_shard_size = 2, origin_full_tensor_shape = (32, 32)

[1m    def _construct_tensor_layout_for_opt_shard(dev_matrix, tensor_map, opt_shard_step, opt_shard_size,[0m
[1m                                               origin_full_tensor_shape):[0m
[1m        """[0m
[1m        dev_mat = [4, 2, 2][0m
[1m        tensor_map = [2, 1, 0][0m
[1m        opt_size = 2[0m
[1m        =>[0m
[1m        dev_mat = [opt_size, 4, 2, 2] = [2, 4, 2, 2][0m
[1m        tensor_map = [2, 3, 1, 0][0m
[1m        thus new_strategy = [4, 2, 2, 2][0m
[1m        the tensor_shape should reshape to (model_parallel_size, -1, xx, xx)[0m
[1m        first 4 means the model parallel sharding of data_dim[0m
[1m        second 2 means the opt sharding of data_dim[0m
[1m        And the model parallel sharding dim is the right of opt sharding dim, so it would be 0-1-2-3 model parallel sharding[0m
[1m        then 0-4 optimizer sharding.[0m
[1m        """[0m
[1m    [0m
[1m        if opt_shard_step == 0 or opt_shard_size == 0:[0m
[1m            return dev_matrix, tensor_map, list(origin_full_tensor_shape)[0m
[1m        tensor_strategy = _get_tensor_strategy(dev_matrix, tensor_map)[0m
[1m        model_parallel_shard_size = np.prod(tensor_strategy)[0m
[1m        if model_parallel_shard_size != opt_shard_step:[0m
[1m            raise ValueError("The optimizer sharding step {} is not equal to the model parallel sharding size {}.".[0m
[1m>                            format(opt_shard_step, model_parallel_shard_size))[0m
[1m[31mE           ValueError: The optimizer sharding step 1 is not equal to the model parallel sharding size 2.[0m

[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/_tensor.py[0m:405: ValueError
[31m============================== [31m[1m1 failed[0m[31m in 37.74s[0m[31m ==============================[0m

Special notes for this issue/备注 (Optional / 选填)

Please assign maintainer to check this issue.
请为此issue分配处理人。
@杜冯泱

感谢您的提问，您可以评论//mindspore-assistant更快获取帮助：

如果您刚刚接触MindSpore，或许您可以在教程找到答案
如果您是资深Pytorch用户，您或许需要：

如果您遇到动态图问题，可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
模型精度调优问题可参考官网调优指南
如果您反馈的是框架BUG，请确认您在ISSUE中提供了MindSpore版本、使用的后端类型（CPU、GPU、Ascend）、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
如果您已经定位出问题根因，欢迎提交PR参与MindSpore开源社区，我们会尽快review

Appearance & Root Cause

测试用例问题

非pipeline场景下网络配置了@lazy_inline，导致子图未展开，策略文件保存时参数搜索不到；
配置切不满场景时，没有配置重复维，排布不符合预期。

Fix Solution

用例适配

1.切分策略切不满场景下,需配置add_prim_attr("repeated_num_in_dev_matrix_right_", False)
2.特性限制，原策略为pipeline，目标策略不能与cell共享特性交互
非问题，需适配用例，此问题单关闭

GVP MindSpore / mindspore

内容风险标识

[CT][MS][checkpoint_transform]pipeline并行，开启优化器切分，切分策略为2x1，ckpt转换报错

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

Appearance & Root Cause

Fix Solution

GVPMindSpore / mindspore

内容风险标识

[CT][MS][checkpoint_transform]pipeline并行，开启优化器切分，切分策略为2x1，ckpt转换报错

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

Appearance & Root Cause

Fix Solution

搜索帮助

GVP MindSpore / mindspore