2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[CT][MS][checkpoint_transform]pipeline并行,开启优化器切分,切分策略为2x1,ckpt转换报错

DONE
Bug-Report 成员
创建于  
2024-04-12 17:16
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

pipeline并行,开启优化器切分,切分策略为2x1,推理报错

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2
test_semi_auto_parallel_single_stage_transform_checkpoints_8p_4x2x1_1x4

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. cd parallel
  2. bash test_feature_parallel.sh parallel/pipeline_split/test_pipeline_checkpoint_change_2.py test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2

Describe the expected behavior / 预期结果 (Mandatory / 必填)

正常执行,推理成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[1m============================= test session starts ==============================[0m
platform linux -- Python 3.7.5, pytest-5.3.5, py-1.11.0, pluggy-0.13.1
rootdir: /home/dfy/MindSporeTest/parallel
plugins: repeat-0.9.1, typeguard-2.13.3, timeout-1.4.2
[1mcollecting ... [0m[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:33.717.479 [mindspore/run_check/_check_version.py:382] MindSpore version 2.2.14.20240410 and "hccl" wheel package version 7.2 does not match. For details, refer to the installation guidelines: https://www.mindspore.cn/install
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:33.717.581 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 3
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:34.718.644 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 2
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:35.719.744 [mindspore/run_check/_check_version.py:396] Please pay attention to the above warning, countdown: 1
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:39.587.591 [mindspore/common/api.py:808] 'mindspore.ms_function' will be deprecated and removed in a future version. Please use 'mindspore.jit' instead.
[1m
collecting 1 item                                                              [0m[1m
collected 1 item                                                               [0m

../pipeline_split/test_pipeline_checkpoint_change_2.py [WARNING] HCCL_ADPT(2295439,7fba5c991740,python):2024-04-12-11:17:40.089.776 [mindspore/ccsrc/plugin/device/ascend/hal/hccl_adapter/hccl_adapter.cc:63] GenHcclOptions] The environment variable DEPLOY_MODE is not set. Now set to default value 0
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:40.093.803 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: hccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:40.096.044 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: hccl_world_group
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:43.338.937 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: hccl_world_group
[WARNING] ME(2295439:140438394181440,MainProcess):2024-04-12-11:17:43.343.683 [mindspore/common/parameter.py:786] This interface may be deleted in the future.
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:54.384.353 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter momentum's parameter_shape in param_info is empty
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:54.384.398 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter learning_rate's parameter_shape in param_info is empty
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:54.618.716 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 2-5004544844489628105 [const vector]{0, 1}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.013.907 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 2-5004544844489628105
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.262.942 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 2-5004544844489628105
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:55.270.208 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 4-6301172352641561019 [const vector]{0, 1, 2, 3}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.414.971 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 4-6301172352641561019
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.689.845 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 4-6301172352641561019
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.693.430 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 1-2297668033614959926 [const vector]{0}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:56.695.039 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 1-2297668033614959926
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.149.977 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 1-2297668033614959926
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:57.153.123 [mindspore/ccsrc/frontend/parallel/step_parallel.cc:2340] GetSensLossPairs] Can not find the loss cnode
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.153.627 [mindspore/ccsrc/distributed/collective/collective_manager.cc:243] CreateCommunicationGroup] Start to create communication group: 2-5208665662337742843 [const vector]{0, 2}
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.155.210 [mindspore/ccsrc/distributed/collective/collective_manager.cc:301] CreateCommunicationGroup] Begin initialize communication group on the device side: 2-5208665662337742843
[WARNING] DISTRIBUTED(2295439,7fba5c991740,python):2024-04-12-11:17:57.393.822 [mindspore/ccsrc/distributed/collective/collective_manager.cc:310] CreateCommunicationGroup] End initialize communication group on the device side: 2-5208665662337742843
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:57.628.153 [mindspore/ccsrc/frontend/parallel/pass/pass_utils.cc:119] ExtractBackwardMatMul] backward_matmul_dx_dw_map size:0
[WARNING] PRE_ACT(2295439,7fba5c991740,python):2024-04-12-11:17:58.233.782 [mindspore/ccsrc/backend/common/pass/adjust_depend_for_parallel_optimizer_recompute_all_gather.cc:84] IncreaseAllgatherFusionId] Increase the duplicated allgather fusion id
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:59.181.962 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter momentum's parameter_shape in param_info is empty
[WARNING] UTILS(2295439,7fba5c991740,python):2024-04-12-11:17:59.181.980 [mindspore/ccsrc/utils/parallel_context.cc:278] ParallelParameterContextRestoreShape] The parameter learning_rate's parameter_shape in param_info is empty
[WARNING] PARALLEL(2295439,7fba5c991740,python):2024-04-12-11:17:59.421.854 [mindspore/ccsrc/frontend/parallel/pass/pass_utils.cc:119] ExtractBackwardMatMul] backward_matmul_dx_dw_map size:0
dfy=============================================================
{'cell.weight', 'moments.cell1.weight', 'moments.cell.weight', 'cell1.weight'}
{'moments.cell3.weight', 'moments.cell.weight', 'moments.cell1.weight', 'cell3.weight', 'cell.weight', 'moments.cell2.weight', 'cell1.weight', 'cell2.weight'}
[31mF[0m

=================================== FAILURES ===================================
[31m[1m___ test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2 ____[0m

[1m    @Author('d30029696')[0m
[1m    @Level1[0m
[1m    @Env_Cards('1x8')[0m
[1m    @Manual[0m
[1m    @AR('SR.c54211e5')[0m
[1m    @SKIP_ENV_CPU()[0m
[1m    @SKIP_MODE_PYNATIVE(reason='pynative only support data parallel')[0m
[1m    def test_semi_auto_parallel_single_stage_transform_checkpoints_8p_2x2x1_2x2():[0m
[1m        rank_id = contextbase.get_parallel_variable_from_env("RANK_ID")[0m
[1m        # pipeline_stage=2,半自动并行,策略是2x1,保存ckpt和strategy文件[0m
[1m        contextbase.set_parallel_context(device_num=8, pipeline_stages=2,[0m
[1m                                         parallel_mode="semi_auto_parallel",[0m
[1m                                         enable_parallel_optimizer=True,[0m
[1m                                         parallel_optimizer_config={'optimizer_weight_shard_size': 2,[0m
[1m                                                                    'parallel_optimizer_threshold': 0,[0m
[1m                                                                    'gradient_accumulation_shard': True},[0m
[1m                                         dataset_strategy='full_batch')[0m
[1m        device_num = context.get_auto_parallel_context("device_num")[0m
[1m        if rank_id == 0:[0m
[1m            context.set_auto_parallel_context([0m
[1m                strategy_ckpt_config={"save_file": "./pipeline/strategy1/strategy_2x2x2_1.ckpt"})[0m
[1m        elif rank_id == 4:[0m
[1m            context.set_auto_parallel_context([0m
[1m                strategy_ckpt_config={"save_file": "./pipeline/strategy1/strategy_2x2x2_2.ckpt"})[0m
[1m        net_p1 = PipelineNet(size=(32, 32), strategy=((2, 1), (2, 1)))[0m
[1m        fake_dataset = GeneratorFakeData(size=128, batch_size=64,[0m
[1m                                       image_size=(32,), num_classes=32)[0m
[1m        dataset_p1 = ds.GeneratorDataset(fake_dataset, ["data", "label"])[0m
[1m        # sink_mode=True 报错[0m
[1m        create_strategy_file(net_p1, dataset_p1, pipe_flag=True, micro_size=2, save_ckpt_flag=True,[0m
[1m                             ckpt_path=f"./pipeline/rank_{rank_id}", prefix="pipeline", sink_mode=True)[0m
[1m    [0m
[1m        # #pipeline_stage=2,半自动并行,策略是2x2,只保存strategy文件[0m
[1m        context.reset_auto_parallel_context()[0m
[1m        contextbase.set_parallel_context(device_num=8, parallel_mode="semi_auto_parallel",[0m
[1m                                         strategy_ckpt_config={[0m
[1m                                             "save_file": "./strategy_2x2.ckpt"})[0m
[1m        net_p1 = PipelineNet([0m
[1m            size=(32, 32), strategy=((2, 2), (2, 2)), epoch=0)[0m
[1m        fake_dataset = GeneratorFakeData(size=128, batch_size=4,[0m
[1m                                       image_size=(32,), num_classes=32)[0m
[1m        dataset_p1 = ds.GeneratorDataset(fake_dataset, ["data", "label"])[0m
[1m        create_strategy_file([0m
[1m            net_p1, dataset_p1, ckpt_path=f"./change/rank_{rank_id}", prefix="pipeline", sink_mode=True)[0m
[1m    [0m
[1m        # 转换CKPT文件后对比两次推理结果[0m
[1m        if rank_id == 0:[0m
[1m            transform_checkpoints(src_checkpoints_dir="./pipeline",[0m
[1m                                  dst_checkpoints_dir="./change",[0m
[1m                                  ckpt_prefix="pipeline_changed",[0m
[1m                                  src_strategy_file="./pipeline/strategy1/strategy_2x2x2_1.ckpt",[0m
[1m>                                 dst_strategy_file="./strategy_2x2.ckpt")[0m

[1m[31m../pipeline_split/test_pipeline_checkpoint_change_2.py[0m:244: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/checkpoint_transform.py[0m:421: in transform_checkpoints
[1m    src_strategy_file, dst_strategy_file)[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/checkpoint_transform.py[0m:269: in _transform_checkpoint_by_stage
[1m    param_type_dict)[0m
[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/_parallel_serialization.py[0m:405: in _transform_parallel_checkpoint
[1m    from_dev_matrix, from_tensor_map, from_opt_shard_step, from_opt_shard_size, origin_tensor_shape)[0m
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dev_matrix = [2, 2], tensor_map = [1, -1], opt_shard_step = 1
opt_shard_size = 2, origin_full_tensor_shape = (32, 32)

[1m    def _construct_tensor_layout_for_opt_shard(dev_matrix, tensor_map, opt_shard_step, opt_shard_size,[0m
[1m                                               origin_full_tensor_shape):[0m
[1m        """[0m
[1m        dev_mat = [4, 2, 2][0m
[1m        tensor_map = [2, 1, 0][0m
[1m        opt_size = 2[0m
[1m        =>[0m
[1m        dev_mat = [opt_size, 4, 2, 2] = [2, 4, 2, 2][0m
[1m        tensor_map = [2, 3, 1, 0][0m
[1m        thus new_strategy = [4, 2, 2, 2][0m
[1m        the tensor_shape should reshape to (model_parallel_size, -1, xx, xx)[0m
[1m        first 4 means the model parallel sharding of data_dim[0m
[1m        second 2 means the opt sharding of data_dim[0m
[1m        And the model parallel sharding dim is the right of opt sharding dim, so it would be 0-1-2-3 model parallel sharding[0m
[1m        then 0-4 optimizer sharding.[0m
[1m        """[0m
[1m    [0m
[1m        if opt_shard_step == 0 or opt_shard_size == 0:[0m
[1m            return dev_matrix, tensor_map, list(origin_full_tensor_shape)[0m
[1m        tensor_strategy = _get_tensor_strategy(dev_matrix, tensor_map)[0m
[1m        model_parallel_shard_size = np.prod(tensor_strategy)[0m
[1m        if model_parallel_shard_size != opt_shard_step:[0m
[1m            raise ValueError("The optimizer sharding step {} is not equal to the model parallel sharding size {}.".[0m
[1m>                            format(opt_shard_step, model_parallel_shard_size))[0m
[1m[31mE           ValueError: The optimizer sharding step 1 is not equal to the model parallel sharding size 2.[0m

[1m[31m/root/miniconda3/envs/dfy/lib/python3.7/site-packages/mindspore/parallel/_tensor.py[0m:405: ValueError
[31m============================== [31m[1m1 failed[0m[31m in 37.74s[0m[31m ==============================[0m

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

杜冯泱 创建了Bug-Report
杜冯泱 添加协作者杜冯泱
杜冯泱 添加了
 
v2.2.14
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@杜冯泱

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
杜冯泱 添加了
 
sig/parallel
标签
杜冯泱 添加了
 
device/ascend
标签
杜冯泱 添加了
 
v2.3.0
标签
杜冯泱 修改了标题
杜冯泱 移除了
 
v2.3.0
标签
杜冯泱 移除了
 
v2.3.0
标签
杜冯泱 添加了
 
v2.3.0.rc2
标签

Appearance & Root Cause

测试用例问题

  1. 非pipeline场景下网络配置了@lazy_inline,导致子图未展开,策略文件保存时参数搜索不到;
  2. 配置切不满场景时,没有配置重复维,排布不符合预期。

Fix Solution

用例适配

kairui_kou 添加了
 
rca/others
标签
kairui_kou 添加了
 
rct/newfeature
标签
kairui_kou 添加了
 
ctl/componenttest
标签
kairui_kou 任务状态TODO 修改为VALIDATION
kairui_kou 添加协作者kairui_kou
kairui_kou 负责人kairui_kou 修改为杜冯泱
kairui_kou 取消协作者杜冯泱

1.切分策略切不满场景下,需配置add_prim_attr("repeated_num_in_dev_matrix_right_", False)
2.特性限制,原策略为pipeline,目标策略不能与cell共享特性交互
非问题,需适配用例,此问题单关闭

杜冯泱 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
v2.3.0
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助