1.4K Star 7.6K Fork 1.4K

GVP方舟编译器 / OpenArkCompiler

 / 详情

spec-500-502-ref性能问题整理&&Huawei后续优化计划

待办的
成员
创建于  
2021-08-16 10:38

500 性能问题总结

(1)500 最热的函数是 S_regmatch,该函数性能比 gcc 差 20% 以上,通过将该函数替换成 gcc 的版本,整体性能有显著提升。

替换最热的 S_regmatch 为 gcc 版本后, user time 统计(单位 s)。

【注意】gcc S_regmatch 是 called_once callee,O2 默认会被 inline 且本身会被删掉。为了能保留该 symbol,使用 gcc -fno-inline 关闭了内联,然后提取 gcc 版本的 S_regmatch。

500 ref gcc maple maple 替换为gcc S_regmatch 替换后整体提升
case 1 218.499 280.514 269.002 5.3%
case 2 218.499 178.329 170.264 3.7%
case 3 218.499 204.044 196.889 3.3%

(2)S_regmatch 性能问题主要集中在 RA 上,以 500 ref case1 为例,下表对比了 gcc S_regmatch 和 maple S_regmatch 在 cycles 总采样数量,分配栈帧大小,栈操作的 cycles 采样数量上的对比。maple 将更多的变量 spill 到了栈上。

500 ref case1 gcc S_regmatch maple S_regmatch 差距
cycles采样数 294063 415720 41.4%
分配栈帧大小(byte) 464 1232 165.5%
stack ldr/str采样数 109701 195745 78.4%
stack操作占比 37.3% 47.1% 26.2%

(3)store merging 问题,详见 issue

问题详述如下:

500 ref 热点函数

第 1 列和第 2 列分别是 cpu-cycles percent 和 cache-misses percent。

第 3 列和第 4 列分别是 cpu-cycles samples 和 cache-misses samples。

第 5 列是当前可执行文件名称,第 6 列是 symbol name。

仅显示 cycles 百分比大于 0.5% 的 symbol。

500 ref case 1

采样命令:

perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib checkspam.pl 2500 5 25 11 150 1 1 1 1

perf 报告:

Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 717386962088, DSO: perlbench_r
          Overhead                   Samples  Command      Symbol
+   37.26%   5.62%        415720       43157  perlbench_r  [.] S_regmatch
##### gcc S_regmatch #####
+   32.77%   4.71%        294063       22615  perlbench_r  [.] S_regmatch
##########################
+   11.14%   1.26%        124221       19733  perlbench_r  [.] S_find_byclass
+    9.54%   3.13%        106545       28123  perlbench_r  [.] Perl_leave_scope
+    4.91%   0.77%         54747        9168  perlbench_r  [.] S_regtry
+    2.84%   8.35%         31996       46504  perlbench_r  [.] Perl_hv_common
+    1.71%   0.55%         19115        4868  perlbench_r  [.] Perl_save_pushptr
+    1.53%   3.04%         17275       15945  perlbench_r  [.] Perl_sv_setsv_flags
+    1.50%   3.30%         16938       18312  perlbench_r  [.] Perl_pp_nextstate
+    1.45%   2.56%         16268       14864  perlbench_r  [.] Perl_regexec_flags
+    1.42%   5.73%         16006       33863  perlbench_r  [.] Perl_pp_entersub
+    1.35%   1.52%         15101       10375  perlbench_r  [.] Perl_fbm_instr
+    1.23%   4.35%         13812       23788  perlbench_r  [.] Perl_pp_multideref
+    1.12%   2.79%         12621       15979  perlbench_r  [.] Perl_pp_match
+    1.01%   2.35%         11312       13897  perlbench_r  [.] Perl_pp_and
+    0.95%   0.09%         10548         758  perlbench_r  [.] Perl_ckwarn
+    0.73%   2.67%          8243       13344  perlbench_r  [.] Perl_pp_padsv
+    0.60%   1.62%          6819        7586  perlbench_r  [.] Perl_sv_clear
+    0.57%   1.40%          6389        7741  perlbench_r  [.] Perl_re_intuit_start
+    0.55%   0.44%          6146        2897  perlbench_r  [.] Perl_pp_iter
+    0.53%   1.23%          6001        6175  perlbench_r  [.] Perl_sv_upgrade
+    0.53%   0.30%          5939        1698  perlbench_r  [.] S_regrepeat

500 ref case 2

采样命令:

perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib diffmail.pl 4 800 10 17 19 300

perf 报告:

Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 445421299381, DSO: perlbench_r
          Overhead                   Samples  Command      Symbol
+   18.78%   5.74%        130396       40373  perlbench_r  [.] S_regmatch
##### 替换为 gcc S_regmatch #####
+   15.62%   5.29%        103537       35774  perlbench_r  [.] S_regmatch
################################
+    6.04%   5.41%         41928       38142  perlbench_r  [.] Perl_pp_padsv
+    4.14%   0.53%         28786        3703  perlbench_r  [.] Perl_pp_substr
+    4.05%   1.62%         28165       11492  perlbench_r  [.] Perl_leave_scope
+    3.66%   6.10%         25454       42784  perlbench_r  [.] Perl_sv_setsv_flags
+    3.23%   1.09%         22426        7609  perlbench_r  [.] S_regrepeat
+    3.12%   3.71%         21652       26126  perlbench_r  [.] Perl_pp_nextstate
+    3.01%   2.89%         20909       20195  perlbench_r  [.] Perl_pp_and
+    2.18%   1.35%         15164        9522  perlbench_r  [.] Perl_pp_enter
+    2.14%   1.08%         14889        8040  perlbench_r  [.] Perl_sv_setpvn
+    2.13%   5.91%         14812       41528  perlbench_r  [.] Perl_hv_common
+    2.12%   0.71%         14752        4978  perlbench_r  [.] Perl_pp_leave
+    1.80%   0.37%         12513        3988  perlbench_r  [.] Perl_pp_preinc
+    1.66%   1.25%         11526        8877  perlbench_r  [.] Perl_runops_standard
+    1.64%   1.43%         11430        9884  perlbench_r  [.] Perl_sv_upgrade
+    1.58%   0.11%         10948         763  perlbench_r  [.] Perl_pp_ord
+    1.55%   3.73%         10796       26195  perlbench_r  [.] Perl_pp_multideref
+    1.48%   0.29%         10300        2018  perlbench_r  [.] S_setup_EXACTISH_ST_c1_c2
+    1.39%   1.12%          9623        7917  perlbench_r  [.] Perl_pp_const
+    1.33%   0.10%          9247         685  perlbench_r  [.] Perl_translate_substr_offsets
+    1.20%   0.16%          8345        1062  perlbench_r  [.] Perl_pp_eq
+    1.20%   1.39%          8319        8779  perlbench_r  [.] Perl_sv_clear
+    1.11%   0.34%          7739        2365  perlbench_r  [.] Perl_pp_lt
+    1.10%   0.11%          7662         782  perlbench_r  [.] Perl_pp_unstack
+    1.09%   2.93%          7561       20829  perlbench_r  [.] Perl_regexec_flags
+    1.05%   0.70%          7316        4828  perlbench_r  [.] Perl_sv_setiv
+    0.96%   2.56%          6635       17999  perlbench_r  [.] Perl_pp_sassign
+    0.95%   2.83%          6598       20339  perlbench_r  [.] Perl_pp_entersub
+    0.94%   0.29%          6532        2083  perlbench_r  [.] Perl_pp_rv2sv
+    0.85%   1.06%          5901        7537  perlbench_r  [.] S_find_byclass
+    0.67%   0.52%          4642        3636  perlbench_r  [.] Perl_save_strlen
+    0.67%   0.83%          4637        5924  perlbench_r  [.] Perl_push_scope
+    0.59%   0.25%          4079        1780  perlbench_r  [.] S_glob_assign_glob
+    0.58%   1.08%          4031        7654  perlbench_r  [.] Perl_pp_match
+    0.56%   0.91%          3906        6196  perlbench_r  [.] Perl_pp_aassign
+    0.56%   0.31%          3899        2239  perlbench_r  [.] Perl_pop_scope
     0.51%   0.19%          3550        1342  perlbench_r  [.] S_share_hek_flags

500 ref case 3:

采样命令:

perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib splitmail.pl 6400 12 26 16 100 0

perf 报告:

Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 525004111325, DSO: perlbench_r
          Overhead                   Samples  Command      Symbol
+   16.62%   1.46%        136048        9904  perlbench_r  [.] S_regmatch
##### 替换为 gcc S_regmatch #####
+   14.74%   0.91%        116766        6159  perlbench_r  [.] S_regmatch
################################
+   11.01%   8.10%         90158       41508  perlbench_r  [.] Perl_hv_common
+   10.37%   8.89%         84833       72495  perlbench_r  [.] Perl_hv_common
+    9.56%   2.87%         78186       21477  perlbench_r  [.] Perl_pp_multideref
+    7.47%  12.85%         61105      103171  perlbench_r  [.] Perl_regexec_flags
+    4.99%   4.99%         40880       35093  perlbench_r  [.] Perl_leave_scope
+    3.42%   0.32%         27961        2702  perlbench_r  [.] Perl_pp_unstack
+    3.33%   2.24%         27210       18355  perlbench_r  [.] Perl_av_fetch
+    3.13%   0.81%         25635        4816  perlbench_r  [.] Perl_pp_gvsv
+    2.36%   1.30%         19343       11693  perlbench_r  [.] Perl_pp_or
+    1.98%   2.22%         16174       18391  perlbench_r  [.] S_regtry
+    1.93%   0.23%         15778        1870  perlbench_r  [.] Perl_pp_preinc
+    1.89%   9.10%         15470       74589  perlbench_r  [.] S_cleanup_regmatch_info_aux
+    1.76%   6.79%         14509       19775  perlbench_r  [.] Perl_sv_cmp_flags
+    1.28%   1.63%         10469       13453  perlbench_r  [.] Perl_vivify_ref
+    0.99%   1.46%          8119        7799  perlbench_r  [.] Perl_pp_nextstate
+    0.97%   0.01%          7895         103  perlbench_r  [.] Perl_pp_stub
+    0.93%   1.88%          7707       11304  perlbench_r  [.] Perl_sv_setsv_flags
+    0.83%   0.46%          6797        2361  perlbench_r  [.] Perl_runops_standard
+    0.74%   1.23%          6085        5575  perlbench_r  [.] Perl_pp_padsv
+    0.73%   1.87%          6002       15451  perlbench_r  [.] Perl_save_destructor_x
+    0.65%   0.08%          5355         729  perlbench_r  [.] Perl_ckwarn
+    0.61%   1.18%          5022        8563  perlbench_r  [.] Perl_sv_eq_flags
+    0.61%   1.29%          5009        5938  perlbench_r  [.] Perl_hv_iternext_flags
+    0.60%   2.03%          4946        9101  perlbench_r  [.] Perl_newSVhek
+    0.56%   0.21%          4629        1670  perlbench_r  [.] Perl_pp_iter
+    0.56%   0.74%          4561        3200  perlbench_r  [.] Perl_pp_and
+    0.52%   1.43%          4277        4404  perlbench_r  [.] Perl_sortsv_flags

502 热点函数

502 ref case 1

采样命令:

perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O3 -finline-limit=0 -fif-conversion -fif-conversion2 -o gcc-pp.opts-O3_-finline-limit_0_-fif-conversion_-fif-conversion2.s

perf 报告:

Samples: 625K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 215393196829, DSO: cpugcc_r
          Overhead                   Samples  Command   Symbol
+    2.12%   2.20%          7125        4695  cpugcc_r  [.] df_worklist_dataflow_doublequeue
+    1.86%   2.17%          6302        6310  cpugcc_r  [.] bitmap_set_bit
+    1.42%   1.47%          4806        4281  cpugcc_r  [.] bitmap_bit_p
+    1.42%   2.13%          4795        5699  cpugcc_r  [.] df_note_compute
+    1.34%   2.04%          4526        4327  cpugcc_r  [.] bitmap_ior_into
+    1.01%   2.94%          3408        6348  cpugcc_r  [.] compute_transp
+    0.92%   1.42%          3102        2952  cpugcc_r  [.] bitmap_ior_and_compl
+    0.88%   1.97%          2972        4091  cpugcc_r  [.] bitmap_and
+    0.87%   0.50%          2950        1675  cpugcc_r  [.] htab_find_slot_with_hash
+    0.85%   0.05%          2855         253  cpugcc_r  [.] record_reg_classes
+    0.72%   2.09%          2421        3721  cpugcc_r  [.] bitmap_and_into
+    0.71%   0.08%          2389         337  cpugcc_r  [.] find_reloads
+    0.71%   0.23%          2380         923  cpugcc_r  [.] sorted_array_from_bitmap_set
+    0.70%   0.61%          2421        2223  cpugcc_r  [.] ggc_alloc_stat
+    0.69%   0.24%          2326         838  cpugcc_r  [.] constrain_operands
+    0.67%   1.16%          2265        2783  cpugcc_r  [.] bitmap_copy
+    0.67%   0.97%          2254        2725  cpugcc_r  [.] fast_dce
+    0.66%   0.95%          2211        2334  cpugcc_r  [.] find_reg_note
+    0.64%   0.90%          2154        2723  cpugcc_r  [.] bitmap_clear_bit
+    0.61%   0.60%          2074        1909  cpugcc_r  [.] pool_alloc
+    0.59%   0.46%          2005        1596  cpugcc_r  [.] extract_insn
+    0.59%   0.49%          1998        1317  cpugcc_r  [.] regstat_bb_compute_ri
+    0.59%   0.87%          1984        1770  cpugcc_r  [.] inverted_post_order_compute
+    0.58%   0.74%          1953        1519  cpugcc_r  [.] bitmap_elt_insert_after
+    0.54%   0.89%          1835        2505  cpugcc_r  [.] df_lr_bb_local_compute
+    0.54%   0.34%          1822        1286  cpugcc_r  [.] df_ref_create_structure

502 ref case 2

采样命令:

perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O2 -finline-limit=36000 -fpic -o gcc-pp.opts-O2_-finline-limit_36000_-fpic.s

perf 报告:

Samples: 720K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 242373601238, DSO: cpugcc_r
          Overhead                   Samples  Command   Symbol
+    4.26%   7.34%         16053       19824  cpugcc_r  [.] compute_transp
+    2.09%   1.92%          7895        5406  cpugcc_r  [.] df_worklist_dataflow_doublequeue
+    1.74%   1.99%          6616        7195  cpugcc_r  [.] bitmap_set_bit
+    1.61%   2.17%          6074        6043  cpugcc_r  [.] bitmap_ior_into
+    1.29%   1.79%          4884        6110  cpugcc_r  [.] df_note_compute
+    1.28%   1.25%          4864        4440  cpugcc_r  [.] bitmap_bit_p
+    1.04%   1.96%          3950        5176  cpugcc_r  [.] bitmap_and
+    1.04%   1.07%          3903        4045  cpugcc_r  [.] ix86_delegitimize_address
+    1.01%   1.47%          3837        4040  cpugcc_r  [.] bitmap_ior_and_compl
+    0.95%   2.22%          3603        4768  cpugcc_r  [.] bitmap_and_into
+    0.91%   0.07%          3453         426  cpugcc_r  [.] find_reloads
+    0.85%   0.96%          3189        3656  cpugcc_r  [.] delegitimize_mem_from_attrs
+    0.81%   0.86%          3049        2337  cpugcc_r  [.] bitmap_elt_insert_after
+    0.75%   0.04%          2861         245  cpugcc_r  [.] record_reg_classes
+    0.75%   0.36%          2862        1539  cpugcc_r  [.] htab_find_slot_with_hash
+    0.71%   0.54%          2733        2430  cpugcc_r  [.] ggc_alloc_stat
+    0.70%   1.06%          2646        3314  cpugcc_r  [.] bitmap_copy
+    0.67%   0.17%          2562         844  cpugcc_r  [.] constrain_operands
+    0.66%   0.83%          2495        2553  cpugcc_r  [.] find_reg_note
+    0.66%   0.64%          2468        2646  cpugcc_r  [.] find_base_term
+    0.62%   0.13%          2340         487  cpugcc_r  [.] rtx_equal_for_memref_p
+    0.62%   0.41%          2349        1888  cpugcc_r  [.] extract_insn
+    0.61%   0.06%          2305         289  cpugcc_r  [.] ao_ref_from_mem
+    0.59%   2.02%          2235        1944  cpugcc_r  [.] pre_expr_reaches_here_p_work
+    0.59%   0.42%          2229        1514  cpugcc_r  [.] regstat_bb_compute_ri
+    0.58%   0.76%          2218        2722  cpugcc_r  [.] fast_dce
+    0.57%   0.48%          2187        1911  cpugcc_r  [.] pool_alloc
+    0.57%   0.70%          2145        1848  cpugcc_r  [.] inverted_post_order_compute
+    0.56%   0.09%          2118         360  cpugcc_r  [.] get_ref_base_and_extent
+    0.56%   0.79%          2129        2804  cpugcc_r  [.] df_lr_bb_local_compute
+    0.56%   0.18%          2106         665  cpugcc_r  [.] memrefs_conflict_p
+    0.55%   0.69%          2091        2638  cpugcc_r  [.] bitmap_clear_bit
     0.51%   0.31%          1941        1832  cpugcc_r  [.] reload
     0.51%   0.09%          1951         413  cpugcc_r  [.] ix86_decompose_address
     0.51%   0.04%          1923         167  cpugcc_r  [.] get_alias_set
     0.50%   0.45%          1904        1832  cpugcc_r  [.] canon_rtx

502 ref case 3

采样命令:

perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-smaller.c -O3 -fipa-pta -o gcc-smaller.opts-O3_-fipa-pta.s

perf 报告:

Samples: 727K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 241888277315, DSO: cpugcc_r
          Overhead                   Samples  Command   Symbol
+   18.62%  18.04%         70999       71934  cpugcc_r  [.] bitmap_ior_into
+    3.91%   2.34%         14901       12303  cpugcc_r  [.] do_complex_constraint
+    2.33%   2.42%          8879        8819  cpugcc_r  [.] bitmap_set_bit
+    1.90%   1.79%          7161        4511  cpugcc_r  [.] df_worklist_dataflow_doublequeue
+    1.23%   1.19%          4644        4173  cpugcc_r  [.] bitmap_bit_p
+    1.04%   2.66%          3936        7043  cpugcc_r  [.] compute_transp
+    1.00%   1.41%          3788        4329  cpugcc_r  [.] df_note_compute
+    0.91%   0.66%          3463        3318  cpugcc_r  [.] find
+    0.85%   1.37%          3210        3331  cpugcc_r  [.] bitmap_ior_and_compl
+    0.81%   1.71%          3054        4181  cpugcc_r  [.] bitmap_and
+    0.73%   0.42%          2786        1667  cpugcc_r  [.] htab_find_slot_with_hash
+    0.72%   0.19%          2744         925  cpugcc_r  [.] sorted_array_from_bitmap_set
+    0.72%   0.87%          2895        2578  cpugcc_r  [.] bitmap_elt_insert_after
+    0.66%   1.85%          2479        3938  cpugcc_r  [.] bitmap_and_into
+    0.60%   0.03%          2274         155  cpugcc_r  [.] record_reg_classes
+    0.59%   0.98%          2254        2773  cpugcc_r  [.] bitmap_copy
+    0.58%   1.05%          2193        2321  cpugcc_r  [.] inverted_post_order_compute
+    0.58%   0.14%          2194         542  cpugcc_r  [.] constrain_operands
+    0.57%   0.05%          2173         228  cpugcc_r  [.] find_reloads
+    0.54%   0.40%          2087        1711  cpugcc_r  [.] ggc_alloc_stat
+    0.53%   0.69%          2024        2276  cpugcc_r  [.] fast_dce
     0.52%   0.74%          1959        2104  cpugcc_r  [.] find_reg_note

502 ref case 4

采样命令:

perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O5 -o ref32.opts-O5.s

perf 报告:

Samples: 713K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 231080095228, DSO: cpugcc_r
          Overhead                   Samples  Command   Symbol
+    3.48%  11.50%         12592       12824  cpugcc_r  [.] rtl_split_edge
+    3.30%   2.51%         11951       10183  cpugcc_r  [.] bitmap_set_bit
+    3.03%   2.02%         10912        8214  cpugcc_r  [.] bitmap_bit_p
+    2.79%   2.06%         10061        7459  cpugcc_r  [.] df_worklist_dataflow_doublequeue
+    1.95%   1.79%          7037        7040  cpugcc_r  [.] bitmap_ior_into
+    1.75%   1.63%          6312        5954  cpugcc_r  [.] df_note_compute
+    1.31%   1.78%          4736        5763  cpugcc_r  [.] bitmap_and
+    1.25%   1.32%          4498        5164  cpugcc_r  [.] find_reg_note
+    1.23%   2.13%          4433        5700  cpugcc_r  [.] et_splay
+    1.20%   0.56%          4314        2589  cpugcc_r  [.] vrp_visit_phi_node
+    1.13%   1.22%          4113        4435  cpugcc_r  [.] bitmap_copy
+    1.03%   1.20%          3725        4147  cpugcc_r  [.] bitmap_ior_and_compl
+    1.03%   1.26%          3711        3563  cpugcc_r  [.] inverted_post_order_compute
+    1.01%   0.87%          3684        3026  cpugcc_r  [.] sbitmap_a_or_b
+    0.86%   0.73%          3103        3298  cpugcc_r  [.] fast_dce
+    0.85%   0.81%          3080        2784  cpugcc_r  [.] df_lr_bb_local_compute
+    0.85%   0.60%          3088        2501  cpugcc_r  [.] last_stmt
+    0.82%   1.06%          2975        2946  cpugcc_r  [.] find_unreachable_blocks
+    0.81%   0.97%          2961        2618  cpugcc_r  [.] calc_idoms
+    0.76%   1.05%          2759        3089  cpugcc_r  [.] bitmap_clear
+    0.72%   0.58%          2623        2107  cpugcc_r  [.] pool_alloc
+    0.71%   0.70%          2563        2652  cpugcc_r  [.] gsi_start_phis
+    0.70%   0.78%          2522        2277  cpugcc_r  [.] post_order_compute
+    0.69%   1.33%          2482        3392  cpugcc_r  [.] bitmap_and_into
+    0.67%   0.05%          2411         662  cpugcc_r  [.] record_reg_classes
+    0.66%   0.84%          2409        2409  cpugcc_r  [.] df_live_bb_local_compute
+    0.65%   0.75%          2353        2566  cpugcc_r  [.] remove_unused_locals
+    0.62%   0.79%          2274        2042  cpugcc_r  [.] calc_dfs_tree_nonrec
+    0.57%   0.51%          2079        2139  cpugcc_r  [.] bitmap_clear_bit
+    0.57%   0.14%          2073         717  cpugcc_r  [.] fold_binary_loc
+    0.57%   0.76%          2050        1750  cpugcc_r  [.] flow_loops_find
+    0.57%   0.25%          2051        1464  cpugcc_r  [.] df_ref_create_structure
+    0.55%   0.26%          1988        1220  cpugcc_r  [.] htab_find_slot_with_hash
+    0.54%   0.51%          1948        1761  cpugcc_r  [.] init_alias_analysis
+    0.54%   0.42%          1933        1929  cpugcc_r  [.] regstat_bb_compute_ri
     0.51%   0.50%          1853        1684  cpugcc_r  [.] mark_all_vars_used_1

502 ref case 5

采样命令:

perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O3 -fselective-scheduling -fselective-scheduling2 -o ref32.opts-O3_-fselective-scheduling_-fselective-scheduling2.s

perf 报告:

Samples: 878K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 261867226907, DSO: cpugcc_r
          Overhead                   Samples  Command   Symbol
+    3.22%   1.86%         13193       10837  cpugcc_r  [.] bitmap_set_bit
+    3.06%   7.75%         12508       12685  cpugcc_r  [.] rtl_split_edge
+    2.78%   1.38%         11333        7981  cpugcc_r  [.] bitmap_bit_p
+    2.46%   1.35%         10036        7181  cpugcc_r  [.] df_worklist_dataflow_doublequeue
+    1.90%   1.25%          7770        7175  cpugcc_r  [.] bitmap_ior_into
+    1.55%   1.09%          6329        5771  cpugcc_r  [.] df_note_compute
+    1.27%   0.26%          5180        1235  cpugcc_r  [.] sched_analyze_insn
+    1.16%   1.24%          4718        5824  cpugcc_r  [.] bitmap_and
+    1.12%   0.92%          4549        5434  cpugcc_r  [.] find_reg_note
+    1.11%   1.47%          4525        5725  cpugcc_r  [.] et_splay
+    1.07%   0.88%          4392        4590  cpugcc_r  [.] bitmap_copy
+    0.98%   0.38%          3963        2569  cpugcc_r  [.] vrp_visit_phi_node
+    0.94%   0.90%          3823        4541  cpugcc_r  [.] bitmap_ior_and_compl
+    0.90%   0.85%          3679        3532  cpugcc_r  [.] inverted_post_order_compute
+    0.87%   0.49%          3559        2541  cpugcc_r  [.] pool_alloc
+    0.84%   0.83%          3457        3622  cpugcc_r  [.] bitmap_clear
+    0.81%   0.54%          3336        2717  cpugcc_r  [.] sbitmap_a_or_b
+    0.76%   0.53%          3097        2613  cpugcc_r  [.] df_lr_bb_local_compute
+    0.74%   0.68%          3032        2679  cpugcc_r  [.] calc_idoms
+    0.73%   0.52%          2992        3374  cpugcc_r  [.] fast_dce
+    0.71%   0.72%          2919        2910  cpugcc_r  [.] find_unreachable_blocks
+    0.71%   0.42%          2897        2545  cpugcc_r  [.] last_stmt
+    0.64%   0.41%          2615        3021  cpugcc_r  [.] extract_insn
+    0.62%   0.95%          2530        3621  cpugcc_r  [.] bitmap_and_into
     0.61%   0.53%          2492        2251  cpugcc_r  [.] post_order_compute
     0.60%   0.46%          2473        2581  cpugcc_r  [.] gsi_start_phis
     0.60%   0.03%          2428         496  cpugcc_r  [.] record_reg_classes
     0.60%   0.59%          2452        2474  cpugcc_r  [.] df_live_bb_local_compute
     0.59%   0.52%          2413        2636  cpugcc_r  [.] remove_unused_locals
     0.55%   0.54%          2273        2043  cpugcc_r  [.] calc_dfs_tree_nonrec
     0.55%   0.41%          2230        2506  cpugcc_r  [.] bitmap_clear_bit
     0.52%   0.21%          2118        1693  cpugcc_r  [.] df_ref_create_structure
     0.50%   0.09%          2061         713  cpugcc_r  [.] fold_binary_loc

HW进行中的主要性能优化相关开发

1. RA前global scope的instruction scheduling

2. 中端CFG优化

​ 1)类似链接 中的场景,尽可能化简逻辑运算,减少判断次数,化简CFG

​ 2)基于Value Range的冗余跳转化简

3. IPA增强alias或函数多版本

4. sink/hoist优化使能

5. SR-add opt的后端支持 链接

6. alignment for loop/label/jump/data

7. RA后strldopt冗余消除

[附录] perf 使用

perf 采样

perf record -e cpu-cycles,cache-misses -g [cmd]

-e 指定采样事件,这里同时采样 cpu-cyclescache-misses-g 记录 call-graph 信息,方便 perf report 的时候查看调用栈信息。采样结束后,在当前目录生成 perf.data,使用 perf report 查看:

perf report -n --no-children --group -f -i perf.data

-n 显示每个 symbol 的采样数量,--no-children 只统计 caller 本身的采样数量,不统计它的 callee。--group 会将 perf.data 中所有的采样事件都显示出来。-i 指定输入文件名。

进入 perf report 交互界面后,可以看到热点函数的采样统计,一般会选中某个当前 DSO 的 symbol,然后使用快捷键 dFd 表示仅显示当前 DSO 的 symbol,无关 symbol(比如 libc.so 中的 malloc)会隐藏,F 表示对当前显示的 symbol 重新计算采样百分比。选中某个 symbol,按 a 可以进入该 symbol 的汇编代码,汇编代码是标记过的,可以看到每段汇编指令的采样分布。

评论 (0)

yi_jiang 创建了任务
yi_jiang 负责人设置为yi_jiang
yi_jiang 添加协作者fredchow
yi_jiang 添加协作者Leo Young
展开全部操作日志

登录 后才可以发表评论

状态
负责人
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
参与者(4)
C++
1
https://gitee.com/openarkcompiler/OpenArkCompiler.git
git@gitee.com:openarkcompiler/OpenArkCompiler.git
openarkcompiler
OpenArkCompiler
OpenArkCompiler

搜索帮助

14c37bed 8189591 565d56ea 8189591