(1)500 最热的函数是 S_regmatch,该函数性能比 gcc 差 20% 以上,通过将该函数替换成 gcc 的版本,整体性能有显著提升。
替换最热的 S_regmatch 为 gcc 版本后, user time 统计(单位 s)。
【注意】gcc S_regmatch 是 called_once callee,O2 默认会被 inline 且本身会被删掉。为了能保留该 symbol,使用 gcc -fno-inline 关闭了内联,然后提取 gcc 版本的 S_regmatch。
500 ref | gcc | maple | maple 替换为gcc S_regmatch | 替换后整体提升 |
---|---|---|---|---|
case 1 | 218.499 | 280.514 | 269.002 | 5.3% |
case 2 | 218.499 | 178.329 | 170.264 | 3.7% |
case 3 | 218.499 | 204.044 | 196.889 | 3.3% |
(2)S_regmatch 性能问题主要集中在 RA 上,以 500 ref case1 为例,下表对比了 gcc S_regmatch 和 maple S_regmatch 在 cycles 总采样数量,分配栈帧大小,栈操作的 cycles 采样数量上的对比。maple 将更多的变量 spill 到了栈上。
500 ref case1 | gcc S_regmatch | maple S_regmatch | 差距 |
---|---|---|---|
cycles采样数 | 294063 | 415720 | 41.4% |
分配栈帧大小(byte) | 464 | 1232 | 165.5% |
stack ldr/str采样数 | 109701 | 195745 | 78.4% |
stack操作占比 | 37.3% | 47.1% | 26.2% |
(3)store merging 问题,详见 issue。
问题详述如下:
第 1 列和第 2 列分别是 cpu-cycles percent 和 cache-misses percent。
第 3 列和第 4 列分别是 cpu-cycles samples 和 cache-misses samples。
第 5 列是当前可执行文件名称,第 6 列是 symbol name。
仅显示 cycles 百分比大于 0.5% 的 symbol。
采样命令:
perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib checkspam.pl 2500 5 25 11 150 1 1 1 1
perf 报告:
Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 717386962088, DSO: perlbench_r
Overhead Samples Command Symbol
+ 37.26% 5.62% 415720 43157 perlbench_r [.] S_regmatch
##### gcc S_regmatch #####
+ 32.77% 4.71% 294063 22615 perlbench_r [.] S_regmatch
##########################
+ 11.14% 1.26% 124221 19733 perlbench_r [.] S_find_byclass
+ 9.54% 3.13% 106545 28123 perlbench_r [.] Perl_leave_scope
+ 4.91% 0.77% 54747 9168 perlbench_r [.] S_regtry
+ 2.84% 8.35% 31996 46504 perlbench_r [.] Perl_hv_common
+ 1.71% 0.55% 19115 4868 perlbench_r [.] Perl_save_pushptr
+ 1.53% 3.04% 17275 15945 perlbench_r [.] Perl_sv_setsv_flags
+ 1.50% 3.30% 16938 18312 perlbench_r [.] Perl_pp_nextstate
+ 1.45% 2.56% 16268 14864 perlbench_r [.] Perl_regexec_flags
+ 1.42% 5.73% 16006 33863 perlbench_r [.] Perl_pp_entersub
+ 1.35% 1.52% 15101 10375 perlbench_r [.] Perl_fbm_instr
+ 1.23% 4.35% 13812 23788 perlbench_r [.] Perl_pp_multideref
+ 1.12% 2.79% 12621 15979 perlbench_r [.] Perl_pp_match
+ 1.01% 2.35% 11312 13897 perlbench_r [.] Perl_pp_and
+ 0.95% 0.09% 10548 758 perlbench_r [.] Perl_ckwarn
+ 0.73% 2.67% 8243 13344 perlbench_r [.] Perl_pp_padsv
+ 0.60% 1.62% 6819 7586 perlbench_r [.] Perl_sv_clear
+ 0.57% 1.40% 6389 7741 perlbench_r [.] Perl_re_intuit_start
+ 0.55% 0.44% 6146 2897 perlbench_r [.] Perl_pp_iter
+ 0.53% 1.23% 6001 6175 perlbench_r [.] Perl_sv_upgrade
+ 0.53% 0.30% 5939 1698 perlbench_r [.] S_regrepeat
采样命令:
perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib diffmail.pl 4 800 10 17 19 300
perf 报告:
Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 445421299381, DSO: perlbench_r
Overhead Samples Command Symbol
+ 18.78% 5.74% 130396 40373 perlbench_r [.] S_regmatch
##### 替换为 gcc S_regmatch #####
+ 15.62% 5.29% 103537 35774 perlbench_r [.] S_regmatch
################################
+ 6.04% 5.41% 41928 38142 perlbench_r [.] Perl_pp_padsv
+ 4.14% 0.53% 28786 3703 perlbench_r [.] Perl_pp_substr
+ 4.05% 1.62% 28165 11492 perlbench_r [.] Perl_leave_scope
+ 3.66% 6.10% 25454 42784 perlbench_r [.] Perl_sv_setsv_flags
+ 3.23% 1.09% 22426 7609 perlbench_r [.] S_regrepeat
+ 3.12% 3.71% 21652 26126 perlbench_r [.] Perl_pp_nextstate
+ 3.01% 2.89% 20909 20195 perlbench_r [.] Perl_pp_and
+ 2.18% 1.35% 15164 9522 perlbench_r [.] Perl_pp_enter
+ 2.14% 1.08% 14889 8040 perlbench_r [.] Perl_sv_setpvn
+ 2.13% 5.91% 14812 41528 perlbench_r [.] Perl_hv_common
+ 2.12% 0.71% 14752 4978 perlbench_r [.] Perl_pp_leave
+ 1.80% 0.37% 12513 3988 perlbench_r [.] Perl_pp_preinc
+ 1.66% 1.25% 11526 8877 perlbench_r [.] Perl_runops_standard
+ 1.64% 1.43% 11430 9884 perlbench_r [.] Perl_sv_upgrade
+ 1.58% 0.11% 10948 763 perlbench_r [.] Perl_pp_ord
+ 1.55% 3.73% 10796 26195 perlbench_r [.] Perl_pp_multideref
+ 1.48% 0.29% 10300 2018 perlbench_r [.] S_setup_EXACTISH_ST_c1_c2
+ 1.39% 1.12% 9623 7917 perlbench_r [.] Perl_pp_const
+ 1.33% 0.10% 9247 685 perlbench_r [.] Perl_translate_substr_offsets
+ 1.20% 0.16% 8345 1062 perlbench_r [.] Perl_pp_eq
+ 1.20% 1.39% 8319 8779 perlbench_r [.] Perl_sv_clear
+ 1.11% 0.34% 7739 2365 perlbench_r [.] Perl_pp_lt
+ 1.10% 0.11% 7662 782 perlbench_r [.] Perl_pp_unstack
+ 1.09% 2.93% 7561 20829 perlbench_r [.] Perl_regexec_flags
+ 1.05% 0.70% 7316 4828 perlbench_r [.] Perl_sv_setiv
+ 0.96% 2.56% 6635 17999 perlbench_r [.] Perl_pp_sassign
+ 0.95% 2.83% 6598 20339 perlbench_r [.] Perl_pp_entersub
+ 0.94% 0.29% 6532 2083 perlbench_r [.] Perl_pp_rv2sv
+ 0.85% 1.06% 5901 7537 perlbench_r [.] S_find_byclass
+ 0.67% 0.52% 4642 3636 perlbench_r [.] Perl_save_strlen
+ 0.67% 0.83% 4637 5924 perlbench_r [.] Perl_push_scope
+ 0.59% 0.25% 4079 1780 perlbench_r [.] S_glob_assign_glob
+ 0.58% 1.08% 4031 7654 perlbench_r [.] Perl_pp_match
+ 0.56% 0.91% 3906 6196 perlbench_r [.] Perl_pp_aassign
+ 0.56% 0.31% 3899 2239 perlbench_r [.] Perl_pop_scope
0.51% 0.19% 3550 1342 perlbench_r [.] S_share_hek_flags
采样命令:
perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib splitmail.pl 6400 12 26 16 100 0
perf 报告:
Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 525004111325, DSO: perlbench_r
Overhead Samples Command Symbol
+ 16.62% 1.46% 136048 9904 perlbench_r [.] S_regmatch
##### 替换为 gcc S_regmatch #####
+ 14.74% 0.91% 116766 6159 perlbench_r [.] S_regmatch
################################
+ 11.01% 8.10% 90158 41508 perlbench_r [.] Perl_hv_common
+ 10.37% 8.89% 84833 72495 perlbench_r [.] Perl_hv_common
+ 9.56% 2.87% 78186 21477 perlbench_r [.] Perl_pp_multideref
+ 7.47% 12.85% 61105 103171 perlbench_r [.] Perl_regexec_flags
+ 4.99% 4.99% 40880 35093 perlbench_r [.] Perl_leave_scope
+ 3.42% 0.32% 27961 2702 perlbench_r [.] Perl_pp_unstack
+ 3.33% 2.24% 27210 18355 perlbench_r [.] Perl_av_fetch
+ 3.13% 0.81% 25635 4816 perlbench_r [.] Perl_pp_gvsv
+ 2.36% 1.30% 19343 11693 perlbench_r [.] Perl_pp_or
+ 1.98% 2.22% 16174 18391 perlbench_r [.] S_regtry
+ 1.93% 0.23% 15778 1870 perlbench_r [.] Perl_pp_preinc
+ 1.89% 9.10% 15470 74589 perlbench_r [.] S_cleanup_regmatch_info_aux
+ 1.76% 6.79% 14509 19775 perlbench_r [.] Perl_sv_cmp_flags
+ 1.28% 1.63% 10469 13453 perlbench_r [.] Perl_vivify_ref
+ 0.99% 1.46% 8119 7799 perlbench_r [.] Perl_pp_nextstate
+ 0.97% 0.01% 7895 103 perlbench_r [.] Perl_pp_stub
+ 0.93% 1.88% 7707 11304 perlbench_r [.] Perl_sv_setsv_flags
+ 0.83% 0.46% 6797 2361 perlbench_r [.] Perl_runops_standard
+ 0.74% 1.23% 6085 5575 perlbench_r [.] Perl_pp_padsv
+ 0.73% 1.87% 6002 15451 perlbench_r [.] Perl_save_destructor_x
+ 0.65% 0.08% 5355 729 perlbench_r [.] Perl_ckwarn
+ 0.61% 1.18% 5022 8563 perlbench_r [.] Perl_sv_eq_flags
+ 0.61% 1.29% 5009 5938 perlbench_r [.] Perl_hv_iternext_flags
+ 0.60% 2.03% 4946 9101 perlbench_r [.] Perl_newSVhek
+ 0.56% 0.21% 4629 1670 perlbench_r [.] Perl_pp_iter
+ 0.56% 0.74% 4561 3200 perlbench_r [.] Perl_pp_and
+ 0.52% 1.43% 4277 4404 perlbench_r [.] Perl_sortsv_flags
采样命令:
perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O3 -finline-limit=0 -fif-conversion -fif-conversion2 -o gcc-pp.opts-O3_-finline-limit_0_-fif-conversion_-fif-conversion2.s
perf 报告:
Samples: 625K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 215393196829, DSO: cpugcc_r
Overhead Samples Command Symbol
+ 2.12% 2.20% 7125 4695 cpugcc_r [.] df_worklist_dataflow_doublequeue
+ 1.86% 2.17% 6302 6310 cpugcc_r [.] bitmap_set_bit
+ 1.42% 1.47% 4806 4281 cpugcc_r [.] bitmap_bit_p
+ 1.42% 2.13% 4795 5699 cpugcc_r [.] df_note_compute
+ 1.34% 2.04% 4526 4327 cpugcc_r [.] bitmap_ior_into
+ 1.01% 2.94% 3408 6348 cpugcc_r [.] compute_transp
+ 0.92% 1.42% 3102 2952 cpugcc_r [.] bitmap_ior_and_compl
+ 0.88% 1.97% 2972 4091 cpugcc_r [.] bitmap_and
+ 0.87% 0.50% 2950 1675 cpugcc_r [.] htab_find_slot_with_hash
+ 0.85% 0.05% 2855 253 cpugcc_r [.] record_reg_classes
+ 0.72% 2.09% 2421 3721 cpugcc_r [.] bitmap_and_into
+ 0.71% 0.08% 2389 337 cpugcc_r [.] find_reloads
+ 0.71% 0.23% 2380 923 cpugcc_r [.] sorted_array_from_bitmap_set
+ 0.70% 0.61% 2421 2223 cpugcc_r [.] ggc_alloc_stat
+ 0.69% 0.24% 2326 838 cpugcc_r [.] constrain_operands
+ 0.67% 1.16% 2265 2783 cpugcc_r [.] bitmap_copy
+ 0.67% 0.97% 2254 2725 cpugcc_r [.] fast_dce
+ 0.66% 0.95% 2211 2334 cpugcc_r [.] find_reg_note
+ 0.64% 0.90% 2154 2723 cpugcc_r [.] bitmap_clear_bit
+ 0.61% 0.60% 2074 1909 cpugcc_r [.] pool_alloc
+ 0.59% 0.46% 2005 1596 cpugcc_r [.] extract_insn
+ 0.59% 0.49% 1998 1317 cpugcc_r [.] regstat_bb_compute_ri
+ 0.59% 0.87% 1984 1770 cpugcc_r [.] inverted_post_order_compute
+ 0.58% 0.74% 1953 1519 cpugcc_r [.] bitmap_elt_insert_after
+ 0.54% 0.89% 1835 2505 cpugcc_r [.] df_lr_bb_local_compute
+ 0.54% 0.34% 1822 1286 cpugcc_r [.] df_ref_create_structure
采样命令:
perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O2 -finline-limit=36000 -fpic -o gcc-pp.opts-O2_-finline-limit_36000_-fpic.s
perf 报告:
Samples: 720K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 242373601238, DSO: cpugcc_r
Overhead Samples Command Symbol
+ 4.26% 7.34% 16053 19824 cpugcc_r [.] compute_transp
+ 2.09% 1.92% 7895 5406 cpugcc_r [.] df_worklist_dataflow_doublequeue
+ 1.74% 1.99% 6616 7195 cpugcc_r [.] bitmap_set_bit
+ 1.61% 2.17% 6074 6043 cpugcc_r [.] bitmap_ior_into
+ 1.29% 1.79% 4884 6110 cpugcc_r [.] df_note_compute
+ 1.28% 1.25% 4864 4440 cpugcc_r [.] bitmap_bit_p
+ 1.04% 1.96% 3950 5176 cpugcc_r [.] bitmap_and
+ 1.04% 1.07% 3903 4045 cpugcc_r [.] ix86_delegitimize_address
+ 1.01% 1.47% 3837 4040 cpugcc_r [.] bitmap_ior_and_compl
+ 0.95% 2.22% 3603 4768 cpugcc_r [.] bitmap_and_into
+ 0.91% 0.07% 3453 426 cpugcc_r [.] find_reloads
+ 0.85% 0.96% 3189 3656 cpugcc_r [.] delegitimize_mem_from_attrs
+ 0.81% 0.86% 3049 2337 cpugcc_r [.] bitmap_elt_insert_after
+ 0.75% 0.04% 2861 245 cpugcc_r [.] record_reg_classes
+ 0.75% 0.36% 2862 1539 cpugcc_r [.] htab_find_slot_with_hash
+ 0.71% 0.54% 2733 2430 cpugcc_r [.] ggc_alloc_stat
+ 0.70% 1.06% 2646 3314 cpugcc_r [.] bitmap_copy
+ 0.67% 0.17% 2562 844 cpugcc_r [.] constrain_operands
+ 0.66% 0.83% 2495 2553 cpugcc_r [.] find_reg_note
+ 0.66% 0.64% 2468 2646 cpugcc_r [.] find_base_term
+ 0.62% 0.13% 2340 487 cpugcc_r [.] rtx_equal_for_memref_p
+ 0.62% 0.41% 2349 1888 cpugcc_r [.] extract_insn
+ 0.61% 0.06% 2305 289 cpugcc_r [.] ao_ref_from_mem
+ 0.59% 2.02% 2235 1944 cpugcc_r [.] pre_expr_reaches_here_p_work
+ 0.59% 0.42% 2229 1514 cpugcc_r [.] regstat_bb_compute_ri
+ 0.58% 0.76% 2218 2722 cpugcc_r [.] fast_dce
+ 0.57% 0.48% 2187 1911 cpugcc_r [.] pool_alloc
+ 0.57% 0.70% 2145 1848 cpugcc_r [.] inverted_post_order_compute
+ 0.56% 0.09% 2118 360 cpugcc_r [.] get_ref_base_and_extent
+ 0.56% 0.79% 2129 2804 cpugcc_r [.] df_lr_bb_local_compute
+ 0.56% 0.18% 2106 665 cpugcc_r [.] memrefs_conflict_p
+ 0.55% 0.69% 2091 2638 cpugcc_r [.] bitmap_clear_bit
0.51% 0.31% 1941 1832 cpugcc_r [.] reload
0.51% 0.09% 1951 413 cpugcc_r [.] ix86_decompose_address
0.51% 0.04% 1923 167 cpugcc_r [.] get_alias_set
0.50% 0.45% 1904 1832 cpugcc_r [.] canon_rtx
采样命令:
perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-smaller.c -O3 -fipa-pta -o gcc-smaller.opts-O3_-fipa-pta.s
perf 报告:
Samples: 727K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 241888277315, DSO: cpugcc_r
Overhead Samples Command Symbol
+ 18.62% 18.04% 70999 71934 cpugcc_r [.] bitmap_ior_into
+ 3.91% 2.34% 14901 12303 cpugcc_r [.] do_complex_constraint
+ 2.33% 2.42% 8879 8819 cpugcc_r [.] bitmap_set_bit
+ 1.90% 1.79% 7161 4511 cpugcc_r [.] df_worklist_dataflow_doublequeue
+ 1.23% 1.19% 4644 4173 cpugcc_r [.] bitmap_bit_p
+ 1.04% 2.66% 3936 7043 cpugcc_r [.] compute_transp
+ 1.00% 1.41% 3788 4329 cpugcc_r [.] df_note_compute
+ 0.91% 0.66% 3463 3318 cpugcc_r [.] find
+ 0.85% 1.37% 3210 3331 cpugcc_r [.] bitmap_ior_and_compl
+ 0.81% 1.71% 3054 4181 cpugcc_r [.] bitmap_and
+ 0.73% 0.42% 2786 1667 cpugcc_r [.] htab_find_slot_with_hash
+ 0.72% 0.19% 2744 925 cpugcc_r [.] sorted_array_from_bitmap_set
+ 0.72% 0.87% 2895 2578 cpugcc_r [.] bitmap_elt_insert_after
+ 0.66% 1.85% 2479 3938 cpugcc_r [.] bitmap_and_into
+ 0.60% 0.03% 2274 155 cpugcc_r [.] record_reg_classes
+ 0.59% 0.98% 2254 2773 cpugcc_r [.] bitmap_copy
+ 0.58% 1.05% 2193 2321 cpugcc_r [.] inverted_post_order_compute
+ 0.58% 0.14% 2194 542 cpugcc_r [.] constrain_operands
+ 0.57% 0.05% 2173 228 cpugcc_r [.] find_reloads
+ 0.54% 0.40% 2087 1711 cpugcc_r [.] ggc_alloc_stat
+ 0.53% 0.69% 2024 2276 cpugcc_r [.] fast_dce
0.52% 0.74% 1959 2104 cpugcc_r [.] find_reg_note
采样命令:
perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O5 -o ref32.opts-O5.s
perf 报告:
Samples: 713K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 231080095228, DSO: cpugcc_r
Overhead Samples Command Symbol
+ 3.48% 11.50% 12592 12824 cpugcc_r [.] rtl_split_edge
+ 3.30% 2.51% 11951 10183 cpugcc_r [.] bitmap_set_bit
+ 3.03% 2.02% 10912 8214 cpugcc_r [.] bitmap_bit_p
+ 2.79% 2.06% 10061 7459 cpugcc_r [.] df_worklist_dataflow_doublequeue
+ 1.95% 1.79% 7037 7040 cpugcc_r [.] bitmap_ior_into
+ 1.75% 1.63% 6312 5954 cpugcc_r [.] df_note_compute
+ 1.31% 1.78% 4736 5763 cpugcc_r [.] bitmap_and
+ 1.25% 1.32% 4498 5164 cpugcc_r [.] find_reg_note
+ 1.23% 2.13% 4433 5700 cpugcc_r [.] et_splay
+ 1.20% 0.56% 4314 2589 cpugcc_r [.] vrp_visit_phi_node
+ 1.13% 1.22% 4113 4435 cpugcc_r [.] bitmap_copy
+ 1.03% 1.20% 3725 4147 cpugcc_r [.] bitmap_ior_and_compl
+ 1.03% 1.26% 3711 3563 cpugcc_r [.] inverted_post_order_compute
+ 1.01% 0.87% 3684 3026 cpugcc_r [.] sbitmap_a_or_b
+ 0.86% 0.73% 3103 3298 cpugcc_r [.] fast_dce
+ 0.85% 0.81% 3080 2784 cpugcc_r [.] df_lr_bb_local_compute
+ 0.85% 0.60% 3088 2501 cpugcc_r [.] last_stmt
+ 0.82% 1.06% 2975 2946 cpugcc_r [.] find_unreachable_blocks
+ 0.81% 0.97% 2961 2618 cpugcc_r [.] calc_idoms
+ 0.76% 1.05% 2759 3089 cpugcc_r [.] bitmap_clear
+ 0.72% 0.58% 2623 2107 cpugcc_r [.] pool_alloc
+ 0.71% 0.70% 2563 2652 cpugcc_r [.] gsi_start_phis
+ 0.70% 0.78% 2522 2277 cpugcc_r [.] post_order_compute
+ 0.69% 1.33% 2482 3392 cpugcc_r [.] bitmap_and_into
+ 0.67% 0.05% 2411 662 cpugcc_r [.] record_reg_classes
+ 0.66% 0.84% 2409 2409 cpugcc_r [.] df_live_bb_local_compute
+ 0.65% 0.75% 2353 2566 cpugcc_r [.] remove_unused_locals
+ 0.62% 0.79% 2274 2042 cpugcc_r [.] calc_dfs_tree_nonrec
+ 0.57% 0.51% 2079 2139 cpugcc_r [.] bitmap_clear_bit
+ 0.57% 0.14% 2073 717 cpugcc_r [.] fold_binary_loc
+ 0.57% 0.76% 2050 1750 cpugcc_r [.] flow_loops_find
+ 0.57% 0.25% 2051 1464 cpugcc_r [.] df_ref_create_structure
+ 0.55% 0.26% 1988 1220 cpugcc_r [.] htab_find_slot_with_hash
+ 0.54% 0.51% 1948 1761 cpugcc_r [.] init_alias_analysis
+ 0.54% 0.42% 1933 1929 cpugcc_r [.] regstat_bb_compute_ri
0.51% 0.50% 1853 1684 cpugcc_r [.] mark_all_vars_used_1
采样命令:
perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O3 -fselective-scheduling -fselective-scheduling2 -o ref32.opts-O3_-fselective-scheduling_-fselective-scheduling2.s
perf 报告:
Samples: 878K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 261867226907, DSO: cpugcc_r
Overhead Samples Command Symbol
+ 3.22% 1.86% 13193 10837 cpugcc_r [.] bitmap_set_bit
+ 3.06% 7.75% 12508 12685 cpugcc_r [.] rtl_split_edge
+ 2.78% 1.38% 11333 7981 cpugcc_r [.] bitmap_bit_p
+ 2.46% 1.35% 10036 7181 cpugcc_r [.] df_worklist_dataflow_doublequeue
+ 1.90% 1.25% 7770 7175 cpugcc_r [.] bitmap_ior_into
+ 1.55% 1.09% 6329 5771 cpugcc_r [.] df_note_compute
+ 1.27% 0.26% 5180 1235 cpugcc_r [.] sched_analyze_insn
+ 1.16% 1.24% 4718 5824 cpugcc_r [.] bitmap_and
+ 1.12% 0.92% 4549 5434 cpugcc_r [.] find_reg_note
+ 1.11% 1.47% 4525 5725 cpugcc_r [.] et_splay
+ 1.07% 0.88% 4392 4590 cpugcc_r [.] bitmap_copy
+ 0.98% 0.38% 3963 2569 cpugcc_r [.] vrp_visit_phi_node
+ 0.94% 0.90% 3823 4541 cpugcc_r [.] bitmap_ior_and_compl
+ 0.90% 0.85% 3679 3532 cpugcc_r [.] inverted_post_order_compute
+ 0.87% 0.49% 3559 2541 cpugcc_r [.] pool_alloc
+ 0.84% 0.83% 3457 3622 cpugcc_r [.] bitmap_clear
+ 0.81% 0.54% 3336 2717 cpugcc_r [.] sbitmap_a_or_b
+ 0.76% 0.53% 3097 2613 cpugcc_r [.] df_lr_bb_local_compute
+ 0.74% 0.68% 3032 2679 cpugcc_r [.] calc_idoms
+ 0.73% 0.52% 2992 3374 cpugcc_r [.] fast_dce
+ 0.71% 0.72% 2919 2910 cpugcc_r [.] find_unreachable_blocks
+ 0.71% 0.42% 2897 2545 cpugcc_r [.] last_stmt
+ 0.64% 0.41% 2615 3021 cpugcc_r [.] extract_insn
+ 0.62% 0.95% 2530 3621 cpugcc_r [.] bitmap_and_into
0.61% 0.53% 2492 2251 cpugcc_r [.] post_order_compute
0.60% 0.46% 2473 2581 cpugcc_r [.] gsi_start_phis
0.60% 0.03% 2428 496 cpugcc_r [.] record_reg_classes
0.60% 0.59% 2452 2474 cpugcc_r [.] df_live_bb_local_compute
0.59% 0.52% 2413 2636 cpugcc_r [.] remove_unused_locals
0.55% 0.54% 2273 2043 cpugcc_r [.] calc_dfs_tree_nonrec
0.55% 0.41% 2230 2506 cpugcc_r [.] bitmap_clear_bit
0.52% 0.21% 2118 1693 cpugcc_r [.] df_ref_create_structure
0.50% 0.09% 2061 713 cpugcc_r [.] fold_binary_loc
1)类似链接 中的场景,尽可能化简逻辑运算,减少判断次数,化简CFG
2)基于Value Range的冗余跳转化简
perf 采样
perf record -e cpu-cycles,cache-misses -g [cmd]
-e
指定采样事件,这里同时采样 cpu-cycles
和 cache-misses
,-g
记录 call-graph 信息,方便 perf report 的时候查看调用栈信息。采样结束后,在当前目录生成 perf.data,使用 perf report 查看:
perf report -n --no-children --group -f -i perf.data
-n
显示每个 symbol 的采样数量,--no-children
只统计 caller 本身的采样数量,不统计它的 callee。--group
会将 perf.data 中所有的采样事件都显示出来。-i
指定输入文件名。
进入 perf report 交互界面后,可以看到热点函数的采样统计,一般会选中某个当前 DSO 的 symbol,然后使用快捷键 d
和 F
,d
表示仅显示当前 DSO 的 symbol,无关 symbol(比如 libc.so 中的 malloc)会隐藏,F
表示对当前显示的 symbol 重新计算采样百分比。选中某个 symbol,按 a
可以进入该 symbol 的汇编代码,汇编代码是标记过的,可以看到每段汇编指令的采样分布。