对连续内存的赋值常量的操作,maple 生成的汇编会逐个 str,而 gcc 生成的汇编通常可以将多条 str 合并为 1 条。以下举例:
预处理后的C代码
spec 500 文件 regexec.c 第 7297 行到 7308 行。
函数:S_regmatch
case 42: // case STAR
st->u.curly.paren = 0;
st->u.curly.min = 0;
st->u.curly.max = 32767;
scan = scan + 1;
goto repeat;
case 43: // case PLUS
st->u.curly.paren = 0;
st->u.curly.min = 1;
st->u.curly.max = 32767;
scan = scan + 1;
goto repeat;
maple汇编
这 2 条对连续内存的 str 可以合并为 1 条。
[图片上传中…(image-VFfQqnmozhRX5waxoige)]
gcc汇编
[图片上传中…(image-jzFqw8MW4tMwMAFsVmFg)]
me.mpl
@L87554 LOC 2 7296
iassign <* <$regmatch_state>> 101 (regread ptr %485, constval u32 0)
LOC 2 7298
iassign <* <$regmatch_state>> 110 (regread ptr %485, constval i32 0)
LOC 2 7299
iassign <* <$regmatch_state>> 111 (regread ptr %485, constval i32 0x7fff)
LOC 2 7300
regassign ptr %488 (add u64 (regread ptr %488, constval u64 4))
LOC 2 7301
goto @L87810
@L88066 LOC 2 7303
iassign <* <$regmatch_state>> 101 (regread ptr %485, constval u32 0)
LOC 2 7305
iassign <* <$regmatch_state>> 110 (regread ptr %485, constval i32 1)
LOC 2 7306
iassign <* <$regmatch_state>> 111 (regread ptr %485, constval i32 0x7fff)
LOC 2 7307
regassign ptr %488 (add u64 (regread ptr %488, constval u64 4))
LOC 2 7308
goto @L87810
gcc的store-merging
gcc 是通过 store-merging 完成此操作的。gimple-ssa-store-merging.c 关于此 pass 的描述:
The purpose of this pass is to combine multiple memory stores of constant values to consecutive memory locations into fewer wider stores. For example, if we have a sequence peforming four byte stores to consecutive memory locations:
[p ] := imm1;
[p + 1B] := imm2;
[p + 2B] := imm3;
[p + 3B] := imm4;we can transform this into a single 4-byte store if the target supports it: [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to endianness.
before store-merging:
<L1165> [0.26%]:
st_4388->u.curly.paren = 0;
st_4388->u.curly.min = 0;
st_4388->u.curly.max = 32767;
scan_3887 = scan_2406 + 4;
# DEBUG scan => scan_3887
goto <bb 776> (repeat); [100.00%]
<L1166> [0.26%]:
st_4388->u.curly.paren = 0;
st_4388->u.curly.min = 1;
st_4388->u.curly.max = 32767;
scan_3883 = scan_2406 + 4;
# DEBUG scan => scan_3883
goto <bb 776> (repeat); [100.00%]
after store-merging:
<L1165> [0.26%]:
st_4388->u.curly.paren = 0;
MEM[(union *)st_4388 + 60B] = 140733193388032;
scan_3887 = scan_2406 + 4;
# DEBUG scan => scan_3887
goto <bb 776> (repeat); [100.00%]
<L1166> [0.26%]:
st_4388->u.curly.paren = 0;
MEM[(union *)st_4388 + 60B] = 140733193388033;
scan_3883 = scan_2406 + 4;
# DEBUG scan => scan_3883
goto <bb 776> (repeat); [100.00%]
使用 -fno-store-merging 选项关闭 gcc 的这个优化,关闭后 gcc 构建 spec 500,其性能下降 1% 左右。
store-merging 更多示例
C 源码
源码位于 spec 500 的 regexec.c 的 6770 到 6772 行。是函数 S_regmatch 的 case WHILEM 部分。
代码连续 3 次赋值,被赋值的字段在内存中是相邻的。赋值的大小分别是 8 字节,4 字节,4 字节。gcc 生成 1 条 stp 完成这 16 个字节的拷贝,maplec 是逐条翻译。
st->u.whilem.save_lastloc = cur_curlyx->u.curlyx.lastloc;
st->u.whilem.cache_offset = 0;
st->u.whilem.cache_mask = 0;
maple 生成的汇编
# st in x19
str x2, [x19, #40]
str wzr, [x19, #48]
str wzr, [x19, #52]
gcc 生成的汇编
# st in x19
stp x0, xzr, [x19, #40]
before store-merging
st_4388->u.whilem.save_lastloc = _1399;
st_4388->u.whilem.cache_offset = 0;
st_4388->u.whilem.cache_mask = 0;
after store-merging
st_4388->u.whilem.save_lastloc = _1399;
MEM[(union *)st_4388 + 48B] = 0;
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
源码 SPEC 500 pp_ctl.c 函数:Perl_pp_enter
ldr也存在类似优化空间
mpl:
22515 regassign i32 %9 (iread i32 <* <$stackinfo>> 5 (regread ptr %33)) <===== "%33.field 5"
22516 LOC 36 2252
22517 brfalse @@3 (lt i32 i32 (
22518 regread i32 %9,
22519 iread i32 <* <$stackinfo>> 6 (regread ptr %33))) <===== "%33.field 6"
maple汇编:
16652 ldr w1, [x0,#32] // w1 = PL_curstackinfo->si_cxix
16653 ldr w2, [x0,#36] // w2 = PL_curstackinfo->si_cxmax
gcc汇编:
20882 ldp w0, w1, [x22, 32]
recog.c------extract_insn
2044 recog_data.n_operands = 0;
2045 recog_data.n_alternatives = 0;
2046 recog_data.n_dups = 0;
LOC 43 2044
iassign <* <$recog_data>> 8 (regread ptr %59, constval u8 0)
LOC 43 2045
iassign <* <$recog_data>> 10 (regread ptr %59, constval u8 0)
LOC 43 2046
iassign <* <$recog_data>> 9 (regread ptr %59, constval u8 0)
adrp x0, recog_data
add x19, x0, #:lo12:recog_data
strb wzr, [x19,#1086]
strb wzr, [x19,#1088]
strb wzr, [x19,#1087]
两条相邻的strb可以合并为strh
strb wzr, [x19,#1086]
strh wzr, [x19,#1087]
类似的还有
bitmap.c------bitmap_set_bit
466 element->next = ptr->next;
467 element->prev = ptr;
@@50 LOC 49 466
iassign <* <$bitmap_element_def>> 1 (regread ptr %37, regread ptr %26)
LOC 49 467
iassign <* <$bitmap_element_def>> 2 (regread ptr %37, regread ptr %81)
.L.3922__50:
str x0, [x19]
str x3, [x19,#8] <====== 合并为stp
This optimization, if performed in maple IR, requires changing iassign to iassignoff. Since we only support iassignoff after mapleme, this optimization is best done in maple_be, during or after be_lowering.
Since this optimization will use iassignoff to represent the merged iassign's, we'll also add the support of iassignoff in mplcg's instruction selection. This optimization is now in progress.
PR849 on iassign was integrated and brings 2% improvement on cases from 500 train
在支持 memset expand 的过程中发现,struct, array 的 memset 展开为 member-wise 赋值后(我们尽量在中端展开为 member-wise 赋值,其次在后端展开为 iassignoff,因为在中端展开后对于 alias 有帮助),store merge 并没有成功合并。需要考虑增强 store merge:
spec 502
file: cpp_pch.c:338
func: cpp_write_pch_deps
me.mpl
callassigned &memset (addrof ptr %z_309_3, constval i32 0, constval u64 8) {}
me.mpl (after memset expand)
dassign %z_309_3 1 (constval u32 0)
dassign %z_309_3 2 (constval u16 0)
dassign %z_309_3 3 (constval u16 0)
期望 store merge 结果:
iassignoff u64 0 (addrof ptr %z_309_3, constval u64 0)
spec 502
file: tree-eh.c:598
func: record_in_goto_queue
memset 展开之前:
callassigned &memset (regread u64 %6, constval i32 0, constval u64 32) {}
展开之后,当前的 store merge 对 iassgin 的合并也不充分:
iassign <* <$goto_queue_node>> 2 (regread u64 %6, constval ptr 0)
iassign <* <$goto_queue_node>> 3 (regread u64 %6, constval ptr 0)
iassign <* <$goto_queue_node>> 4 (regread u64 %6, constval ptr 0)
iassign <* <$goto_queue_node>> 5 (regread u64 %6, constval ptr 0)
iassign <* <$goto_queue_node>> 6 (regread u64 %6, constval ptr 0)
iassignoff u64 24 (regread u64 %6, constval u64 0)
dassign 的工作由于其他工作暂时搁置。会马上恢复。iassign要看看原因。谢谢。
PR894 -- store merging on iassign enhancement on LHS address arithmetic
is integrated.
群超:case4 and case5 from 502 has 3% improvement. instruction count from case 5 has seen 0.17 reduction.
PR931 completes the remaining work on memset
登录 后才可以发表评论