500.perlbench regexec.c S_regmatch() findings

Ebo to avoid overlapping live ranges.
The following code was found in S_regmatch(). The physical registers should be viewed as virtual registers. 'mov x23, x0' should terminate the live range of x0. Since there is no later definition of x0, subsequent usages of x0 after the 'move' should all be x23.

        mov     x23, x0         //  [R601]  [R232]
        ldr     x10, [x0,104]          //  [R232]
        str     x10, [sp,208]          //  SPILLcolor vreg: 602
        ldr     x21, [sp,184]          //  RELOADcolor vreg: 603
        str     wzr, [sp,216]          //  SPILLcolor vreg: 605
        str     wzr, [sp,440]          //  SPILL vreg: 606 for caller save in BB 2
        str     wzr, [sp,220]          //  SPILLcolor vreg: 607
        ldr     w1, [x0,96]            //  [R232]  [R233]

Basic block ordering can try to prioritize fall through to avoid extra branching.
In this example from S_regmatch(), the 'cbz' target block 2344__500 should be moved off-line, so the fall through can continue to label '2344__501 to avoid the extra branch.

        cbz     x3, .L.2344__500                //  [R10610]
        mov     x0, x1          //  [R622]  [R230]
        b       .L.2344__501
.L.2344__500:   //label order 2192
        mov     w1, wzr
        bl      Perl_gv_add_by_type
        ldr     x0, [x0,16]            //  [R623]  [R622]
.L.2344__501:   //label order 2193
        ldr     x10, [x0]               //  [R622]

Variable 'ln' initialization to 0 is redundant.
The local variable 'ln' is declared in the beginning of the function and assigned with the initial value of 0. Later uses appears to be all dominated by a def, rendering the initial assignment of 0 redundant. By not removing the assignments, this has the drawback of increasing the live range of 'ln' to the start of the function even it is used only in few switch cases.

A futurewei ticket has been filed against ME for this.

No longer an issue.

Many local variables are declared in the beginning of the S_regmatch() function. These variables are not used until later. These initialization assignments should be sinked to a later point to avoid unnecessary extension of live ranges.

Convert ' ? 0 : 1' assignment of bool left hand side to use xor.
In maple IR, lnot u1 (ne u8 (regread %var), 0) generates

cmp var, 0
cset res, eq
instead xor can be used, as used by gcc.
eor res, var, 1

A futurewei ticket 331 has been filed against ME for this.
PR 903 has the fix.

Should turn on pre-scheduling to shorten dependency distances
Observed from rematerialization, adrp -> add can be separated by a number of instructions. Since adrp most likely is only used by add, placing the two instructions together can avoid extra live ranges.
In general for S_regmatch(), pre-scheduling can shorten dependency distances and get a better usage of available register resources.

conditional branch target and fall through to same target
The conditional brfalse statement can be eliminated in S_regmatch()

  brfalse @@605 (lt i32 a64 (regread ptr %403, regread a64 %30))
@@605 LOC 52 4912

futurewei bug 287

PR 912

switch table using adr
gcc's switch table uses adr, one instruction instead of adrp/add pair. The switch table is in .rodata and the entries is divided by 4.

Maple switch code has more instructions. The switch table uses 8 byte entries.

        add     w2, w0, 0              //  [R10620]  [R2260]
        adrp    x1, .LB_S_regmatch2             //  [R10621]
        add     x1, x1, #:lo12:.LB_S_regmatch2          //  [R10621]  [R10621]
        ldr     x2, [x1,x2,LSL 3]              //  [R10621]  [R10620]  [R10623]
        add     x1, x1, x2              //  [R10622]  [R10621]  [R10623]
        br      x1              //  [R10622]

.LB_S_regmatch2:
        .quad   .L.2344__100 - .LB_S_regmatch2
        .quad   .L.2344__330 - .LB_S_regmatch2

Gcc uses adr with less instructions. The switch table entries are smaller.

        ldrh    w0, [x19,w1,uxtw 1]
        adr     x2, .Lrtx1788
        add     x0, x2, w0, sxth 2
        br      x0
.Lrtx1788:
        .section        .rodata
        .align  0
        .align  2
.L1788:
        .2byte  (.L1893 - .Lrtx1788) / 4
        .2byte  (.L1803 - .Lrtx1788) / 4

add w2, w0, 0 -> can propagate w0 for w2
In the switch code above, w2 can be replaced by w0, and the add can be eliminated if is dead.

spill/mov combining
There are several instances of

        ldr     x17, [sp,504]          //  RELOAD vreg: 642 for caller save in BB 1473
        str     x17, [sp,520]          //  SPILL vreg: 6558 for caller save in BB 1473
        mov     x12, x17                //  [R6558]  [R642]
        b       .L.2344__161

Here, x12 can replace x17 and eliminate the mov.

Use info from logical op to eliminate extensions

        and     w1, w1, 31             //  [R10647]  [R10646]
        sxtw    x1, w1          //  [R10648]  [R10647]

w1 & 0x1f will not need sxtw.

combine load and extension

        ldr     w13, [sp,480]          //  RELOAD vreg: 598 for caller save in BB 44
        uxtb    w0, w13         //  [R2813]  [R598]

ldr can be converted to ldrb (unsigned load byte).

Fred wants to improve icache utilization. Will be working on issue
Improve mplme's me_bb_layout.cpp to maximize icache locality.

futurewei bug 335

Use post index address mode

        ldrb    w0, [x26]               //  [R3294]  [R10799]
        add     x26, x26, 1            //  [R3294]  [R3294]

Can combine

        ldrb    w0, [x26], 1

I am not able to get this pattern in regexec.s

This can be a general case worth looking into, such as
*++p
*p++
and can be any integral types. I'll investigate a bit.

FW Bug 340 with testcase created.

Address computation using 'add dst, src1, src2 lsl 2'
Example, line 4929 of regexec.c

   next = scan + NEXT_OFF(scan);

preprocess shows this.

   next = scan + ((scan)->next_off);   <-  next_off is U16

Maple IR after ME is this.

  @@1331   regassign u32 %33 (iread u32 <* <$regnode>> 3 (regread ptr %6397))
  regassign i32 %34 (cvt i32 u16 (regread u32 %33))
  regassign i32 %35 (mul i32 (regread i32 %34, constval i32 4))
  regassign ptr %274 (add ptr (
      regread ptr %6397,
      cvt ptr i32 (regread i32 %35)))

mplcg generated this.

        ldrh    w0, [x13,2]            //  [R6597]  [R233]
        lsl     w0, w0, 2              //  [R235]  [R233]
        sxtw    x0, w0          //  [R10654]  [R235]
        add     x0, x13, x0             //  [R474]  [R6597]  [R10654]

gcc generated this.

        ldrh    w21, [x22, 2]
        add     x21, x22, x21, lsl 2

PR 905 fixed the bug of combining lsl->sxt->add into add with shift since under some circumstances lsl can set the sign bit and sxt will change the result of the single instruction add with shift, since add with shift will extend first then shift, violating the original intent.

submitted futurewei issue 338 for ME.
See related futurewei issue 332 and PR 905.

Fred has filed issue against mplfe, I4BEL1

gcc has ' mov movk ' pair with no unknown purpose.
The two instruction sequence is seen lots of times but with no use. The purpose of these two instruction sequence is unknown.

        mov     x12, -1229782938247303442
        movk    x12, 0xeeef, lsl 0

Short circuiting of if conditions
Need to determine when it is beneficial to perform short circuiting of conditions.
An example is line 7342.

            if (! HAS_TEXT(next) && ! JUMPABLE(next)) {

Here HAS_TEXT() gets expanded into

! ( (PL_regkind[((next)->type)] == 31) || PL_regkind[((next)->type)] == 51 )

gcc short circuited this into

        cmp     w0, 31
        cset    w2, ne
        cmp     w0, 51
        ccmp    w2, 0, 4, ne
        beq     .L2562

Maple uses the conventional cmp and branch.

        cmp     w0, 31         //  [R434]
        beq     .L.2344__2127
        cmp     w0, 51         //  [R434]
        beq     .L.2344__2128

Maple has less instructions, but gcc has less branches.

many cset for a comparison in .s are useless. It is due to "regassign i32 %5919 (eq i32 i32 (regread u32 %214, constval i32 31))" the like in mpl from FE. It was said that china team knows of this.
ideally, we should be able to get something like:

cmp w0, 31
ccmp w0, 51, 0, ne
beq .L2562

inlining
There are a total of 216 'bl' calls in gcc's S_regmatch() while maple has 245.
Gcc does inline other functions into a larger function.

In addition, gcc inlined S_regmatch() into its only caller S_regtry()

Using memory addressing mode of [base, reg, lsl val]
Gcc has 32 instances of using lsl shift in memory operand while maple has none.
It is surprising since I thought both peep and global opt does this optimization.
Example is line location 5321
gcc

        ldrh    w0, [x24, 76]
        ldrh    w0, [x2, x0, lsl 1]

maple

        ldrh    w2, [x21,#76]           //  [R636]  [R1383]
        lsl     x2, x2, #1              //  [R583]  [R1383]
        ldrh    w0, [x0,x2]             //  [R265]  [R583]  [R585]

FYE looking into this one

Using variation of ubfm for address computation
See http://bugzilla.futurewei.com/show_bug.cgi?id=338
and #I4BEL1:mplfe should avoid using OP_cvt for integer types less than 32 bits
for comment on 600.perlbench's regexec.c S_regmatch. regexec.c:5321
gcc uses

         ubfiz   x0, x0, 2, 16

to extract the u16 offset for use in address calculation.

Remove inlined function body if not used.
mpl2mpl after inlining should remove function bodies that are no longer used to reduced icache requirements.

ME use previous known compare result to eliminate later compare
near line 6590, exits a short circuit code.
This is ME optimized mpl.

LOC 52 6590
  regassign i32 %5753 (ge i32 u32 (regread u32 %1407, regread u32 %1353))
  brfalse @Lshortcircuit.1567 (ge i32 u32 (regread u32 %1407, regread u32 %1353))
  regassign i32 %5753 (ne i32 i64 (
      iread i64 <* <$regexp_paren_pair>> 2 (add a64 (
        iread a64 <* <$regexp>> 22 (regread ptr %377),
        mul a64 (
          cvt a64 i32 (regread u32 %1353),
          constval a64 24))),
      constval i64 -1))
@Lshortcircuit.1567 LOC 52 111115
  regassign i32 %5756 (ne i32 i32 (regread i32 %5753, constval i32 0))
  regassign u8 %399 (ne u32 i32 (regread i32 %5756, constval i32 0))

%5756 is always a single def then a single use like above. The value %5753 is an earlier compare result. If %5753 is 1, then %5756 is (1 != 0) 1, else 0. Therefore, the definition of %5756 is redundant.

futurewei bug 345

multiply instruction vs shift and add optimization
Just a note here that gcc is using mult instruction in S_regmatch.
Maple uses shift and add.
It is not clear why gcc did not optimize for shift and add.
Experiment shows using mult instruction in maple is slower than shift/add as expected.

GVP 方舟编译器 / OpenArkCompiler

内容风险标识

评论 (27)

GVP方舟编译器 / OpenArkCompiler

内容风险标识