8 Replies Latest reply: Feb 7, 2013 12:41 AM by Neverhood RSS

Does inefficiently generated IL code affect overall performance?

Neverhood Newbie
Currently Being Moderated

Hello, everyone!

 

Assume we have this simple kernel:

__kernel void test( __global uint2* data )

{

    uint gid = get_global_id( 0 );

   

    uint2 a = data[ gid ];

    uint2 b = a >> 7;

    a = a << (32 - 7);

   

    a.xy = a.xy | b.yx;   

    data[ gid ] = a;

}

 

AMD APP Kernel Analyzer shows IL code for that kernel:

;ARGEND:__OpenCL_test_kernel

func 1027 ; test                        ; @__OpenCL_test_kernel

; BB#0:                                 ; %entry

    mov r65, r1021.xyz0

    mov r65.x___, r65.x000

    ishl r65.x___, r65.x, l11

    iadd r65.x___, r1.x, r65.x

    uav_raw_load_id(10)_cached_aligned r1011.xy__, r65.x

    mov r66.xy__, r1011.xyxy

    mov r67.xy__, l13

    ushr r68.xy__, r66.xyxy, r67.xyxy

    mov r67.x___, r68.x000

    mov r68.x___, r68.y000

    mov r68.xy__, r68.x

    iadd r67.xy__, r68.x000, r67.0x00

    mov r68.xy__, l14

    ishl r66.xy__, r66.xyxy, r68.xyxy

    ior r66.xy__, r66.xyxy, r67.xyxy

    uav_raw_store_id(10) mem0.xy__, r65.x, r66.xyxy

    ret_dyn

 

I don't know, why this code is so nonoptimal, because for simple code like this one:

    a = a << (32 - 7);

    a.xy = a.xy | b.yx; 

 

we get this:

    mov r67.x___, r68.x000

    mov r68.x___, r68.y000

    mov r68.xy__, r68.x

    iadd r67.xy__, r68.x000, r67.0x00

    mov r68.xy__, l14

    ishl r66.xy__, r66.xyxy, r68.xyxy

    ior r66.xy__, r66.xyxy, r67.xyxy

 

instead of this:

    ishl r66.xy__, r66.xyxy, l14.xyxy

    ior r66.xy__, r66.xyxy, r68.yxyx

 

So basically, I have two questions:

1) Is this normal behavior for OpenCL compiler? Or should I know something in additional to avoid such instructions spelling?

2) Whether the generated IL code has the same performance as my, manually written IL code in this example?

 

All this is important to me, because HD 5750 seems to have better IL-2-ASM compiler, than HD 7850: the latter has a lower performance comparing to fair 64-bit rotating, while the former has a better performance.

 

Best regards, Dmitry.

  • Re: Does inefficiently generated IL code affect overall performance?
    himanshu.gautam Master
    Currently Being Moderated

    Edit -- sorry that was a dumb thing to ask.

    IL is generic. I mistook it to be the ISA. Sorry.

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Does inefficiently generated IL code affect overall performance?
    himanshu.gautam Master
    Currently Being Moderated

    I can make an educated guess here.

     

    Since IL is common to all hardware families -- not all optimizations can be applied at the IL level.

    IL must be generic enough to be translated to different ISAs.

    Possibly, the IL to ISA translation does the final round of optimization.

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

    • Re: Does inefficiently generated IL code affect overall performance?
      Neverhood Newbie
      Currently Being Moderated

      I agree, that generated IL code is not optimized in most cases.

      But if I were compiler, I would just map OpenCL code to IL code without optimization.

       

      Generated IL code for just vector component x and y permutation is excessive. IL specification allows to do that much better. And it would not require any optimization from OpenCL compiler. Just precise mapping one operation to another.

       

      And question from topic title is still open...

      • Re: Does inefficiently generated IL code affect overall performance?
        himanshu.gautam Master
        Currently Being Moderated

        I have asked a more knowledge person to take a look at this.
        He may be able to answer you. Request you to wait. Thanks for your patience.

        Regards

        Himanshu , Bruhaspati

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Does inefficiently generated IL code affect overall performance?
    coordz Newbie
    Currently Being Moderated

    From what I know, the shader compiler will do an excellent job of getting rid of redundant mov instructions as this will part of its basic dependency analysis. I also believe SC prefers "loose" IL rather than "tight" IL as it opens up more optimization opportunities and gives it a better insight into what the original code intention was.

     

    To be concrete to your questions:

     

    1) This is normal behaviour for the OCL compiler.

    2) I suspect the performance will be almost identical between hand written code and this example.

    • Re: Does inefficiently generated IL code affect overall performance?
      Neverhood Newbie
      Currently Being Moderated

      coordz wrote:

       

      shader compiler will do an excellent job of getting rid of redundant mov instructions

      ...

      1) This is normal behaviour for the OCL compiler.

      2) I suspect the performance will be almost identical between hand written code and this example.

      And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

       

      1) If it is true, than it is very strange, for my opinion.

      2) My experiments shows, that fair 64-bit rotating is faster, than my code. But if I rewrite code like this:

      __kernel void test( __global uint2* data )

      {

          uint gid = get_global_id( 0 );

        

          uint2 a = data[ gid ];

         

          uint bx = a.x >> 7;

          uint by = a.y >> y;

         

          a.x = a.x << (32 - 7);

          a.y = a.y << (32 - 7);

         

          a.x = a.x | b.y;

          a.y = a.y | b.x;

       

          data[ gid ] = a;

      }

       

      it will almost be as fast as fair 64-bit rotating.

       

      As you can see, the logic of code is the same. But that code is more complicated to write on C, than one in my first message.

      I hoped OpenCL will do all optimization for me, instead I have to do it my own and not always in obvious way.

      • Re: Does inefficiently generated IL code affect overall performance?
        german Newbie
        Currently Being Moderated

        Neverhood wrote:

         

        coordz wrote:

         

        shader compiler will do an excellent job of getting rid of redundant mov instructions

        ...

        1) This is normal behaviour for the OCL compiler.

        2) I suspect the performance will be almost identical between hand written code and this example.

        And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

         

        iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).

        LLVM-IR

          %tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>

          %tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>

        %tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>

          %tmp17 = or <2 x i32> %tmp6, %tmp16

         

        The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.

         

        Tahiti ISA
          tbuffer_load_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
          s_waitcnt     vmcnt(0)                                    // 00000038: BF8C1F70
          v_lshrrev_b32  v3, 7, v1                                  // 0000003C: 2C060287
          v_lshrrev_b32  v4, 7, v2                                  // 00000040: 2C080487
          v_lshlrev_b32  v1, 25, v1                                 // 00000044: 34020299
          v_lshlrev_b32  v2, 25, v2                                 // 00000048: 34040499
          v_or_b32      v1, v4, v1                                  // 0000004C: 38020304
          v_or_b32      v2, v3, v2                                  // 00000050: 38040503
          tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100

        Cypress ISA
        01 TEX: ADDR(48) CNT(1)
              7  VFETCH R0.xy__, R0.z, fc174  FORMAT(32_32_FLOAT) MEGA(8)
                 FETCH_TYPE(NO_INDEX_OFFSET)
        02 ALU: ADDR(41) CNT(7)
              8  x: LSHR        ____,  R0.y,  7     
                 y: LSHR        ____,  R0.x,  7     
                 z: LSHL        ____,  R0.y,  25     
                 w: LSHL        ____,  R0.x,  25     
              9  x: OR_INT      R0.x,  PV8.x,  PV8.w     
                 y: OR_INT      R0.y,  PV8.y,  PV8.z     
        03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4)  MARK  VPM 

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points