14 Replies Latest reply: Feb 18, 2013 9:34 PM by himanshu.gautam RSS

Small temporary arrays in OpenCL

realhet Novice
Currently Being Moderated

Hi,

 

Does OpenCL take advantage of the following techniques when using small local arrays?

- On VLIW -> indexed_temp_arrays (x0[n]) (aka. R55[A0.x] indirect register addressing in ISA)

- On GCN -> v_movrel_b32 instruction

 

Or if OpenCL always uses LDS memory for local arrays, is there an extension to enable those faster techniques?

 

Thanks in advance.

  • Re: Small temporary arrays in OpenCL
    binying Novice
    Currently Being Moderated

    To find out the answer,  I think you can wirte  a simple kernal, in which you use a small local array, and compile it using kernel analyzer. Then check the result in the output window...

    • Re: Small temporary arrays in OpenCL
      realhet Novice
      Currently Being Moderated

      I'm not that lazy, but all I have right now is a HD4850, and on that OCL is terrible beta-ish

      So I now got to have 160 dwords of this kind of fast 'memory' for my project (implementing it with amd_il+indexed_temp_array), and just wonder if OCL can do it.

      I know that for GCN I have to use something hybrid LDS+register_array to stay inside the 128x dword vreg limit. But on VLIW this register array thing is just awesome (there is 128*4 regs limit).

  • Re: Small temporary arrays in OpenCL
    hazeman Novice
    Currently Being Moderated

    I've tested this feature on 58xx card. OpenCL compiler generates indexed_temp_array in IL.

    The problem is what is going on in IL compiler. Almost randomly ( it slightly depends on size of array, whether you use parts (.x,.y,...) of indexed 4-vector ) it can use A indexing register or use scratch memory ( painfully slow ) to implement array.

    Unfortunately in my kernel i couldn't trick IL compiler to reliably use A register indexing and I had to change kernel design so i wouldn't use this feature. 

    • Re: Small temporary arrays in OpenCL
      realhet Novice
      Currently Being Moderated

      Hi,

      That's cool that OCL can use A0 index.

      I've played with it a little and found out that on HD4850 it will always use A0 indexing when total NumGPRS<=118. If NumGPRS would be>118 with the array then it will use scratch instead. And I only used the .x part. I think it doesn't check what parts we use as it will address 128bit array elements only. Maybe your kernel is around that NumGPRS limit.

      It's approx 400 instantly accessible dwords...

      But on GCN I think we can address only 100dwords (depends on other vreg usage) while not runing out of 128 vregs. I need 160 dwords total, and it doesn't fit into either LDS(would be 41KB for a wavefront) or VRegs. I'm afraid I have to mix those two if I want to avoid slow memory access.

  • Re: Small temporary arrays in OpenCL
    drallan Novice
    Currently Being Moderated

    GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

    The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

    In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

    Even here, int array[160] was not a problem.

     

    But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

     

    gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed.  Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .)  Although VLIW uses A0 register,  it also does something similar to serially access different indices.

     

    LDS might be faster but, there's not enough.

    • Re: Small temporary arrays in OpenCL
      realhet Novice
      Currently Being Moderated

      (unfortunately I cant accept 2x answers, though all two has proven a part of my question (vliw&gcn))

       

      "But beware the devil."

      From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

       

      Btw my case of course would be that every lane will accesses different regs.

       

      "GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy."

      That's when the instruction stream is not too dense. Pls take a look at these charts:

      http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

      http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

      If your GCN code [uses a few S instructions and also there are some 64bit big instructions] AND [you're using more than 128 regs] your kernel can end up twice as fast than its estimated ideal performance. That's why I try to avoid 128+ regs (now I can't ) and aim for under 84 or even 64.

      • Re: Small temporary arrays in OpenCL
        drallan Novice
        Currently Being Moderated

        "But beware the devil."

        From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

         

        I know your eyes are wide open but some might not realize how the compiler impliments C in a GPU environment.

        I was a bit surprised when I first saw it.

         

        I look forward to your solution, it's a tough problem!

        • Re: Small temporary arrays in OpenCL
          realhet Novice
          Currently Being Moderated

          My actual struggle in a picture ->

          indexed_temp_array.JPG

          7 clocks instead of 1, This seemed like an easy 2-3x boost to my prog, but ouch

          And it's not just the 4xxx, I've noticed it on 6xxx too.

           

          With A0 the exact same thing is around 10% slower:

            ushr r999.x, dwIdx, 2

            iand r999.z, dwIdx, 1

            iand r999.w, dwIdx, 2

            mov  r998  , x0[r999.x]

            cmov_logical r999.xy, r999.ww, r998.zw, r998.xy

            cmov_logical res, r999.z, r999.y, r999.x  

           

          Another discovery is that when I compared the above X0[] dword accessing with a uniform index (across wavefront), and I compared it to cb0 access (in the same way), the cb0 was faster. (It used a VTEX clause, but was slightly faster than A0).

        • Re: Small temporary arrays in OpenCL
          realhet Novice
          Currently Being Moderated

          Finally I had the chance to do some experiments on a 7970:

           

          - v_movrels_b32 does nothing with the contents of the source operand, it only uses the index of it, so all the lanes will read from the same register. Maybe a0 indexing can access different regs/lane, but now I'm sure, that movrel can't.

          - ds_readx2 is pretty effective (with different addresses for all laness)! I interleaved it with with 10-12 vector instructions and all the latency was hidden. (Make sure to set up the M0 register before using DS_ stuff! I wasted like an hour on this lol)

          - The amd_il compiler can't deal with indexed arrays effectively: It always swaps the contents of the indexed array with unoptimized movs before using those. (x0[const1]+=x0[const2] uses 3 movs and an add)

          • Re: Small temporary arrays in OpenCL
            himanshu.gautam Master
            Currently Being Moderated

            Hi realhet,

            Can you please share some code, which can help us in reproducing the issue.

            I will ask someone more knowledgeable for directions here.

            Thanks

            Regards

            Himanshu , Bruhaspati

            --------------------------------

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

            • Re: Small temporary arrays in OpenCL
              realhet Novice
              Currently Being Moderated

              Hi!

               

              I've managed to narrow it down: This is the simple operation it does over and over:

               

                dcl_indexed_temp_array x0[![(bufLen+3)>>2]]

               

                //array initialization goes here 

               

                //shuffle the elements of the array

                forLoop(i,0,10000)  //a loop so big that cannot be unrolled by the optimizer

                  iadd x0[0].w,x0[0].w,x0[0].x

                  iadd x0[0].x,x0[0].x,x0[0].y

                  iadd x0[0].y,x0[0].y,x0[0].z

                  iadd x0[0].z,x0[0].z,x0[0].w

                endloop

               

              And the unoptimal code is triggered by the way, I initialize the array:

               

              If I do this: mov x0[0], cb0[0]   then it compiles a perfect code (only add instructions are in the inner loop)

              But if I initialize it with a dword indexing macro:

               

                XWrite(x0, 0, cb0[0].x)

                XWrite(x0, 1, cb0[0].y)

                XWrite(x0, 2, cb0[0].z)

                XWrite(x0, 3, cb0[0].w)

               

              Where the XWrite C style macro is this: (It writes a dword in any array (cb0, x0, ...) at any dword position)

               

              #define XWrite(XName,dwIdx,val)         \\

                ushr r999.x, dwIdx, 2                 \\

                iand r999.y, dwIdx, 3                 \\

                ifieq(r999.y,0) mov XName[r999.x].x, val \ endif \\

                ifieq(r999.y,1) mov XName[r999.x].y, val \ endif \\

                ifieq(r999.y,2) mov XName[r999.x].z, val \ endif \\

                ifieq(r999.y,3) mov XName[r999.x].w, val \ endif \\

               

              #define ifieq(a,b)      \\

              ieq r999.w, a, b        \\

              if_logicalnz r999.w     \\

               

              So If I touch that array with that flexible dword addressable thing, the compiler does the following:

              - It realizes that the dword address is a constant so the  ushr,  iand   calculations are constant too.

              - It also drops 3 IFs and leaves a specific mov instruction behind

              So it can optimize the whole XWrite(array , const,  anything)  macro into a single  mov instruction which is great.

              But later when I it got to thhe "iadd x0[0].w,x0[0].w,x0[0].x..." main loop, it does this:

                mov tmp1, x0[0].x  

                mov tmp2, x0[0].w

                add tmp2, tmp1, tmp2

                mov x0[0].w, tmp2

              And this is triggered by the XWrite(x0,0,1234) macro (that the resulting mov of that 4 IFs aren't optimized furter, even when operands are specified exactly (x0[0].x).

               

              ----------------------------------------------------------------------------------------------------

              Bad:

              ; --------  Disassembly --------------------

              00 ALU: ADDR(32) CNT(10) KCACHE0(CB0:0-15)    //initialize with XWrite(x0,0,cb2[0].x) ...and so on

                    0  x: MOV         R0.x,  KC0[0].x     

                       y: MOV         R0.y,  KC0[0].y     

                       z: MOV         R0.z,  KC0[0].z     

                       w: MOV         R0.w,  KC0[0].w     

                    1  x: MOV         R4.x,  R0.x     

                    2  y: MOV         R4.y,  R0.y     

                    3  z: MOV         R4.z,  R0.z     

                    4  w: MOV         R4.w,  R0.w     

                    5  w: MOV         R1.w,  (0xFFFFFFFF, -1.#QNANf).x     

              01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

                  02 ALU: ADDR(42) CNT(3)

                        6  w: ADD_INT     R1.w,  R1.w,  1     

                        7  x: PREDGE_INT  ____,  10000,  R1.w      UPDATE_EXEC_MASK BREAK UPDATE_PRED

                  03 ALU: ADDR(45) CNT(15)

                        8  x: MOV         R0.x,  R4.x                  // iadd x0[0].w,x0[0].w,x0[0].x  ...and so on

                           w: MOV         R0.w,  R4.w     

                        9  z: MOV         R0.z,  R4.z     

                       10  w: ADD_INT     R0.w,  R0.x,  R0.w     

                       11  w: MOV         R4.w,  R0.w     

                       12  x: MOV         R0.x,  R4.x     

                           y: MOV         R0.y,  R4.y     

                       13  x: ADD_INT     R0.x,  R0.x,  R0.y     

                           z: ADD_INT     R1.z,  R0.w,  R0.z     

                       14  x: MOV         R4.x,  R0.x     

                       15  y: MOV         R0.y,  R4.y     

                           z: MOV         R0.z,  R4.z     

                       16  y: ADD_INT     R0.y,  R0.y,  R0.z     

                       17  y: MOV         R4.y,  R0.y     

                       18  z: MOV         R4.z,  R1.z     

              04 ENDLOOP i0 PASS_JUMP_ADDR(2)

              05 ALU: ADDR(60) CNT(5) KCACHE0(CB1:0-15)

                   19  x: MOV         R0.x,  R4.x     

                   20  y: MULADD_UINT24  R127.y,  0.0f,  4,  KC0[0].x     

                   21  x: LSHR        R1.x,  PV20.y,  2     

              06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R1].x___, R0, ARRAY_SIZE(4)  VPM

              07 END

              END_OF_PROGRAM

               

              Good:

              ; --------  Disassembly --------------------

              00 ALU: ADDR(32) CNT(6) KCACHE0(CB0:0-15)   //initialized with mov x0[0],cb2[0]

                    0  x: MOV         R1.x,  KC0[0].x     

                       y: MOV         R0.y,  KC0[0].y     

                       z: MOV         R0.z,  KC0[0].z     

                       w: MOV         R0.w,  KC0[0].w     

                    1  w: MOV         R1.w,  (0xFFFFFFFF, -1.#QNANf).x     

              01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

                  02 ALU: ADDR(38) CNT(3)

                        2  w: ADD_INT     R1.w,  R1.w,  1     

                        3  x: PREDGE_INT  ____,  10000,  R1.w      UPDATE_EXEC_MASK BREAK UPDATE_PRED

                  03 ALU: ADDR(41) CNT(4)

                        4  x: ADD_INT     R1.x,  R1.x,  R0.y     

                           y: ADD_INT     R0.y,  R0.y,  R0.z     

                           w: ADD_INT     R0.w,  R1.x,  R0.w     

                        5  z: ADD_INT     R0.z,  R0.z,  PV4.w     

              04 ENDLOOP i0 PASS_JUMP_ADDR(2)

              05 ALU: ADDR(45) CNT(4) KCACHE0(CB1:0-15)

                    6  y: MULADD_UINT24  R127.y,  0.0f,  4,  KC0[0].x     

                    7  x: LSHR        R0.x,  PV6.y,  2     

              06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R1, ARRAY_SIZE(4)  VPM

              07 END

              END_OF_PROGRAM

              ---------------------------------------------------------------------------------------------

               

              On the GCN it also does this:

                v_mov_b32     v4, v40                                     // 00001C9C: 7E080328

                v_add_i32     v3, vcc, v37, v4                            // 00001CA0: 4A060925

                v_mov_b32     v40, v3                                     // 00001CA4: 7E500303

                v_mov_b32     v4, v41                                     // 00001CA8: 7E080329

                v_mov_b32     v5, v38                                     // 00001CAC: 7E0A0326

                v_add_i32     v4, vcc, v4, v5                             // 00001CB0: 4A080B04

                v_mov_b32     v41, v4                                     // 00001CB4: 7E520304

                v_mov_b32     v5, v42                                     // 00001CB8: 7E0A032A

                v_mov_b32     v6, v39                                     // 00001CBC: 7E0C0327

                v_add_i32     v5, vcc, v5, v6                             // 00001CC0: 4A0A0D05

                v_mov_b32     v42, v5                                     // 00001CC4: 7E540305

                v_mov_b32     v6, v43                                     // 00001CC8: 7E0C032B

                v_add_i32     v3, vcc, v3, v6                             // 00001CCC: 4A060D03

              But I failed to reproduce it with a small test program. It needs more 'pressure', It could be high VReg usage, or big program code or whatever. For small arrays it optimizes fine.

               

              -----------------------------------------------------------------------------------------------------------------------

              (Attaching an HD6970 compatible AMD_IL code.)

              • Re: Small temporary arrays in OpenCL
                himanshu.gautam Master
                Currently Being Moderated

                Hi Realhet,

                I will forward this to appropraite team. Can you let me know the some more details:

                1. Platform - win32 / win64 / lin32 / lin64 or some other?

                    Win7 or win vista or Win8.. Similarly for linux, your distribution

                2. Version of driver

                3. CPU(s) or GPU(s) you worked on. I think this is HD 6970 and HD 7970. Please confirm.

                Regards

                Himanshu , Bruhaspati

                --------------------------------

                The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

                • Re: Small temporary arrays in OpenCL
                  realhet Novice
                  Currently Being Moderated

                  Hi!

                   

                  I've tried with the latest driver also (no changes).

                  Attaching many files to make it easy to reproduce/analyze.

                   

                  Thank You

                   

                  -------------------------------------------------------------------------------------------------------------------------------------------

                  This test in a nutshell:

                   

                  GPU: HD6970

                  OS: win7 64

                  Cat: 12-10 and 13-1 (no differences in result)

                   

                  Have an indexed array x0, length=1.

                   

                  I do the following operation on that in a loop:

                    x0[0].x+=x0[0].y;

                    x0[0].y+=x0[0].x;    //note the constant indexing

                   

                  The compiled ISA loop is differencing basen on the way I use that array.

                   

                  1) When I initialize it, with constant indexing:

                      x0[0].xy=cb2[0].xy 

                    Then it will compile the loop to:

                      3  y: ADD_INT     R0.y,  R1.x,  R0.y     

                      4  x: ADD_INT     R1.x,  R1.x,  PV3.y      //2 cycles is the best time for this dependency chain

                   

                  2) When I initialize it, with register indexing:

                      loop r1.x from 0 to 1 do  

                        if(r1.x%4=0) x0[r1.x/4].x=cb2[r1.x/4].x

                        if(r1.x%4=1) x0[r1.x/4].y=cb2[r1.x/4].y

                        if(r1.x%4=2) x0[r1.x/4].z=cb2[r1.x/4].z

                        if(r1.x%4=3) x0[r1.x/4].w=cb2[r1.x/4].w

                      endloop 

                    This is enought for the compiler, to mark the array that is it variable accessed, and then it will compile the loop to:

                      5  x: MOV         R0.x,  R4.x     

                         y: MOV         R0.y,  R4.y     

                      6  x: MOV         R1.x,  R4.x     

                      7  y: ADD_INT     R0.y,  R0.x,  R0.y     

                      8  x: ADD_INT     R1.x,  PV7.y,  R1.x     

                      9  y: MOV         R4.y,  R0.y     

                     10  x: MOV         R4.x,  R1.x     

                  • Re: Small temporary arrays in OpenCL
                    himanshu.gautam Master
                    Currently Being Moderated

                    Thank You for the testcase. I have reported the issue to AMD OpenCL compiler team. I will update the thread, once the issue has been fixed.

                    Regards

                    Himanshu , Bruhaspati

                    --------------------------------

                    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points