25 Replies Latest reply: Sep 20, 2013 12:00 AM by madsbuvi RSS

Kernel uses less registers since 13.1 - harming performance of my application

madsbuvi Newbie
Currently Being Moderated

Hello!

Since drivers 13.1 and later, the AMD OpenCL compiler has been rather sparse with allocating registers for my code, and as a result there is massive register spilling and about 3x reduction in code performance.

When compiled with 12.11 or older it would use 244 registers, with no spilling. Still enough registers to achieve full utilization of the GPU since with no register spilling my code would be nearly wholly arithmetically bound while having enough wavefronts to keep the GPU occupied.

But when compiled with 13.1 or later it uses only 131, spilling a lot of registers.

Are there any compiler flags i can pass to allow/force the compiler to be more liberal in allocating registers?

 

My apologies in advance if i've missed any documents specifying this, or if this question has already been answered (i searched but couldn't find any entirely similar questions).

 

The specific code in question can be found here:

https://github.com/madsbuvi/MTY_CL/blob/master/readme.md

Run thought CodeXL should give a complete .cl file

the loop at lines 246-250 in gpu.cl and the whole of des.cl/sboxes.cl is the relevant runtime-critical section.

performance dropped to ~40 million from 125-130 million hashes / second with the new drivers. Edit: with the card 7850, i forgot to mention.

 

(The code also broke completely and generated wrong hashes with driver version 13.1, but this has been fixed in the newest beta driver. Mentioning this in case it might be related.)

 

edit:

http://devgurus.amd.com/message/1286728

seems somewhat relevant in terms of losing performance. But does not seem to be caused by register spilling.

  • Kernel uses less registers since 13.1 - harming performance of my application
    nou Expert
    Currently Being Moderated

    try add

    __attribute__((work_group_size_hint(64, 1, 1))) or __attribute__((reqd_work_group_size(64, 1, 1)))

  • Re: Kernel uses less registers since 13.1 - harming performance of my application
    himanshu.gautam Master
    Currently Being Moderated

    Thanks for reporting it. I will try to reproduce it at our end. Is the testcase 32-bit or 64-bit. It contains DLLs so i assume you are using Windows. Win7 or Win8?

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Kernel uses less registers since 13.1 - harming performance of my application
    Raistmer Apprentice
    Currently Being Moderated

    Well, I see the same for my app too,

    With 13.1 app started to cause driver restarts. Comparing  ISA for too long running kernel I found that under 13.1 it uses only 5 registers while on 12.8 (where no driver restart) it uses 12 GPRs:

     

    SQ_PGM_RESOURCES:NUM_GPRS     = 5

    vs

    SQ_PGM_RESOURCES:NUM_GPRS     = 12

     

    So, register spilling inevitable under 13.1 that slows down kernel in such big degree that it causes driver restarts.

  • Re: Kernel uses less registers since 13.1 - harming performance of my application
    madsbuvi Newbie
    Currently Being Moderated

    With drivers 13.4 and 13.5 beta the program crashes with exception code c0000005 at the call to clBuildProgram.

    Not sure if i should make a separate thread about this. I'll edit / post again if i can figure more precisely what is causing the crash.

    • Re: Kernel uses less registers since 13.1 - harming performance of my application
      himanshu.gautam Master
      Currently Being Moderated

      btw.. How many compiler options are you passing?

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

      • Re: Kernel uses less registers since 13.1 - harming performance of my application
        madsbuvi Newbie
        Currently Being Moderated

        Just "-cl-opt-disable" or "-O0". Disabling this eliminates the crash. In regards to the original questions trying with or without this made no difference. I had disabled optimizations since the code produced with -O0 was not noticeable slower and the program frequently needs to recompile so reducing compilation time is desirable (but not crucial).

        I was able to produce a fairly small case that reproduces the crash in KernelAnalyzer 2 (2.1.671) with either "-O0" or "-cl-opt-disable" enabled.

         

        typedef uchar     uint8_t;
        typedef ushort     uint16_t;
        typedef uint     uint32_t;
        typedef ulong     uint64_t;
        typedef char    int8_t;
        typedef short    int16_t;
        typedef int        int32_t;
        typedef long    int64_t;
        
        struct WDW
        {
          uint64_t w;
          uint16_t xp;
        };
        
        inline uint64_t
        xpize(uint64_t m, uint32_t xp, uint32_t len)
        {
            int i;
        
            for (i = 0; i < len; i++){
                if (xp & (1 << i))
                {
                    uint32_t c = (m >> 6 * i) & 077;
                    uint32_t s;
                    if (c >= 046)
                      s = 26;
                    else if (c == 001)
                      s = 1;
                    else
                      continue;
                    m -= (uint64_t)s << 6 * i;
                }
            }
        
            return m;
        }
        
        __kernel void crypt25(
                              __global struct WDW * wdw_pool
                                                    ){    
            
            int index = get_global_id(0);
            uint64_t word = wdw_pool[index].w;
            uint64_t xi = xpize(index,wdw_pool[index].xp,10);
        
        }
        
        

         

         

        Good news is, without -O0, the code is back to full speed (tested only on beta driver 3.5)

        • Re: Kernel uses less registers since 13.1 - harming performance of my application
          roger512 Newbie
          Currently Being Moderated

          Hi,

           

          I will just report a personnal experience I had, that is alike to yours. I was also working on a very big kernel, using a lot of registers. The code worked perfectly on Nvidia cards, when i tested it on a HD7950, it didn't produce the good result.

          Disabling optimizations made it functionnal but really slow. I eventually found that adding volatile qualifier fixed the problem, even if there were no reason to put a volatile (it was just a simple float variable).

           

          Then the code got bigger and the AMD compatibility didn't follow, at the moment it only works with NVidia cards which is quite annoying for an opencl code. I'm currently waiting for a code generation improvement in AMD IL, because I think it's why my kernel isn't working with AMD cards

          • Re: Kernel uses less registers since 13.1 - harming performance of my application
            himanshu.gautam Master
            Currently Being Moderated

            Hi roger,

            Can you explain what you mean by compatibility did not follow. Has the kernel stopped compiling for AMD cards? It will help if you can share some cut-down version of your kernels.

            Regards

            Himanshu , Bruhaspati

            --------------------------------

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

            • Re: Kernel uses less registers since 13.1 - harming performance of my application
              roger512 Newbie
              Currently Being Moderated

              Hi Himanshu, thank you for answering.

               

              Well, i'm sorry but I can't really post the code, it's a really big kernel more than 2000 lines. I wont post cut down version for the moment either, my companie wouldn't allow it anyway.

               

              When I said the compatibility did not follow, I meant, it compiles fine, it just doesn't produce the right result when NVidia cards are able to.

               

              I can give you the followind details :

               

              I used the casting vector type to pointer trick to access a vector type component dynamically.

               

              example :

               

              float4 vec;

              float f =     ((float*)&vec)[x];

               

              That kind of code didn't work well with HD7950 card. I replaced it with that trick :

               

              http://developer.amd.com/community/blog/tips-tricks-a-code-snipit/

               

              It fixed couple of problems, from what I remember the vector type to pointer trick worked with float4 that were not inside a structure (???), but maybe it is pure coincidence.

               

              From there it started to work better, and I finally found that adding a volatile on some random float variable made the kernel functional, that's mainly why I think there is something broken in the OpenCL to AMD IL process.

               

              I never had any problems with small and medium size kernels, I still think AMD Opencl is very reliable but it seems there is some kind of glitchs happening when kernels become quite complex, the kernel I encounter difficulties with has 4/5 nested loop, many break and a load of conditionals statements .

               

              The kernel also relies on warp/wavefront lockstep principal... I used 32/64 to define their width, so it should be ok there.

               

              Roger

  • Re: Kernel uses less registers since 13.1 - harming performance of my application
    madsbuvi Newbie
    Currently Being Moderated

    While the speed is back to normal, the new drivers seems to have other issues.

    I've been unable to produce a smaller case of where something is going wrong, but the drivers 13.4 and 13.5 are producing garbage instead of correct hashes.

    Attached are snapshots of searching for any hashes containing the string "Green", where testing under 12.10 shows correct output while 13.4 produces garbage (and 13.5 has the same issue).

    This problem was also present in 13.1, but not in any of the beta drivers in between the releases of 13.1 and 13.4.

    I'm sorry that i can't provide a smaller case. But what i did find was that changing the randomly generated keys to a hardcoded key would produce the correct output (presumably due to constant propagation hiding the actual issue...).

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points