48 Replies Latest reply: Jun 5, 2013 10:52 PM by himanshu.gautam RSS

AMD 79xx GPUs skip kernel execution for certain indices

timchist Novice
Currently Being Moderated

I'm experiencing a strange problem that occurs on 7950 and 7970 cards, but does not happen on 5850 and 6870.

 

My application processes images in tiles. For each tile a series of OpenCL kernels is called. When tile size becomes relatively small (say, 128x128), some parts of output image may be not fully processed. I simplified my algorithm so that it is only executing the following operations for each tile:

 

  1. Temp1 = 100
  2. Temp2 = 30
  3. Temp1 = Temp1 + Temp2
  4. Dst = Src + Temp1

 

(Temp1, Temp2, Src, Dst are all vectors of 128x128).

 

After that I call clFinish and copy Temp1, Temp2, Src and Dst to host memory for checking. For those tiles that have been calculated incorrectly, I have found out that:

 

  • Temp1 is equal to 130 for all vector components
  • Temp2 is equal to 30 for all vector components
  • Dst is not equal to Src + Temp1 (Src + 130) for some vector components, but is rather equal to Src + 100

 

The number of incorrect vector components is often (but not always) divisible by 64, so it seems that under some circumstances whole wavefronts get skipped.

 

Even though the problem is 100% reproducible in this simplified version of our application, it does not show up when I try to write a standalone test, even when it very accurately models the behaviour of the application. Apparently there are some other factors that trigger the problem that I'm not aware of.

 

I'm attaching a screenshot showing a fragment of the output from our application. Grid indicates the tile boundary. If the output was correct, all the image would be equally pink, without any stripes.

 

The larger tiles become, the less is the likelihood of the problem to appear.

 

My best uneducated guess is that something wrong is happening when kernels are scheduled to hardware either on driver or on firmware level.

 

I tried several versions of driver, specifically: Catalyst 12.4, 12.8, 12.10, 13.1, 13.3 beta, 13.4. I also tried two different 7970s in two computers (one based on AMD FX 8350, the other one with i7 3770K). I also tried a 7950 in a compute based with i7 3930K. On all computers Windows 7 x64 was used. We did not check that under Linux or Mac OS. In all these configurations the problem did occur.

 

Does that ring a bell?

  • Re: AMD 79xx GPUs skip kernel execution for certain indices
    vmiura Newbie
    Currently Being Moderated

    Are you using any complex control flow?


    I ran into 2 bugs that I could reproduce on 13.3 beta:

     

    while(a && b) {} // loops even when b is false

     

    do
    {

       store some debug

      if(a) return;  // <- having a return inside do while loop caused register clobbering, and weird data was stored to my debug buffer
    }while(b);

  • Re: AMD 79xx GPUs skip kernel execution for certain indices
    timchist Novice
    Currently Being Moderated

    So far I have two possible reasons of why this problem may occur:

    • as task size is small and not all compute units are utilised, GPU may attempt to schedule next kernel to free compute units while the previous kernel is still not finished. This may be caused by an error in dependency analysis
    • cache coherency problem: second call to Add (Dst = Src + Temp1) is executed on the compute unit that has previously executed the first Fill with 100 (Temp1 = 100) and for some reason the cache of this compute unit did not get updated with a subsequent value of 130 (after Temp1 = Temp1 + Temp2 was executed)
  • Re: AMD 79xx GPUs skip kernel execution for certain indices
    timchist Novice
    Currently Being Moderated

    I have just got a confirmation that the behaviour I experience is caused by executing two kernels that have dependencies in parallel. Please see two attached screenshots, one showing timeline from a correct tile, the other one -- from an incorrect tile.

     

    As you can see, for a correct tile GPU executes Temp1 = 100 and Temp2 = 30 in parallel. That's ok, there is no dependencies. Temp1 = Temp1 + Temp2 and Dst = Src + Temp1 are executed sequentially, as the first kernel modifies Temp1, so the second one depends on the results of the first one.

     

    For a tile that is calculated incorrectly the timeline is different: Temp1 = 100 and Temp2 = 30 are executed sequentially, but two Add calls are incorrectly executed in parallel.

     

    Is there a workaround?

    • Re: AMD 79xx GPUs skip kernel execution for certain indices
      vmiura Newbie
      Currently Being Moderated

      Is it an asynchronous queue?

       

      Technically you should execute clEnqueueBarrier() or clEnqueueMarker() beween kernels if you don't want them to execute in parallel, although I thought that the current drivers don't support asynchronous execution.

      • Re: AMD 79xx GPUs skip kernel execution for certain indices
        timchist Novice
        Currently Being Moderated

        No, this queue is synchronous. In addition, as far as I know AMD OpenCL does not support asynchronous queues.

        In synchronous queues kernels MUST execute sequentially without executing clEnqueueBarrier, clEnqueueMarker or any other explicit synchronization points ("CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE  Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order.").

         

        I'd say it's OK to execute commands in parallel even in a synchronous queue, but only if there is 100% no dependencies between them. Which is not true in my case.

        • Re: AMD 79xx GPUs skip kernel execution for certain indices
          vmiura Newbie
          Currently Being Moderated

          Yeah, it shouldn't need extra synchronization.

           

          I've seen some unexpected results overlapping in CodeXL kernel tracing though, so I'm not sure you can trust that they are actually overlapping.

           

          Do you get the same bug if you use clFinish() to force sync between the kernel dispatches?

          • Re: AMD 79xx GPUs skip kernel execution for certain indices
            timchist Novice
            Currently Being Moderated

            No, inserting clFinish helps to avoid the errors, but with a performance penalty of ~30%.

            • Re: AMD 79xx GPUs skip kernel execution for certain indices
              himanshu.gautam Master
              Currently Being Moderated

              Can you post a small repro case so that we can take this issue up?

              Regards

              Himanshu , Bruhaspati

              --------------------------------

              The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

            • Re: AMD 79xx GPUs skip kernel execution for certain indices
              himanshu.gautam Master
              Currently Being Moderated

              Hi Timchist,

              Are you able to create a repro case, after figuring out the asynchronous issue with the kernels. I can send the test case to AMD Engginers, to fix the issue.

              Regards

              Himanshu , Bruhaspati

              --------------------------------

              The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

              • Re: AMD 79xx GPUs skip kernel execution for certain indices
                timchist Novice
                Currently Being Moderated

                Please see my reply just above.

                • Re: AMD 79xx GPUs skip kernel execution for certain indices
                  himanshu.gautam Master
                  Currently Being Moderated

                  I did read that you were not able to reproduce it with a minimal testcase. But I cannot help you without a repro case.

                  I just asked you for test case again, as you have done quite a work since the first post. Sorry for the confusion.

                  Regards

                  Himanshu , Bruhaspati

                  --------------------------------

                  The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

                  • Re: AMD 79xx GPUs skip kernel execution for certain indices
                    timchist Novice
                    Currently Being Moderated

                    Dear Himanshu,

                    even though I don't have a working example demonstrating the problem, I'm pretty sure the problem exists.

                     

                    I found a workaround for the problem: if for kernels such as Add(Src1, Src2, Dst) I never pass the same pointer as both Src and Dst, the problem does not appear. My guess is that the dependency analyzer somehow deduces that kernel Add only reads from Src1 and Src2, but only writes to Dst, but does not take into account the fact that I can pass the same pointer in both arguments. This does not happen always (otherwise I would be able to reproduce this in a simple program very quickly), but the bug does seem to be present under some circumstances, which I can't figure out with the tools I have and with my level of understanding of the hardware.

                     

                    I'd appreciate if you pass all the data I posted above (the description of the workflow, the screenshots from APP profiler, and if required -- I can also post APP trace files along with the source of kernels I was using) to AMD engineers (even though you can't reproduce it), as such an incorrect behavior is basically a show stopper.

                     

                    Regards,

                    Tim

                    • Re: AMD 79xx GPUs skip kernel execution for certain indices
                      himanshu.gautam Master
                      Currently Being Moderated

                      Reminds me of "fno-alias" compiler option. Previously we had to pass it to the compiler. At some point, it was made to be default.... May be, Passing a compiler flag to consider aliasing might help...

                      Just a thought... Can't remember what that option is (or if at all such an option exists).

                      Will check on Monday to see if I can find something.

                      Regards

                      Himanshu , Bruhaspati

                      --------------------------------

                      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

                      • Re: AMD 79xx GPUs skip kernel execution for certain indices
                        timchist Novice
                        Currently Being Moderated

                        Hi Himanshu,

                         

                        are you talking about cl-strict-aliasing?


                        This option is documented in Khronos specs (see clBuildProgram), however, AMD APP Programming Guide says that AMD compiler only supports -I and -D (see section 2.1.3). Neither does it mention any default options. It also seems that this option was only present in OpenCL 1.0 and 1.1 (it is not mentioned in OpenCL 1.2), but I also can't find any info on whether this option is now enabled by default.

                         

                        If this option is enabled by default, then yes, this could explain the behavior I observe.

                         

                        Regards,

                        Tim

                        • Re: AMD 79xx GPUs skip kernel execution for certain indices
                          himanshu.gautam Master
                          Currently Being Moderated

                          Nice find.. OpenCL 1.2 spec specifically says that "cl-strict-aliasing" is not supported from 1.1 (Appendix F)

                           

                          Anyway. all these aliasing fundas work only within a kernel (for compilation purposes)....and does not work out for dependencies among kernel launches. I don't know how the RT handles it..

                           

                          Coming to your problem -->

                          1) 2 kernels cannot execute in parallel in current AMD implementations (as on today)

                          2) out of order processing of command queue does not happen with AMD Runtime (even if you had enabled it)

                           

                          btw.. Are the dependent kernels scheduled from 2 different command queues? or a single command queue?

                          Please post a quick smallest reproduction case for us to pursue with AMD engg.

                          Regards

                          Himanshu , Bruhaspati

                          --------------------------------

                          The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

                          • Re: AMD 79xx GPUs skip kernel execution for certain indices
                            timchist Novice
                            Currently Being Moderated

                            > 2 kernels cannot execute in parallel in current AMD implementations (as on today)

                            Strange. That's a quote from APP SDK Guide (see section 5.9, page 5-34) about Southern Islands GPUs:

                            "Execution of kernel dispatches can overlap if there are no dependencies between them and if there are resources available in the GPU."

                             

                            The whole application uses a single command queue. This queue was created with 0 passed as the value of the properties argument of clCreateCommandQueue.

                             

                            I'm attaching the source code of kernels I have used as well as the trace files produced with AMD APP profiler. When inspecting the timeline, an example of a correct tile starts at 7151.988, while the incorrect tile starts at 7181.370.

                             

                            Unfortunately, as I have explained in the initial post, I failed to produce a complete working example.

                            • Re: AMD 79xx GPUs skip kernel execution for certain indices
                              himanshu.gautam Master
                              Currently Being Moderated

                              Thanks for attaching a case. Will try it out.

                               

                              ANd, the APP Programming guide, most likely was talking about the Hardware capability.

                              Hardware has it... but the currnet OpenCL RT does not use it.

                              Regards

                              Himanshu , Bruhaspati

                              --------------------------------

                              The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: AMD 79xx GPUs skip kernel execution for certain indices
    glimberg Newbie
    Currently Being Moderated

    I've come across a very similar issue in an application I'm working on.

     

    In my case, the images are not processed in tiles, but rather as full image buffers.  Like your case, everything works fine on 5xxx and 6xxx based cards and we're only seeing the issue on 7950 and 7970, though I still need to track down our 7750 and see if it's an issue there, too.  I have also tried the 7950 and 7970 on Mac OS 10.8.3 and the issue does NOT appear there.  I can't say for certain as I haven't tried Linux yet, but it sure seems like a Windows only issue.

     

    Basically, the moment that there are more than 2 kernels queued, I start to get incorrect results, but only on buffers above a certain size.

     

    The only way I've been able to circumvent the issue is to disable the OpenCL optimizer by passing "-cl-opt-disable" to the options field of clBuildProgram().

    • Re: AMD 79xx GPUs skip kernel execution for certain indices
      himanshu.gautam Master
      Currently Being Moderated

      It will be helpful, if you can post a small repro case.

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

    • Re: AMD 79xx GPUs skip kernel execution for certain indices
      timchist Novice
      Currently Being Moderated

      Did you try the workaround I have suggested above? Specifically, avoid passing the same pointers in different parameters to kernels, paying special attention to not passing the same pointer in both input and output parameters.

      • Re: AMD 79xx GPUs skip kernel execution for certain indices
        glimberg Newbie
        Currently Being Moderated

        We don't have that specific case as far as I can tell.  We do have a few kernels that accept only a single buffer for input and output, i.e. the input buffer is modified in place, but none where the same buffer is passed to two different kernel args.  Said buffer is then passed to other kernels for additional operations.

         

        I will try to modify those kernels tomorrow and see if adding a destination buffer argument helps things.

      • Re: AMD 79xx GPUs skip kernel execution for certain indices
        glimberg Newbie
        Currently Being Moderated

        Another thing to note.  There is a difference between your issue and mine.  No amount of calling clFinish() between my kernels helps in this case. The only thing thus far that has helped me is completely disabling optimization on the kernels.  And optimization must be disabled for all kernels in the chain of 8 or so.  If it's enabled for a single one of the kernels, I start getting incorrect results.

        • Re: AMD 79xx GPUs skip kernel execution for certain indices
          timchist Novice
          Currently Being Moderated

          Could be a different issue then. Do you see any overlap happening in APP Profiler timeline or kernel executions look sequential (both graphically and when inspecting start and end times)?

          • Re: AMD 79xx GPUs skip kernel execution for certain indices
            glimberg Newbie
            Currently Being Moderated

            Yeah, no overlapping kernels on my end as far as I can tell.  Kernels just aren't running in their entirety before the next one starts up when the optimizer is enabled.  It skips the last 1/3rd or so of the buffer once the buffer size starts to approach 39,321,600 bytes (5120 x 2560 x 3 channels).

            • Re: AMD 79xx GPUs skip kernel execution for certain indices
              himanshu.gautam Master
              Currently Being Moderated

              Hi glimberg,

              I request you to start a new thread for your question. The two topics being discussed here, does not seem related.

              Also as i understand, you question is not at all related to multiple kernel launches, but more likely related to some incorrect kernel optimization. Please provide a suitable testcase if possible to reproduce it at our end.

              Regards

              Himanshu , Bruhaspati

              --------------------------------

              The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

              • Re: AMD 79xx GPUs skip kernel execution for certain indices
                glimberg Newbie
                Currently Being Moderated

                Actually it's very related to multiple kernel launches. If we onlyrun the first kernel in the 8 kernel chain, everything works fine.  Once a second kernel is added is when the issue surfaces.  The commonality with Tim's issue is that the first few kernels in the chain are operations in place on the input buffer via a single input/output parameter to the kernel. This is slightly different in kernel code form from Tim's issue, but likely compiles down to the same thing. Tim sometimes does an operation in place by passing in the same memory pointer to different input and output parameters on the kernel function.

                 

                Just to demonstrate the issue, which I have done in a minimal sample here, but have not been authorized to release the sample:

                 

                1) Create a memory buffer that can encapsulate a 5120x2160 16-Bit RGB image.  Approximately 66MB.

                 

                2) before uploading the buffer to the card, memset the entire buffer to 0.

                 

                3) Upload the host buffer to the card.

                 

                4) Run a memset() kernel like the following:

                 

                __kernel void cl_memset(__global unsigned char *buffer,

                                        const unsigned char value)

                {

                     buffer[get_global_id(0)] = value;

                }

                 

                and set the value kernel arg to 255 and have it run over every byte in the buffer.

                 

                5) Run any other kernel with an operation that takes a separate input and output buffer.  In my case in my minimal sample, it was a pixel type conversion from RGB-16 to BGRA-8.

                 

                6) Download the kernel back to a host memory buffer and ensure each byte is equal to 255.

                 

                At around buffer index 33,400,000 (not exact), you will start getting char values of 0 in the host buffer downloaded back from the card.

                 

                IF you just run the cl_memset kernel without any other kernels queued after it, everything will work just fine.  The moment there's another kernel enqueued after is when everything starts going to hell. 

                 

                No amount of enqueuing barriers, markers, or waiting for events changes anything.

  • Re: AMD 79xx GPUs skip kernel execution for certain indices
    vmiura Newbie
    Currently Being Moderated

    Did you try putting clEnqueueBarrier after all calls to clEnqueueNDRangeKernel?

     

    Since you don't want kernels to overlap, this is what you want anyway.  In the worst case it will cost nothing, but if it fixes the bugs it might confirm the idea that kernels are running in parallel.

    • Re: AMD 79xx GPUs skip kernel execution for certain indices
      timchist Novice
      Currently Being Moderated

      No, I didn't, but it's worth giving a try. I'll post my findings here after I have a chance to make this experiment.

       

      UPDATE (May 23rd): events do not seem to help. Neither do barriers.

    • Re: AMD 79xx GPUs skip kernel execution for certain indices
      jeff_golds Novice
      Currently Being Moderated

      Kernels can run in parallel, but we have dependency checking to make sure that kernels with dependencies don't overlap, i.e. if the output from one kernel is used as the input or output to another, then they can't run in parallel.

       

      If you find a bug with this, please let us know.

      Jeff Golds

      AMD, Inc.

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

      • Re: AMD 79xx GPUs skip kernel execution for certain indices
        timchist Novice
        Currently Being Moderated

        Thank Jeff. Apparently I've found a bug, please see the posts above.

        • Re: AMD 79xx GPUs skip kernel execution for certain indices
          jeff_golds Novice
          Currently Being Moderated

          If you are using profiling, then concurrent execution is disabled.  If you aren't getting correct results with profiling enabled, then it's not an issue with dependency checking.

           

          Which kernel is generating incorrect data?  There are 80 kernels in your trace file.

          Jeff Golds

          AMD, Inc.

          The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

          • Re: AMD 79xx GPUs skip kernel execution for certain indices
            timchist Novice
            Currently Being Moderated

            I am executing these 4 steps for every of ~80 tiles:

            1. Temp1 = 100
            2. Temp2 = 30
            3. Temp1 = Temp1 + Temp2
            4. Dst = Src + Temp1

            The results for incorrect tiles look like steps 3 and 4 are executing in parallel (some elements in Dst contain Src + 100 and some of them contain Src + 130). When inspecting the timeline in my trace file, an example of a correct tile starts at 7151.988, while the incorrect tile starts at 7181.370.

            • Re: AMD 79xx GPUs skip kernel execution for certain indices
              himanshu.gautam Master
              Currently Being Moderated

              Hi tim,

              Can you confirm events are not able to fix your issue? You should be able to force required kernel scheduling using events.

              Regards

              Himanshu , Bruhaspati

              --------------------------------

              The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

              • Re: AMD 79xx GPUs skip kernel execution for certain indices
                timchist Novice
                Currently Being Moderated

                Confirmed. Returning an event from launches 1, 2, 3 with subsequent waiting for that event in following launches 2, 3, 4 does not fix the issue. I also tried to enqueue barriers (clEnqueueBarrier/clEnqueueBarrierWithWaitList) with no success. Inserting clFlush or clFinish between steps 3 and 4, however, does fix the problem (with a performance penalty though).

                • Re: AMD 79xx GPUs skip kernel execution for certain indices
                  himanshu.gautam Master
                  Currently Being Moderated

                  Can you confirm if enabling profiling fixes the problem?  -- Then, we can be sure this is related to concurrent kernel execution.

                  Also, Can you confirm if this behaviour is seen on other OSes as well?

                  Regards

                  Himanshu , Bruhaspati

                  --------------------------------

                  The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

                  • Re: AMD 79xx GPUs skip kernel execution for certain indices
                    timchist Novice
                    Currently Being Moderated

                    The problem exists both when profiling is enabled and disabled. Unfortunately, I don't have access to computers running Linux or Mac OS that have a 79xx GPU.

                    • Re: AMD 79xx GPUs skip kernel execution for certain indices
                      himanshu.gautam Master
                      Currently Being Moderated

                      Oh well... Then this means that the kernels are not executing concurrently...and still you are facing problems.

                      Hmm.....

                       

                      I see following potential reasons for this bug:

                      1) Software issue with your program (well...Just to cover all cases....)

                      2) A catalyst driver bug that runs kernels concurrently even if profiling is enabled.

                      3) A failed OS primitive that Catalyst driver is relying on (thats the reason I want you to check on other OSes) - which is causing concurrent execution when it should not.

                           If not linux, Can you try with Win8?

                           The fact that it does not occur on 5xxx and 6xxx need not necessarily imply a hardware issue.

                           7xxx cards are much faster and the quickness can induce races or timing issues in other software.

                      4) Bug in dependency checks in Catalyst driver

                            However, if thats the case -- enabling profiling should have fixed the issue.

                            So, I dont think the bug is lurking here.... but as I said before... just to cover all cases.

                      5) Hardware bug  (Well....just to cover all cases)

                       

                      Please post a smallest reproducible test-case. Otherwise, I fear we will go nowhere in this thread....

                      Regards

                      Himanshu , Bruhaspati

                      --------------------------------

                      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: AMD 79xx GPUs skip kernel execution for certain indices
    timchist Novice
    Currently Being Moderated

    The problem is no longer reproducible with the Catalyst 13.6 beta driver. Apparently, something has been fixed. Thank you all for your help.

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points