28 Replies Latest reply: Feb 12, 2013 6:18 PM by himanshu.gautam RSS

Bug in OpenCL (GCN only)

pswwsp Newbie
Currently Being Moderated

Hello,

I’ve got the working OpenCL kernel that calculates SHA-256 hash of VERY long string and it takes too much time. I decided to split this kernel into several parts and save the intermediate results in global buffer. This means the same kernel is called several times and, if the calculation is not completed, the previous intermediate context is loaded from this buffer, more hash calculation is performed and new intermediate result is saved again.

Unfortunately, the new split kernel is not working – the global buffer with intermediate result has always zeros. But,

  1. It’s not working on GCN architecture only (I’ve tested on Capeverde). I've tested Catalyst from 12.3 up to 12.11. It works fine on VLIW5 and NVIDIA GPU.
  2. If I try to print using printf() the intermediate buffer, the code works ok.
  3. If I comment some lines in the code , the buffer is also not zero.

The minimal sample is attached. If there is the right way to report such a bug, please let me know – I couldn’t find it. Thanks.

  • Re: Bug in OpenCL (GCN only)
    binying Novice
    Currently Being Moderated

    and you are using the latest driver?

  • Re: Bug in OpenCL (GCN only)
    yurtesen Apprentice
    Currently Being Moderated

    What do you mean that it doesnt work? Does it crash? or you get wrong results?

    • Re: Bug in OpenCL (GCN only)
      pswwsp Newbie
      Currently Being Moderated

      It prints all zeros instead of real results. If you compile the sample, you will see either zeros (incorrect) or  non-zero result (correct). If you use printf, the result is always correct.

      • Re: Bug in OpenCL (GCN only)
        yurtesen Apprentice
        Currently Being Moderated

        I had a quick look at your code. If you use CL_MEM_ALLOC_HOST_PTR with clCreateBuffer, shouldnt you be using map/unmap?  Also this memory object will be on host memory and not on device memory according to the AMD documentation (its perhaps alright unless if this is not what you want). Did you try making cglobal a pointer then use map/unmap to access it? This way you can avoid double allocating the memory and unnecessary copy operations in between.

         

        Also if the device is doing async operations, the buffer could be read before kernel finishes execution, try a cl_wait after kernel enqueue. There is an example for this in the OpenCL PDF (see page 1-20)

        http://developer.amd.com.php53-23.ord1-1.websitetestlink.com/wordpress/media/2012/10/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

         

        The printf might be adding enough time for the kernel to complete and correct values to be read.

        • Re: Bug in OpenCL (GCN only)
          binying Novice
          Currently Being Moderated

          Indeed, I can reproduce this problem on 79xx (GCN Architecture) on Win7 & Catalyst 12.11.

           

          I downloaded gcn-error.zip from the link pswwsp provided.  The code compiles and runs well without any modification. It prints out non-zero "ctx", which is correct according to pswwsp. 

           

          After either of the two "printf"s (Line 200 and Line 220 in zerobuf-gcn.cl) is uncommented, the output shows that the values of "ctx" are all zeros, which are incorrect.

        • Re: Bug in OpenCL (GCN only)
          pswwsp Newbie
          Currently Being Moderated

          Thanks for looking at my code. As for CL_MEM_ALLOC_HOST_PTR, it's really an incorrect flag, because the buffer should be placed on the device. Unfortunately, removing this flag doesn't help.

          As for clWaitForEvents (did you mean this?) I guess it's not needed because the reading buffer via clEnqueueReadBuffer is blocking. Anyway, I've tried to insert clWaitForEvents before the buffer read, and it's doesn't fix the bug..

          I've attached the fixed code, and it works exactly the same as previous. That's why I'm thinking it's not my bug, but the bug of OpenCL compiler or run-time.

          • Re: Bug in OpenCL (GCN only)
            binying Novice
            Currently Being Moderated

            Removing CL_MEM_ALLOC_HOST_PTR or inserting clWaitForEvents doesn't help.

             

            --Yes, this is what I've seen.

            • Re: Bug in OpenCL (GCN only)
              binying Novice
              Currently Being Moderated

              I narrowed the problem a little bit. Please replace the zerobuf-gcn.cl inside the gcn-error.zip with the attached zerobuf-gcn.cl. You will find that it doesn't matter if we comment the "printf" inside the kernel or not with the modified zerobuf-gcn.cl.

               

              I would say, sth. inside the kernel triggered the problem.  It may or may not be a bug...

          • Re: Bug in OpenCL (GCN only)
            yurtesen Apprentice
            Currently Being Moderated

            AFAIK you need to wait for kernel execution to finish with clWaitForEvents or clFinish.

             

            You have:

            #define THREADS_PER_BLOCK    128

            #define MAX_BLOCKS        512

             

            in your .cpp file:

            int cglobal [MAX_BLOCKS * THREADS_PER_BLOCK * 8];

             

            Use calloc for cglobal... (I dont trust this allocation )

             

            int grid = 32;

             

            globalWorkSize = THREADS_PER_BLOCK * grid;

             

            In your kernel:

            #define SHA_LONG unsigned int

                __global SHA_LONG *c_global,

             

            First of all you allocate int and then use uint (I guess doesnt matter in this case but...)

             

            Then you access it inside your kernel with tid * 8 + [0-7] which will be a maximum of 32776 while you allocated millions of int s. Sounds unnecessarily high? (unless I calculated something wrong?)

             

            Of course probably none of these are the cause of the problem...

             

            I guess I should compile your code and test it, but I am a bit busy, but I will try tomorrow.

            • Re: Bug in OpenCL (GCN only)
              pswwsp Newbie
              Currently Being Moderated

              Thanks for trying find a bug in my code. As for clWaitForEvents, this not helps (see above).

              As for MAX_BLOCKS, the code is designed to use variable blocks amount up to 512. Unfortunately, the problem is somewhere else, probably in OpenCL compiler/runtime.

              • Re: Bug in OpenCL (GCN only)
                yurtesen Apprentice
                Currently Being Moderated

                OK lets recap, I ran your code and I should see zeros? (the latest one?)

                 

                ~/temp/test$ ./zerobuf

                1 platforms detected

                Platform 0:

                        Vendor: Advanced Micro Devices, Inc.

                        Name: AMD Accelerated Parallel Processing

                 

                1 devices detected

                Device 0:

                        Device: Advanced Micro Devices, Inc.

                        Name: Tahiti

                        Max threads: 256

                        Max cores: 32

                 

                        Max Threads (by kernel): 256

                        Multiply (by kernel): 64

                        Compiled threads (by kernel): 0

                Device #0, Block size is: 32 x 128 (-m32), step = 2

                ctx host   = 2f0f1c 2f0f1c 30101d 2d0d1a

                ctx host   = 2f0f1c 2f0f1c 30101d 2d0d1a

                If ctx host is all zeros, this is the bug!

                NOTE. ah right I got zeros second time I got it, is that the problem?

                • Re: Bug in OpenCL (GCN only)
                  pswwsp Newbie
                  Currently Being Moderated

                  Not tested under Linux (will try in a few hours). Under Windows I've got zeros in every launch. You've got zeros only once, right?

                • Re: Bug in OpenCL (GCN only)
                  yurtesen Apprentice
                  Currently Being Moderated

                  also, am I suppose to be getting random results at each run?

                  • Re: Bug in OpenCL (GCN only)
                    yurtesen Apprentice
                    Currently Being Moderated

                    Not really... I am getting different results at each run

                     

                    eyurtese@extremum-desktop:~/temp/test$ ./zerobuf  |grep 'ctx host  '

                    ctx host   = 0000 0000 0000 01fe

                    ctx host   = 0000 0000 0000 01fe

                    eyurtese@extremum-desktop:~/temp/test$ ./zerobuf  |grep 'ctx host  '

                    ctx host   = e5382000 ffff8803 0000 01fe

                    ctx host   = e5382000 ffff8803 0000 01fe

                    eyurtese@extremum-desktop:~/temp/test$

                    • Re: Bug in OpenCL (GCN only)
                      yurtesen Apprentice
                      Currently Being Moderated
                                          status = clEnqueueReadBuffer(cmdQueue, d_cglobal, CL_TRUE, 0,
                                                  sizeof (int) * MAX_BLOCKS * THREADS_PER_BLOCK * 8, &cglobal,
                                                  0, NULL, NULL);

                       

                      You are reading the address of the pointer, not where it points... shouldnt this be cglobal instead of &cglobal? just saying. I think there are other problems in your code too, I wouldnt blame the sdk just yet

                      • Re: Bug in OpenCL (GCN only)
                        pswwsp Newbie
                        Currently Being Moderated

                        Maybe it's a small bug in the code, because this clEnqueueReadBuffer call is inserted just for debugging and printing purposes. As I said, I didn't testit under Linux/gcc. Please remove the '&' before cglobal.

                        • Re: Bug in OpenCL (GCN only)
                          yurtesen Apprentice
                          Currently Being Moderated

                          Well,  I give up. Maybe you are right, I tried on CPU (bulldozer) and it crashed, yet your program seems to work on intel cpu with AMD OpenCL Also with Intel OCL SDK on Intel CPU (did not test on bulldozer with this) and on some Tesla cards.

                           

                           

                          I tried to figure it out by not printing but putting values into output array

                           

                          c_global[0] = HashC_end

                          c_global[1] = HashRounds

                           

                          Funnily although these looked different,  it didnt seem to enter

                                      if (HashC_end != HashRounds) {

                          (where I had assigned something to c_global[2] )

                           

                          You know, you can try to run it with CodeXL and see what it does (at least I hear it is suppose to be able to debug OpenCL code line by line)

                          Also perhaps you should use atomic_add or atomic_inc (according to khronos, these are 32bit versions).

                           

                          Sorry that I couldnt be more help maybe you are right, it might be a bug perhaps. It is strange that it gets worse on CPU

                           

                          You should definitely report this to AMD...

  • Re: Bug in OpenCL (GCN only)
    pswwsp Newbie
    Currently Being Moderated

    Unfortunately, 13.1 drivers have the same bug. I sent the bug-report using http://www.amdsurveys.com site, but seems this was not successfully. Anyone could help to submit this bug to AMD team?

    • Re: Bug in OpenCL (GCN only)
      himanshu.gautam Master
      Currently Being Moderated

      Will check this out and raise the issue with the engineering team, if needed (or if its not already being tracked)

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Bug in OpenCL (GCN only)
    himanshu.gautam Master
    Currently Being Moderated

    I could reproduce it.

    Looks like the compiler was optimizing out the code.

    1. Making ‘HashRounds’ variable volatile solves the issue temporarily -- file zerobuf-gcn.cl, line 206

    2. Alternatively disabling optimization by passing "-cl-opt-disable" to clbuildProgram() also solves the issue

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points