25 Replies Latest reply: Feb 9, 2014 6:58 AM by nou RSS

Multiple contexts parallel allocating or writing to memory of a single device

chevydevil Newbie
Currently Being Moderated

Hello, I have a program which uses openmp to schedule work in parallel to one opencl device i.e a gpu. This is done right now by using multiple contexts and which have there own unique queues and buffers. The program stops after some iteration steps. I mean it just stops, without exiting or segmentation fault or something. Could it be that the allocation from multiple contexts is not thread safe? Do I have to use one context and a queue for each thread (which is my choice for the future anyway)? Btw. this only happens on a GPU device. CPU devices work fine.

 

Thx in advance.

  • Re: Multiple contexts parallel allocating or writing to memory of a single device
    himanshu.gautam Master
    Currently Being Moderated

    Multiple threads operating on a context is supported from OpenCL 1.1. All OpenCL calls  are thread-safe except "clSetKernelArg". Even with this API, multiple threads can still work with unique cl_kernel objects. However, they cannot wok with the same cl_kernel object at the same time. So, per-thread allocation of "cl_kernel" object will help overcome this issue.

    Check Appendix A.2 of OpenCL Spec. So, as long as your platform is OpenCL 1.1 or later, you can use just 1 context and allow all your openmp threads to work.

     

    However, if multiple threads are reading/writing shared "cl_mem" objects across multiple command queues -- then this can result in undefined behaviour. Check Appendix A.1 of the OpenCL Spec. That will help resolve all your doubts.

     

    Now coming to the issue you are facing,

    I am not sure what you mean the program stops...but no seg-fault. You may want to first find out until which point the application is running. (or) Please post your sources as a standalone zip file which we can use to reproduce here.

    You need to also specify the following:

    1. Platform - win32 / win64 / lin32 / lin64 or some other?

        Win7 or win vista or Win8.. Similarly for linux, your distribution

    2. Version of driver

    3. CPU or GPU Target?

    4. CPU/GPU details of your hardware


    THanks,

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Multiple contexts parallel allocating or writing to memory of a single device
    chevydevil Newbie
    Currently Being Moderated

    It has been a while, but my problem still exists. My above responses weren't accurate because the remote access didn't use the GPU but only found the CPU. The classic healess problem. I am now able to access the GPU remotely but then there is my "stopping" problem again. I believe a deadlock is happening when releasing a memory object in the multiple command_queue called by multiple threads scenario. Here is a part of my debug log taken when the execution stops:

    debug]#0  0x00007ffff582d420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
    [debug]#1  0x00007fffef1f9ba0 in amd::Semaphore::wait() () from /usr/lib/libamdocl64.so
    [debug]#2  0x00007fffef1f6162 in amd::Monitor::finishLock() () from /usr/lib/libamdocl64.so
    [debug]#3  0x00007fffef21f6fc in gpu::Device::ScopedLockVgpus::ScopedLockVgpus(gpu::Device const&) () from /usr/lib/libamdocl64.so
    [debug]#4  0x00007fffef242c3e in gpu::Resource::free() () from /usr/lib/libamdocl64.so
    [debug]#5  0x00007fffef243207 in gpu::Resource::~Resource() () from /usr/lib/libamdocl64.so
    [debug]#6  0x00007fffef22fd3d in gpu::Memory::~Memory() () from /usr/lib/libamdocl64.so
    [debug]#7  0x00007fffef23123f in gpu::Buffer::~Buffer() () from /usr/lib/libamdocl64.so
    [debug]#8  0x00007fffef1e8998 in amd::Memory::~Memory() () from /usr/lib/libamdocl64.so
    [debug]#9  0x00007fffef1e9607 in amd::Buffer::~Buffer() () from /usr/lib/libamdocl64.so
    [debug]#10 0x00007fffef1f41eb in amd::ReferenceCountedObject::release() () from /usr/lib/libamdocl64.so
    [debug]#11 0x00007fffef1c5a37 in clReleaseMemObject () from /usr/lib/libamdocl64.so
    

    I will try to finally reproduce this by focusing on threaded allocating and releasing memory in a minimal example. Hopefully this is leading somewhere. It would be nice to solve this to convince my boss to by some of the 7990 cards for our computing.

    • Re: Multiple contexts parallel allocating or writing to memory of a single device
      himanshu.gautam Master
      Currently Being Moderated

      Thanks for the update. We look forward to your test case.

      I would suggest to go through the Appendix A of OpenCL programming guide for some guidance.

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

    • Re: Multiple contexts parallel allocating or writing to memory of a single device
      soylentgraham Newbie
      Currently Being Moderated

      Benjamin Bendig
      Did you get any further with this? I'm pretty sure I have the same problem (I don't think it's happening on CPU)

       

      I have...
      1 Context
      10 threads

      10 queues (one per thread)

      20 kernels (one cl_kernel is instanced every use for each thread so they're not shared)

       

      I'm blocking all writes and reads, and blocking my executions with clWaitEvent immediately after clEnqueueNDRangeKernel.
      (Things seem to hang much earlier if I don't block everything, but I'm not sure yet if it's the same issue)
      The faster my code works, and the more work I throw at it, the quicker it hangs. (more memory object allocation/deallocation)


      Whenever it stops (just as described above) one thread ALWAYS just happens to be releasing a memory object (the others are usually reading/writing)

       

      I understand the object release is threadsafe... (I'm doing it VERY regularly, say, 10 times per kernel, per thread)
      In my case should I have *any* mutex's? I don't currently other than for some management on the host side.

       

      Windows 7, driver version in device manager is 13.200.0.0. (I think I'm still using beta drivers)

      • Re: Multiple contexts parallel allocating or writing to memory of a single device
        himanshu.gautam Master
        Currently Being Moderated

        Others are reading/writing "same" cl_mem object that you are trying to release??

        Regards

        Himanshu , Bruhaspati

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

        • Re: Multiple contexts parallel allocating or writing to memory of a single device
          soylentgraham Newbie
          Currently Being Moderated

          I am sure it's not the same object, they're not shared across threads

          • Re: Multiple contexts parallel allocating or writing to memory of a single device
            chevydevil Newbie
            Currently Being Moderated

            For my problem, I think this is the source:

             

            http://devgurus.amd.com/message/1300289#1300289

             

            Still, to this point I was not able to reproduce the Problem in a simple example but I also don`t have much time to invest in this. Anyway the since the problem only occours with the AMD GPU runtime it seems to be driver related. It happens either if I have one context created by the main thread and accesed by multiple different threads or if I have multiple contexts created by the main thread and accessed by multiple different threads. Note also that in the latter case no shared memory objects or kernels are used at all.

          • Re: Multiple contexts parallel allocating or writing to memory of a single device
            soylentgraham Newbie
            Currently Being Moderated

            I realised my image-memory objects weren't using the correct queue (all were using a "default" queue which the kernels weren't using), not sure why the system still worked, but this may be the cause; not that the hang/deadlock was related to any memory objects or kernels that were using the image objects at the time.

             

            I added a host-side mutex when releasing memory objects, no help.

            I then used that mutex when reading/writing to any memory object, where I then discovered my issue with image-memory-objects.

             

            I'll update shortly if my problem has gone away, but currently my driver crashes before it hangs (though it's running for a lot longer) which I think is a OOB memory access as it gives me a memory violation when I execute on CPU instead of GPU

            • Re: Multiple contexts parallel allocating or writing to memory of a single device
              nou Expert
              Currently Being Moderated

              buffers are automaticaly copy between devices. but OpenCL runtime will place buffer on that device which queue is associated with.

            • Re: Multiple contexts parallel allocating or writing to memory of a single device
              himanshu.gautam Master
              Currently Being Moderated

              Thanks for posting back and the quick experiments.

              Will await a nice repro-case so that we can start working on this..

              Thanks,

              Regards

              Himanshu , Bruhaspati

              --------------------------------

              The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

              • Re: Multiple contexts parallel allocating or writing to memory of a single device
                soylentgraham Newbie
                Currently Being Moderated

                Okay, I wrote a big long winded reply, but I think I've solved my problems now. No more access violations (which I thought were out of bounds access) and no deadlocks. (or at least, not for the last few hours) on GPU or CPU.

                 

                My new setup;

                100 host threads, 100 queues (one each)

                N kernels, all instanced per-thread (no cross-queue/cross-thread kernels). All kernels on a thread use the same queue.

                All writes are now non-blocking

                All executions are non-blocking.

                All reads are blocking.

                [then kernel and data is disposed]

                 

                My problem I realised in the end was whenever I made a write or execution non-blocking, was that the data on the queue for that kernel wasn't ready. PERHAPS more threads & queues just highlighted a problem that was there, or I read somewhere about having more than one queue for a context warranted more clFinish's. (clFinish before execution also worked, but clFlush still resulted in access violations)

                 

                Anyway, now, for all my non-blocking writes I store the cl_event...

                Before execution (though after clSetKernelArg) I do clWaitForEvents on all the events relevent to this kernel/queue.

                All my crashes and deadlocks have gone away. I have NO mutex's host side related to opencl and execution is faster.

                 

                I wrongly assumed an execution (blocking or non-blocking) would ensure the relevent data-write[s] on the queue would be finished, but it seems not.

                 

                 

                • Re: Multiple contexts parallel allocating or writing to memory of a single device
                  himanshu.gautam Master
                  Currently Being Moderated

                  Oh, Thanks!

                  From what I infer from your post, the bug was due to your misunderstanding of asynchronous execution and nothing to do with AMD's opencl run-time. Please confirm.

                   

                  And yes, Good luck and Thanks for taking time to post your experience here!

                  It can be a great time-saver to someone...

                  And,. I hope your code runs for many more hours to come and then one day terminates normally...!

                   

                  Best,

                  Bruhaaa........

                  Regards

                  Himanshu , Bruhaspati

                  --------------------------------

                  The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

                  • Re: Multiple contexts parallel allocating or writing to memory of a single device
                    tugrul_512bit Newbie
                    Currently Being Moderated

                    Does all these mean that I can do operations below?

                     

                    1)Create single context.

                     

                    2)Create single oredered queue for all kernels.

                     

                    3)Create an oredered queue for each write/read operation. So if I have N read and M write operations, I create N+M queues.

                     

                    4)start

                     

                    5)From an openmp body, simultaneously do :

                     

                    chunk0

                    {

                           clEnqueueWriteBuffer(queue0, buffer0);

                           queue0.finish();

                    }

                     

                    chunk1

                    {

                           clEnqueueWriteBuffer(queue1, buffer1);

                           queue1.finish();

                    }

                     

                    ....

                     

                    chunkN

                    {

                           clEnqueueWriteBuffer(queueN, bufferN);

                           queueN.finish();

                    }

                     

                    5) All writes/reads are done so I can start computing on the gpu:

                     

                    clEnqueueNDRangeKernel(queueCompute, blabla)

                    queueCompute.finish();

                     

                    6)Do very similar thing for reading the results as step 5

                     

                    7)repeat from 5

                     

                     

                     

                    This way, can I get full pci-express read/write bandwidth?

                     

                    Right now Im using only a single ordered queue for all read/write/compute operations and I have a single singleQueue.clFinish() at the very end. This makes me able to use only 1.4 GB/s for read/write buffer operations. I'm kind of hoping 4GB/s - 5GB/s for my gigabyte 990-xa-ud3 motherboard.

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points