I have a functioning OpenCL application right now that uses 2 command queues so that I can run a kernel and DMA transfer data concurrently. It works with multiple systems that use discreet GPUs (NVidia and AMD).
However, when I try to run it on my system with an AMD A10 APU, the kernel locks up and freezes. Is this just not possible with this architecture or is there some kind of exception I need to use?
I can provide an example program privately if an AMD developer can help.
If I recall correctly, CUDA requires Multiple Streams (within a CUDA context) for overlapping DMA with Kernel Execution.
However, I think, in AMD - You really dont need multiple command queues. Just make sure that the kernel and the buffer copy are enqueued one after another and that they dont have depndency and that the buffer uses pinned memory. This should suffice.
Please give me sometime while I experiment with the same and let you know of.
So do I take it correctly, that DMA is only used when pinned memory is used? Do I remember correctly, that pinned memory is used only, when a buffer is smaller than 32MB and is moved by clEnqueueMapBuffer? I recall reading about this a while back, and if I remember correctly mapping buffers return pointers to pinned memory, if they are small enough. I only ask because I'm writing a prototype of GPU cluster capable physics simulation with MPI, and CUDA has RDMA implemented with CUDA (most likely not ported to OpenCL), so my best chance with AMD is using pinned buffers.
Plus, does AMD plan on implementing something similar on the Red side of the force? (RDMA namely with Infini, or simply within a host)
I really hope there isn't a hard cap that small on the size of pinned memory, I haven't checked. And I'm also curious about whether or not there are plans for RDMA in OpenCL, but not very hopeful as that is probably an architecture-specific thing that nVidia is doing (as it only appears to be available on newer Tesla models).
Himanshu - Sorry I haven't responded to the main replies here, I've had to move forward with an alternate approach but I am still curious whether or not this can be done on APU hardware (concurrent DMA and kernel execution). If you come up with a very simple example that works on Trinity hardware I'd be very appreciative to see it. Thanks for your time
I think the 32MB limit comes from Table 4.2 in AMD APP Programming Guide. This is the case for normal regular buffers (which are not pinned and stored in device usually) and the guide is talking about behaviour of "clEnqueueMap"
But, if you want to use DMA - you got to Pin the buffer. Pinning usually happens when you use "USE_HOST_PTR". Either the host application pages are directly pinned (or) the host application pages are copied to a temporary pinned buffer for one-shot transfer (or) Transferred chunk by chunk using DMA and double-buffering. The run-time will decide the time of transfer (depending on first time usage mostly.) Until you MAP that buffer, the OpenCL runtime will own your host-ptr. When you map, you own it - you can write to it.. When you UNMAP, control returns to OpenCL run-time.
When you use ALLOC_HOST_PTR, if zero-copy is supported, pinned memory is allocated. The KERNEL can directly read this data using a pointer and hence data-transfer and kernel execution occur together -- which is not a great way to overlap data-transfer and kernel execution (GPU is too fast and will often stall waiting for data to arrive from system memory across PCIe)
When you use PERSISTENT_AMD flag, the buffer is allocated inside the GPU and the CPU gets a pointer (that read/writes across the PCIe bus). In this case, memcpy and kernel execution can happen together. But the memcpy is PIO and cannot be called as DMA.
The best way to overlap a transfer with kernel execution is to first allocate Pinned buffer (using ALLOC_HOST_PTR), Map it to get the pointer and write something onto the buffer. Allocate another normal buffer (which sits on GPU). Now, do a clEnqueueWrite* from pinned buffer to the normal buffer. This is just DMA.
It is this DMA that I would like to overlap with Kernel execution. I am still investigating whether this is possible or not.
Will post an update next week.
Here is a Sample Code to showcase Asynchronous DMA using AMD GPUs. It should compile for both windows and linux.