14 Replies Latest reply: Feb 22, 2013 2:59 PM by bobrog RSS

streaming large datasets through the GPU

bobrog Newbie
Currently Being Moderated

Hi all:

 

This is my first dab at GPU computing.  I am running on 64-bit Linux with 32GB

memory and an AMD 7770 GPU with 2GB memory.  The data set is large, (28GB for

the largest mesh) and will be streamed through the GPU in pieces for computation.

In the best of all worlds a simple 3-buffer scheme with each buffer controlled

by a separate queue would allow transfers to and from the GPU as well as the GPU

computation to run concurrently.

 

To set up the CPU buffers I have tried two methods:

 

    float* Data_s = (float*) valloc( size_S );

    if( mlock( Data_s, size_S ) != 0 )    printf("*Data_s not locked\n" );

    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);

or

    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);

    float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );

 

To set up the GPU buffers :

    cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);

 

The runs indicate that Kernel execution is overlapped but reads do not overlap writes.

The latter is a disappointment, but not totally unexpected.  For small meshes the

ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but

the USE_HOST_PTR only runs at roughly 2/3 of that speed. 

For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,

which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR

method will handle the largest mesh (28GB).

 

Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method

that gives the full transfer rates over the largest mesh.  There are several posts

on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.

Has this been fixed or a work-around provided?  Also since the GCN family has dual

bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers

in the future?

 

catalyst-13.1-linux-x86.x86_64    AMD-APP-SDK-v2.8-lnx64

  • Re: streaming large datasets through the GPU
    german Newbie
    Currently Being Moderated
    • AHP from your code is USWC memory and UHP is cacheable.  Performance from the GPU side should be identical, but USWC doesn't pollute the CPU cache. I would suggest to create a simple test just for the data transfer and see if you can reproduce the performance difference.
    • What’s the size of AHP buffer? USWC allocation by default doesn’t request the CPU virtual address. That requires an extra operation – map(), which may fail for whatever reason. The error code can be changed, but it doesn’t mean runtime won’t fail the call.
    • I’m not sure about the allocation issue. Is it VM/zerocopy? HD7770 supports VM – no allocations on GPU for CPU mem.
    • Bidirectional transfers should work in the latest driver. Usually Windows is the main target for all performance tunings, because MS has the advanced tools for the GPU profiling. However I don’t expect any major issues under Linux. OpenCL runtime pairs 2 CPs(command processors) with 2 DMA engines. So if the application creates 3 queues, then 2 queues will be assigned to CP0 and DMA0 and 1 queue with CP1 and DMA1. The application has to make sure read and write transfers go to different DMA engines without any synchronization between.
  • Re: streaming large datasets through the GPU
    himanshu.gautam Master
    Currently Being Moderated

    First thing to fnid out is -- whether VM is enabled or not. That, you can check by running "clinfo" and check the driver version string. (should be something like "1182.2 (VM)"). Presence of "VM" string is what you should look for.

     

    Assuming VM is enabled, AHP buffer will be directly accessed by the OpenCL kernel i.e. kernel's pointer access will translate to PCIe transaction which in turn accesses pinned memory. This means that the kernel is not doing any work most of the time and is stalling (very badly) on memory operations. So, the overlap that you intend to make - probably is not happening. Suggest you to allocate a buffer inside GPU and "enqueueWriteBuffer" to it.

     

    I had earlier noted that using "UHP" directly as kernel argument - slows it down very badly. You may want to build a small prototype to probe this.

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: streaming large datasets through the GPU
    bobrog Newbie
    Currently Being Moderated

    Thanks German and Himanshu for your rapid response.

     

    In response to Himanshu, the driver from clinfo is 1084.4 (VM).  The test code

    is a modified version of the HelloWorld sample and is quite simple.  The kernel

    simply modifies the GPU buffer to show that the data was actually moved to and

    from the GPU. The large buffer on the CPU (DATA_S) is streamed in pieces to and

    from the buffers on the GPU (Buffer_s1, etc) using WriteBuffer and ReadBuffer.

    The host code must be able to access DATA_S but does not ever access Buffer_s1,

    etc.  Likewise the kernel will access Buffer_s1 but not DATA_S.  The only

    commumication of data between CPU and GPU is via the Write/Read processes.

     

    In response to German, I changed the command queues from separate queues for the

    three buffers to separate queues for write, read, execute as you suggested and

    that did enable read/write overlap.  The round-trip bandwidth increased from

    ~12 GB/sec. to ~16 GB/sec., less than the ~22 GB/sec. that I hoped for, but a very

    promising start. The different allocations implied by AHP and UHP do probably

    account for the lower bandwidth of UHP (cache thrashing on the CPU) so I may be

    forced to use AHP for DATA_S.  But since host needs a pointer with which to access

    DATA_S it seems I must Map it as using Query to get an AHP pointer is explicitly

    prohibited according to the opencl 1.2 ref.  So that seems to imply that I must

    somehow overcome the problem of mapping a large buffer.  The problem may be in

    the AMD software or in Linux.  Is it possible to get more informative error

    information from Map?  It might also be useful to try the code under Windows to

    see if Linux is the problem.  If you think so, I can send you the code to try.

    The code runs with DATA_S = 1.21 GB but fails at 1.23 GB.

    • Re: streaming large datasets through the GPU
      german Newbie
      Currently Being Moderated

      22GB? Do you have a PCIE Gen3 system? Attach the code for windows and I'll tell if you can improve performance and how.

      1.23GB doesn't look big, but originally runtime didn't allow single allocations > 512MB even for AHP. Linux base driver could fail something. I'll try to check that. I assume you run a 64bit build of your test? You shouldn't see this issue under Windows.

      • Re: streaming large datasets through the GPU
        bobrog Newbie
        Currently Being Moderated

        German:

        I will try to attach a zip file with everything needed for Linux.  For windows you

        can substitute a suitable timer for WALLTIME.  Other than that, the .cpp and .cl

        files should work under Windows.  To control the buffer sizes change NX, NY, NZ,

        and BATCH.  Running "transfer2 1" uses AHP, "transfer2 2" uses UHP. The largest

        mesh I can run is NX=NY=NZ=1024, (with BATCH=16)  16 GB DATA_S.  My machine is

        an i7-3820 (32GB) HD7770 (2GB) with pcie-3.

         

        I am now trying to avoid Mapping the large structure DATA_S and instead Mapping

        each piece of it before Writing/Reading it, and UnMapping it after.  Lots of

        Map/UnMap.  Getting seg. faults at the first Write at this point ... probably my bad.

        Happy hunting.

        • Re: streaming large datasets through the GPU
          bobrog Newbie
          Currently Being Moderated

          Here is the code.

          • Re: streaming large datasets through the GPU
            german Newbie
            Currently Being Moderated

            1. You still have to prepin memory even for the UHP allocations. In theory it's not really necessary. However runtime uses the same path for AHP and UHP. Also I believe OpenCL 1.2 spec requires a map call for CPU access even for the UHP allocations. So call clEnqueueMapBuffer for UHP similar to AHP and that should fix the app performance. Also don't forget about the unmap calls:-)

            2. I can confirm that the both transfers are running asynchronously in HW, but when they run together DMA engine 1 is slower than DMA engine 0.On top of that even DMA0 is slightly slower than a single transfer on either DMA0 or DMA1. So I would say 16GB/s is the best what you can get for now.

             

            • Re: streaming large datasets through the GPU
              bobrog Newbie
              Currently Being Moderated

              German:

              Yes, I did lock (prepin) DATA_S by calling mlock.  On my machine any user can lock up to 4GB,

              and above that I run as root.  I had also tried Map for the UHP case and that Map failed just

              like the one for AHP.  Map simply fails if the buffer to be mapped is too large (> ~1.2GB).

              As I said in my last post, I tried Mapping each separate piece ( <= .25GB) of DATA_S to get

              the pointer for Read/Write, and that worked as before with DATA_S < 1.23GB but failed with

              map error -12 with larger DATA_S.  So the Map failure seems to be triggered by the size of

              the buffer being mapped rather than the size of the region of that buffer that is Mapped.

              The UHP case runs at the same rate whether or not it is prepinned and whether or not it is

              Mapped.  Does Windows do any better?

              • Re: streaming large datasets through the GPU
                german Newbie
                Currently Being Moderated

                You don't have to call mlock. The base driver will lock memory when UHP allocation is created. mlock has nothing to do with clEnqueueMapBuffer().

                That's correct. There is a limit on the allocated AHP/UHP size in the linux base driver. The pools have to be preallocated during the boot. As far as I heard the limitation comes from the linux kernel and has to be workaround. Windows should allow a half of system memory for AHP/UHP allocations. The reason it works without clEnqueueMapBuffer is runtime has deferred memory allocations. Basically clCreateBuffer does nothing and runtime allocates memory on the first access. So when you call clEnqueueMapBuffer the actual allocation occurs (the error code can be fixed). Without clEnqueueMapBuffer call runtime doesn't know that the pointer in read/write buffer is a UHP allocation, so it will pin system memory in small chunks and perform multiple transfers. There are optimizations in runtime that will try to hide the pinning cost, but performance may vary, depending on the CPU speed and OS. Currently the pinning cost in Linux is quite more expensive than in Windows. In general it's much less efficient than prepin (clEnqueueMapBuffer call). In windows with prepin the performance is identical I ran with smaller buffers(my systems don't have 32GB ram).

                 

                Please note: there are more limitations with big allocations. Currently any buffer allocations (AHP/UHP/Global) can't exceed >4GB address space, but only if they are used in kernels. Runtime can work with >4GB AHP/UHP allocations for data upload/download, because transfers are done with the DMA engines and it doesn't require single address space.

                • Re: streaming large datasets through the GPU
                  bobrog Newbie
                  Currently Being Moderated

                  German:

                   

                  OK, I think I understand most of your response.  I will look into the Linux kernel/boot pool

                  issue.  The remaining mystery is why, even with small enough DATA_S that MAP does not fail,

                  in either AHP or UHP cases, the UHP case (with Mapped DATA_S) setup does not run as fast as AHP

                  (~7GB/sec vs ~16GB/sec).

                   

                  AHP:

                      DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);

                      Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );

                  UHP:

                      Data_s = (float*) valloc( size_S );

                      DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);

                   

                      Data_sx = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );

                      status |= clEnqueueUnmapMemObject(Queue_1, DATA_S, Data_sx, 0, NULL, NULL );

                   

                  Your reply indicated that Windows, using calls as above, runs both AHP and UHP at the same high rate.

                  Have you also tried this on a Linux system?

                  • Re: streaming large datasets through the GPU
                    german Newbie
                    Currently Being Moderated

                    There are keys for the linux base driver to increase the pools. I don't recall them. Don't know if they are publicly available or not.

                    You have to remove clEnqueueUnmapMemObject call after map. As soon as you call unmap, runtime doesn't consider UHP as prepinned allocation anymore (no CPU access from the app). Call unmap at the end before memory release. Basically as I mentioned before AHP and UHP have the same behavior in runtime.

                    BTW, runtime guarantees (Data_s == Data_sx) for UHP.

                    • Re: streaming large datasets through the GPU
                      bobrog Newbie
                      Currently Being Moderated

                      German:

                       

                      Yes, I had checked that Data_s == Data_sx, but removing the Unmap still gives the lower

                      ~7 GB/sec rate.  If we can get UHP up to AHP speed, I might be able to get around the

                      Map size limit by:

                      1) get Data_s from valloc or equivalent (page aligned)

                      2) form other pointers from Data_s ( eg.  p1, p2, ... one for each Read/Write transfer )

                      3) CreateBuffer( UHP ) small buffer for each ( an array of CPU buffers )

                      4) Map each one  (small size ~ .25 GB ) just before Read/Write

                      5) Read/Write

                      6) Unmap small buffer

                       

                      Silly idea?

                      • Re: streaming large datasets through the GPU
                        german Newbie
                        Currently Being Moderated

                        I still have to run your test under Linux. Didn't have time. Basically I forgot about an extra limitation under Linux. In Linux cacheable pool is much smaller than USWC pool. UHP allocations will go to cacheable. Personally I don't see any reason to limit UHP allocations to any pools, but that's how memory manager under Linux works. Windows also has some limitations, but much bigger size. Anyway try to reduce UHP allocations to 128MB to see if you will get 16GB/s. In a case of UHP alloc failure runtime may disable zero-copy it's necessary so some tests could still work. That may explain your numbers. The pool size limitation under Linux can be fixed in the future, but don't know the time frame.

                        Your pseudo code isn't optimal and should introduce bubbles between CPU/GPU executions. Any UHP allocation requires memory pinning. Pinning involves GPU page tables update, GPU stalls are possible. I believe under Windows VidMM scheduling thread will disable any submissions during that operation and I doubt Linux will be any more optimal than that.  To be honest I'm not sure there is an optimal solution to bypass UHP size limit, which shouldn't really exist in the first place. Well again it depends on the system configuration and the amount of requested memory for pinning.

                        I would suggest you to implement double copy to see if you can get better performance overall, running CPU copy

                        asynchronously with GPU transfers. Otherwise I think your new code shouldn't be any faster than the current 7GB/s and no size "limits".

                        • Re: streaming large datasets through the GPU
                          bobrog Newbie
                          Currently Being Moderated

                          German:

                          You are correct ... reducing DATA_S to 128MB has both AHP and UHP running at ~12GB/sec.

                          Probably lower than 16GB/sec due to overhead relative to smaller transfers. I will try

                          my silly idea just to see what happens.  I am willing to reserve at boot a large pool in physical

                          memory, but I do not know how to configure it so that UHP will recognize it.  For now,

                          since I have a workable, if slow, UHP method I will proceed to the more interesting job

                          of the kernels.  If you have any further thoughts on this, let me know ... and Thanks

                          for your help.

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points