8 Replies Latest reply: Oct 4, 2012 11:40 PM by Martin Nilsson RSS

unified memory in Trinity?

dominik_g Newbie
Currently Being Moderated

I have a question about the memory model on AMD Fusion devices:

As far as I understand on Llano the CPU and GPU work on separate areas of the same physical memory. So a copy is still needed if the CPU works a buffer first and then the GPU works on the same buffer. Is that right?

 

Has this situation changed with AMD Trinity? On Intel's Ivy Bridge platform it seems that copying data is not necessary any more. Afaik you can simply create a buffer in the CPU-GPU context which can then be accessed by both the CPU and the GPU device without being copied. Is this similar on Trinity?

  • Re: unified memory in Trinity?
    tzachi.cohen Moderator
    Currently Being Moderated

    Resources create with 'CL_MEM_ALLOC_HOST_PTR' are accessed directly by the GPU and CPU with no copy in between.

    Tzachi Cohen

    Advanced Micro Devices Inc.

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


    • Re: unified memory in Trinity?
      dominik_g Newbie
      Currently Being Moderated

      Thanks for your reply! Is this a new feature of Trinity or has this already been the case with Llano?

      • Re: unified memory in Trinity?
        tzachi.cohen Moderator
        Currently Being Moderated

        This feature is relevant to all platforms supporting zero copy: APUs and discrete GPUs.

        Discrete GPU access to host memory is slower than APUs since it passes through pci-e bus.

        Tzachi Cohen

        Advanced Micro Devices Inc.

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


        • Re: unified memory in Trinity?
          dominik_g Newbie
          Currently Being Moderated

          What about the performance on the APUs? I thought on Llano there's still some overhead when the buffer is not explicitly copied to GPU memory. Is that right? If so, has that situation changed on Trinity?

          • Re: unified memory in Trinity?
            jross Newbie
            Currently Being Moderated

            This is the state of Llano found on page 35 of the presentation Memory System on Fusion APUs:

            http://developer.amd.com/afds/assets/presentations/1004_final.pdf

             

            Llano Memory StateLocalUncachedCacheable
            GPU Read17 GB/s6-12 GB/s4.5 GB/s
            GPU Write12 GB/s6-12 GB/s5.5 GB/s
            CPU Read< 1GB/s< 1GB/s8-13 GB/s
            CPU Write8 GB/s8-13 GB/s8-13 GB/s

             

            What a programmer would like is a state that has full bi-directional bandwidth performance for both the CPU and GPU so the programmer isn't constantly worried about this performance table and which flags they need to use on buffer allocation.

             

            I don't think dominik_g's question was fully answered.  Dominik_g is correct that there is a penalty on Llano if not copied from CPU Cacheable memory to GPU Local memory.

             

            Does Trinity perform differently?  Is there another chart like the one above?

          • Re: unified memory in Trinity?
            jross Newbie
            Currently Being Moderated

            If you find the (attached) presentation "Assessing the relevance of APU for high performance scientific computing" from AFDS12, all of the benchmarks listed for Trinity still use the same memory system found in Llano. 

             

            The highest performance option for the benchmarks is to explicitly copy input data from "CPU memory" to "GPU memory" and then copy the output data from "GPU memory" to "CPU memory".  This appears to be no different than what is done with discrete GPUs.

             

            So it appears that nothing has changed between Llano and Trinity.  What a disappointment.

            • Re: unified memory in Trinity?
              cadorino Newbie
              Currently Being Moderated

              I'm actually performing some benchmarks (matrix addition, multiplication, reduction, convolution) using an A8 APU and a 5870 GPU.
              I get the completion time by varying the device and the allocation strategy (ALLOC_HOST, USE_PERSISTENT_MEM, no flags, ...), where completion time include allocating and initializing the input, executing kernel and retrieving the output.
              I determined that for both the discrete and the integrated GPU, using zero-copy input allocation on the host (ALLOC_HOST | READ_ONLY) or zero-copy on the visible device memory lead to best performances.

               

              I think that the best allocation/data-transfer strategy strictly depends on the memory access patterns of the the kernel and of the host.

    • Re: unified memory in Trinity?
      Martin Nilsson Newbie
      Currently Being Moderated

      tzachi.cohen wrote:

       

      Resources create with 'CL_MEM_ALLOC_HOST_PTR' are accessed directly by the GPU and CPU with no copy in between.

       

      When you say CPU, is that limited to accesses through an OpenCL kernel running on the CPU device or is the zero copy also true when accessing the same memory from C/C++ host code after a clEnqueueMap?

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points