7 Replies Latest reply: May 22, 2013 4:50 AM by kd2 RSS

Asynchronous DMA  + Kernel Execution using AMD GPUs

himanshu.gautam Master
Currently Being Moderated

Hi all,

We have recently worked on this code to showcase, asynchronous DMA + Kernel Execution on AMD GPUs. Please go through it, give feedback. We hope it helps a lot developers and students to achieve better performance using AMD's hardware.

Courtesy: Andryeyev German

 

Message was edited by: Himanshu Gautam

Regards

Himanshu , Bruhaspati

--------------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
    himanshu.gautam Master
    Currently Being Moderated

    Very good work AMD!

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
    kd2 Newbie
    Currently Being Moderated

    Many thanks for this. For me, the main point is to use two command queues simultaneously.

     

    But using this code, it is reporting that I get at most 4GB/s of total throughput. I believe that number. And that number is showing I'm not getting the most out of the hardware. If the graphics card has GDDR5, my system has PC3-8500, and they're connected between with x16 PCI 2.0, shouldn't that mean that this code should give throughput closer to 8GB/s? (and double that if the code can be changed so that the read and writes are pinned on different memory sticks)?

     

    Part of the slow-up may be that even with this program pushing a half gigabyte of memory back and forth, the card (a Tahiti) doesn't seem to want to kick up the performance to using the full 16 lanes in the PCI. It seems to be stuck to using 8 lanes. For example, during this AsyncDMA program's run, aticonfig is still showing 8 lanes being utilized...

     

    # aticonfig --pplib-cmd "get activity"

    Current Activity is Core Clock: 950MHZ

    Memory Clock: 1425MHZ

    VDDC: 1170

    Activity: 53 percent

    Performance Level: 2

    Bus Speed: 5000

    Bus Lanes: 8

    Maximum Bus Lanes: 16

     

    (and if you're wondering, I do have the system bios configured so that 16 lanes go directly to this card's PCI slot)

    # lspci -vv -s 04:00.0 | grep LnkCap

                    LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us

     

    Is there any way to use the full 16 lanes in a program such as this?

    • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
      himanshu.gautam Master
      Currently Being Moderated

      This looks ,more like an OS issue.. Can you check performance under Windows?

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

      • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
        kd2 Newbie
        Currently Being Moderated

        you're absolutely right. It turned out that even though my BIOS allowed me to select x16, the PCIe riser card in the machine was only x8 capable. Swapped in a true x16 riser and now getting the AsyncDMA program to report the 8.0GB/s that I had expected in the case of allocated host pointer (and so the system's RAM is now my bottleneck)..

         

        Write/Read operation 2 queue; profiling disabled using AHP: 8.01996 GB/s

        ----------- Time frame 16569.652 (us), scale 1:207

        BufferWrite - W>; KernelExecution - X#; BufferRead - R<;

        CommandQueue #0

        <<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<

        CommandQueue #1

        ------W>>>>>>>>>>>>------W>>>>>>>>>>>>------W>>>>>>>>>>>-------W>>>>>>>>>>>>----

        Write/Read operation 2 queue; profiling enabled using AHP: 7.76999 GB/s

         

        # aticonfig --pplib-cmd "get activity"

        Current Activity is Core Clock: 950MHZ

        Memory Clock: 1425MHZ

        VDDC: 1170

        Activity: 63 percent

        Performance Level: 2

        Bus Speed: 5000

        Bus Lanes: 16

        Maximum Bus Lanes: 16

        • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
          sajis997 Newbie
          Currently Being Moderated

          Hi forum,

           

          I am going through the attached source code. The following snippet is not clear to me .

           

          Inside the ProfileQueue::findMinMax(...)

           

          {

           

          }

           

          you are calculating the times taken by each operation (read, write or kernel execution)

           

          [code]


          clGetEventProfilingInfo(events_[op][0], CL_PROFILING_COMMAND_START,




          sizeof(cl_long), &time, NULL);

          if (0 == *min_time)

          {

              *min_time = time;

          }

          else

          {

              *min_time = std::min<cl_long>(*min_time, time);

          }

          clGetEventProfilingInfo(events_[op][events_[op].size() - 1],




          CL_PROFILING_COMMAND_END, sizeof(cl_long), &time, NULL);

          if (0 == *max_time)

          {

              *max_time = time;

          }

          else

          {

              *max_time = std::max<cl_long>(*max_time, time);

          }

           

          [/code]

           

           

          As you see you are considering two separate events , should it not be the same event ?

           

          Some explanation would be helpful.

           

           

          Regards

          Sajjad

          • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
            himanshu.gautam Master
            Currently Being Moderated

            events_ is a 2D array of cl_events. The function findminmax needs to find the total time taken by all the events together in all the command queues, so we take the min of START time of the first command in all queues(events[op][0]), and the max of the END time for the last command in all queues(events[op][size]).

            Regards

            Himanshu , Bruhaspati

            --------------------------------

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

          • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
            kd2 Newbie
            Currently Being Moderated

            My two cents is not to get too tied up with the profiling setup and the display of the profiling results of that original code. The profiling was not the major point of the code, but it takes a lot of effort to get past the profiling in that code. The point is that in the GCN chips, it is really simple to asynchronously use three things at once: the dual dma engines and the kernel execution. I had to strip down the code before I understood that there's not much to it -- I'll attach.

             

            For example,

            ./a.out

            ...

            TEST use_kernel 1, n_bufs 3

            Write/Kernel/Read  3 queues,  ALLOC_HOST,  6.34 GB/s.

               0: W   0.2-  2.6  2.5 ms, X   8.5-  9.3  0.7 ms, R   9.5- 12.3  2.8 ms,

               1: W   2.6-  5.1  2.4 ms, X   9.4- 10.1  0.7 ms, R  12.4- 17.5  5.1 ms,

               2: W   5.1-  7.6  2.6 ms, X  10.3- 11.0  0.7 ms, R  18.0- 22.7  4.8 ms,

               3: W  12.4- 17.5  5.1 ms, X  18.2- 19.0  0.7 ms, R  22.8- 27.9  5.1 ms,

               4: W  17.6- 22.7  5.1 ms, X  22.9- 23.7  0.7 ms, R  27.9- 33.0  5.0 ms,

            ...

            The printout is the time in milliseconds of each Read, eXecution and Write (from the time of the first event's queueing). So for example, using 3 command queues, you see that while we're using one DMA engine to read the results of dataset #2 (18.0-22.7ms), the kernel is executing on dataset #3 (18.2-19.0ms), and we're using the second DMA engine to write dataset #4 (17.6-22.7ms).

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points