43 Replies Latest reply: Feb 6, 2014 2:00 PM by ISR RAS RSS

OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

ISR RAS Newbie
Currently Being Moderated

Pavel Bogdanov, Institute of System Research Russian Academy of Sciences (NIISI), bogdanov@niisi.msk.ru

 

INTRO

 

Nowadays heterogeneous computing becomes more and more popular. In november 2011 three of top5 supercomputers had hybrid architecture. In the last version (june 2012) of this list they were pushed out by computers with new IBM BlueGene/Q architecture, built with standard multicore processors. But according to experts, it is impossible to construct an exaflop-class supercomputer using only multicore CPUs, so these computers may have hybrid architecture.

Modern hybrid machines are built with massively-parallel accelerators, such as graphic cards (GPGPU). Intel MIC and others. How can these devices be programmed? At the moment there are two stable programming models: CUDA and OpenCL. CUDA is an NVidia proprietary product and can be used only with NVidia hardware. OpenCL is an open model, it is supported by a number of hardware vendors such as AMD, Intel, IBM, including NVidia, and the code written using this model is highly portable between these devices.

Now the next question: how to use these powerful massively-parallel devices to accelerate existimg MPI/OpenMP codes? The simple approach — to use optimized for these devices numerical libraries — will not work: the amount of changes to the existing code will be large, and it is as difficult as entirely rewrite the code.

One of the possible ways out of this problem is to create a special infrastructure (possibly made as a system-level service), which would on the one side control the devices and regulate access to them and on the other provide a comfortable API for the developer. The main concept of such infrastructure is to minimize the changes which must be made to an existing MPI/OpenMP code to work with OpenCL model.

Our work has two main purposes: to create such infrastructure for heterogeneous computing and to accumulate numerical libraries with standard interfaces, written with this infrastructure. The priorities are: the best performance, scalability to all devices in hybrid machine, easy coding and automatic transfer-overlap when possible.

Further, we'll describe the first version of the infrastructue and three model methods of linear algebra: DGEMM — general dense matrix multiplication, DTRSM — a triangle system of linear equiations solver, and DGETRF — LU-decomposition.

 

SCHEDULER

 

Concept

We introduce two definitions: register and instruction.

Register is an arbitrary linear region in a device global memory. So, the global memory is divided into several registers, their number and size are bounded only with total amount of memory and some OpenCL restrictions (such as maximum allocation size and etc).

Instruction is an arbitrary compute kernel on a device. The whole device can be interpreted as a virtual machine which runs instructions on registers.

The devices are operated by a special scheduler. Scheduler initializes the available resources, and provides a developer with following abilities: to control the registers, to transfer data between host and devices, and to launch compute kernels on devices. These abilities are used by the means of  control commands. Commands and dependencies between them form a dependency graph, which is a program for a scheduler.

At the moment, the scheduler works statically. It means, we firstly create the whole program, then execute it.

 

Commands

There are three types of commands: data transfer, kernel execution and auxiliary commands.

LD command reads a rectangular region from host memory and sends it into register. At the moment, it stores the lines sequentally into a small buffer and sends it to device via OpenCL call. If the device is connectd to a PCI-Express slot, memcpy and transfer are performed in parallel. LD command receives as arguments a register and region parameters.

ST command stores data from register to a rectangle region. The algorithm and arguments are the same as with LD command.

EXEC command (an instruction in our terminology) executes an arbitrary compute kernel on a device. It takes registers and scalar control parameters as arguments.

Auxiliary commands are presented with MARK (marker) and BAR (barrier) commands used for synchronization. MARK is an empty command, from which dependencies can be created, and BAR simply waits for all commands it depends of.

 

Queues

The scheduler consists of three queues: queue of LD commands, queue of ST commands and queue of EXEC commands. Auxiliary commands can be included in every queue. All three queues work in parallel, each queue works in a separate thread, the thread syncronization is performed by standard system-level objects and calls. EXEC queue works strictly in-order, and the next command is not launched until previous command is finished. LD and ST work asyncronously: commands are launched in order they were added, but the next command may start before previous finishes.

 

TESTING MACHINE

 

Tests were launched in Institute of System Resarch Russian Academy of Science (NIISI) on the following stand:

CPU               2 x AMD Opteron 6176

Memory          64 Gb DDR3

GPU               3 x AMD Radeon 6990

Operating system     Ubuntu Linux, Windows 7

Video driver           AMD Catalyst 12.3

 

DGEMM

 

Task

Calculate C = alpha*A*B + beta*C, A, B, C — general dense matrices of an arbitrary size.

 

Logic

We use a standard blocking algorithm with two levels of blocking. In the host memory matrices are cut into big blocks (in regular case 5184x5184 elems in double precision, which takes approx. 205 MBytes of memory), which are transferred to a device and there are cut into smaller blocks. One thread    computes a 6x6 block of the result matrix. The blocks on the host are taken consequentally.

When there are several devices, we divide rows of resulting matrix equally between devices (one device multiplies part of rows of A on the whole matrix B), so algorithm is data-parallel.

In irregular case we use padding: fill matrices with additional zeros to get appropriate sizes.

Pics clBLAS A4 - 3.jpg

Scheduler program

Algorithm: first, we should divide matrix C into blocks, then write programs, which calculate each blocks, and then join them into one program. If a block from C fits into one register, a program for a scheduler which compute it, looks like this:

 

LD     C → rc

FOR (k = 0; k < nblocks; k++)

LD     Ak → ra

LD     Bk → rb

EXEC     pad ra → rpa

EXEC     pad rb → rpb

EXEC     dgemm rpa, rpb, rc

END FOR

ST     rc → C

 

Padding depends on LD's, gemm depends on padding. The dependencies on register access are put automatically, EXEC queue is executed consequentally, so no additional dependencies should be done.

If OpenCL driver and a device suppor transfer overlapping, LD commands are hidden behind EXEC dgemm. To hide LD C and ST C we use double-buffering.

So, algorithm requires five registers to work and six to use double buffering.

Pics clBLAS A4 - 2.jpg

 

Computational kernel

One thread computes 6x6 block of C (it multiplies six rows on six columns): in a cycle we load 6x2 block from A and 2x6 block from B into registers and multiply them. LDS is not used. Performance: we have to load 24 doubles and make 72 mads on them, so theoretical peak is 75% of card peak. We achieve 63% of card peak on a kernel.

Pics clBLAS A4 - 1.jpg

 

Tuning

There is a question: how to split matrix C into blocks? The goal was to get as smooth curve of performance as possible. Let's assume that matrix has m rows, and regular block has r rows. We'll make m/r-1 rows of blocks (each containing r rows), and the last (r+m mod r) rows we divide into two block rows with equal size. The columns are splitted in the same way. This partition provides the stable performance dependency from size.

 

Results

Pics clBLAS A4 - 4.jpg

 

DTRSM

 

Task

DTRSM one of triangle systems: op(A)*X = B or X*op(A) = B. Result X is placed into B.

 

Logic

We use the parallel form of the back substitution part of the Gaussian elimination, applied to the blocked matrix. The algorithm is the following:

FOR (k = 0; k < nblocks; k++)

Inverse Akk

Calculate Xk. For example, if DTRSM parameters are LUNN, Xk = Akk^-1 * Bk

if (k != nblocks-1)

Update trailing matrix: Bkk -= Ak*Xk

END FOR

So we see, that DTRSM performance is asymptotically equial to DGEMM performance.

Triangle matrix Akk is inversed in a block way. Firstly, we use a standard algorithm to inverse diagonal 32x32 blocks in-place, and then apply the same method to inverse block matrix.

When we have several cards, we split matrix B equally between cards (data-parallel). All cards do the same matrix inversion, but it is not resource-consuming, so it doesn't affect the result performance.

 

Scheduler program

Akk is inversed on the card, so the original matrix is not changed. Answer Xkl is written on the place of Bkl. In the first cycle we get the inversed block Akk, in the second — part of the answer Xk.

LD     Akk → rpakk

EX     inverse_diag_blocks rpakk

FOR     (l = 0; l < nblocks; l++)

EX     dgemm_step1     rpakk, l

EX     dgemm_step2     rpakk, l

END FOR

FOR     (l = 0; l < npages; l++)

LD     Bl → rbl

EX     pad rbl → rpbl

EX     dgemm rpakk, rpbl, rc

ST     rc → Bl

END FOR

 

Additional dependencies here are not nesessary too: all required dependencies will be set automatically.

If Akk is not the last block, we shoul update trailing matrix via DGEMM.

Pics clBLAS A4 - 5.jpg

 

Computational kernel

Diagonal 32x32 blocks are inversed by a kernel with a standard algorithm (netlib). We use LDS and a «one thread — one row of 32 elements» principle. Other kernels — our dense dgemm.

For triangle matrix multiplication we use optimized for this matrix structure dgemm kernels. From asymptotic point of view, it doesn't matter, which kernel to use, but optimized kernels allow to achieve better performance on smaller matrices.

 

Results

Pics clBLAS A4 - 7.jpg

 

DGETRF

 

Task

DGETRF performs an LU decomposition of matrix A with partial pivoting: A = PLU.

 

Logic

As a model task we implemented a simpliest case of LU decomposition: decomposition without pivoting. Our goal was to investigate, whether our approach can be applied to such problems. This method can be used with the narrow class of matrices, but it's theoretical performance is almost the same as performance of gneral method.

Algorithm can be written as a sequience of standard calls:

 

FOR (k = 0; k < nblocks, k++)

CALL          CPU_DGETRF (Akk)

IF (k != nblocks - 1)

CALL          GPU_DTRSM(L, L, N, U, Akk, A')

CALL          GPU_DTRSM(R, U, N, N, Akk, A1)

CALL          GPU_DGEMM(A', A1, Ak)

END IF

END FOR

 

GPU calls can be done in parallel on several devices, CPU code works consequentially. Overall loss of such use of CPU increases with increasing number of GPUs. Asymptotically, the performance of the call is equal to DGEMM on stripes, but CPU code makes it to converge slowly.

There is a way to hide CPU code behind GPU calculation, but it makes algorithm more complex and is not required for our goal.

Pics clBLAS A4 - 8.jpg

Algorithm with pivoting

An algorithm with string pivoting has two major differences from simple method: DGETRF_CPU is called not on a square region, but on a whole block column, and we should switch rows according to pivoting array. Asymptotically, these operations are much cheaper than DGEMM, so they can be hidden behind it.

 

Results

Pics clBLAS A4 - 9.jpg

 

CONCLUSION

 

In short, we can make several conclusions from our work: firstly, massively-parallel accelerators can be effectively applied for such tasks in mathematical modelling; secondly, infrastructure satisfies the goals, and it can be used to program hybrid nodes.

In future, we plan to develop this infrastructure on one hand, and accumulate numerical libraries written with it on the other. Now we are working on operations with sparse matrices: SpMV and numerical methods using it.

 

 

We send our best regards to AMD, and a few requests:

- Please allow us to allocate 100% of GPU space under linux!

- Please make OpenCL driver work with multi-GPU properly with any number of GPUs (at least, 8 or 16)!

- Please make WriteBufferRect work with transfer overlap!

- Please provide full and correct support of OpenCL 1.2!

 

From Russia with love,

Pavel, Anton.

 

Message was edited by: anton efremov

  • Re: OpenCL programming infrastructure for heterogeneous computing and its applications: multi-GPU DGEMM, DTRSM, DGETRF
    Novice
    Currently Being Moderated

    Thank you for your feedback. I've passed it on to the right people at AMD.

     

    Cheers!

    Kristen

    Check out the open source project Aparapi - an API for Java devs that converts Java bytecode to OpenCL

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

  • Re: OpenCL programming infrastructure for heterogeneous computing and its applications: multi-GPU DGEMM, DTRSM, DGETRF
    ISR RAS Newbie
    Currently Being Moderated

    Hi everybody, a little update. We managed to launch the codes on Radeon 7970 (at the moment, we have three of them), and here are the first results, without additional optimizations.

     

    Pics clBLAS A4 - 10.jpg

    Pics clBLAS A4 - 11(1).jpg

    Pics clBLAS A4 - 12(1).jpg

    A few global remarks about performance:

    1. To cover all sizes and transposition cases, we use padding procedures, and it cost ~1.5% of performance

    2. It seems that transfer overlap is not free of charge. We lose ~3.5% of performance on a kernel when using transfer overlap compared to consequient code. Synthetic tests show the same performance loss.

    3. When we launch kernels consequently, the average kernel performance is significantly lower than in single launches. We are trying to work out why it happens. Expensive NDRange call? Or the card should just relax for some time after a 0.6 secs of hard work?

    4. When using several devices, an overhead (transfer first and last block via PCI-express, which could not be hidden) increases in proportion to the number of devices.

     

    There is a simple recipe how to increase overall performance: to increase gemm kernel performance. A good approach can be found here:

    http://gpgpu.org/2012/07/11/code-generator-for-fast-matrix-multiplication-in-opencl

     

    About scaling. Three 7970 give us 2,77 TFlop/s peak in double precision, which is close to 5 6970 cards, and the reasons why we don't see good scaling on three devices are the same as there: we don't have appropriate amount of work for such computational power (+ multidevice overhead).

     

    From Russia with love,

    Pavel, Anton.

  • Re: OpenCL programming infrastructure for heterogeneous computing and its applications: multi-GPU DGEMM, DTRSM, DGETRF
    ISR RAS Newbie
    Currently Being Moderated

    Hi everybody, we're here again with greetings from windy and cold Mother Russia!

    Agenda: some fresh results with 8 GPUs and a humble request for proper data transfer routines in AMD OpenCL drivers.

     

    amd_dgemm_res.png

    amd_stand.png

    We see that scaling is good up to 7 devices. Now a few words about the curves (why they are not as smooth as they should be) and scaling: we can't achieve an acceptable data transfer speed. We face serious problems with transferring data from host memory (allocated in DDR via operators like malloc, new and etc) to global memory on a device. Now some details.

     

    Data transfer.

    Pseudocode:

    double *src = new double [ 200 MB ]

    cl_mem buf = clCreateBuffer( flags, 200 MB);

    double t1 = gettime();

    [data transfer]

    double time = gettime() - t1;

    speed = 200MB / time  = ~1.5 GB/sec

     

    [data transfer] could be each of 1-5 scenarios from Programming Guide (we also tried several "custom" scenarios - splitting buffer into smaller pieces and etc.) - with the same results.

    One again, we tried APP SDK 2.6, APP SDK 2.7, AMD CATALYST from 12.1 to 12.9 - on all possible configurations for the last year.

     

    More than that, AMD SDK BufferBandwidth example show the same results. For example, assume we make map buffer, memcpy and unmap. We see the following picture:

    memcpy        2.5 GB/sec

    map/unmap    5.7 GB/sec

     

    Now, average speed = size / time = size / (memcpytime + maptime) = size / ( size/memcpyspeed + size/mapspeed) = 1 / (1/memcpyspeed + 1/mapspeed) = mapspeed*memcpyspeed / (mapspeed + memcpyspeed)

    Let's apply it: 2.5*5.7 / (2.5 + 5.7) = 1.7 GB/sec in average.

     

    Transfer Overlap.

    Transfer overlap (kernel execution and PCIe-data transfer at the same time) in all drivers could be achieved only with clEnqueueWriteBuffer OpenCL API call. So, if we want to transfer rect, we have to merge it into a linear region (lots of memcpy's) and then transfer via WriteBuffer. In this case, average is significantly slower than already slow linear transfer.

    If we use other calls (such as clEnqueueReadBuffer), total performance of transfer and kernel execution is equial to performance of consequent launches, so there is no transfer overlap effect.

    Transfer overlap example from SDK does three things: memcpy, PCIe transfer and kernel execution. We tried all launch cases, and we see only memcpy & kernel execution working in parallel. PCIe transfer and kernel execution work consequently.

     

    Summary.

    We are grateful for all advices, but to read Programming Guide and install latest drivers is not the advice we expected to get.

     

    Once again, there are two requests:

    1) FULL PCIe bandwidth on standard OpenCL calls, or, at least, on  scenarios from Programming Guide.

    double *src = new double [ 200 MB ]

    cl_mem buf = clCreateBuffer( flags, 200 MB);

    double t1 = gettime();

    clEnqueueWriteBuffer( flags, 200MB)

    double time = gettime() - t1;

    speed = 200MB / time ~ 5.7 GB/sec     (!!)

     

    2)Transfer overlap working properly(!) with clEnqueueWriteBuffer, clEnqueueReadBuffer, clEnqueueWriteBufferRect, clEnqueueReadBufferRect and/or clEnqueueMapBuffer & clEnqueueUnmapMemObject on full bandwidth. "Properly" means "hide the whole PCIe transfer over kernel execution"

     

    Ideal situation is the following: GPU executes kernels one after another without gaps, and data transfer is completly hidden behind this execution. We are working hard to provide such system, but AMD drivers at the moment are not good enough to support it.

     

    And, as the proper bug report should end with machine configuration, here is ours:

    2 x AMD Opteron 6176 SE

    SuperMicro MNL-H8DG6-F

    64 Gb DDR3

    8 x AMD Radeon 7970

    Ubuntu 12.04 LTS

     

    From Russia with love,

    Pavel, Anton.

  • Re: OpenCL 8 GPU DGEMM. Programming infrastructure for heterogeneous computing and its applications.
    ISR RAS Newbie
    Currently Being Moderated

    Hello from Russia!

    We continue our uphill battle with AMD OpenCL drivers! We struggle to prove that AMD GPU chips can be used in HPC,and there is some light at end of tunnel.

    Tonight in the show:

    - 4,4 TFlops double precision on 8 GPU's: new dgemm results, blocksize and CPU influence

    - clEnqueueWriteBuffer to GPU performance in context with CPU device

    - some thoughts about new Tesla K20

     

    19-09-2012-2.png

    Let's start from new dgemm results. There are some moments i'd like to mention. In short, we use the following algorithm. We cut matrices in the host memory to squares (say, 5760 x 5760), memcpy them to linear pieces, send them to GPU via clEnqueueWriteBuffer, pad them with zeros to proper size, do dgemm and read results via MapBuffer and memcpy's from linear piece on device to square in the host.

    Everything could be much more simple if clEnqueueWriteBufferRect and clEnqueueReadBufferRect worked asynchronously with kernels, even on slow speed...

    So, the first thing is that if we put the CPU (AMD Opteron 6176 SE) to the performance mode ("performance" > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor), we get significantly faster results. This is a little strange effect, because CPU does only memcpy's and some AMD OpenCL driver calls.

    The second thing is about blocking size. We found out that kernels work 2x slower on size 6144x6144. Why this is not good: imagine, we do 48384 (7x7 blocks 6912x6912) on 4 devices. One device gets 48384 / 4 = 12096 rows of a matrix. The program cuts it into two block rows of 6048, and then it is padded to multiple of 192 (to fit workgroup size), resulting in 6144. And there are lots of other examples when this size occur.

     

    dgemm_profiler_1_gpu.png

    dgemm_profiler_8_gpu_1.png

    dgemm_profiler_8_gpu_2.png

    Once again (see previous post), average PCIe speed is low, and to decrease this effect, we have to use bigger block size (n^3 operations vs n^2 data transfer on dgemm) - limited only with 3Gb device memory. We see that enqueueWriteBuffer calls are overlapped with kernels, but sometimes data is not transferred in time. To illustrate this effect, we provide some pictures from profiler, for one device (all good) and 8 devices (transfer is not fast enough to feed 8 GPUs) - 2 slides, didn't fit in one screen. Our simple algorithm doesn't work good on 8 devices, but we see that it can be tuned and optimized.

     

    19-09-2012-1.png

    We accidentally found out why the transfers on 200 Mb were so slow (see previous post). It is an impact of CPU device in a context. If the context contains 8 GPUs, everything works perfectly. If the context contain one GPU and one CPU, we get a very strange transfer rate loss when transfer 192-255Mb buffer to GPU.

     

    A few words about new NVidia chip. This week specifications of new K20 compute processor were released. What do we see: 1.2 TFlop/s double precision with 200GB/s memory bandwidth. In fact, a-yer-ago AMD Radeon 7970 Ghz edition has the same flops and better bandwidth. In codes we do (CFD, quantum chemistry, cryptography and numbers theory, etc.) performance totally depends of memory subsystem speed. So, AMD is one generation forward NVidia in flops and bandwidth.

     

    In conclusion. At the moment, we do dgemm just for fun in spare time (to test the framework we apply in real tasks) and in common GPGPU research purposes. The algorithms are very simple, but they demonstrate that these devices can show brilliant performance if properly supported with good software (drivers and etc).

    Do the good drivers, AMD Team, and HPC sector will be yours!

    From Russia with love,

    Pavel, Anton.

  • Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.
    ISR RAS Newbie
    Currently Being Moderated

    We send our greetings to everyone reading this topic.

     

    Contents for today:

    - Catalyst GPU drivers comparison: transfer speed & overlap, max memalloc and command queues

    - Facts about cl_event management

    - Overlap and multiple command queues

     

    This post was inspired by the following 8 AMD FirePro S10000s machine:

    http://fireuser.com/blog/8_amd_firepro_s10000s_16_gpus_achieve_8_tflops_real_world_double_precision_/

     

    That topic is about the 16 GPU cluster and 8,2 TFlop/s DGEMM launched on it. To achieve such a result, one should solve two major problems: hardware and software. First of all, the hardware should meet rather strict conditions: it should be able to launch operationg system with corrlectly initialized 16 GPU devices and it should provide enough PCIe speed to feed GPUs fast enough to hide all transfers behind calculations. Secondly, the AMD driver (the weakest link in chain) should be good enough to use all these powers and utilize PCIe bus effectively (speed + overlap).

    http://fireuser.com/images/uploads/sc12-quantum.jpg - on this photo you can see the specification of host side: 2 x Intel Xeon Processor 5500/ 5600 Series, which gives us 6 memory channels, ~64 GB/s theoretical peak host memory bandwidth. In real, ~80% of bandwidth can be achieved on linear read/write, it is ~51 GB/s. When transferring data to device, a temporary buffers are used, so we have to read and write from host memory => peak memcpy bandwidth is 51/2=25,5 GB/s. So, if there are 16 GPUs, 25,5/16=1,59 GB/s of host bandwidth per card. It is an ideal and unreachable situation with linear-only read/writes and peaceful GPU's (not fighting for host memory channels).

     

    GPU-PCIE.png

    At the moment, full transfer path (see below)  from DDR cannot be faster than host memcpy, and if you transfer rectangles to several devices simultaneously, it's very difficult to get even this speed.

    On our system with 8 GPUs (2xOpteron 6176 SE, 8 host memory channels, peak 80 GB/s bandwidth), peak bandwidth is 5GB/s per GPU. When launched simultaneously, real bandwidth is ~1.6 GB/s per GPU. So, we make the following conclusion: when we do 16 parallel memcpy's, peak bandwidth will be less than 1,59/2=800 MB/s - our tests confirm it.

    Peak maxalloc on a device on all drivers is 2,7 GB. To do full overlap, double bufferization is required, so it comes for 450 MB per buffer. It gives us square block 7680x7680, which dgemms in 1.723s (on one of 16 GPUs, which give 0.7*750=525 GFlop/s). Data transfer is 3*450=1350 MB (sometimes an extra 450 MB store), which happens in 1,685s with 800 MB/s speed.

    These numbers (1,68 & 1,72) are almost equal in the ideal scenario, when all involved OpenCL API calls give peak performance, overlap and etc. Dear developers (DEVELOPERS, DEVELOPERS, Steve Balmer), make your own conclusions.

     

    We decided to clarify the question, which drivers are capable of doing such hard work. We have a program written in good-old C++, which runs some tests on an OpenCL device, and we have Radeon 7970. We have an src array in host memory, size 200 mb, allocated via malloc, possibly aligned (on these sizes we find no difference between aligned or not aligned memory). The task is to transfer it to device as a linear region or a rectangular (rect is n x n of elements in double precision, where n is max possible n: n*n*sizeof(double) <= 200Mb), and read data from device to src. Rectangular scenarios were built in the similar way to linear ones (1st uses EnqueueWriteBufferRect, 2nd uses EnqueueCopyBufferRect, 3rd and 4th uses Maps and lots of memcpys to copy rect as an array of rows). Speed was measured as a speed of full path including pinning, memcpy (where required) and etc. Memcpy is single-threaded.

     

    A few words about optimal device usage. The maximum performance is achieved in the following case: we load the first block of data, then device launches compute kernels consequently without gaps, data transfers are totally hidden, and then we store results of the last kernel. See diagram below

    dgemm_profiler_1_gpu.png

     

    Thereby, data transfer with 5th scenary isn't good for optimal usage with huge transfers, because we spend too much kernel time for this.

     

    So, here is the table.

     

    PCIE-TRANSFER 1.png

    PCIE-TRANSFER 2.png

     

    PCIE-TRANSFER 3.png

    PCIE-TRANSFER 4.png

    PCIE-TRANSFER 5.png

    What can we see. The 4th scenario passes verification with non-standard usage: we map the buffer on a device ONCE in the beginning and unmap it ONCE in the end. An illustration:

     

    mapped_mem = clEnqueueMapBuffer(persistent_buffer)

    for (i = 0; i < niters; i++) {

    memcpy (mapped_mem, src, 200Mb)

    // after this memcpy call data is somehow ready on a device...

    clEnqueueNDRange(kernel_that_uses_persistend_buffer)

    update(src)

    }

    clUnmapMemObject(persistent_buffer, mapped_mem)

     

    This is the simpliest usage case, and it shows brilliant results (great speed and full overlap), but two problems: non-standard & very small amount of persistent memory (real amount we can use never exceeds 130 MB, even if it is possible to allocate more). If memory is not AMD_PERSISTENT, verification on this case fails. Usage with mapping & unmapping AMD_PERSISTENT memory on each iteration show the same performance as 3rd scenario.

     

    Full overlap is achieved on 3rd scenario in both ways (on low speed) and on 1st scenario on linear regions on high speed.

     

    cl_event_block.png

    The final test we'd like to speak is the max number of available command queues. AMD driver doesn't allow to use more than 50 command queues total! Why it is important: when device is busy with calculations, AMD OpenCL driver manages cl_events in a strange way (see illustration). For example, if we do writing a small buffer and executing a huge compute kernel simultaneously, cl_event associated with writing becomes CL_COMPLETE only when kernel computation completes. So, if we put two transfers in one queue and wish to overlap them with a kernel, only one transfer will actually be overlapped. The second transfer would be done after kernel finishes.

    [diag. cl_event dirung kernel]

     

    As for example, for the toughest case of big DGEMM C <- a*A*B + b*C, we have to do three loads (next parts of matrices A, B, C) and one read (part of resulting C). So, to acheve full overlap we have to use 5 command queues (4 transfers and one kernel execution at the same time), or to merge parts of different host memory and transfer them as one piece, etc.

    In most general case, one has to use at least two queues for transfer overlap (or three if do reading and writing) per device. With our DGEMM algorithm (5 queues per device), we use 5*8=40 queues for 8 devices.

    If one uses 16 devices, max three queues per device can be allocated, and it is really not obvious, how to completly overlap several transfers...

     

    As for a conclusion, a little remark. We see that new drivers work worse and worse. A year ago, when Radeons 7xxx were not presented, we had the same situation with drivers for Radeons 6970 & 6990. And there is an opinion (as we say in Russia) that AMD possibly wants to split graphics and computations: Radeons for graphics only, FirePro for GPGPU. It's a dark side of Force, and what we see in NVidia approves that.

    So, farewell and may the Force be with you.

    From Russia with love,

    Pavel, Anton.

  • Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.
    ISR RAS Newbie
    Currently Being Moderated

    Hi there!

    The winter doesn't seem to end in Russia, so we decided to melt some snow with the AMD GPU heat.

    The plan for today:

    1) New results in linear algebra

    2) Some thoughts about AMD drivers

     

    In dgemm we managed to conquer the new height of 5 TFl/s on 8 devices. Also, the slightly improved algorithm allowed us to scale dgetrf up to 6 devices. Here are the pictures:

    слайд1.png

    слайд2.png

    First, the hardware. We upgraded our stand, so it has two Intel Xeon E5-2670 CPUs and 128 Gb Ram. Host memory bandwidth has a theoretical peak of 100 Gb/s (=> 50 Gb/s on memcpy), and this platform has a PCIe 3.0.

    Second, the NUMA. We found out that the "manual usage" of libnuma library (malloc and thread affinity) can dramatically increase performance. Devices in PCIe slots are controlled by different NUMA nodes. It is important to init and run command queue on the same NUMA node with the device it controls (maybe, because cl_command_queue runs in a separate thread). Also, clEnqueueTransfer calls can work faster if host data and a device belong to one NUMA node.

    So, in memcpy tests with all these optimizations on AMD platform we managed to achieve 13 GB/s on interleaved memory and 20 Gb/s on local (out of 40 Gb/s peak). On Intel platform the numbers are 27 Gb/s on interleaved and 36 Gb/s on local (out of 50 Gb/s peak).

     

    Here is the comparison with Magma results:

    слайд3.png

    слайд4.png

    The second part of the post is about AMD drivers. We are chasing the peak performance, and it can be achieved in one case: when we have kernels on a device running on peak performance without gaps. As we see, all drivers after 12.6 (starting from 12.8 and up to 13.2) are not able to run kernels in a proper way. On the other hand, 12.6 drivers allow transfer overlap on two calls (clEnqueueWriteBuffer and clEnqueueReadBuffer) on PCI Express 2.0, but they don't work with overlap on PCIe 3.0. So, unfortunately, we have to state that at the moment AMD do not have appropriate drivers for a full-fledged work with OpenCL on a modern hardware.

    Our tests show that NVidia GPUs don't have any problems at all, they are just slow...

     

    See you at the GTC 2013!

     

    From Russia with love,

    Pavel,

    Anton.

  • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications
    ISR RAS Newbie
    Currently Being Moderated

    Hi everybody,

    Here is the video with brief overview of our results for the last several months

     

     

    And sorry for that accent...

     

     

     

    From Russia with love,

    Pavel,

    Anton.

  • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications
    dirtrider Newbie
    Currently Being Moderated

    Sorry to bother you...you are clearly way smarter than me...I haven't a clue what you are up to there...but I am having trouble getting more than 4 amd HIS 7950s to be recognized with any drivers including 13.3 beta and was hoping you might have something to share based on an admin telling me to hit you up.  Thanks...

    http://devgurus.amd.com/message/1295428#1295428

  • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL.
    ISR RAS Newbie
    Currently Being Moderated

    Hi everybody,

         this post is devoted to the Heterogeneous HPL test launches. Our code is written in OpenCL and MPI programming models and (in theory) should work on arbitrary distributed systems using all CPUs and accelerators with OpenCL support. The HPL test is a complex benchmark for a distributed system. It is used to form the Top 500 list (http://www.top500.org). Test solves the dense system of linear equations via LU-decomposition, and this is a highly resource-consuming operation with intense communications. As it is said by J. Dongarra, the good HPL results don't guarantee the quality of the system, otherwise bad results almost always indicate problems. Speaking math, Good system => Good HPL results. More info could be found at http://www.netlib.org/benchmark/hpl/

         The main requirement to the test is the unchanging source code. The test uses external libraries (BLAS and MPI), which could be optimized to achieve the best performance. We decided to use standard MPI libraries (OpenMPI and MPICH2 on different systems) and to build custom Heterogeneous BLAS from OpenBLAS (ex-GotoBLAS) and our custom accelerated calls written in OpenCL.

         HPL uses 10 BLAS calls (in double precision):

    • HPL_idamax — index of max abs value
    • HPL_dscal — x = a*x
    • HPL_dswap — swap x and y
    • HPL_dcopy — copy x into y
    • HPL_daxpy — y = a*x + y
    • HPL_dgemv — matrix vector multiply
    • HPL_dger — performs the rank 1 operation A := alpha*x*y' + A
    • HPL_dtrsv — solving triangular matrix problems
    • HPL_dgemm — matrix matrix multiply
    • HPL_dtrsm — solving triangular matrix with multiple right hand sides

         The prevailing part of the work is done in DGEMM and DTRSM calls (more than 95% time), so we decided to build the BLAS library in the following way: DGEMM and DTRSM are hybrid - custom OpenCL DGEMM and DTRSM run on the accelerators, in the same time CPU executes OpenBLAS dgemm & dtrsm, other 8 calls are taken from the OpenBLAS directly. To build Heterogeneous BLAS, we compiled OpenBLAS library, then changed cblas_dgemm and cblas_dtrsm symbols to cpu_dgemm and cpu_dtrsm, then created our own cblas_dgemm and cblas_dtrsm symbols (with usage of cpu_dgemm and cpu_dtrsm) and injected them into OpenBLAS library. So we created the library with standard BLAS interface for HPL, with accelerated DGEMM and DTRSM calls and old cpu dgemm and dtrsm left for internal use. The device initialization and cleanup are done in two calls, cblas_init() and cblas_finalize(), which are added to HPL test near the MPI_Init and MPI_finalize() calls.

         Some details about accelerated calls. OpenCL versions of DGEMM and DTRSM are described higher in the topic. These algorithms are data-parallel, so we can slice the matrix and process two parts simultaneously by CPU and accelerators. For example, on the machine with two Intel Xeon E5-2670 and two Radeons we give 28 threads to the BLAS, and 4 threads are used to maintain the GPU calculations. There are several ways how to split matrix between CPU and accelerators, for example, do the main part on accelerators with best possible performance, and process the rest on CPUs. We decided to make CPUs and accelerators work with the same time. The slice ratios is obtained from performance of the algorithms on these processors. For example, if DGEMM works on CPU with 250 GFlops and on two GPUs with 1300 GFlops, we give 250/(250+1300) ~ 15% of the matrix to the CPU, and 85% to GPUs.

     

    =========================================

    Theoretical peak for one node of K10 supercomputer:

    1 GPU FERMI: 2 Xeon E5-2660 + 1 M2090 = 2 * 2.2Ghz * 8ops * 8cores  +   666 GFlops = 948 GFlops

    2 GPU FERMI: 2 Xeon E5-2660 + 2 M2090 = 2 * 2.2Ghz * 8ops * 8cores  + 2*666 GFlops = 1614 GFlops

    3 GPU FERMI: 2 Xeon E5-2660 + 3 M2090 = 2 * 2.2Ghz * 8ops * 8cores  + 3*666 GFlops = 2280 GFlops

     

    HPL results (efficiency in %) on matrix size 106496:

    1 GPU FERMI: 480 GFlops (50%)

    2 GPU FERMI: 664 GFlops (41%)

    3 GPU FERMI: 810 GFlops (35%)

     

    =========================================

    Theoretical peak for one node of K100 supercomputer:

    1 GPU FERMI: 2 Xeon X5670 + 1 M2050 = 2 * 2.9Ghz * 4ops * 6cores +   550 GFlops = 670 GFlops

    2 GPU FERMI: 2 Xeon X5670 + 2 M2050 = 2 * 2.9Ghz * 4ops * 6cores + 2*550 GFlops = 1240 GFlops

    3 GPU FERMI: 2 Xeon X5670 + 3 M2050 = 2 * 2.9Ghz * 4ops * 6cores + 3*550 GFlops = 1790 GFlops

     

    HPL results (efficiency in %) on matrix size 81920:

    1 GPU FERMI: 295 GFlops (44%)

    2 GPU FERMI: 446 GFlops (36%)

    3 GPU FERMI: 520 GFlops (29%)

     

    =========================================

    Theoretical peak for one node with 2xSandy Bridge E5-2670 + 2xRadeon Tahiti Ghz Ed:

    2 GPU Radeon: 2 Xeon E5-2670 + 2 Radeon 7970 Ghz Ed = 2 * 2.6Ghz * 8ops * 8cores  + 2*1034 GFlops = 2400 GFlops

     

    HPL results (efficiency in %) on matrix size 106496:

    2 GPU Radeon: 890 GFlops (37%)

     

    =========================================

    Theoretical peak for one node with 2xSandy Bridge E5-2670 + 3xGeforce Titan:

    1 GPU TITAN: 2 Xeon E5-2670 + 1 nVidia TITAN = 2 * 2.6Ghz * 8ops * 8cores  +   1500 GFlops = 1833 GFlops

    2 GPU TITAN: 2 Xeon E5-2670 + 2 nVidia TITAN = 2 * 2.6Ghz * 8ops * 8cores  + 2*1500 GFlops = 3333 GFlops

    3 GPU TITAN: 2 Xeon E5-2670 + 3 nVidia TITAN = 2 * 2.6Ghz * 8ops * 8cores  + 3*1500 GFlops = 4833 GFlops

     

    HPL results (efficiency in %) on matrix size 106496:

    1 GPU TITAN: 620 GFlops (34%) (with very weak DGEMM 650 GFlops)

     

    If we had the kernel with performance 1.3TFlops (out of 1.5TFlops, CUDA kernel version from nVidia) on TITAN, we would expect the following performance and efficiency:

    1 GPU TITAN: ~1000 GFlops (55%)

    2 GPU TITAN: ~1433 GFlops (43%)

    3 GPU TITAN: ~1836 GFlops (38%)

     

    In the attachments there are the full logs for several HPL launches on different machines.

     

    And some extras in the end. We didn't manage to insert pictures (how?..), so all files are in attachments.

     

    Here are the kernel codes for MAGMA kernel ported to OpenCL (fermiDgemm_v2_ocl, 650/1500 GFlops on NVidia TITAN)

    __kernel void fermiDgemm_v2_ocl

        (

            __global double *A,

            __global double *B,

            __global double *C,

            double alpha,

            double beta,

            unsigned int m,

            unsigned int n,

            unsigned int k,

            unsigned int lda,

            unsigned int ldb,

            unsigned int ldc,

            unsigned int offsetA,

            unsigned int offsetB,

            unsigned int offsetC

        )

    {

    #define OFFSET 16

    #define BLOCK_SIZE 64

     

            const uint2 lid = (uint2)(get_local_id(0), get_local_id(1));

            const uint  idt = lid.y * BLOCK_SIZE + lid.x;

     

            const uint2 bid = (uint2)(get_group_id(0), get_group_id(1))*BLOCK_SIZE;

            const uint2 tid = (uint2)(idt%OFFSET, idt/OFFSET);

     

            __local double  Ab[BLOCK_SIZE][OFFSET+1];

            __local double  Bb[BLOCK_SIZE][OFFSET+1];

     

            A += offsetA + bid.x +  tid.y * lda            + tid.x;

            B += offsetB +         (tid.y * 4 + bid.y)*ldb + tid.x;

            C += offsetC + bid.x + (tid.y     + bid.y)*ldc + tid.x;

     

            double4 xxA;

            double4 xxB;

     

            double16 Cb = 0;

     

            int k1;

            for(k1=0; k1<k; k1+=OFFSET)

          {

                    xxB.s0 = B[    0];

                xxB.s1 = B[  ldb];

                    xxB.s2 = B[2*ldb];

                    xxB.s3 = B[3*ldb];

     

                    xxA.s0 = A[       0];

                    xxA.s1 = A[  OFFSET];

                    xxA.s2 = A[2*OFFSET];

                    xxA.s3 = A[3*OFFSET];

     

                    A += OFFSET*lda;

                    B += OFFSET;

     

                    barrier(CLK_LOCAL_MEM_FENCE);

                    Ab[tid.x           ][tid.y] = xxA.s0;

                    Ab[tid.x +   OFFSET][tid.y] = xxA.s1;

                    Ab[tid.x + 2*OFFSET][tid.y] = xxA.s2;

                    Ab[tid.x + 3*OFFSET][tid.y] = xxA.s3;

     

                    Bb[tid.y*4         ][tid.x] = xxB.s0;

                    Bb[tid.y*4 + 1     ][tid.x] = xxB.s1;

                    Bb[tid.y*4 + 2     ][tid.x] = xxB.s2;

                    Bb[tid.y*4 + 3     ][tid.x] = xxB.s3;

                    barrier(CLK_LOCAL_MEM_FENCE);

     

                    #pragma unroll

                for( int j1=0;j1<OFFSET;j1++)

                {

                            xxA.s0 = Ab[tid.x           ][j1];

                            xxA.s1 = Ab[tid.x +   OFFSET][j1];

                            xxA.s2 = Ab[tid.x + 2*OFFSET][j1];

                            xxA.s3 = Ab[tid.x + 3*OFFSET][j1];

     

                      xxB.s0 = Bb[tid.y           ][j1];

                            xxB.s1 = Bb[tid.y +   OFFSET][j1];

                            xxB.s2 = Bb[tid.y + 2*OFFSET][j1];

                            xxB.s3 = Bb[tid.y + 3*OFFSET][j1];

     

                    Cb.s0123 = mad((double4)xxA.s0, xxB, Cb.s0123);

                    Cb.s4567 = mad((double4)xxA.s1, xxB, Cb.s4567);

                    Cb.s89ab = mad((double4)xxA.s2, xxB, Cb.s89ab);

                    Cb.scdef = mad((double4)xxA.s3, xxB, Cb.scdef);

                    }

            }

     

            C[       0] = alpha*Cb.s0 + beta * C[       0];

            C[  OFFSET] = alpha*Cb.s4 + beta * C[  OFFSET];

            C[2*OFFSET] = alpha*Cb.s8 + beta * C[2*OFFSET];

            C[3*OFFSET] = alpha*Cb.sc + beta * C[3*OFFSET];

     

            C += ldc*OFFSET;

            C[       0] = alpha*Cb.s1 + beta * C[       0];

            C[  OFFSET] = alpha*Cb.s5 + beta * C[  OFFSET];

            C[2*OFFSET] = alpha*Cb.s9 + beta * C[2*OFFSET];

            C[3*OFFSET] = alpha*Cb.sd + beta * C[3*OFFSET];

     

            C += ldc*OFFSET;

            C[       0] = alpha*Cb.s2 + beta * C[       0];

            C[  OFFSET] = alpha*Cb.s6 + beta * C[  OFFSET];

            C[2*OFFSET] = alpha*Cb.sa + beta * C[2*OFFSET];

            C[3*OFFSET] = alpha*Cb.se + beta * C[3*OFFSET];

     

            C += ldc*OFFSET;

            C[       0] = alpha*Cb.s3 + beta * C[       0];

            C[  OFFSET] = alpha*Cb.s7 + beta * C[  OFFSET];

            C[2*OFFSET] = alpha*Cb.sb + beta * C[2*OFFSET];

            C[3*OFFSET] = alpha*Cb.sf + beta * C[3*OFFSET];

    }

     

    Here is our kernel (gemm_buffer_8x1_1x4, 745/1034 GFlops on Radeon 7970 Ghz Ed):

    #define DATATYPE   double

    #define DATATYPE2  double2

    #define DATATYPE4  double4

    #define DATATYPE8  double8

    #define DATATYPE16 double16

     

    #define DATATYPE_SIZE 8

     

    #define _define(type, name, N1)\

      type name##N1 = 0;

     

    #define _prefetch(mem, type, dst, addr, N1)\

      type dst##N1 = *((mem type*)(addr##N1));

     

    #define _store(mem, type, dst, src, N1)\

      *((mem type*)(dst##N1)) = src##N1;

     

    #define _mul(type, dst,dsuf, src1, ssuf1, src2, N1)\

      dst##N1##dsuf = src1##N1##ssuf1 * src2;

     

    #define _mad(type, dst,dsuf, src1, ssuf1, src2, N1)\

      dst##N1##dsuf = mad((type)src1##N1##ssuf1, src2, dst##N1##dsuf);

     

     

    #define _define2(type, name,N1,N2)\

             _define(type, name,N1)\

             _define(type, name,N2)

    #define _prefetch2(mem, type,dst,addr,N1,N2)\

             _prefetch(mem, type,dst,addr,N1)\

             _prefetch(mem, type,dst,addr,N2)

    #define _store2(mem, type,dst,src,N1,N2)\

             _store(mem, type,dst,src,N1)\

             _store(mem, type,dst,src,N2)

    #define _mul2(type, dst,dsuf,src1,ssuf1,src2,N1,N2) \

             _mul(type, dst,dsuf,src1,ssuf1,src2,N1) \

             _mul(type, dst,dsuf,src1,ssuf1,src2,N2)

    #define _mad2(type, dst,dsuf,src1,ssuf1,src2,N1,N2) \

             _mad(type, dst,dsuf,src1,ssuf1,src2,N1) \

             _mad(type, dst,dsuf,src1,ssuf1,src2,N2)

     

    #define prefetch(mem,type,dst,addr)\

           _prefetch(mem,type,dst,addr,0)

     

    #define define2(type,name)\

           _define2(type,name,0,1)

    #define prefetch2(mem,type,dst,addr)\

           _prefetch2(mem,type,dst,addr,0,1)

    #define store2(mem,type,dst,src)\

           _store2(mem,type,dst,src, 0,1)

    #define mul2(type,dst,dsuf,src1,ssuf1,src2)\

           _mul2(type,dst,dsuf,src1,ssuf1,src2,0,1)

    #define mad2(type,dst,dsuf,src1,ssuf1,src2)\

           _mad2(type,dst,dsuf,src1,ssuf1,src2,0,1)

     

     

    #define define4(type,name)\

            define2(type, name)\

           _define2(type, name,2,3)

    #define prefetch4(mem,type,dst,addr)\

            prefetch2(mem,type,dst,addr)\

           _prefetch2(mem,type,dst,addr,2,3)

    #define store4(mem,type,dst,src)\

            store2(mem,type,dst,src)\

           _store2(mem,type,dst,src,2,3)

    #define mul4(type,sum,suf,vec1,el,vec2)\

            mul2(type,sum,suf,vec1,el,vec2)\

           _mul2(type,sum,suf,vec1,el,vec2,2,3)

    #define mad4(type,sum,suf,vec1,el,vec2)\

            mad2(type,sum,suf,vec1,el,vec2)\

           _mad2(type,sum,suf,vec1,el,vec2,2,3)

     

     

    #define define6(type,name)\

            define4(type, name)\

           _define2(type, name,4,5)

    #define prefetch6(mem,type,dst,addr)\

            prefetch4(mem,type,dst,addr)\

           _prefetch2(mem,type,dst,addr,4,5)

    #define store6(mem,type,dst,src)\

            store4(mem,type,dst,src)\

           _store2(mem,type,dst,src,4,5)

    #define mul6(type,sum,suf,vec1,el,vec2)\

            mul4(type,sum,suf,vec1,el,vec2)\

           _mul2(type,sum,suf,vec1,el,vec2,4,5)

    #define mad6(type,sum,suf,vec1,el,vec2)\

            mad4(type,sum,suf,vec1,el,vec2)\

           _mad2(type,sum,suf,vec1,el,vec2,4,5)

     

     

    #define define8(type,name)\

            define6(type, name)\

           _define2(type, name,6,7)

    #define prefetch8(mem,type,dst,addr)\

            prefetch6(mem,type,dst,addr)\

           _prefetch2(mem,type,dst,addr,6,7)

    #define store8(mem,type,dst,src)\

            store6(mem,type,dst,src)\

           _store2(mem,type,dst,src,6,7)

    #define mul8(type,sum,suf,vec1,el,vec2)\

            mul6(type,sum,suf,vec1,el,vec2)\

           _mul2(type,sum,suf,vec1,el,vec2,6,7)

    #define mad8(type,sum,suf,vec1,el,vec2)\

            mad6(type,sum,suf,vec1,el,vec2)\

           _mad2(type,sum,suf,vec1,el,vec2,6,7)

     

     

     

    #define DEFINE4x2(acc)  define4(DATATYPE2, acc);

     

    #define DEFINE4x4(acc)  define4(DATATYPE4, acc);

     

    #define DEFINE4x8(acc)  define4(DATATYPE8, acc);

     

    #define DEFINE6x4(acc)  define6(DATATYPE4, acc);

     

    #define DEFINE8x4(acc)  define8(DATATYPE4, acc);

     

    #define DEFINE4x6(acc)  define4(DATATYPE4, acc);\

                            define4(DATATYPE2, acc##5);

     

    #define DEFINE6x6(acc)  define6(DATATYPE4, acc);\

                            define6(DATATYPE2, acc##5);

     

     

     

    #define GEMS2(type, alpha, AB, beta, addr) \

      prefetch2(__global, type, C, addr);\

      AB##0 = AB##0 * (type)alpha;\

      AB##1 = AB##1 * (type)alpha;\

      AB##0 = mad((type)beta, type##C0, AB##0);\

      AB##1 = mad((type)beta, type##C1, AB##1);

     

    #define GEMS4(type, alpha, AB, beta, addr) \

      prefetch4(__global, type, type##C, addr);\

      AB##0 *= alpha;\

      AB##1 *= alpha;\

      AB##2 *= alpha;\

      AB##3 *= alpha;\

        AB##0 = mad((type)beta, type##C0, AB##0);\

        AB##1 = mad((type)beta, type##C1, AB##1);\

        AB##2 = mad((type)beta, type##C2, AB##2);\

        AB##3 = mad((type)beta, type##C3, AB##3);

     

    #define GEMS6(type, alpha, AB, beta, addr) \

      prefetch6(__global, type,type##C, addr);\

      AB##0 *= alpha;\

      AB##1 *= alpha;\

      AB##2 *= alpha;\

      AB##3 *= alpha;\

      AB##4 *= alpha;\

      AB##5 *= alpha;\

      AB##0 = mad((type)beta, type##C0, AB##0);\

      AB##1 = mad((type)beta, type##C1, AB##1);\

      AB##2 = mad((type)beta, type##C2, AB##2);\

      AB##3 = mad((type)beta, type##C3, AB##3);\

      AB##4 = mad((type)beta, type##C4, AB##4);\

      AB##5 = mad((type)beta, type##C5, AB##5);

     

    #define GEMS8(type, alpha, AB, beta, addr) \

      prefetch8(__global, type,type##C, addr);\

      AB##0 *= alpha;\

      AB##1 *= alpha;\

      AB##2 *= alpha;\

      AB##3 *= alpha;\

      AB##4 *= alpha;\

      AB##5 *= alpha;\

      AB##6 *= alpha;\

      AB##7 *= alpha;\

      AB##0 = mad((type)beta, type##C0, AB##0);\

      AB##1 = mad((type)beta, type##C1, AB##1);\

      AB##2 = mad((type)beta, type##C2, AB##2);\

      AB##3 = mad((type)beta, type##C3, AB##3);\

      AB##4 = mad((type)beta, type##C4, AB##4);\

      AB##5 = mad((type)beta, type##C5, AB##5);\

      AB##6 = mad((type)beta, type##C6, AB##6);\

      AB##7 = mad((type)beta, type##C7, AB##7);

     

     

    #define GEMS4x2(alpha, AB, beta, addr)  GEMS4(DATATYPE2, alpha, AB, beta, addr)

     

    #define GEMS4x4(alpha, AB, beta, addr)  GEMS4(DATATYPE4, alpha, AB, beta, addr)

     

    #define GEMS4x8(alpha, AB, beta, addr)  GEMS4(DATATYPE8, alpha, AB, beta, addr)

     

    #define GEMS6x4(alpha, AB, beta, addr)  GEMS6(DATATYPE4, alpha, AB, beta, addr)

     

    #define GEMS8x4(alpha, AB, beta, addr)  GEMS8(DATATYPE4, alpha, AB, beta, addr)

     

    #define GEMS4x6(alpha, AB, beta, addr)  GEMS4(DATATYPE4, alpha, AB, beta, addr) \

                                            GEMS4(DATATYPE2, alpha, AB##5, beta, 4*DATATYPE_SIZE+addr)

     

    #define GEMS6x6(alpha, AB, beta, addr)  GEMS6(DATATYPE4, alpha, AB, beta, addr) \

                                            GEMS6(DATATYPE2, alpha, AB##5, beta, 4*DATATYPE_SIZE+addr)

     

     

    #define STORE4x2(dst,src)   store4(global, DATATYPE2, dst, src);

     

    #define STORE4x4(dst,src)   store4(global, DATATYPE4, dst, src);

     

    #define STORE4x8(dst,src)   store4(global, DATATYPE8, dst, src);

     

    #define STORE6x4(dst,src)   store6(global, DATATYPE4, dst, src);

     

    #define STORE8x4(dst,src)   store8(global, DATATYPE4, dst, src);

     

    #define STORE4x6(dst,src)   store4(global, DATATYPE4, dst, src);\

                                store4(global, DATATYPE2, 4*DATATYPE_SIZE + dst, src##5);

     

    #define STORE6x6(dst,src)   store6(global, DATATYPE4, dst, src);\

                                store6(global, DATATYPE2, 4*DATATYPE_SIZE + dst, src##5);

     

     

     

    #define ADDR1(addr,matrix, offset, ld) \

      uint2 addr = (uint2)(uint)matrix + (offset + (uint2)(0,1)*ld)*DATATYPE_SIZE;

     

    #define ADDR2(addr,matrix, offset, ld) \

      uint2 addr = (uint2)(uint)matrix + (offset + (uint2)(0,1)*ld)*DATATYPE_SIZE;

     

    #define ADDR4(addr,matrix, offset, ld) \

      uint4 addr = (uint4)(uint)matrix + (offset + (uint4)(0,1,2,3)*ld)*DATATYPE_SIZE ;

     

    #define ADDR8(addr,matrix, offset, ld) \

      uint8 addr = (uint8)(uint)matrix + (offset + (uint8)(0,1,2,3,4,5,6,7)*ld)*DATATYPE_SIZE;

     

    #define ADDR6(addr,matrix, offset, ld) ADDR8(addr,matrix,offset,ld)

     

     

     

    ////////////////////////////////

    //                            //

    //           GLOBAL           //

    //                            //

    ////////////////////////////////

     

     

    #define gemm_buffer(AY, AX, BY, BX, KERNEL) \

    __kernel void gemm_buffer_##AY##x##AX##_##BY##x##BX##( \

       __global DATATYPE *matrixA,            \

       __global DATATYPE *matrixB,            \

       __global DATATYPE *matrixC,            \

          const DATATYPE alpha,               \

          const DATATYPE beta,                \

          const unsigned int m,               \

          const unsigned int n,               \

          const unsigned int k,               \

          const unsigned int lda,             \

          const unsigned int ldb,             \

          const unsigned int ldc,             \

          const unsigned int offsetA,         \

          const unsigned int offsetB,         \

          const unsigned int offsetC)         \

    {\

      uint2 offset = (uint2)(AX*lda,BY*ldb)*DATATYPE_SIZE;\

      ADDR##AX##(addA,matrixA, offsetA+get_global_id(1)*AY                          ,lda);\

      ADDR##BY##(addB,matrixB, offsetB+get_global_id(0)*BX                          ,ldb);\

      ADDR##BX##(addC,matrixC, offsetC+get_global_id(0)*BX*ldc + get_global_id(1)*AY,ldc);\

      DEFINE##BX##x##AY##(sum);\

      for(int i=0; i<k; i+= BY) { KERNEL##(mad); }\

      GEMS##BX##x##AY##(alpha,sum,beta,addC.s);\

      STORE##BX##x##AY##(addC.s,sum);\

    }

     

     

    //////////////////////////////////////////////

    //                                          //

    //        KERNEL gemm_buffer_8x1_1x4        //

    //                                          //

    //    A - colmaj, B - rowmaj, C - colmaj    //

    //                                          //

    //////////////////////////////////////////////

    #define MMUL_1x8t_1x4(f, A, B, C) \

      C##0  = mad( (DATATYPE8)B##0, A##0 , C##0 );\

      C##1  = mad( (DATATYPE8)B##1, A##0 , C##1 );\

      C##2  = mad( (DATATYPE8)B##2, A##0 , C##2 );\

      C##3  = mad( (DATATYPE8)B##3, A##0 , C##3 );

     

    #define mmul_b1x8t_b1x4(f) \

      prefetch(__global, DATATYPE8, tempA, addA.s);\

      prefetch(__global, DATATYPE4, tempB, addB.s);\

      MMUL_1x8t_1x4(f, tempA, tempB0.s, sum)\

      addA += offset.x;\

      addB += offset.y;

     

    gemm_buffer(8, 1, 1, 4, mmul_b1x8t_b1x4)

    //////////////////////////////////////////////

     

     

     

    //////////////////////////////////////////////

    //                                          //

    //        KERNEL gemm_local_buffer_8x2_2x4  //

    //                                          //

    //    A - colmaj, B - colmaj, C - colmaj    //

    //                                          //

    //////////////////////////////////////////////

    #define MMUL_8x2_2x4(f, A, B, C)\

      C##0 = mad( (DATATYPE8)B##0.s0, A##0, C##0);\

      C##1 = mad( (DATATYPE8)B##1.s0, A##0, C##1);\

      C##2 = mad( (DATATYPE8)B##2.s0, A##0, C##2);\

      C##3 = mad( (DATATYPE8)B##3.s0, A##0, C##3);\

      C##0 = mad( (DATATYPE8)B##0.s1, A##1, C##0);\

      C##1 = mad( (DATATYPE8)B##1.s1, A##1, C##1);\

      C##2 = mad( (DATATYPE8)B##2.s1, A##1, C##2);\

      C##3 = mad( (DATATYPE8)B##3.s1, A##1, C##3);

     

    #define AY 8

    #define AX 2

    #define BY 2

    #define BX 4

     

    #define LWSX 8

    #define LWSY 32

     

    #define FROWS  (BX*LWSX)

    #define FCOLS  (BY*LWSY)

     

    __kernel void gemm_local_buffer_8x2_2x4(

       __global DATATYPE *matrixA,          

       __global DATATYPE *matrixB,          

       __global DATATYPE *matrixC,          

          const DATATYPE alpha,             

          const DATATYPE beta,              

          const unsigned int m,             

          const unsigned int n,             

          const unsigned int k,             

          const unsigned int lda,           

          const unsigned int ldb,           

          const unsigned int ldc,           

          const unsigned int offsetA,       

          const unsigned int offsetB,       

          const unsigned int offsetC)       

    {

      uint2 offset = (uint2)(AX*lda,BY)*DATATYPE_SIZE;

      ADDR8(addA,matrixA, offsetA+get_global_id(1)*AY, lda);

     

      DEFINE4x8(sum);

     

      uint4 addrB = (uint4)(uint)matrixB  +

        (

         offsetB +

         (get_global_id(0)*BX + (uint4)(0,1,2,3))*ldb +

         (get_local_id(1)*BY)

        )*DATATYPE_SIZE;

     

     

      __local double FRAME_B [FROWS][FCOLS+1];

     

      uint2 lid  = (uint2)(get_local_id(0)*BX, get_local_id(1)*BY);

      uint4 lrow = (uint4)(0,1,2,3) + lid.x;

      for(int frame=0; frame<k/FCOLS; frame++)

      {

        double2 tempB0 = *((__global double2*)(addrB.s0));

        double2 tempB1 = *((__global double2*)(addrB.s1));

        double2 tempB2 = *((__global double2*)(addrB.s2));

        double2 tempB3 = *((__global double2*)(addrB.s3));

     

        FRAME_B[lid.x + 0][lid.y + 0] = tempB0.s0; FRAME_B[lid.x + 0][lid.y + 1] = tempB0.s1;

        FRAME_B[lid.x + 1][lid.y + 0] = tempB1.s0; FRAME_B[lid.x + 1][lid.y + 1] = tempB1.s1;

        FRAME_B[lid.x + 2][lid.y + 0] = tempB2.s0; FRAME_B[lid.x + 2][lid.y + 1] = tempB2.s1;

        FRAME_B[lid.x + 3][lid.y + 0] = tempB3.s0; FRAME_B[lid.x + 3][lid.y + 1] = tempB3.s1;

     

        addrB += FCOLS*DATATYPE_SIZE;

     

        barrier(CLK_LOCAL_MEM_FENCE);

     

        uint2 lcol = (uint2)(0,1);

     

        for(int i=0; i<LWSY; i++)

        {

          prefetch2(__global, DATATYPE8, tempA, addA.s);

     

          tempB0.s0 = FRAME_B[lrow.s0][lcol.s0]; tempB0.s1 = FRAME_B[lrow.s0][lcol.s1];

          tempB1.s0 = FRAME_B[lrow.s1][lcol.s0]; tempB1.s1 = FRAME_B[lrow.s1][lcol.s1];

          tempB2.s0 = FRAME_B[lrow.s2][lcol.s0]; tempB2.s1 = FRAME_B[lrow.s2][lcol.s1];

          tempB3.s0 = FRAME_B[lrow.s3][lcol.s0]; tempB3.s1 = FRAME_B[lrow.s3][lcol.s1];

     

          MMUL_8x2_2x4(mad, tempA, tempB, sum);

     

          addA += offset.x;

          lcol += BY;

        }

     

        barrier(CLK_LOCAL_MEM_FENCE);

      }

      ADDR4(addC,matrixC, offsetC+get_global_id(0)*BX*ldc + get_global_id(1)*AY,ldc);

      GEMS4x8(alpha,sum,beta,addC.s);

      STORE4x8(addC.s,sum);

    }

    //////////////////////////////////////////////

     

     

    Here are some codes from Heterogeneous BLAS library

    //    ===============
    //    INIT & FINALIZE
    //    ===============
    
    // add to testing/ptest/HPL_pddriver.c:
    void cblas_init()
    {
        d = new Daemon();
        d->init_all_gpus();
        d->build_programs("clblas_kernels.cl");
        d->print_info();
    }
    
    
    void cblas_finalize()
    {
    #ifdef PRINT_GPUBLAS_PROFILING
        print_gpublas_profiling();
    #endif
        d->finish();
    }
    
    
    //    ===========
    //    CPU THREADS
    //    ===========
    
    
    class CpuDgemmThread : public ppThread {
    
        enum CBLAS_ORDER Order;
        enum CBLAS_TRANSPOSE TransA;
        enum CBLAS_TRANSPOSE TransB;
        blasint M;
        blasint N;
        blasint K;
        double alpha;
        double *A;
        blasint lda;
        double *B;
        blasint ldb;
        double beta;
        double *C;
        blasint ldc;
    
    public:
    
        CpuDgemmThread(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANSPOSE TransB, blasint M, blasint N, blasint K,
                double alpha, double *A, blasint lda, double *B, blasint ldb, double beta, double *C, blasint ldc) :
                Order(Order), TransA(TransA), TransB(TransB), M(M), N(N), K(K), alpha(alpha), A(A), lda(lda), B(B), ldb(ldb), beta(beta), C(C), ldc(ldc)
                {}
    
        void Execute() {
                double gflop = 1e-9*2*M*N*K;
                double t2 = gettime();
                openblas_set_num_threads(DGEMM_CPU_THREADS);
                cpu_dgemm(Order, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);
                double time = gettime() - t2;
                prof[GPUBLAS_CPU].inc(time, gflop);
        }
    
    };
    
    
    
    //    =====
    //    CALLS
    //    =====
    
    
    void cblas_dgemm(enum CBLAS_ORDER Order, enum CBLAS_TRANSPOSE TransA, enum CBLAS_TRANSPOSE TransB, blasint M, blasint N, blasint K,
                double alpha, double *A, blasint lda, double *B, blasint ldb, double beta, double *C, blasint ldc)
    {
        char        gta    = TransA == CblasNoTrans ? 'n' : 't';
        char        gtb    = TransB == CblasNoTrans ? 'n' : 't';
        size_t    gm    = (size_t) M;
        size_t    gn    = (size_t) N;
        size_t    gk    = (size_t) K;
        size_t    glda    = (size_t) lda;
        size_t    gldb    = (size_t) ldb;
        size_t    gldc    = (size_t) ldc;
    
        if (gm <= 0 || gn <= 0 || gk <= 0)
            return;
    
        double gflop    = 1e-9*2*M*N*K;
    
        double time = 0.0;
    
        if (gflop < MAX_CPU_ONLY_GFLOP_DGEMM) {
            openblas_set_num_threads(MAX_CPU_THREADS);
            double t1 = gettime();
            cpu_dgemm(Order, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);
            time = gettime() - t1;
            prof[GPUBLAS_CPU].inc(time, gflop);
        }
        else {
            unsigned int dpc        = gflop < CPU_GPU_GFLOP_MEDIAN ? DGEMM_PART_CPU_S(M) : DGEMM_PART_CPU(M);       
            unsigned int dpg        = gflop < CPU_GPU_GFLOP_MEDIAN ? DGEMM_PART_GPU_S(M) : DGEMM_PART_GPU(M);       
    
            blasint cpu_m        = dpc;
            size_t cpu_offset_a    = dpg * (gta == 'n' ? 1 : lda);
            size_t cpu_offset_c    = dpg;
    
    #ifdef DGEMM_HYBRID
            CpuDgemmThread t(Order, TransA, TransB, cpu_m, N, K, alpha, A + cpu_offset_a, lda, B, ldb, beta, C + cpu_offset_c, ldc);
            t.Start();
    #endif
            double t1 = gettime();
            {
    #ifdef DGEMM_HYBRID
                size_t gpu_m    = gflop < CPU_GPU_GFLOP_MEDIAN ? DGEMM_PART_GPU_S(gm) : DGEMM_PART_GPU(gm);
    #else
                size_t gpu_m    = gm;
    #endif
                double gflop = 1e-9*2*gpu_m*N*K;
                double t2 = gettime();
                clblas::dgemm(d, DGEMM_N_DEV, gta, gtb, gpu_m, gn, gk, alpha, A, glda, B, gldb, beta, C, gldc);
                double time = gettime() - t2;
                prof[GPUBLAS_GPU].inc(time, gflop);
            }
    
    #ifdef DGEMM_HYBRID
            t.Join();
    #endif
            time = gettime() - t1;
        }
        prof[GPUBLAS_DGEMM].inc(time, gflop);
        openblas_set_num_threads(MAX_CPU_THREADS);
    }
    
    
    
    
    
    
    
    
    
    
    

     

     

    dgemm3x3x1.png : here is the dependency graph for scheduler for DGEMM on (m, n, k) = (3, 3, 1) block matrices.
    dgemm_3x3x1_nvidia.png , dgemm_3x3x1_radeon.png : here are the timelines for execution on AMD Radeon and nVidia TITAN.
    The timeline for AMD Radeon 7970 has the spaces between kernels because... if we create the command queue with CL_PROFILING_ENABLE flag, we get the consequent execution with no overlap. The best results on Radeon are achieved on 12.6 drivers with config files from Catalyst 13.15.100.1 drivers. Как говорится, чудеса на виражах .

     

    From Russia with love,

     

    Pavel,
    Anton.

    • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
      himanshu.gautam Master
      Currently Being Moderated

      Anton,

       

      Thanks for this great post invigorating the community here.

       

      In an earlier post,  you had reported 1229 GFLOPs with 2 AMD 7970 GPUs.

      Later you were able to reach 1380 GFLOPs on the same GPUs with a new NUMA system.

       

      Now, you have reported 890GFLOPs after including CPUs to work as well.

       

      As I understand, you program CPU using optimized BLAS code written in C and GPUs using OpenCL. So, This configuration does not include the memory-transfer issues you had reported earlier while including CPU device in the context. btw...Does this problem exist with latest drivers (say 13.4 or 13.6)?

       

      Is the 15% reduction in problem-size contributing to decreased GFLOP output?

      Can you elaborate on your findings here, Please?

       

      Also, Have you tried the latest 13.4 and 13.6 Beta drivers? (or) 12.6 betters both of them in performance??

      Do you have any specific finding on these?

      Like -- Is it the memory-transfer delays in the drivers causing you these problems?

      If you have some benchmark programs on memory transfers / compute -- Will it be possible for you to share it with us?

      I can forward that to the driver team to see to that the numbers improve with the driver revisions.

       

      Many Thanks,

      Bruhaspati

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

      • Re: Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
        ISR RAS Newbie
        Currently Being Moderated

        Hi Himanshu,

         

        1229 - performance on the reference Radeons 7970 (925 MHz core clocks)

        1380 - performance of the SAME algorithm on Radeons 7970 GHz ed (with increased frequencies, 1010 MHz core clock)

         

        At the moment we have the same 1380+ GFlops on 2 GPU dgemm, 1550 GFlops on 2 GPU + CPU dgemm (on big enough matrices). So we don't have the decreased output The number 890 is overall HPL performance. It is much lower than peak dgemm performance due to the following reasons:

        1) 1550 on CPU+GPU is the peak performance on huge calls. HPL makes a lot of smaller dgemm and dtrsm calls, where performance is much lower. So, average dgemm for all calls on 2 GPUs + CPU is 1300

        2) DTRSM works slower  than DGEMM

        3) the algorithm uses several other BLAS methods which couldn't be accelerated

        4) the algorithm does some work in internal routines which don't use BLAS calls.

        So, average performance of all these calls is 890 GFlops. It would be much easier to understand these statements if you look at the verbose log with performance of all calls (with matrix sizes), average calls performances and timing statistics at the end of the log, for example:

         

        ================================================================================

        T/V                N    NB    P    Q              Time                Gflops

        --------------------------------------------------------------------------------

        WR00C2L64      106496  4096    1    1            907.09              8.877e+02     ( <<<<----- The Overall HPL performance and time - 907s )

        HPL_pdgesv() start time Wed Jul 31 14:50:40 2013

        HPL_pdgesv() end time  Wed Jul 31 15:05:54 2013

        --------------------------------------------------------------------------------

        ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0046169 ...... PASSED

        ================================================================================

        Finished      1 tests with the following results:

                      1 tests completed and passed residual checks,

                      0 tests completed and failed residual checks,

                      0 tests skipped because of illegal input values.

        --------------------------------------------------------------------------------

        End of Tests.

        ================================================================================

        =============================================

        Name                Time    GFlop    Gflops

        ---------------------------------------------

        cblas_dgemm      601.055  782225.4    1301.4     ( <<<<----- Average,  in the log there are 1550 GFlops values on the huge calls )

        cblas_dtrsm        47.701  22632.0    474.5

        cblas_dgemv        19.012      0.0      0.0

        cblas_dswap        0.000      0.0      0.0

        cblas_dscal        14.830      0.0      0.0

        cblas_dger          0.000      0.0      0.0

        cblas_dcopy        36.520      0.0      0.0

        cblas_daxpy        0.000      0.0      0.0

        cblas_dtrsv        0.429      0.0      0.0

        cblas_idamax        7.148      0.0      0.0

        CPU              617.829  144873.9    234.5

        GPU              563.699  659983.4    1170.8

        Total            726.694  804857.4    1107.6     ( <<<<----- cblas calls work for 726s )

         

        Please notice, that there is a big part of the test (907 - 726 = ~180 s) for some internal HPL routines. In general, the flops on HPL don't increase in linear way when we add new devices. The Amdahl's law comes into play: HPL has the "non-acceleratable" part, so speed improvement caused by 2nd device is lower than by 1st, improvement by 3rd device is lower than by 2nd and etc.

         

        Yes, we create one OpenCL context containing ONLY gpus, the CPU is controlled by optimized C (or asm?) code. There are two reasons for that:

        1) this old problem with transfer speed when a CPU is in the context with gpu...

        2) OpenBLAS has one of the fastest implementations of dgemm and dtrsm

         

        I think we will prepare the brief listing of the main problems in drivers we encountered and post it here soon. In short, all the problems on AMD GPUs are caused by two things: memory transfers and decreasing kernel performance when launched in series (possible CLOCK_GATING?)

         

        From Russia with love,

        Pavel,

        Anton.

        • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
          himanshu.gautam Master
          Currently Being Moderated

          Thanks for your time in clarifying the HPL FLOP calculations. Truly appreciate your  extra effort and time here.

           

          >> I think we will prepare the brief listing of the main problems in drivers we encountered and post it here soon.

          >> In short, all the problems on AMD GPUs are caused by two things: memory transfers and

          >> decreasing kernel performance when launched in series (possible CLOCK_GATING?)

           

          This will be an invaluable feedback to the driver team.

          Also, if you  can send us your benchmark programs (or) repro-cases for the problems you encounter - we can make a strong case.

          In fact, we can benchmark new driver beta releases against this and see how we are progressing from release to release.

           

          Many Thanks,

          Best Regards,

          Bruhaspati

          Regards

          Himanshu , Bruhaspati

          --------------------------------

          The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

    • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
      moosingin3space Newbie
      Currently Being Moderated

      Is there somewhere I could download this build of High-Performance LINPACK from? I'm very interested in running it on my cluster of four 7970s to see how it compares.

       

      Nathan Moos

  • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
    himanshu.gautam Master
    Currently Being Moderated

    >>>> Transfer overlap (kernel execution and PCIe-data transfer at the same time) in

    >>>> all drivers could be achieved only with clEnqueueWriteBuffer OpenCL API call.

    >>>> So, if we want to transfer rect, we have to merge it into a linear region (lots of memcpy's)

    >>>> and then transfer via WriteBuffer. In this case, average is significantly slower than already slow linear transfer.

     

    This is not completely true (anymore?). "clEnqueueCopyBuffer" can also use DMA.

    Please check pre-pinned buffer support in AMD APP Programming Guide.

    Just create a UHP (use host ptr) buffer out of a chunk of application memory (make sure it is aligned to 256 bytes... or better 4096 bytes). AMD Runtime will Pin this memory as it is and then use clEnqueueCopyBuffer() to DMA to transfer to GPU.

    This should work fast and can also overlap with Kernel execution.

     

    Sometime back, we uploaded a sample program showing how to overlap PCIe transfer with kernel execution in the forums.

    Asynchronous DMA  + Kernel Execution using AMD GPUs

    I know you guys are much more knowledgeable than me on this. I am just pointing you guys - may be, it rings a bell!

     

    Best Regards,

    Bruha....

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

    • Re: Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
      ISR RAS Newbie
      Currently Being Moderated

      Yes, we use prepinned buffers in all transfers (in addition, it works good on NVidia too). The general model looks like this:

      1) we create a cl_mem prepinned_buffer (ALLOC_HOST_PTR, prepinned_mem)

      2) then do an explicit memcpy(prepinned_memory, host_data_ptr, sizeof(data))

      3) and then call enqWB(dev_buf, prepinned_memory)

       

      The brief list of problems with AMD drivers (under Linux):

      X server

      1) OpenCL doesn't recognize accelerators without active X session

      2) Drivers don't allow to monitor temperatures, clocks and fan control without X

      Kernels

      12.6 - all good with kernels. On the latter drivers:

      1) The work time of heavy kernels on series of consequent launches dramatically increases

      2) There are unclear gaps between kernel launches in series

      3) Sometimes kernel could hang when launched in series

      Positions 1) and 2) result in 10-20% decrease of kernel performance

      5) It would be very good if you improve the compiler! It's a frequent situation when you have to do some magic tricks (manually change instruction order and etc) to achieve the best performance

      Events

      1) The event status of overlapped transfer command changes to CL_COMPLETE only when the kernel ends (we haven't deeply researched this case yet)

      Transfers

      1) The speed of all transfer calls (enqWB, enqRB, enqWBRect, enqRBRect) is unstable and depends on call and transferring data size. For example, even on the same call and the same transfer size there could be up to 25-30% dispersion in speed (from 3,5 to 5 Gb/s on PCI-e 2.0 16x)

      2) enqWBRect and enqRBRect are significantly slower than enqWB and enqRB

      3) When there are several simultaneous independent transfers on a device, the total speed can be lower than single-call speed.

      4) Transfer overlap is not good: it is unstable, it doesn't work on rects correctly and it doesn't go on the full speed

       

      In summary, to provide overlap we have to do some magic tricks with command queues, memcpy rect to linear buffer before transfers and etc.

      It's important that NVidia OpenCL drivers don't have these issues and they work well and stable without ANY additional magic... This fact proves that there is a possibility to overcome these problems.

       

      Here is the illustration decreasing kernel performance on MSI Radeon 7970 1010MHz clock. The kernel - gemm_buffer_8x1_1x4 from the post about HPL. The same picture shows up on all other kernels as well.

      On good old 12.6 drivers:

      yefremov@xeon:~/Software/Casestudy/DMMUL_tABt_Ct$ ./test -qo -l1 -m7168 -n7168 -k7168 -X16 -Y16 -i 50

       

      Number of platforms:                2

        Platform Profile:                 FULL_PROFILE

        Platform Version:                 OpenCL 1.2 AMD-APP (938.1)

        Platform Name:                    AMD Accelerated Parallel Processing

        Platform Vendor:                  Advanced Micro Devices, Inc.

        Platform Extensions:              cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

        Platform Profile:                 FULL_PROFILE

        Platform Version:                 OpenCL 1.2 LINUX

        Platform Name:                    Intel(R) OpenCL

        Platform Vendor:                  Intel(R) Corporation

        Platform Extensions:              cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread

       

      ************************************************

      Platform 0: Advanced Micro Devices, Inc. (default)

          Device 0: Tahiti (default)

          Device 1: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

      Platform 1: Intel(R) Corporation

          Device 0: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (default)

      7168 7168

      ------------------------------------------------

      Series from size m,k,n=(7168,7168,7168) to (7168,7168,7168): (7168,7168,7168)

      gemm_buffer_8x1_1x4: G[1792,896] L[16, 16]:

      ------------------------------------------------

      I   CL(sec)       SYS(sec)       GFlop/sec     

      ------------------------------------------------

      0   1.029643      1.030861       714.69

      1   0.975778      0.976895       754.17

      2   1.030160      1.031280       714.39

      3   1.029737      1.032412       713.61

      4   0.975416      0.978116       753.22

      5   1.030054      1.032763       713.37

      6   1.029759      1.032398       713.62

      7   1.030001      1.032676       713.43

      8   1.030075      1.032808       713.34

      9   1.030351      1.033005       713.20

      10   1.029534      1.032280       713.70

      11   1.030309      1.033047       713.17

      12   1.029799      1.032516       713.54

      13   1.029050      1.031759       714.06

      14   1.030015      1.032758       713.37

      15   1.029978      1.032717       713.40

      16   0.976210      0.978933       752.60

      17   1.030151      1.032783       713.36

      18   1.030121      1.032849       713.31

      19   1.029537      1.032237       713.73

      20   1.029812      1.032512       713.54

      21   1.029975      1.032717       713.40

      22   1.030484      1.033268       713.02

      23   1.029488      1.032227       713.74

      24   0.976048      0.978750       752.74

      25   1.030873      1.033587       712.80

      26   0.976426      0.979358       752.27

      27   1.030297      1.033018       713.19

      28   1.030074      1.032804       713.34

      29   1.030754      1.033611       712.78

      30   1.030456      1.033384       712.94

      31   1.030288      1.032980       713.22

      32   1.029422      1.032131       713.81

      33   1.030312      1.032951       713.24

      34   1.030376      1.033098       713.14

      35   1.030265      1.032955       713.24

      36   0.975986      0.978781       752.71

      37   0.976276      0.979024       752.53

      38   1.029328      1.032046       713.86

      39   1.029848      1.032566       713.51

      40   1.029311      1.032058       713.86

      41   1.030018      1.032753       713.38

      42   1.030446      1.033031       713.18

      43   1.030015      1.032689       713.42

      44   1.029929      1.032612       713.47

      45   1.029971      1.032681       713.43

      46   0.976377      0.979210       752.38

      47   1.030562      1.033290       713.01

      48   1.030109      1.032783       713.36

      49   1.029791      1.033039       713.18

                                                        Avg:719.74    Max:754.17

       

       

      On the latest 13.15.100.1 drivers (the same picture on all drivers after 12.6):

      yefremov@xeon:~/Software/Casestudy/DMMUL_tABt_Ct$ ./test -qo -l1 -m7168 -n7168 -k7168 -X16 -Y16 -i 50

       

      Number of platforms:                2

        Platform Profile:                 FULL_PROFILE

        Platform Version:                 OpenCL 1.2 AMD-APP (1268.1)

        Platform Name:                    AMD Accelerated Parallel Processing

        Platform Vendor:                  Advanced Micro Devices, Inc.

        Platform Extensions:              cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

        Platform Profile:                 FULL_PROFILE

        Platform Version:                 OpenCL 1.2 LINUX

        Platform Name:                    Intel(R) OpenCL

        Platform Vendor:                  Intel(R) Corporation

        Platform Extensions:              cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread

       

      ************************************************

      Platform 0: Advanced Micro Devices, Inc. (default)

          Device 0: Tahiti (default)

          Device 1: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

      Platform 1: Intel(R) Corporation

          Device 0: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (default)

      7168 7168

      ------------------------------------------------

      Series from size m,k,n=(7168,7168,7168) to (7168,7168,7168): (7168,7168,7168)

      gemm_buffer_8x1_1x4: G[1792,896] L[16, 16]:

      ------------------------------------------------

      I   CL(sec)       SYS(sec)       GFlop/sec     

      ------------------------------------------------

      0   0.963114      0.963674       764.51

      1   0.972941      0.973270       756.98

      2   0.978294      0.978627       752.83

      3   0.979179      0.979503       752.16

      4   0.979069      0.979384       752.25

      5   0.984987      0.985306       747.73

      6   0.989396      0.989708       744.40

      7   0.995792      0.996110       739.62

      8   0.998844      0.999122       737.39

      9   0.999154      0.999427       737.16

      10   1.002955      1.003257       734.35

      11   1.005051      1.005349       732.82

      12   1.013999      1.014303       726.35

      13   1.018131      1.018412       723.42

      14   1.022920      1.023218       720.02

      15   1.023512      1.023821       719.60

      16   1.023695      1.023954       719.51

      17   1.027477      1.027771       716.83

      18   1.035647      1.035943       711.18

      19   1.041343      1.041648       707.28

      20   1.049543      1.049815       701.78

      21   1.051407      1.051698       700.53

      22   1.053856      1.054157       698.89

      23   1.053736      1.054176       698.88

      24   1.055031      1.055518       697.99

      25   1.064649      1.064951       691.81

      26   1.071969      1.072408       687.00

      27   1.074647      1.074960       685.37

      28   1.079190      1.079518       682.47

      29   1.087648      1.087979       677.16

      30   1.088860      1.089158       676.43

      31   1.089783      1.090080       675.86

      32   1.089824      1.090552       675.57

      33   1.089499      1.089797       676.04

      34   1.089685      1.089981       675.92

      35   1.098615      1.098891       670.44

      36   1.105597      1.105891       666.20

      37   1.116764      1.117096       659.51

      38   1.118467      1.118761       658.53

      39   1.128375      1.128707       652.73

      40   1.134080      1.134385       649.46

      41   1.136028      1.136333       648.35

      42   1.135480      1.135774       648.67

      43   1.136379      1.136709       648.14

      44   1.137923      1.138285       647.24

      45   1.136555      1.136865       648.05

      46   1.136110      1.136420       648.30

      47   1.135715      1.136063       648.50

      48   1.135781      1.136027       648.52

      49   1.135943      1.136175       648.44

                                                        Avg:695.78    Max:764.51

       

      Notice the Avg performance and the performance distribution:

      1) stable on 12.6 drivers

      2) very good in the beginning on 13.15.100.1 and then decreasing to the lower peak which is ~10% worse than on stable driver.

       

      The first transfer tests on 13.15.100.1 show good results, but we're not ready to say something definitely - they require better testing.

       

      Our testers widely use the internal structures, so I think we'll rewrite some of them in plain OpenCL and post them here in several days.

       

      From Russia with love,

      Pavel,

      Anton.

      • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
        himanshu.gautam Master
        Currently Being Moderated

        Thanks for your feedback on AMD drivers. I will pass it on to AMD.

        Btw...Do these drivers work well on Windows? i.e. Linux-only issues?


        Hi,

        It seems that it is possible to insert images using the "Advanced Editor" (Camera Icon).

        Please see the forum post here.

        How to add an image when replying to a forum post

        Regards

        Himanshu , Bruhaspati

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

      • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
        german Newbie
        Currently Being Moderated

        Events

        1) The event status of overlapped transfer command changes to CL_COMPLETE only when the kernel ends (we haven't deeply researched this case yet)

        Runtime submits the commands to GPU in batches and currently runtime doesn't have an instant feedback from HW when the command is completed, unless OCL profiling is enabled. But even with profiling enabled CL_COMPLETE can be set about at the same time, since it's done by CPU and not GPU. clFlush or a wait for events can help with better granularity, but that will introduce smaller batches - hence lower performance.

         

        Transfers

        1) The speed of all transfer calls (enqWB, enqRB, enqWBRect, enqRBRect) is unstable and depends on call and transferring data size. For example, even on the same call and the same transfer size there could be up to 25-30% dispersion in speed (from 3,5 to 5 Gb/s on PCI-e 2.0 16x)

        We had some changes in the base driver to remove fluctuation on read/write performance under Linux. You have to check the latest driver. Also we will have another generic improvement soon. Basically HW had conservative settings for power management and clock could not always go up on DMA transfers.

         

        2) enqWBRect and enqRBRect are significantly slower than enqWB and enqRB

        This issue should be no longer valid with the latest drivers. However there are restrictions in HW on the rectangle alignments and OCL runtime may fall back to a slower copy path.

         

        3) When there are several simultaneous independent transfers on a device, the total speed can be lower than single-call speed.

        4) Transfer overlap is not good: it is unstable, it doesn't work on rects correctly and it doesn't go on the full speed

        I don't have any explanation to those issues without the sample code. Do you see them with 13.15.100.1?

      • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
        himanshu.gautam Master
        Currently Being Moderated

        Anton,

        CodeXL 1.2 should be able to show you overlapped memory transfers visually. Can you check, if that helps?

        Regards

        Himanshu , Bruhaspati

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
    ISR RAS Newbie
    Currently Being Moderated

    Hi everybody,

    here are some testing codes that we use to measure drivers. In addition, we provide some explanations and brief summary about the latest drivers.

     

    The project consists of three parts:

         1) general utilities, such as: parsing command line arguments, time measurements, OpenCL status checks and etc.

         2) the main driver: initialize OpenCL, then start a loop "get test arguments -> set test parameters -> call the corresponding test"

         3) a set of testing codes

         At first we would like to mention that 1) and 2) are written in a dirty C/C++ mix - these codes are not essential. The only things that matter are the testing procedures.

    The driver can read test parameters in two ways:

         - from stdin (pipe, or launch and insert manually)

         - from file

         It accepts several flags: test number, some ints and doubles, and boolean flags. The parameter semantics may vary from test to test. The output is written to stdout. More details can be found if launched with -h flag.

    This project contains three tests, written in plain OpenCL: the kernel test, the transfer test and the verification via square block dgemm multiplication.

         The kernel test determines how kernels are executed on a device. The main goal is to determine, wheter the performance is stable from call to call or not. Also, it allows to detect gaps between kernels, if there are any.

         How does it work: we measure on the host the whole work time of series of identical calls and measure each kernel's work time via OpenCL timer. Then we compare two numbers. In addition, the test allows to get the kernel performance via OpenCL timer and host timer.

    Accepted parameters:

         -t = 7, test number

         -s block size for the kernel (total 2*s^3 operations on the kernel)

         -n number of calls in series

         -A if the flag is set, we execute consequently (do synchronization after each call). Otherwise we send all n commands at the same time.

         -B if the flag is set, we do a synchroniztion via clWaitForEvents call. Otherwise, we do clFinish.

         -C turnes the profiling on (CL_QUEUE_PROFILING_ENABLE flag). It allows to get each kernel's time.

     

         The transfer test allows to know details about speed and overlap on four OpenCL calls: clEnqueueWriteBuffer, clEnqueueReadBuffer, clEnqueueWriteBufferRect, clEnqueueReadBufferRect. Also, it allows to check overlap on different combinations of Linear and Rect calls.

         How does it work. We create several execution "streams", each stream does several iterations "block of writes -> one NDRange -> block of reads". We assume that while first execution stream does the NDRange, other streams do writes and reads, so the kernels from different streams work consequently without gaps.

         How to interpret results. We measure the C-time of the whole program execution. The speed is calculated as one_transfer_size*(num_writes+num_reads)*niters*nstreams/total_time. The overlap availability is determined by three calls. We set the fixed kernel execution time, then launch three tests: full kernels with small (1MB) transfers ("kernels_only"), small kernels with full transfers ("transfers_only"), full kernels with full transfers ("whole_program") . If the kernels_only time is close to whole_program time, then we have good overlap. If the whole_program time is equal to kernels_only time + transfers_only time, then we don't have an overlap. It is also possible to calculate the percentage of hidden transfers.

    Accepted parameters:

         -t = 160, test number

         -s the transfer size in mbytes. If rects, we send the maximum square of doubles which fits in s mbytes.

         -n number of iterations

         -k number of "execution streams"

         -e sets the number of writes (first digit) and reads (second digit). For example, -e32 will do "3 writes -> 1 NDRange -> 2 reads"

         -p sets the kernel execution time in seconds. Now it is set up so p=1.0 works 1.0 seconds on Radeon 7970 GHz edition

         -A if this flag is set, we use Rect calls for transfers (otherwise, we use linear calls)

         -B if this flag is set, we use prepinned host memory

         -C turns the profiling on and prints all call timings

     

         The dgemm test is designed to evaluate the overall driver work. It calculates the result of square block matrices multiplication (in double precision) and does the verification. Also this test allows to check the overlap on the "close-to-real" task.

    We use the same algorithm as usual.

         How do we know about the overlap: if the whole_program time is close to kernels_only time, then we say that transfers are overlapped. Otherwise, there is no overlap. On Catalyst 12.6 and Radeon 7970 we get 730 GFlops on kernel, 680+ GFlops on overlapped linear calls and 560 GFlops on non-overlapped calls. It is important that overlap depends on queue management...

    Accepted parameters:

         -t = 300, test number

         -s block size in doubles

         -n number of blocks (total matrix size is s*n doubles)

         -A turnes on rects

         -B turnes on prepinned memory

         -C turnes on profiling

         -D turnes on verification. Now it uses OpenBLAS cblas_dgemm function (the simple implementation is commented, if there is no OpenBLAS in the system).

     

     

    Some examples. Here is the lists/current file:

    yefremov@xeon:~/Casestudy/tester/trunk$ cat lists/current

    #whole program

    -t160 -p0.5 -e21 -s250 -k2 -n20

     

    #kernels only

    -t160 -p0.5 -e21 -s1   -k2 -n20

     

    #transfers only

    -t160       -e21 -s250 -k2 -n20

    Here is the program output:

    yefremov@xeon:~/Casestudy/tester/trunk$ ./tester lists/current

    Advanced Micro Devices, Inc.

    =====================

    Devices in context: 1

    ---------------------

    GPU    Tahiti               OpenCL 1.2 AMD-APP (1268.1)

    =====================

     

    #whole program

     

    Running -t160 -p0.5 -e21 -s250 -k2 -n20

          Mbytes    LD    ST        Time       Speed        Perf

             250     2     1     20.9793   1429.8142    706.5817

     

    #kernels only

     

    Running -t160 -p0.5 -e21 -s1   -k2 -n20

          Mbytes    LD    ST        Time       Speed        Perf

               1     2     1     20.1973      5.9401    733.9402

     

    #transfers only

     

    Running -t160       -e21 -s250 -k2 -n20

          Mbytes    LD    ST        Time       Speed        Perf

             250     2     1     11.3395   2645.3007      0.0000

    yefremov@xeon:~/Casestudy/tester/trunk$

         The interpretation. We launch test with id 160 with kernel work time 0.5 sec, 2 writes and 1 reads, one transfer size is 250 mbytes and we launch 2 execution streams with 20 iterations (in we do one stream, we always get consequent execution). What do we see: the whole_program time works for 20.97s, kernels work for 20.19s with avg perf 733 GFlops, and transfers work for 11.33s with average bandwidth 2645 MB/s. So we see that we get a good overlap here, 93% of transfers were hidden behind calculations (93% = ((11.33 - (20.97 - 20.19)) / 11.33 * 100%).

     

     

    Another example:

    yefremov@xeon:~/Casestudy/tester/trunk$ echo "-t7 -s5632 -n10 -C" | ./tester

    Advanced Micro Devices, Inc.

    =====================

    Devices in context: 1

    ---------------------

    GPU    Tahiti               OpenCL 1.2 AMD-APP (1268.1)

    =====================

     

    Running -t7 -s5632 -n10 -C

    0     0.770920

    1     0.775234

    2     0.776484

    3     0.779655

    4     0.781146

    5     0.780192

    6     0.781455

    7     0.779243

    8     0.777731

    9     0.774533

             Block     Total       Avg      Perf

    C         5632    7.9890    0.7989  447.2252

    CL        5632    7.7766    0.7777  459.4398

    yefremov@xeon:~/Casestudy/tester/trunk$

    The interpretation. The kernel used in dgemm show 450 GFlops performance (730 GFlops on 12.6)

     

    What do we see on dgemm with that kernel:

    yefremov@xeon:~/Casestudy/tester/trunk$ echo "-t300 -n3 -s5632 -AD" | ./tester

    Advanced Micro Devices, Inc.

    =====================

    Devices in context: 1

    ---------------------

    GPU    Tahiti               OpenCL 1.2 AMD-APP (1268.1)

    =====================

     

    Running -t300 -n3 -s5632 -AD

           Block      Matrix        Time        Perf

            5632       16896     27.2921    353.4640

    ||CL - REF||_c    =              0.0000000000

    ||CL - REF||_1    =              0.0000000022

    ||CL - REF||_oo    =              0.0000000025

    yefremov@xeon:~/Casestudy/tester/trunk$

    So, there is no overlap with rects, but the verification passes.

     

         There are sample scenarios in lists folder. The ./run.sh script works for about an hour and collects all the information required to determine how kernel work on different block sizes, speed and overlap on different sizes and configurations, are the transfers and kernels correct (on dgemm) and do the drivers provide overlap on "close-to-real" task.

     

    The list of problems on new drivers:

         1) the enqWRRect hang the driver on transfer with size 32MB and 128MB

         2) extremely low kernel performance

         3) dgemm verification fails on rects on one block (all passed on 12.6)

    tester1.png

    tester2(1).png

     

    tester3(1).png

    dgemm_2x2x2_graph.png

    There may be some bugs or problems in logic of the tester, if so, please write about them.

     

    From Russia with love,

    Pavel,

    Anton.

  • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
    ISR RAS Newbie
    Currently Being Moderated

    We say hello to our dear heterogeneous computing friends! Today we will discuss the recent news from the battlefield, and unfortunately they are not the cheerful news.

     

    For almost two years we have been obtaining all our results with a dozen old trusty AMD Radeons 7970 GHz Ed. New scientific plans leaded us to a choice of a hardware platform for the next 1-1,5 years. In addition, the new HAWAII cards seemed inspiring. Let's run the benchmarks, do some calculations and make the conclusions!

     

    The soldiers of Applied Science are interested in at least two major properties of a device: peak performance in double precision and a global memory bandwidth.

    We have tested the following accelerators (+prices in Moscow, written in $$$):

    - Radeon 7970 Ghz Ed (1010MHz) (~350$)

    - Radeon 7990 Reference (~650$)

    - Radeon R9 290X Reference (~700$)

    - Geforce TITAN Reference (~1000$)

     

    All AMDs had 12.6 and 14.1b drivers, 331.38 for TITAN.

     

    Here are the theoretical peaks of these devices and the best real performance (we used the synthetic OpenCL kernels from the clpeak project: https://github.com/krrishnarraj/clpeak):

     

    benchmarks1.png

     

    After that, we launched a mini-stress-test: 100 iterations with 10 kernel launches in each iteration. Resulting performance of an iteration is an average of 10 launches. That's how Marsellus Wallace the performance degradation looks like. Isn't it beautiful?

     

    benchmarks3.png

     

    After looking at these diagrams, several questions arise.

    1. It is unclear why one half of 7990 has 733 GFlop/s while 7970 has 1007 GFlop/s? As far as we know, these cards are equal, and they have the same frequencies and amount of SPUs.

    2. Why does the kernel performance decrease on Tahiti as time goes by? This issue keeps appearing with all drivers after 12.6, and driver developers seem to do nothing about that.

    3. Why the HAWAII card is so slow? The fact that it has DP/SP = 1/8 (compare with 1/4 on Tahiti) made us frustrated. In addition, we notice that our non-optimized DGEMM kernels get 650 out 704 GFlop/s, which is 92% - unrealistic number. So we hypothesize that the chip has full 1,4 TFlop/s performance, which is artificially (by software?) limited.

     

    The next chart contains theoretical peaks of global memory bandwidth:

     

    benchmarks2.png

     

    The real bandwidth of all devices was measured with GlobalMemoryBandwidth and MemoryOptimization tests from AMD APP SDK:

     

    benchmarks_tables_1.png

    benchmarks_tables_2.png

     

    Important moments:

    1. In some aspects AMD drivers become better and better, and these improvements greatly affect the overall performance.

    2. All AMD GPUs demonstrate an exсellent memory subsystem work.

     

    Unfortunately, we have to state that HAWAII is less suitable for scientific computations than Tahiti, and the time of cheap GPGPUs for scientists has ended.

     

    As we think, up to this point AMD hardware was much better than hardware from NVidia. On the other hand, these advantages were neutralized by unstable drivers. At the moment, the AMD drivers became better, but there are some old bugs (like performance degradation) and some new ones, so some codes which worked with 12.6, don't work on the new drivers.

     

    We have also tested the FirePro card (based on Tahiti architecture), and the situation was the same: unstable drivers and a lot of difficulties with multi-GPU systems.

     

    So, the question is: what does AMD plan to do in the computational sector? Previously we had a cheap and fast hardware, which could be used despite the awful drivers. At the moment the drivers are still not fully operational, and soon the FirePro cards will become very expensive (who will buy them?..).

     

    On the other hand, in some performance metrics NVidia devices are not as good as AMD ones, but they have the brilliant software support. We have tested a lot of different NVidia GPUs (GeForce and Teslas) and never encountered serious problems.

     

    Almost all computational problems we investigate are memory-bound, and in this aspect all these GPUs are approximately equal. In the real launches AMD 7970s are usually better than TITANs, so we decided to stay on 7970. But we expect that NVidia will start selling something more fast in a year or so.

     

    There is an opinion (not only ours), that if AMD will not change their slighting attitude to GPGPU sphere, a lot of scientific researches will start choosing NVidia.

     

    From Russia with love,

    Pavel,

    Anton.

    • Re: OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).
      moozoo Newbie
      Currently Being Moderated

      Thanks Anton, I just wanted to show my appreciation for your work and for reporting your findings.

       

      My opencl application is heavily dependent on double precision performance.

       

      A lot of people out there have the attitude that you only need double precision for a few scientific problems.

      The fact that every major computer language uses double precision for calculations and float is just a storage format eludes them. Ditto Excel using 16 digits of precision.

      It would be great if someone could write a paper as to why double precision is so important for all uses.

       

      Someone needs to invent a bitcoin that depends on double precision (or higher..) performance. (finding locations of a certain pattern in a mandlebot?)

      Perhaps this would raise the desire for high DP performance in the general user base.

More Like This