I've been using clAmdBlas with clMAGMA 1.0 and am quite impressed with what it does!
The clAmdBlas v1.8.239 and v1.10.274 release notes claims to have fixed "Failures in *trsm functions with clMAGMA tests," yet with the latest clAmdBlas v1.10.274 (Win32+64) installed I'm experiencing problems with Strsm(), both clAmdBlasStrsm() and clAmdBlasStrsmEx(), including [A] matrix not remaining unchanged on exit, and incorrect data in [B] matrix on exit, depending on the combination of the input argument flags. In one particular case (the combination of "left," "upper triangular," "no transpose," and "non unit") I found a limited work-around: trsm ( Right, Upper, NoTrans ) { [A], [B] } = [ trsm( Left, Upper, Trans ) { [A], [B] _T ) ] _T, where _T is an external transposition - but that work-around fails when N (# cols of [B]) is even and > 3 and M (rows of B) > 3. That work-around also has the cost of the extra memory it requires to create a copy of [B] in order to produce its transpose [B]_T.
It would be nice to get the clAmdBlasStrsm() subroutines to work as originally and efficiently intended. Any suggestions of what might be wrong with my build would be appreciated, or if this is a clAmdBlas bug, might there be a fix for it planned in a newer release?
P.S. The same problem occurs with calls to the other types ...Dtrsm(), and ...Ctrsm() as well, but ...Ztrsm() seems to work just fine -- at least on one of my Intel systems.
Could you please provide us the system configuration(software version etc) with which you tested this particular sample. Also give us the test case to reproduce the same.
Thanks. Here you go:
I have a detailed description of my configuration in a posting on the MAGMA Forum: MAGMA Forum • View topic - clMAGMA v1.0 build on WinXPHE 32bit with Visual Studio & MKL [Since that original posting I've upgraded to VS 2008 ProSP1, MKL v11.0.2, GNU gfortran 4.8, AMD APP SDK v2.4.595.10 (Catalyst 11.4), and clAmdBlas v1.10.274 (Win32+64).] Four builds on four different systems (two AMD CPUs, two with Intel CPUs), all with WinXP-HE(32bit) OS. Three systems have a HD5850, and the fourth a HD6850. Some systems experience the problem slightly differently: On one Intel system Ztrsm seems to work perfectly (but not S/C/Dtrsm), and on one AMD system (PhenomIIx6) Ztrsm fails (in addition to S/C/D) with a certain combination of argument flags passed into the clAmdBlasZtrsmEx() call: Left/Upper/Trans/NonUnit. The Strsm version fails with more different combinations of input arg flags than the other types do.
To try to give you enough info to reproduce the problem, the remainder of this msg provides a few test cases, with code nucleus and data in/out: The clMAGMA v1.0 sgetrf_gpu.cpp and sgetrs_gpu.cpp codes each call on magma_strsm(), which is a wrapper for clAmdBlasStrsmEx().
Hyperlinks:
1. First case (for reference, works properly, using Lapack STRSM) - Does not call clAmdBlasStrsmEx(). [A] unchanged on exit, and [B] changed on exit, replaced with "answer."
2. Second case (clMAGMA, calling clAmdBlasStrsmEx() ) - Matrix [A] corrupted and did not remain "unchanged on exit." [B] was supposed to be changed on exit, but wasn't.
3. Third case (my revised clMAGMA code, calling clAmdBlasStrsmEx(), trying to use the "work-around" I mentioned in my original posting, but which doesn't seem to work all that well in this case.) - [A} remains unchanged on exit (good), but matrix [B] was corrupted on exit (lost pointer?).
4. Fourth case (an experiment, varying the clAmdBlasStrsmEx args, trying to find a combination that works, but didn't.) - Matrix [A] corrupted and did not remain "unchanged on exit." [B] was supposed to be changed on exit, but wasn't.
Regards,
Art Densmore
Hi Art~
I wanted to make two comments for you:
Kent
Thanks for the links Kent, I'll look into those. You're right: I'm using WinXP, and I believe Catalyst 11.4 was the last to support it. Although I'd still prefer to get the clAmdBlas lib working properly, I've solved my present problem by coding my own OpenCL versions of the Xtrsm GPU Blas routines, one of which I've copied below. This routine may not be fully optimized, but it runs much faster than the Netlib and seems to run at a speed comparable to the clAmdBlas lib routines.
/*
* strsm_gpu_lunn1() single-precision triangular matrix solver:
*
* specifically for the case of: Left / Upper / NoTrans / NonUnitTriangular / alpha=1.
*
* Written to as an alternate for the clAmdBlas v.1.10.274 Win32+64 clAmdBlasStrsmEx() routine.
*
* clMAGMA's Xgetrf_gpu.cpp and Xgetrs_gpu.cpp were both originally written to rely on clAmdBlasStrsmEx(),
* so I've coded this C/OpenCL code, based on Netlib's TRSM.f source, to patch the clMAGMA build v.1.0.0.
*
* 2013-09-16 Art Densmore: University of California, Los Angeles. This code may be used as long as this title block remains unchanged.
*
* -- Supplement to clMAGMA (version 1.0.0) --
* The .cpp subr defined in include/magmablas_s.h, its kernel label entered into the kernel list in interface_opencl/CL_MAGMA_RT.cpp, and
* both the .cpp file and accompanying .cl file listed in interface_opencl/Makefile.src
*
* The preprocessor code below is meant to provide general support for all clMAGMA precisions: s d c z
*/
#define PRECISION_s
#if defined(PRECISION_d) || defined(PRECISION_z)
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#if defined(PRECISION_z)
typedef double2 magmaDoubleComplex;
#endif
#define Zero 0.
#else
#define Zero 0.f
#endif
#define __mul24( x, y ) ((x)*(y))
/* AMD GPUs use workgroup size of 64 */
#define SIZE_LOCAL_WORKGROUP 64
__kernel void strsm_gpu_lunn1( int m, int n, __global float *A, int offsetA, int lda, __global float *B, int offsetB, int ldb)
{
int inx = get_local_id(0);
int ibx = get_group_id(0) * SIZE_LOCAL_WORKGROUP;
int tx = ibx + inx;
int i, j, k, pa, pb;
int kXlda;
int jXldb;
int itemp;
float ftemp;
A += offsetA;
B += offsetB;
// Each work item (parallel "thread") processes one of the B matrix columns.
j = tx;
if( j < n ) { // Only act on workitem "threads" that represent columns that fall inside the range of matrix B.
jXldb = j * ldb;
itemp = m - 1;
kXlda = itemp * lda;
for( k = itemp; k >= 0; k-- ) {
pa = kXlda;
pb = jXldb;
ftemp = B[ itemp = k + jXldb ]; // use ftemp to minimize memory access in inner loop
if( ftemp != Zero ) {
ftemp /= A[ k + kXlda ];
B[ itemp = k + jXldb ] = ftemp;
for( i = 0; i < k; i++ ) { // Note that first colmun is only scaled.
B[ pb++ ] -= A[ pa++ ] * ftemp ;
barrier(CLK_LOCAL_MEM_FENCE); // synchronize the walk through the memory field, to keep grabs localized
}
}
kXlda -= lda;
}
}
}
/*
// code from Netlib TRSM.f
DO 60 J = 1,N
END IF
DO 50 K = M,1,-1
IF (B(K,J).NE.ZERO) THEN
B(K,J) = B(K,J)/A(K,K)
DO 40 I = 1,K - 1
B(I,J) = B(I,J) - B(K,J)*A(I,K)
40 CONTINUE
END IF
50 CONTINUE
60 CONTINUE
*/
Buddy...,
TRSM is a BLAS-3 operation. It can be orchestrated using "GEMM" operation and can give near PEAK performance.
Your code posted above can act as a filler...but can never give you the performance that the hardware is capable of.
FYI...
Best Regards,
Bruhaspati
Thank you.
Where can I find out more about such "orchestration using GEMM?" Would I review the GEMM code in the GitHub set that Kent pointed out above?
Hi Artdensmore,
Sorry for the delay.. The BLAS code in Github is a tad bit complicated and will have a learning curve. Some of the kernels are generated dynamically and can be a nightmare to figure out how this is happening.
You can check TRSV in GITHub -- It is less complicated and orchestrates using GEMV.
I will tell you a simple design that can use BLAS.
Consider AX = B
where A -- Coefficient Triangular Matrix (let us consider an upper triangular matrix)
X = Variable Matrix (1 column for 1 equation)
B -- Value Matrix
1. Solve a lower-tip of triangle A for some Variables (say 16 variables). You can use 1 workgroup to solve for 1 lower-column of the Variable matrix..
2. Now, if you look at X matrix, a small lower portion (16xN) contains valid variables now.
3. Now, you need to multiply (M-16)x16 portion of A matrix (right-side rectangle just above the trianglur tip you used for solving) with the
partial matrix in X (16xN) and subtract it from (M-16xN) portion in the Value Matrix (B)
It is as simple as saying:
Since we have solved for few variables, we will substitute the variable values in all remaining un-solved equations and subtract the dot-product from the actual result...
You just need to do this process recursively to find your way..
HTH,
Best Regards,
Bruhaspati
Thank you - that's insightful. And I got the clBLAS from GitHub working.
Turns out the main issue appears to be a Catalyst driver issue with XP: The clAmdBlas calls appear to be OK when called alone, outside of the clMAGMA context, and I (randomly) got the clMAGMA code working with original the clAmdBlasXtrsmEx calls (C/D/S/Z) without the bug reported above, on three of four XP systems (two Intel, one AMD, all with HD5850), by simply updating the drivers as follows (specifically and only this driver combination): display driver ver 8.911 (Cat 11.11), AMD APP SDK (OpenCL) runtime ver 10.0.873.1 (Cat 12.3 ?), and SDK Developer ver 2.4.595.10.
A fourth system (AMD) still shows some symptoms of this problem (erroneous Xgesv_gpu results with -N odd) even after the specific driver update; although, all four systems avoid this problem entirely, no matter whatever reasonable driver combination, with the driver-sensitive clAmdBlasXtrsmEx calls noted above replaced with the customized Xtrsm OpenCL code shown above. The tradeoff though is that the custom code, as you noted, is rather slow.
If anyone would like a copy of my clMAGMA v1.0 build on WinXP (make cleanall), send me an email at adensmore@ucla.edu with subject line "Requesting clMAGMA source".