Hallo,
I have an OpenCL kernel that implements a dot product between two float arrays. The first is an array of size*n elements and the second is an array of n elements.
This is a sample code
_kernel
void evaluate_product(__global const float *pFirstArray,
const int n,
__global const float *pSecondArray,
__global float *pOutput)
{
int gid = get_global_id(o);
int size = get_global_size(o);
if (gid>=0 && gid<size)
{
float output = 0;
for (int k=0; k<n; k++)
{
output += pFirstArray[gid + k*size]*pSecondArray[k];
}
pOutput[gid] = output;
}
}
If I execute the same operations on CPU, I have different results, above all after 6 or 7 decimal digit. Why this strange behaviour? In kronos OpenCL specification (v 1.2) they say the x+y and x*y are correctly rounded as well as IEEE 754 compliant.
Any ideas?
Lately this is so common "problem" that I just repost my answer from other thread
First of all if we talk about basic operations ( +, -, /, * ) AMD GPUs give exactly the same results as CPU ( with exception of native double div ). For fused mad the accuracy is even higher than what CPUs can do.
Most people just simply forget that CPU/FPU uses 80 bit precision for internal registers and all operations. So only when you store float/double values in memory they are truncated to proper size/representation.
The difference is not because of GPU's inaccuracies but because you compare results from 80 bit math with results from 32 or 64 bit math.
There are 2 options to get the same results on CPU. You can make basic operations that store results in memory before they are reused ( overload operators in C++ ). Or you can switch to SSE because it doesn't use this archaic FPU 80 bit mode ( you can force gcc ( flag -fmfpmath=sse ) to use SSE instead of FPU ).
Isn't there even more to it? For example where the compiler puts a devision and so on? I have a numerical simulation where many sin, cos, tan and log operations occure. When I compare the results from a C version and a OpenCL version after a few iterations there are differences up to 10-8 despite of using fp64 and the sse instruction.
AMD GPU's don't have true/native double precision transcendental ( sin, cos, log, tan, ... ) functions. Native double functions just return single precision result converted to double precision ( with lowest bits 0 ). To improve accuracy GPU must use basic operations. Achieving full accuracy ( error <=0.5 ulp ) takes too much instructions and isn't practical. That's why OpenCL specifies quite huge error for transcendental functions ( >2-3 ulp ).
Also native single precision transcendental functions on AMD's GPU can return results with quite huge error ( but it depends on a function ).
Usually CPU gives much lower error for those functions ( Although if I remember correctly it isn't always fully accurate so in some cases error is >0.5 ulp ).
But like I said with exception of native double div ( which is single precision div converted to double precision by adding 0's ) all AMD's GPU basic operations ( add,sub,div,mul,... ) give error <=0.5 ulp.
This means they are exactly rounded and are the same as on CPU.
Of course we should remember that any change in order of floating point operations can give different result ( and it can be quite drastic ). So sometimes enabling code optimization in compiler can be an issue.
That is good to know. Could you please give me the sources for this information?
I have seen the compiler to change multiply+add operations (independent of relaxed math options, do you use them?) into mad/fma when compiling. Maybe you are not even doing multiply and add operations. Check with KernelAnalyzer, then you might have an easy explanation. You should probably compare your results with an AMD Bulldozer processor which supports fma4 or fma3/fma4.
Also native single precision transcendental functions on AMD's GPU can return results with quite huge error ( but it depends on a function ).
Usually CPU gives much lower error for those functions ( Although if I remember correctly it isn't always fully accurate so in some cases error is >0.5 ulp ).
This is true of older VLIW5 and VLIW4 devices but the GCN architecture seems to have full 32 bit floating point accuracy pretty much equal to anything on a PC. The GCN native sin function, v_sin_f32, compared to gcc's sinf() library function over a full range of values gives:
native v_sin_f32 average error: 2.78e-8
gcc float sinf() average error: 2.62e-8
The other 32 bit hardware transcendentals exp, log, sqrt, rsqrt, pow, and div (reciprocal) have roughly the same accuracy, at least for the values I've tested. With a little effort, the GPU's can do a wide range of floating point calculations almost completely in hardware while the PC/CPU must use long library functions for most transcendental calculations. Although CPU's do have hardware acceleration for sqrt and sin/cos, it's very slow compared to the GPU's single cycle (4 clock) transcendentals. The problem with GPU's is integer divide!
Seems like you made a mistake in the option name to not use FPU but SSE :
it's not -fmfpmath=sse but -mfpmath=sse.