16 Replies Latest reply: Nov 14, 2013 11:36 PM by inferrna RSS

Results differs for GPU and CPU devices depending on data.

inferrna Newbie
Currently Being Moderated

I wrote simple propagation example realized for CPU and for OpenCL. OpenCL results differs from CPU results depending on data structure and GPU using. First device Devastator (7560D) gives right result only when all layers have the same size. Second device Pitcairn(HD 7870) gives wrong result when all layers have size equals WORK_GROUP_SIZE (256 for both). Wrong always is the third layer (second of mutable layers). My question is: is something wrong with my code, or fglrx driver, or this task is impossible for OpenCL and I must do separate call for each propagation?

 

In attachment is test program to see the difference between CPU & GPU results.

Compile options: gcc -O0  -std=c99 test_prop.c test_prop_cl.c -o test_prop -lm -lOpenCL -ggdb -L/usr/lib/

 

First variant of kernel, gives the same results as version in attachment.

__kernel void test_prop( __global float* values, __global __read_only float* conns, __global unsigned int* sz)

{

    __global float *pvalues = values;

    __global float *cvalues = values;

    __local float sums[256];

    float sum;

    unsigned int i, nr, pnr, cn, clc, k, offc=0, lcn, szp;

    for(k=1; k<3; k++){

        szp = sz[k-1];

        cvalues+=szp;

        clc = sz[k]*szp;

        cn  = select((uint)0, (uint)get_global_id(0), get_global_id(0)<clc);

        lcn = get_local_id(0);

        if(cn<clc){

            nr  = cn / szp; //Current neuron

            pnr = cn % szp; //Prev layer's connected neuron.

            sums[lcn] = conns[cn+offc]*pvalues[pnr];

            barrier(CLK_LOCAL_MEM_FENCE);

            for(i=2; i<=szp; i<<=1){

                sum = select((float) 0.0, sums[lcn + i/2], (uint)(clc % i == 0));

                sums[lcn] += sum;

                barrier(CLK_LOCAL_MEM_FENCE);

            }

           cvalues[nr] = sums[lcn];

        }

        sums[lcn] = 0;

        barrier(CLK_LOCAL_MEM_FENCE);

        offc += clc;

        pvalues = cvalues;

        barrier(CLK_GLOBAL_MEM_FENCE);

    }

}

  • Re: Results differs for GPU and CPU devices depending on data.
    himanshu.gautam Master
    Currently Being Moderated

    Before I get into the details,

    I see a "barrier" inside a conditional...

    Are you sure all workitems of a workgroup would take the same branch condition?

     

    +

     

    Floating point math is tricky..Small errors will be there between CPU and GPU executions.

    The order of errors would depend on the number of operations you do..

    And, these are common and cannot be avoided....Thats the nature of floating point numbers.

    Their result would depend on the order of computation....

    Since the order differs, when you break down parallely, the result changes by a small degree..(everytime it happens)

    Try it with integer like float data which are small, say less than 5.0f. (assuming you are not having divisions)

     

    +

    Are you using cl-fast-relaxed-math compilation option?

     

    Best,

    Bruhaspati

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

    • Re: Results differs for GPU and CPU devices depending on data.
      inferrna Newbie
      Currently Being Moderated

      Here is the modified kernel in which I trying to avoid barriers inside a conditional and replaced interleaved writes to globals with async_work_group_copy. It gives same results.

      [C] test_prop.cl - Pastebin.com

      > Are you using cl-fast-relaxed-math compilation option?

      I have tried changing options, it gave no result.

      > Small errors will be there between CPU and GPU executions.

      I have seen these and it were insignificant. Significant is 16.001 vs -58.54 for example.

      Also, there possible an unstable situation (depends on data structure): sometimes it gives right result, sometimes not. For 7870 it is 256/128/64 (layer size must be a power of 2 for reduction works).

      > Try it with integer like float data which are small, say less than 5.0f.

      Interesting advice. I need float data, but I will examine this to see the difference.

      • Re: Results differs for GPU and CPU devices depending on data.
        himanshu.gautam Master
        Currently Being Moderated

        Barriers, in a literal sense are actually barriers to performance as well.

        As a parallel programmer, you should not be using so many barriers in your code...

        Anyway,

        If the errors are insignificant -- then it is the regular floating point deviation.

         

        Also,
        I see that in the new code you are not writing into global memory.

        You are merely doing some pointer math..

        I dont understand what this code is trying to do...

         

        -

        Bruhaspati

        Regards

        Himanshu , Bruhaspati

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

        • Re: Results differs for GPU and CPU devices depending on data.
          inferrna Newbie
          Currently Being Moderated

          Illustration what kernel does http://borgu.org/test_prop.pdf

          In two words: layer 0 is immutable, its data propagates (with coefficients stored in conns) to layer1 and from layer1 to layer2. First propagation, from layer0 to layer1 is always right, but the second produces wrong data depending on structure - you can see it on video

          > Barriers, in a literal sense are actually barriers to performance as well.

          now accurate results is more important. When it done, I will going to optimize code.

          > I see that in the new code you are not writing into global memory.

          it writes to global with async_work_group_copy ( and reuses results in the next iteration )

          • Re: Results differs for GPU and CPU devices depending on data.
            himanshu.gautam Master
            Currently Being Moderated

            Thanks. In your video, at time 1:16 (after clang compilation) -- Pause it....

            If you examine the results, the results are still wrong.. (example: look at 119 and above)

             

            To me, it looks like a race condition (because of async_copy and FOR loops..)

            We have seen earlier that a barrier in the middle of FOR loop causes race between upper and lower half.

            I will check your code now to see if some such assumptions are broken...

             

            Best,

            Bruhaspati

            Regards

            Himanshu , Bruhaspati

            --------------------------------

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

          • Re: Results differs for GPU and CPU devices depending on data.
            himanshu.gautam Master
            Currently Being Moderated

            if(pnr == 0 && cn<clc) lvalues[lnr] = sums[lcn];

             

            How many workitems execute the statement above?

            I hope only 1. If not, many workitems would be writing to the same location - which is non-deterministic.

             

            And in that case, I believe you should be writing "&lvalues[lnr]" to memory.....

            instead of writing "&lvalues[0]" to global memory...

             

            Best,

            Bruhaspati

             

            Regards

            Himanshu , Bruhaspati

            --------------------------------

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Results differs for GPU and CPU devices depending on data.
    inferrna Newbie
    Currently Being Moderated

    I manually unrolled main loop and replaced async_work_group_copy to simple copying inside single tread. Results are the same. This unrolled and more commented version of test_prop.cl in attachment.

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points