9 Replies Latest reply: Jul 30, 2013 8:49 AM by himanshu.gautam RSS

Profiling time for blocking and non-blocking execution

lenjyco Newbie
Currently Being Moderated

Hi all,

 

I'm new to OpenCL dev and i want to understand some mechanics.

 

I've a simple matrix multiplication kernel, and i want to see the impact of the blocking option for the clEnqueue* instructions.

So, i compute one time with blocking write&read and the other non blocking.

 

When I look to the profiling times of execution, I've, for the blocking version, a sequential order for each time (enqueue, submit, kernel start, kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

 

Can someone explain me this behaviour, thank you very much.

  • Re: Profiling time for blocking and non-blocking execution
    nou Expert
    Currently Being Moderated

    blocking call is equivalent of clFinish() after that call so it finish executing before return. unless you have out of order queue (which is not supported) you will always see execution in order.

  • Re: Profiling time for blocking and non-blocking execution
    himanshu.gautam Master
    Currently Being Moderated

    lenjyco wrote:

     

    kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.

     

    Can someone explain me this behaviour, thank you very much.

    Getting a negative number for (END-START) is certainly strange. You can submit a test case for that.

    also you should notice that non-blocking call returns in very less time(just after queueing your command), as the execution will be started at some later time. This should in general result in very high submit time (submit-queued), other times should be more or less comparable.

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: Profiling time for blocking and non-blocking execution
    lenjyco Newbie
    Currently Being Moderated

    Hey,

     

    thanks for the response,

     

    @nou : i'm using IN-ORDER queue right now, i will test OUT-OF-ORDER after.

     

    @himanshu.gautam : I think you misunderstood me, it's not the difference between END-START that is negative .

     

    For example, i get the 4 differents time (Queueing, submiting, sarting and ending) in my non-blocking version and i get from my output :

     

    - Queue = 27095955722029

    - Submit = 27095177522023
    - Start =  27095177553977

    - End = 27095946659941

     

    (Sorry for the use of raw data)

     

    You can obverse that the execution starts before it's queued, it's nonsense for me but i'm sure i'm missing something.

     

    Regards,

    • Re: Profiling time for blocking and non-blocking execution
      himanshu.gautam Master
      Currently Being Moderated

      ok. cl_event counters are very trust worthy. Can you attach you testcase, i can forward it to AMD Engg Team.

      For now,  use some high precision CPU Timers only.

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

      • Re: Profiling time for blocking and non-blocking execution
        lenjyco Newbie
        Currently Being Moderated

        "Can you attach you testcase" : You mean that you want my code where i get those profiling time ?

        • Re: Profiling time for blocking and non-blocking execution
          himanshu.gautam Master
          Currently Being Moderated

          Yes I mean can you attach a minimal host code + kernel which gives such weird counter values. BTW, on the last reply i meant , the counters are NOT very trustworthy. Sorry for the typo.

          I would suggest you to use High precision CPU Timers only. But AMD would certainly want to make their counters reliable, and your test case can help us greatly.

          Regards

          Himanshu , Bruhaspati

          --------------------------------

          The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

          • Re: Profiling time for blocking and non-blocking execution
            lenjyco Newbie
            Currently Being Moderated

            Here is my kernel :

             

            kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C,  int widthA, int widthB, int k )
            {
                int i = get_global_id(0);
                int j = get_global_id(1);
                float value = 0;
                for (int tmp =  0; tmp < k; tmp++)
                {
                    for ( int x = 0; x < widthA; x++)
                    {
                        value = value + A[x + j * widthA] * B[x*widthB + i];
                    }
                    C[i + widthA * j] = value;
                }
            }

             

             

             

             

            and here is my hostcode which is a little bit heavy so i just attached the queueing instructions :

             

             

             

             

            oclCopyHostToDeviceSynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu);

            oclCopyHostToDeviceSynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu);

             

             

                //EXECUTE the kernel

                ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

                if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

             

                //Wait the end of computation, used to get time of execution

                clWaitForEvents(1, event_list_execute);

             

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

             

            //WRITE the data

                oclCopyHostToDeviceSynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu);

                oclCopyHostToDeviceSynch(queue_gpu, B, widthB * heightB * sizeof (float), b_in_gpu);

             

             

                //EXECUTE the kernel

                ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

                if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

             

                //Wait the end of computation

                clWaitForEvents(1, &event_list_execute[0]);

             

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

             

            oclCopyHostToDeviceAsynch(queue_cpu, A, widthA * heightA * sizeof (float), a_in_cpu, &event_list_write_cpu[0]);

                oclCopyHostToDeviceAsynch(queue_cpu, B, widthB * heightB * sizeof (float), b_in_cpu, &event_list_write_cpu[1]);

             

                oclCopyHostToDeviceAsynch(queue_gpu, A, widthA * heightA * sizeof (float), a_in_gpu, &event_list_write_gpu[0]);

                oclCopyHostToDeviceAsynch(queue_gpu, B, widthA * heightA * sizeof (float), b_in_gpu, &event_list_write_gpu[1]);

             

             

                //Wait for the end of writting

                clWaitForEvents(2, event_list_write_gpu);

                clWaitForEvents(2, event_list_write_cpu);

             

             

                //EXECUTE the kernel

                ret = clEnqueueNDRangeKernel(queue_gpu, kernel_gpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[0]);

                if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

                ret = clEnqueueNDRangeKernel(queue_cpu, kernel_cpu, 2, NULL, gWorkSize, lWorkSize, 0, NULL, &event_list_execute[1]);

                if (checkCLError(ret, __FILE__, __FUNCTION__, __LINE__) == 1)return 1;

             

             

             

                //Wait the end of computation

                clWaitForEvents(2, event_list_execute);

             

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                clGetEventProfilingInfo(event_list_execute[0], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

                for (int i = 0; i < 4; i++)

                    profiling_time_tab[2][i] = time_buff[i];

             

                clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_QUEUED, sizeof (cl_ulong), &time_buff[0], NULL);

                clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_SUBMIT, sizeof (cl_ulong), &time_buff[1], NULL);

                clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_START, sizeof (cl_ulong), &time_buff[2], NULL);

                clGetEventProfilingInfo(event_list_execute[1], CL_PROFILING_COMMAND_END, sizeof (cl_ulong), &time_buff[3], NULL);

             

             

             

            The oclCopyHostToDeviceAsynch & oclCopyHostToDeviceSynch are just functions more easier to understand and the difference is that Asynch is non blocking while Synch is.

  • Re: Profiling time for blocking and non-blocking execution
    lenjyco Newbie
    Currently Being Moderated

    I've found the error, i think maybe it will interest someone. If you use clWaitforEvents with a list of event which belongs to differents contexts an error occur (i forgot to check the return value).

    • Re: Profiling time for blocking and non-blocking execution
      himanshu.gautam Master
      Currently Being Moderated

      Thanks for coming back on this... You save our time for sure...Secondly, it is also a great help to the community..

       

      Yes, Context owns buffers, images (data), kernel (code) and events (synch)... You cannot do cross-context stuff like this. Thanks for coming back...

       

      - Bruhaspati

      Regards

      Himanshu , Bruhaspati

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points