I'm using clAmdFft a lot in my code. About 75% of the execution time is spend in fft_fwd and fft_back. According to CodeXL the occupancy of these kernels is only at 33% and and the limiting factor is the (local) work group size being 64. Is the local work group size algorithm specific or can I somehow increase it?
Soft-/hardware that I'm using:
AMD APP SDK 2.8
Driver Packaging Version: 9.012-121219a-151962C-ATI
This is bit strange. Probably you are doing many kernel calls, and the lower half section of the image is for overall application (or maybe the hot spot-kernel). Can you attach the profiler counts output also?
To answer your question, I would think only library developers will be able to fix any kernel occupancy issues, but not sure, as i have not used this library so far.
@himanshu.gautam: I've attached the csv File to the initial post. Is it that what you meant?
PS: I was not able to reply to your post directly (not authorized) ...
The workgroup sizes were chosen empirically for maximum performance. The library was tuned to work well for the 59xx cards and to most extent apply to the 5770 card as well. Please know that the GPU card family you have is 3 generations old and typically support for them will dwindle in such cases.
What FFT problems are you running? What size transforms etc?
Also, we recently open-sourced the code. It is called clMath. It is available on github at https://github.com/clMathLibraries/clFFT
You can now browse through the code if needed.