I'm using clAmdFft a lot in my code. About 75% of the execution time is spend in fft_fwd and fft_back. According to CodeXL the occupancy of these kernels is only at 33% and and the limiting factor is the (local) work group size being 64. Is the local work group size algorithm specific or can I somehow increase it?
Soft-/hardware that I'm using:
AMD APP SDK 2.8
Driver Packaging Version: 9.012-121219a-151962C-ATI
This is bit strange. Probably you are doing many kernel calls, and the lower half section of the image is for overall application (or maybe the hot spot-kernel). Can you attach the profiler counts output also?
To answer your question, I would think only library developers will be able to fix any kernel occupancy issues, but not sure, as i have not used this library so far.
@himanshu.gautam: I've attached the csv File to the initial post. Is it that what you meant?
PS: I was not able to reply to your post directly (not authorized) ...
The workgroup sizes were chosen empirically for maximum performance. The library was tuned to work well for the 59xx cards and to most extent apply to the 5770 card as well. Please know that the GPU card family you have is 3 generations old and typically support for them will dwindle in such cases.
What FFT problems are you running? What size transforms etc?
Also, we recently open-sourced the code. It is called clMath. It is available on github at https://github.com/clMathLibraries/clFFT
You can now browse through the code if needed.
If I change SIMD_WIDTH=64 to SIMD_WIDTH=128 or any other number (even SIMD_WIDTH=100000 compiles and runs fine) in plan.h, nothing changes in the performance. Moreover, SIMD_WIDTH doesn't appear in any other place in the clFFT source code. So how can I tune this parameter?
(By the way, I'm using a HD7970 now. I've also successfully changed the maximum DP FFT size from 2^22 to 2^24.)
If that is not seen anywhere in the code, that particular name SIMD_WIDTH is not used and probably stale code. It should be removed. The only parameter that you can programmatically change is the kernel work group size. The file stockham.generator.cpp has a class/constructor called KernelCoreSpecs where WorkGroupSize is set for a particular transform. This value can be changed and experimented with. Keep in mind that this value is just for the 1d transform that any problem gets eventually broken down into.