APP : 2.8
OS : Kubuntu 12.04 x64
//for (i =0 ; i < 25 ; i++)
When I comment the the loop the cache hit (tested using codeXL 1.1) is 99%. But as soon as I un comment it cache hit drops to 23% and the kernel execution time is increased by 50 times when it should increase only by 25 times. The function encrypt() is quite large to fit into i-cache but still when there is no loop cache hit is 99%. But as soon as I increase the no iterations i.e anything more than 1 iteration the cache hit wil drop to 23% and the performance penalty is 2x times , where x is the number of iterations.
Called once, the function encrypt() might just get inlined in the kernel. There may be several optimization, that may reduce the variables needed, resulting in high performance. Multiple iterations of a big function is highly unlikely to get inlined. Which would require lot of variable fetching, and stack management.
Anyways it is interesting, and I will seek some experts advice
Can you try checking the performance once again with using "-cl-opt-disable" flag for compiling the kernel?
Generally goto statements produce only warnings which are mostly harmless. Also the kernel I have attached doesn't have any goto statement. Also the compiler seems to auto inline all the functions which I really don't want to. Is there any possible way to reduce the code length (ISA length) ?
goto is not that big a problem. But number of iterations is. Can you report your results after running the kernel for multiple iterations. Cache-hit counter might be buggy (and in that case, the issue should go to CodeXL team), but we need to make sure that performance is indeed going worse. In that case, it becomes a OpenCL Compiler/runtime issue.
CodeXL seems to be reporting correctly because when the cache hit drops it is accompanied by an increase in fetch size and mem unit busy which can be only explained by increased cache misses. I also tested the kernel 10 times inside a loop on the the host side and performance counters were almost identical for each kernel call. This seems to be a compiler problem to me.
This code is not going to run well because it’s too large for the instruction cache. Even the code without the loop take 35072 bytes, which is too large. Combine that with the fact that we can only get 4 waves per CU, due to the VGPR usage, and we can’t hide the latency of the I$ fetches. Perhaps with the user’s particular driver, the code without the loop fits in the I$, but with the driver I am testing both kernels are too large for the I$.
The developer should also be aware that, as far as I can see, adding the loops does nothing to the algorithm since the first thing done in encrypt() is to set out to in which undoes all the previous computations. The compiler can’t see this.
Courtesy: Jeff Golds