CPU: turion ultra ZM-86 / 4GB
GPU: radeon HD 4570 / 512MB
Catalyst 13.1 for legacy GPU
linux kernel 3.2.43
I have a very weird problem while performing an 'iteration' over an openCL kernel:
when the program iterates over the kernel for more than 2 times, it makes the Xorg server crash.
I can run the program more than 2 times sequentially when it does not perform iteration
I can run the program in iteration mode when launching it from the console: the system is perfectly stable when Xorg is not running.
some minor remark here is that during kernel execution, Xorg also 'locks up': if you put a clock on the background before launching the program in non-iterative mode, it just keeps the time before launch until it finalizes, then the screen is redrawn.
Is there any way I can instruct the program to be less 'aggressive" with its resources? I already tried a clfinish at the end of each iteration to make sure I did not forgot any read / writes in the command queue, but that didn't help either.
1. How long does your kernel run?
2 imho, . Seems more like a bug in your code.... Please post your code
1. about a minute
2. I really hoped I wouldn't have to do so :s the code is still in 'under investigation' stage and as such not documented at all
anyway, the C source file is up here: http://184.108.40.206:/debugCL.c
the 'iteration' is in the function 'continue_work', line 473
the device setup is in the function 'progress_task_proposal_advertisement', line 556
If you hog the GPU for 1 minute -- it can potentially kickstart some watchdog timer which can attempt a forceful recovery.
I know this used to the case with Windows long time back... Not too sure what happens in Linux..
But it is just natural to expect a 1-minute long kernel on a diplay card to invite some watchdogs/monitors.
could certainly be some kind of problem ... i'll verify it
*edit* does not seem to be the problem, the drivers here have no watchdog engine:
note that by "crash", als also do not mean that the screen blanks, just any I/O becomes impossible, and there is no screen redrawn anymore, as described in the program thread. however, between 2 kernel executions, it still should.
OpenCL on GPU is tied to X - currently on Linux.
AMD is working to de-couple the two so that you can run your programs even without the X-server running.
Hogging the GPU for 1 minute will certainly invite trouble.
I will ask around to see what can be done...
but your best bet would be to break down your kernel into multiple kernel launches each probably running for a second or two.
then I believe I found an unexpected feature (or a very serious program malfunction):
The program runs without X running.
Even if X has not been loaded (so at startup, before I'm loading X) the program runs
the only dependency is that the ati kernel module must be loaded - which the kernel loads by itself.
BTW: I don't know if I should change the thread subject, as running the program inside a minimalistic environment (no opengl, directfb, ...) works fine - no screen freezes at all! so the problem is not with X - it might be Qt or another rendering engine which messes up the situation
Nou 's right ... the code without X running (as a normal user) indeed only ran on the CPU. As root, (yes, I somehow have the intuition to always log in as root on a tty terminal), the calculations were offloaded to the GPU. So what happened: the program just selected another device (CPU) to do the job ... as root, it indeed stays on the GPU
I optimized my kernel a bit (replaced % with & where possible, decreased memory usage, etc ...), and gained a 15% performance increase, so that's nice. but the problem does not seem to solve itself, so I guess I'll just have to find myself a GPU which is more suited for this stuff (the bottleneck is memory bandwidth). case can be closed
I have a question concerning this topic:
I have a pc which is capable of running 3 VGA cards, and used this one to experiment with this program.
Currently, this pc is populated with nvidia 8800/9500 cards (bought way before the actual phenom X6), and some signs of a solution seem to appear. Of course, the lack of GPU capabilities (8800 cores were actually never designed to be openCL 1.0 compatible) will not fix the problem.
A card replacement may let me continue my research, but I got one question that annoys me a bit:
in the openCL benchmarks, AMD cards seem to perform way better with integer operations (the ones I am performing) for the same price than nvidias, so that would be obvious. However, the workgroup size limit is only 256 compared to 1024 with nvidia cards. Is this a driver software limit which AMD might change in one of its next releases? cause it would be useful if I could put a few more items in the same workgroup
However, the workgroup size limit is only 256 compared to 1024 with nvidia cards. Is this a driver software limit which AMD might change in one of its next releases? cause it would be useful if I could put a few more items in the same workgroup
I do not think, that limit will change any time soon. And I do not see a big reason, to support 1024 work-items in a compute unit anyhow. The intent of GPU computing is to use the available GPU resources to their maximum. You can always breakdown your work within 256 threads, as compared to 1024 threads, by assigning 4 times more work to each thread.
true, but that could imply I got to enqueue a few extra kernel executions. the process is mainly coordinated by get_local_id(i) and get_group_id(i) and a wg size of 1024 would let me use a 2nd workgroup dimension from time to time (32*32). Isn't this an overhead worth thinking about (I absolutely have no clue, so if it's a stupid question, just tell)