In the document 'AMD Accelerated Parallel Processing OpenCL Programming Guide' provided here
Table 5.3 gives the instructions per cycle (IPC) ratings for various instructions from which we may calculate the peak FLOPS for both single and double precision calculations. Using the table I calculate the double precision peak FLOPS as
dp_add_flops = total_alu_count * clock_rate * dp_add_ipc
= 2048 * 1.05 GHz * 0.5
= 1.0752 TFLOPS
which is roughly in line with the advertised performance, however for single precision I have
sp_add_flops = total_alu_count * clock_rate * sp_add_ipc
= 2048 * 1.05 GHz * 4
= 8.6016 TFLOPS
which is exactly double the advertised performance. What am I missing here? If the single point add IPC is reduced to 2 then the numbers are spot on, however, that does not agree with the specs provided in the document identified above. Also is there a place where I can find very detailed hardware specifications for my card specifically?
I went through the table and I am equally confused. I will ask around.
Here are some of my thoughts:
4 operations per cycle per stream processor looks like a huge task. Usually this can be done only if the instruction is operating on a vector.
Also, the guide calculates DP Peak assuming that the Tahiti is a "one-quarter double precision speed device". No idea why it is so....
Will ask around and get back to you,
I think the table is probably correct, but a little interpretation is needed.
Cards like Tahiti 7970, 7950 are "Full Speed Double Precision devices", so they are in the right column.
Full Speed only means the best the architecture can do, no specific speed.
The word "cycle" here means 4 clocks. Stream processors issue a wave in 4 parts with a minimum
4 clock latency which is considered 1 cycle. However, instructions can be issued
on each clock of a cycle thus 4 insns/cycle.
Most FP instructions are 4/cycle (1/clock), which is impressive for FMA and the like.
The transcendentals (rcp, sin, log, sqrt, rsqrt) are 1/cycle.
Most all DP is 1/cycle except ADD, where they manage to squeak out 2.
(note they choose ADD to calculate a performance for DP).
Also, "peak" performance is almost always based on multiply + add insns (MAD)
which count as 2 instructions per instruction, which gives a FACTOR of 2.
Using clocks, not cycles, peak performance would be .
(1insn/clock)*(2048)*(1.0e9)*FACTOR = 4.096 TFlops/sec of most FP and Int.
(1/4 insn/clock)*(2048)*(1.0e9)*FACTOR = 1.024 TFlops/sec DP.
Using cycles would be 4 or 1 insn/cycle and a cycle speed of 0.25e9, which comes out the same.
Basic int operations are 4/cycle with the big exception the 32 bit mul and mad reduce to 1/cycle.
However there are 24 bit accuracy versions of mad and mul that run at 4 insns/cycle.
Presumably the reason is the 24 bit insns use the fast FP multipliers, which only have to
multiply 24 bit mantissas.At least that was always my guess.
Edit, fixed the ambiguous phrase
"many instructions can be issued on each clock thus 4 insns/cycle.
Noting of course that ADD alone and MAD with the double multiplier give the same quarter-clock-rate throughput of DP ops.
If your explanation is correct (and it seems sound as you write it, though I'd have to read thoroughly to find where else 'cycle' is used in that way, then I think the problem is that the parts of the programming guide I rewrote use "cycle" to mean "clock cycle", and like most people here I'm naturally reading the table the same way. Some clarification is in order to make that table more consistent with the rest of the chapter I will make a note.
Thanks for the feedback.
Yes, I only assume that meaning for cycles from looking at the table and already knowing the answers.
It is the only meaning that makes sense. But just a few lines below the table states
... Table 5.3, a Tahiti device can perform one double-precision ADD operations/2 cycles in each stream core.
where as the table clearly says 2 DP ADDs per "cycle" for each stream core.
Thank you very much for the explanation, it clears things up quite a bit. A few points of clarification though. When you say that many instructions can be issued on each clock thus 4 insns/cycle, did you mean 1ins/clock and thus 4ins/cycle or is there actually something superscalar in nature going on here? Also, in your calculations I think it is TFLOPS and not GFLOPS, (1)*(2048)*(1E9)*(2)/(10**12) = 4.096.
Each of the 32 cores on the chip (CUs in diagram terminology) as a whole issues multiple operations per cycle, it is certainly superscalar. When we count these 2048 numbers we are counting ALUs, and so that is already below the superscalar level of the chip. Each ALU issues one floating point operation per clock cycle at peak, less if they are multi-cycled in some way (however that is practically implemented).
A few points of clarification though. When you say that many instructions can be issued on each clock thus 4 insns/cycle, did you mean 1ins/clock and thus 4ins/cycle or is there actually something superscalar in nature going on here? [...] I think it is TFLOPS and not GFLOPS, (1)*(2048)*(1E9)*(2)/(10**12) = 4.096.
Oops, fixed both, thanks.
It should be "most instruction types can be issued on each clock" or just "instructions can be issued on...."
(I can see an award for ambiguity here)