9 Replies Latest reply: Apr 12, 2013 1:37 AM by kittyshen2013 RSS

HD7970ghz Peak TFLOPS calculation

rcgoodfellow Newbie
Currently Being Moderated

In the document 'AMD Accelerated Parallel Processing OpenCL Programming Guide' provided here 

 

http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/

 

Table 5.3 gives the instructions per cycle (IPC) ratings for various instructions from which we may calculate the peak FLOPS for both single and double precision calculations.  Using the table I calculate the double precision peak FLOPS as

 

     dp_add_flops = total_alu_count * clock_rate * dp_add_ipc

                            = 2048 * 1.05 GHz * 0.5

                            = 1.0752 TFLOPS

 

which is roughly in line with the advertised performance, however for single precision I have

 

     sp_add_flops = total_alu_count * clock_rate * sp_add_ipc

                            = 2048 * 1.05 GHz * 4

                            = 8.6016 TFLOPS

 

which is exactly double the advertised performance.  What am I missing here? If the single point add IPC is reduced to 2 then the numbers are spot on, however, that does not agree with the specs provided in the document identified above.  Also is there a place where I can find very detailed hardware specifications for my card specifically?

 

Thanks

~ry

  • Re: HD7970ghz Peak TFLOPS calculation
    himanshu.gautam Master
    Currently Being Moderated

    I went through the table and I am equally confused. I will ask around.

     

    Here are some of my thoughts:

    4 operations per cycle per stream processor looks like a huge task. Usually this can be done only if the instruction is operating on a vector.

    Also, the guide calculates DP Peak assuming that the Tahiti is a "one-quarter double precision speed device". No idea why it is so....

     

    Will ask around and get back to you,

    Regards

    Himanshu , Bruhaspati

    --------------------------------

    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied

  • Re: HD7970ghz Peak TFLOPS calculation
    drallan Novice
    Currently Being Moderated

    I think the table is probably correct, but a little interpretation is needed.

     

    Cards like Tahiti 7970, 7950 are "Full Speed Double Precision devices", so they are in the right column.

    Full Speed only means the best the architecture can do, no specific speed.

    The word "cycle" here means 4 clocks. Stream processors issue a wave in 4 parts with a minimum

    4 clock latency which is considered 1 cycle. However, instructions can be issued

    on each clock of a cycle thus 4 insns/cycle.

     

    Most FP instructions are 4/cycle (1/clock), which is impressive for FMA and the like.

    The transcendentals  (rcp, sin, log, sqrt, rsqrt) are 1/cycle.

    Most all DP is 1/cycle except ADD, where they manage to squeak out 2.

    (note they choose ADD to calculate a performance for DP).

     

    Also, "peak" performance is almost always based on multiply + add insns (MAD)

    which count as 2 instructions per instruction, which gives a FACTOR of 2.

     

    Using clocks, not cycles, peak performance would be .

        (1insn/clock)*(2048)*(1.0e9)*FACTOR    = 4.096 TFlops/sec of most FP and Int.

        (1/4 insn/clock)*(2048)*(1.0e9)*FACTOR  = 1.024 TFlops/sec DP.

    Using cycles would be 4 or 1 insn/cycle and a cycle speed of 0.25e9, which comes out the same.

     

    Basic int operations are 4/cycle with the big exception the 32 bit mul and mad reduce to 1/cycle.

    However there are 24 bit accuracy versions of mad and mul that run at 4 insns/cycle.

    Presumably the reason is the 24 bit insns use the fast FP multipliers, which only have to

    multiply 24 bit mantissas.At least that was always my guess.

     

    Edit, fixed the ambiguous phrase

      "many instructions can be issued on each clock thus 4 insns/cycle.

    • Re: HD7970ghz Peak TFLOPS calculation
      dmeiser Novice
      Currently Being Moderated

      Nice explanation of this confusing topic.

    • Re: HD7970ghz Peak TFLOPS calculation
      LeeHowes Apprentice
      Currently Being Moderated

      Noting of course that ADD alone and MAD with the double multiplier give the same quarter-clock-rate throughput of DP ops.

       

      If your explanation is correct (and it seems sound as you write it, though I'd have to read thoroughly to find where else 'cycle' is used in that way, then I think the problem is that the parts of the programming guide I rewrote use "cycle" to mean "clock cycle", and like most people here I'm naturally reading the table the same way. Some clarification is in order to make that table more consistent with the rest of the chapter I will make a note.

      Lee Howes
      Advanced Micro Devices Inc.

      --------------------------------

      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

      • Re: HD7970ghz Peak TFLOPS calculation
        drallan Novice
        Currently Being Moderated

        Thanks for the feedback.

         

        Yes, I only assume that meaning for cycles from looking at the table and already knowing the answers.

        It is the only meaning that makes sense. But just a few lines below the table states

        ... Table 5.3, a Tahiti device can perform one double-precision ADD operations/2 cycles in each stream core.

        where as the table clearly says 2 DP ADDs per "cycle" for each stream core.

         

         

         

         


    • Re: HD7970ghz Peak TFLOPS calculation
      rcgoodfellow Newbie
      Currently Being Moderated

      Hi drallan,

       

      Thank you very much for the explanation, it clears things up quite a bit.  A few points of clarification though. When you say that many instructions can be issued on each clock thus 4 insns/cycle, did you mean 1ins/clock and thus 4ins/cycle or is there actually something superscalar in nature going on here?  Also, in your calculations I think it is TFLOPS and not GFLOPS, (1)*(2048)*(1E9)*(2)/(10**12) = 4.096.

       

      thanks

      ~ry

      • Re: HD7970ghz Peak TFLOPS calculation
        LeeHowes Apprentice
        Currently Being Moderated

        Each of the 32 cores on the chip (CUs in diagram terminology) as a whole issues multiple operations per cycle, it is certainly superscalar. When we count these 2048 numbers we are counting ALUs, and so that is already below the superscalar level of the chip. Each ALU issues one floating point operation per clock cycle at peak, less if they are multi-cycled in some way (however that is practically implemented).

        Lee Howes
        Advanced Micro Devices Inc.

        --------------------------------

        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

      • Re: HD7970ghz Peak TFLOPS calculation
        drallan Novice
        Currently Being Moderated

        A few points of clarification though. When you say that many instructions can be issued on each clock thus 4 insns/cycle, did you mean 1ins/clock and thus 4ins/cycle or is there actually something superscalar in nature going on here?  [...] I think it is TFLOPS and not GFLOPS, (1)*(2048)*(1E9)*(2)/(10**12) = 4.096.

        Oops, fixed both, thanks.

        It should be "most instruction types can be issued on each clock" or just "instructions can be issued on...."

        (I can see an award for ambiguity here)

  • Re: HD7970ghz Peak TFLOPS calculation
    kittyshen2013 Newbie
    Currently Being Moderated

    thank you for share.

More Like This

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points