Wednesday, June 14, 2006

T1 CPI

(7) The instruction execution resulting in overlapping latency which leads to the memory model of T1 addresses the effectiveness contributed with or without memory stalls across the memory
hierarchy. Empirical data set indicates the problem size and requires further investigation on the hidden factors contributing the CPU efficiency.

(6) Someone may agree. A CPI of >=4 as tested on a Niagara would indicate linear thread scalaing, but the same data found on an USIII would not necessarily lead to the same conclusion. 1 - 2 CPI on an USIII could actually be >=4 CPI on a T1 because of the USIII's superscalarness. The amount of thread level paralellism in an instruction stream is somewhat limited. So the real question is, is it possible for a *realistic* instruction stream to have a number of stalling instructions that would give >4 CPI on a T1 but closer to 1 CPI on an USIII, due to the USIII's superscalarness masking the stalls? I'm not so convinced that this is true. But data would be good.

(5) Someone questioned why a cpi > 4[as seen on a USIII] is required for a workload to scale linearly on T1."most of the kernels don't meet the T1 requirement of a cpi of 4 to get thread scaling".USIII has instruction level parallelism so (theoretically) a CPI greater than 4(as seen on USIII) should not be a necessary condition to linearly scale on T1.


(4) Data on which specint tests contain heavy FP. I don't have the data, but I suspect the twolf test also has decent FP as it's another place and route test like vpr.Eon is a graphics visualization test. Also see Brian's comments about Niagara's CPI for specint, indicating that even if you discount the performance on the FP heavy workloads, specint still won't do as well as a "real world" workload that has average cpi > 4.

(3) the SPECint_rate FP data set

Percent fp...
vpr dataset 1 -> 5.6%
dataset 2 -> 8%
eon dataset 1 -> 15.9%
dataset 2 -> 15.2%
dataset 3 -> 16.3%
Only 0.1% of the instructions in eon are sqrt, so fixing
sqrt will help single core, but not significantly change
the rate result.

CPI
gzip ds1 -> 1.14
ds2 -> 1.10
ds3 -> 0.97
ds4 -> 0.97
ds5 -> 1.17
vpr ds1 -> 1.34
ds2 -> 2.37
gcc ds1 -> 2.19
ds2 -> 1.37
ds3 -> 1.50
ds4 -> 1.64
ds5 -> 1.46
mcf ds1 -> 5.81
crafty ds1 -> 1.00
eon ds1 -> 1.20
ds2 -> 1.25
ds3 -> 1.30
perlbmk ds1 -> 1.27
ds2 -> 1.08
ds3 -> 1.74
ds4 -> 1.03
ds5 -> 1.07
ds6 -> 1.04
ds7 -> 1.06
gap ds1 -> 1.46
vortex ds1 -> 1.38
ds2 -> 1.24
ds3 -> 1.39
bzip2 ds1 -> 1.11
ds2 -> 0.91
ds3 -> 0.97
twolf ds1 -> 1.94

(data collected by Darryl Gove on a US3 1056MHz system)

So, for this benchmark, most of the kernels don't meet the T1
requirement of a cpi of 4 to get thread scaling. That, along
with a single issue processor make it impossible to get good
numbers on this benchmark.

So, the problem really is, int_rate doesn't stall on memory enough
for the T1 processor.

I noticed a reply from you to niagara-interest saying that according to folks at SAE, specint is ~20% floating point. Do you know where I might be able to find a breakdown of FP % for each of the 12 benchmarks.

I have a partner that uses specint results to compare platforms internally and is doing some Niagara testing. They're aware that specint contains floating point instructions and are willing to take suggestions from us on how the different benchmarks should be weighted to emphasize integer performance (I'm hoping, of course, that there are some benchmarks with little to no FP).

No comments: