quarta-feira, 8 de outubro de 2014

Core confusion - round II

Introduction

In the previous article we explained how core counting should be done for comparison between Intel and AMD processors to be possible and got some performance numbers from a simplistic (and non-realistic) "floating point" benchmark (more on this later). We performed tests on two different server machines we have available in production:

Intel Xeon E5620 2.40GHz
AMD Opteron 6328 3.2 GHZ

In terms of operations per GHZ per second we can summarize the previously obtained results as follows:
Intel single thread: 24/2.4 = 10
Intel two threads: 28/2.4 = 11.67
AMD single thread: 24/3.2 = 7.5
AMD two threads: 40/3.2 = 12.5
Normalizing the results from Intel's single thread score up, we get
AMD single thread: 0.75
Intel single thread: 1
Intel two threads: 1.17
AMD two threads: 1.25
This normalization serves the purpose of evaluating both single thread performance and SMT scalability. We see that Intel's single thread performance is significantly better than AMD's. We also see that Intel's Hyperthreading enhances performance by 17% whereas AMD's core subdivision allows for 67% more throughput compared to the single thread situation.

If we assume that we want most of our machines to be busy running a large number of threads, we might prefer to use Intel's two thread performance as a reference. In that case, we obtain
AMD single thread: 0.64
Intel single thread: 0.86
Intel two threads: 1
AMD two threads: 1.07
These scores can be regarded as relative efficiencies of each processor. From here we can derive an expression for the optimal per GHZ throughput of a fully busy machine, ie, a machine that is executing at least 2 threads per core:
Per GHZ perf level = N * n * S

N - number of processors
n - number of cores per processor
S - processor score running two threads
If we need to calculate the total performance of a system we can just multiply the previous formula by the processor clock frequency
Perf level = N * n * S * F
Note: the previous formula assumes that we are dividing AMD's advertised number of cores by two, as explained in the previous article.

We have seen so far that, for the tested workflow, the difference in performance between Intel and AMD is not significant IF the systems is running at least two threads per core. But what about the financials?

We recently had the opportunity to compare two different server machines. All other things equal, we had

AMD dual Opteron 6320 x 4 core 2.8 GHZ machine = 2708 EUR
Intel dual E5-2620V2  x 6 core 2.1 GHZ machine = 2890 EUR

Assuming the efficiencies are the same as the ones found on the test machines (more on this below) we would find:

Throughput per EUR
TPE_AMD = 2 * 4 * 2.8 * 1.07 / 2708 = 0.008850
TPE_INTEL = 2 * 6 * 2.1 * 1 / 2890 =  0.008720
Thus, we get nearly a match
TPE_AMD = 0.998 * TPE_INTEL
Further testing

But it turns out that our simplistic floating-point benchmark was actually doing only integer calculations, due to the use of the bc command line calculator that works internally with integers. This was pointed out by Henry Wong, from stuffedcow.net, during a discussion of his own test results. By replacing bc with a simple loop of mathematical libc operations, compiled with gcc, we were able to test pure (still very simplistic) floating-point performance on the same machines tested before.

 In terms of operations per GHZ per second we found
Intel single thread: 3.33/2.4 =1.39
Intel two threads: 5.59/2.4 =  2.33
AMD single thread: 2.86/3.2 = 0.89
AMD two threads: 4.41/3.2 = 1.38
By performing the same normalizations as before we got
AMD single thread: 0.64
AMD two threads: 0.99
Intel single thread: 1
Intel two threads: 1.68 
and
AMD single thread: 0.38
AMD two threads:0.56
Intel single thread: 0.60
Intel two threads: 1
The results seem very disappointing for the AMD machine: single thread performance difference is even higher than on the previous test and this time the extra scalability that compensated for the weak single thread performance is not there.  Intel shows a hyperthreading bonus of 68% whereas for AMD we see only about 55%.

The testing we performed was meant to allow some intuition to be gained into the subject. But we know that floating point performance is a subtle topic and that we must be careful about drawing conclusions from basic testing. Therefore, we decided to have a look at an industry standard benchmark.

Reference results from spec.org

Looking at real benchmark output at spec.org we found again different results. For the throughput of floating point operations we have the following base scores:
Dell Inc. - PowerEdge M710 (Intel Xeon E5620, 2.40 GHz)
two thread score (16 threads on 8 cores): 164 / 8 = 20.50

Advanced Micro Devices - Supermicro A+ Server 1022G-NTF, AMD Opteron 6328
two thread score (16 threads on 8 cores, counted Intel's way ): 289 / 8 = 36.13
Please note that at spec.org they are using AMD's core counting on the score table...In terms of operations per GHZ per second values above translate to:
Intel two threads: 20.5/2.4 = 8.54
AMD two threads: 36.13/3.2 = 11.29
Normalizing we get:
Intel two threads: 1
AMD two threads: 1.32
In terms of floating point throughput per EUR we would find, by combining the performance numbers from our directly tested processors with the quotes for the new machines,
TPE_AMD = 2 * 4 * 2.8 * 1.32 / 2708 = 0.0109
TPE_INTEL = 2 * 6 * 2.1 * 1 / 2890 =  0.0087
TPE_AMD = 1.25 * TPE_INTEL
Fortunately, we can get exact numbers from spec.org for the processors we got quotes for
Dell Inc. - PowerEdge R720 (Intel Xeon E5-2620 v2, 2.10 GHz)
two thread score (16 threads on 8 cores): 375 /  = 31.25

Advanced Micro Devices - Supermicro A+ Server 4022G-6F, AMD Opteron 6320
two thread score (16 threads on 8 cores, counted Intel's way ): 268 / 8 = 33.50
In terms of operations per GHZ
Intel two threads: 31.25/2.1 = 14.88
AMD two threads: 33.5/2.8 = 11.96
Normalizing we get:
Intel two threads: 1
AMD two threads: 0.80
That would mean
TPE_AMD = 2 * 4 * 2.8 * 0.8 / 2708 = 0.0066
TPE_INTEL = 2 * 6 * 2.1 * 1 / 2890 =  0.0087
TPE_AMD = 0.76 * TPE_INTEL

What this means is that Intel has dramatically increased its throughput per GHZ at least for the SPEC benchmark, that uses 2 threads per core. Therefore, efficiency factors from older processors can hardly be used for TPE comparisons.

Note: Unfortunately we couldn't find a way to compare single thread scores  with multi thread scores for these CPUs because spec.org runs different tests for speed (single thread) and throughput (multiple threads). In the floating point speed test AMD delivers just slightly less per GHZ (96%) then Intel, for these specific CPUs. An comparison of the two different test types is available here.

Conclusion

Since multicore processors are standard nowadays and multiprocessor machines are becoming more and more affordable it is more important to compare total per EUR throughput than maximum single thread performance.

Virtualization is here to stay and therefore the parallel throughput of current processors is of paramount importance - if the system doesn't perform well enough one can buy another one or a larger one. Unless, of course, one runs a small number of non-parallel workloads where peak single thread throughput is the defining variable.

However, we have seen that both single thread processor performance and double-thread scalability are highly dependent on the workload. The difference between integer and floating point calculations became evident from a pair of very simple cpu tests.

The most important conclusion is that in face of the inherent complexity of the subject  and the artificial complexity introduced by certain marketing teams (the core confusion...) it is very hard to base purchase decisions on third party benchmarks.

For mission critical computing situations we should certainly test our specific workload on different processors and calculate the specific TPE (throughput per EUR) for the candidate systems.

References

Integer calculation script
Floating point calculation script and aux C loop (mathtest)

.