segunda-feira, 27 de janeiro de 2014

Core confusion

Keywords

cpuinfo, hyperthreading, intel, amd, cores, modules, numa

Introduction

Comparison of processor features and performance is a complicated subject. Back in the days we had RISC vs CISC, then we had the growing market of turtle-slow PCs vs large scale machines from SUN, IBM... then PCs got acceptably fast, then they got more cores at a lower clock speed, then came Hyperthreading, then clock speeds went faster, then servers got even faster and in the end of 2011 AMD has launched a processor architecture called Buldozzer where computing units are something between an Intel core and an Intel thread but to which they also call cores.

Needless to say that for the high level IT manager this is a confusing mess - a mess perhaps driven by marketing that must be carefully looked at, first from the procurement side and later from the operations side.

Core counting

Before purchasing a server we usually look at the available options for number of CPUs (N), CPU clock frequency (F) and number of cores per CPU (n). When our applications are both CPU intensive and parallel in nature it makes sense to estimate the total available computing power by calculating P=N*F*n. To the guaranteed computing power, calculated in this manner, we would add a non-guaranteed, workload dependant extra that comes from Hyperthreading, if the processor is Intel. The ability to run a second separate thread in the same core translates in fact to a performance boost of something between 10 and 30%, depending on the workload.

If the processors are recent Opterons from AMD we can't use the same formula. We must adapt it to look like this:
P=N*F*n*S
where S is a scaling factor that compensates for the fact that what AMD advertises as a core is not as independent as an Intel core. In their new terminology AMD talks about modules inside which there are two cores present. They could, instead, have counted each module as a core and have invented some word, eg Astrothreading, that would make things easily comparable - and it seems that there is a word for that. But the result on sales might not be as good :-)

Given the described physical situation - 2 "cores" sharing resources inside a module - it is tempting to propose, as an educated guess, S=0.5. In fact, we have seen a similar value present on a HPC tender in order to allow for the comparison of recent Intel and AMD systems with a simple N*F*n formula. It was something like:

Intel E5-26XX S=1.00
Intel E5-26XXv2 S=1.00
Intel E5-24XX S=1.00
Intel E5-46XX S=1.00
Intel E7-28XX S = 0.71
AMD Opteron 62XX S = 0.47
AMD Opteron 63XX S = 0.47

Being that so, for a crude estimation of the guaranteed performance we could go back to the initial formula
P=N*F*n
where n would now be the number of Intel cores or the number of advertised AMD cores divided multiplied by 0.47 (roughly a division by 2).

We should have in mind that the non-guaranteed extra performance will, in principle, be greater in AMD Opterons than what Intel's Hyperthreading has to offer. And we should also have in mind that the S factor mentioned above was empirically calculated on a specific HPC scenario, while recognizing that it matches our educated guess. It means that for this specific HPC scenario AMD processors seem to behave as if having only half of the advertised cores.

We will see, later in this article, how this calculation holds on a simplistic benchmark we have performed, even more closely when the comparison takes into account the extra performance given by Intel's Hyperthreading and AMD's core 'duplication'.

Linux classification of computing units

We will now analyze two different systems, both exposing 16 virtual processors to the operating system. This is an example of an  Intel Xeon E5620 CPU, as seen from dmidecode
 Version: Intel(R) Xeon(R) CPU    E5620  @ 2.40GHz
 Voltage: Unknown
 External Clock: 133 MHz
 Max Speed: 2400 MHz
 Current Speed: 2400 MHz
 Status: Populated, Enabled
 Upgrade: Other
 L1 Cache Handle: 0x0005
 L2 Cache Handle: 0x0006
 L3 Cache Handle: 0x0007
 Serial Number: To Be Filled By O.E.M.
 Asset Tag: To Be Filled By O.E.M.
 Part Number: To Be Filled By O.E.M.
 Core Count: 4
 Core Enabled: 4
 Thread Count: 8
 And this is an example of an AMD Opteron 6328
 Version: AMD Opteron(tm) Processor 6328
 Voltage: 1.1 V
 External Clock: 200 MHz
 Max Speed: 3200 MHz
 Current Speed: 3200 MHz
 Status: Populated, Enabled
 Upgrade: Socket G34
 L1 Cache Handle: 0x0005
 L2 Cache Handle: 0x0006
 L3 Cache Handle: 0x0007
 Serial Number: To Be Filled By O.E.M.
 Asset Tag: To Be Filled By O.E.M.
 Part Number: To Be Filled By O.E.M.
 Core Count: 8
 Core Enabled: 8
 Thread Count: 8
From these examples it would seem that Linux had adopted the AMD's terminology regarding core counts. However, dmidecode only displays what is stored in the computer BIOS and available through the DMI interface (man dmidecode). Let's see what is present on the /proc/cpuinfo.

For the Intel processor we have
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping        : 2
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
whereas for the AMD Opteron we see
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 2
model name      : AMD Opteron(tm) Processor 6328
stepping        : 0
cpu MHz         : 1400.000
cache size      : 2048 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
From what is present on /proc/cpuinfo we see that Linux has NOT adopted AMD's terminology. In fact it is treating AMD's core duplication in the same way it treats Intel's Hyperthreading: the number of cores is exactly the same and the total number of core threads, called "sibblings", is twice as large. That is, in both cases we have 2 threads per core and each of those appears to the operating system as a virtual CPU. We will see as many entries as the number of sibblings in each processor times the number of processors. But not all entries are equal: each core with Hyperthreading (Intel) or "double core in a module" (AMD) represents two entries that share resources and, thus, don't behave as independent processors.

Note: the displayed cpu MHZ values are not the maximum ones - the energy saving  mechanism lowers the frequency when the system is idle and increases it on demand.

Distinguishing independent and coupled virtual CPUs

We have seen that anytime the number of sibblings is twice the number of cores the system will contain pairs of virtual CPUs that are dependent on each other. The Linux kernel is aware of that and spreads the load over available physical cores before allowing a second thread of each core be put to use. But the user might want to manually bind certain long running cpu intensive processes to specific cpus. How can the user know which virtual CPUs depend on each other?

We can group the relevant processor info by running something like
 cat /proc/cpuinfo|egrep "processor|physical id|core id" |sed 's/^processor/\nprocessor/g'
 For the Intel system we have
processor       : 0
physical id     : 0
core id         : 0


processor       : 1
physical id     : 0
core id         : 1

processor       : 2
physical id     : 0
core id         : 9

processor       : 3
physical id     : 0
core id         : 10

processor       : 4
physical id     : 1
core id         : 0

processor       : 5
physical id     : 1
core id         : 1

processor       : 6
physical id     : 1
core id         : 9

processor       : 7
physical id     : 1
core id         : 10

processor       : 8
physical id     : 0
core id         : 0


processor       : 9
physical id     : 0
core id         : 1

processor       : 10
physical id     : 0
core id         : 9

processor       : 11
physical id     : 0
core id         : 10

processor       : 12
physical id     : 1
core id         : 0

processor       : 13
physical id     : 1
core id         : 1

processor       : 14
physical id     : 1
core id         : 9

processor       : 15
physical id     : 1
core id         : 10
For the AMD system we have
processor       : 0
physical id     : 0
core id         : 0


processor       : 1
physical id     : 0
core id         : 1


processor       : 2
physical id     : 0
core id         : 2

processor       : 3
physical id     : 0
core id         : 3

processor       : 4
physical id     : 0
core id         : 0

processor       : 5
physical id     : 0
core id         : 1

processor       : 6
physical id     : 0
core id         : 2

processor       : 7
physical id     : 0
core id         : 3

processor       : 8
physical id     : 1
core id         : 0

processor       : 9
physical id     : 1
core id         : 1

processor       : 10
physical id     : 1
core id         : 2

processor       : 11
physical id     : 1
core id         : 3

processor       : 12
physical id     : 1
core id         : 0

processor       : 13
physical id     : 1
core id         : 1

processor       : 14
physical id     : 1
core id         : 2

processor       : 15
physical id     : 1
core id         : 3
For each of the previous examples we have formatted as bold two threads belonging to a single physical core. We see that even with the same number of cores and threads per processor, the output of /proc/cpuinfo is different for these two systems. For the Intel processor the two threads of the same core are the ones that share the same physical id and core id whereas for the AMD processor the thread pairs have adjacent processor ids.

We can reach this conclusion empirically by developing a simple cpu performance test program, say called cputest.sh, and running two instances of it on an otherwise idle machine. With the taskset command we can bind each instance to a different virtual cpu and evaluate the performance. Whenever we detect a performance penalty due to the second instance we have hit a pair of core threads.

Example for the Intel CPU - different physical cores
[user@srv1 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.11 ops per sec 24
proc id 0 time 4.09 ops per sec 24


[user@srv1 ~]$ taskset -c 7 ./cputest.sh 7
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.11 ops per sec 24
proc id 0 time 4.09 ops per sec 24
Example for the Intel CPU - same physical core
[user@srv1 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.09 ops per sec 24
proc id 0 time 6.06 ops per sec 16
proc id 0 time 7.05 ops per sec 14
proc id 0 time 7.02 ops per sec 14
[user@srv1 ~]$ taskset -c 8 ./cputest.sh 8
proc id 8 time 7.05 ops per sec 14
proc id 8 time 7.03 ops per sec 14
proc id 8 time 7.03 ops per sec 14
proc id 8 time 7.01 ops per sec 14
Example for the AMD CPU - different physical cores
[user@terminalserver01 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.00 ops per sec 25
proc id 0 time 4.01 ops per sec 24
proc id 0 time 4.02 ops per sec 24
proc id 0 time 4.02 ops per sec 24

  
[user@terminalserver01 ~]$ taskset -c 3 ./cputest.sh 3
proc id 3 time 4.00 ops per sec 25
proc id 3 time 4.01 ops per sec 24
proc id 3 time 4.00 ops per sec 25
proc id 3 time 4.00 ops per sec 25
Example for the AMD CPU - same physical core
[user@terminalserver01 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.01 ops per sec 24
proc id 0 time 4.00 ops per sec 25
proc id 0 time 4.79 ops per sec 20
proc id 0 time 5.00 ops per sec 20


[user@terminalserver01 ~]$ taskset -c 1 ./cputest.sh 1
proc id 1 time 5.02 ops per sec 19
proc id 1 time 5.00 ops per sec 20
proc id 1 time 5.01 ops per sec 19
proc id 1 time 5.00 ops per sec 20
On the previous examples we see, as expected, a per program performance drop when two CPU intensive programs run on virtual CPUs that share common resources, i.e. , programs running on different threads of the same core. We tried to normalize the examples by translating AMD's new  terminology to the usual Intel case - otherwise we would have to say, in the AMD case, that the performance penalty happens whenever we run two programs in the cores belonging to the same module.

It is also interesting to compare performance penalties, or put in another way, the amount of extra computing power arising from Hyperthreading or AMD's core duplication. Even though we see a per program performance penalty, the total number of operations increases when the two programs are running on the same core.

On the Intel system we saw the number of operations per second increase from 24 to 28 (14+14) as we added a second instance of cputest.sh in the same core. On the AMD processor that number increased from 24 to 40. So, for this particular workload - which is a rather trivial series of floating point operations - Hyperthreading increases performance by 16% whereas AMD's core duplication increases performance by 67%. On the other hand we see that Intel's 2.4GHZ processor performs the same single-thread 24 operations per second that AMD's 3.2GHZ processor is capable of. With two threads per core the AMD processor performs 67% more operations. These numbers are not of absolute value in real world scenarios - for more rigorous benchmarking you can look at this article.

Still, it is at least instructive to work out some trivial numbers. If we were to compare systems made of these two processors, for the specific simple calculations we referred to, we could write
P=N*F*n*S
For the Intel system, considering the advertised 4 cores without making use of Hyperthreading and using this system as the reference (S=1) we would have
P_Intel=N*2.4*4*1
and for the AMD system, considering the advertised 8 'cores' put to use, we would have
P_AMD=P_Intel*(40/24)
Since we also have
P_AMD=N*3.2*8*S
we conclude that  S=0.625.

Of course this comparison is not fair since we are treating AMDs 'cores' as if they were real cores and ignoring the extra performance that would come from Intel's Hyperthreading.  More fair comparisons can and should also be performed.

Let us consider single threaded performance first. Since in that case the number of operations per second is the same for  both tested processors we have
P_AMD=P_Intel
P_AMD=N*3.2*4*S
and therefore S=2.4 / 3.2 = 0.75. In this case S accounts only for the single thread efficiency of each processor - Intel computes more per available Mhz.

Now if we prefer to consider the more efficient, and more realistic, scenario where we run two threads on each core (Intel's terminology), we should write
P_AMD=P_Intel*(40/28)
P_AMD=N*3.2*A*S
from where we obtain S=(30/7A). In the previous expression A is what we consider to be AMD's core count. If we use AMD's marketing core count (A=8) we would find S=0.54. Otherwise, ie using A=4, we would have S ~ 1.07.

Thus, if we count the cores the same way Intel and the Linux kernel do we see that the multi-threaded throughput per MHZ of the AMD Opteron processor is roughly equivalent to the throughput per MHZ of the Intel processor, at least for this particular workload.

Couting cores in  that way enables us to use directly the formula
 P=N*F*n
Surprisingly, the net effect between better single threaded throughput per Mhz on Intel and higher performance bonus from using more than one thread per core (Intel's terminology) on AMD is such that, for this simple workload, a counting correction is enough to enable fair comparisons. 

Therefore, on fully busy systems the number of correctly counted cores and their working frequency are enough to define the expected performance.

What we have found using a very simple benchmark is consistent with our first "educated guess" and with what we mentioned before as seen on a Linux HPC system tender.

Another important conclusion that we arrive at, from this analysis, is /proc/cpuinfo not being a consistent source of information anymore. What is present there depends on the specific kernel CPU driver and there were public discussions between Intel and AMD engineers about what should be available on cpuinfo in face of the new processor architectures. Apparently /proc/cpuinfo is seen as deprecated its replacement being the information available at
 /sys/devices/system/node/nodeX/cpuY/topology
where X is the NUMA node and Y the virtual CPU id. For example, by looking at the contents of
 /sys/devices/system/node/node0/cpu0/topology/thread_sibblings_list
on the AMD system we would have seen that virtual processors 0 and 1 are in fact a pair of sibblings.

NUMA is the architecture that replaced SMP - and we expect to be present in all new multi processor machines - where each NUMA "node" has faster access to a specific memory region. We won't go into details on this post besides mentioning that a summary of NUMA related information can be seen by running
numactl --hardware
The same command can be used to bind taks to a NUMA node. This might be useful if the tasks are not only CPU intensive but very I/O intensive in terms of memory.

Conclusions

A short summary of what we learned / reviewed
  • performance is not totally comparable by clock speed values
  • performance is not totally measurable by number of cores
  • the advertised number of cores is a very different thing for Intel processors and current AMD Opterons
To compare the guaranteed parallel CPU performance of  different machines one needs to calculate N*F*n and run a single threaded test application on each one in order to derive a performance scale factor.

The first value estimates the theoretical parallel computing power, which needs to be normalized across machines by a value proportional to the effective number of "operations" (whatever those are) per second per MHZ, each machine is capable of.

The same thing can be done with two or more working instances of a test application in order to calculate the total throughput of a physical core, exploring its internal subdivisions.

If testing is not possible it is a good starting point to calculate N*F*n for the candidate systems dividing the advertised value of n by 2 for each AMD system in the list.

From there, we can estimate the cost per "effective parallel MHZ" and become conscious buyers. All conclusions will, of course, be very specific to the tested workloads.

While this article is focused on CPU performance please note that I/O is just as important in many real world scenarios.  

Sem comentários: