segunda-feira, 27 de Janeiro de 2014

Core confusion

Keywords

cpuinfo, hyperthreading, intel, amd, cores, modules, numa

Introduction

Comparison of processor features and performance is a complicated subject. Back in the days we had RISC vs CISC, then we had the growing market of turtle-slow PCs vs large scale machines from SUN, IBM... then PCs got acceptably fast, then they got more cores at a lower clock speed, then came Hyperthreading, then clock speeds went faster, then servers got even faster and in the end of 2011 AMD has launched a processor architecture called Buldozzer where computing units are something between an Intel core and an Intel thread but to which they also call cores.

Needless to say that for the high level IT manager this is a confusing mess - a mess perhaps driven by marketing that must be carefully looked at, first from the procurement side and later from the operations side.

Core counting

Before purchasing a server we usually look at the available options for number of CPUs (N), CPU clock frequency (F) and number of cores per CPU (n). When our applications are both CPU intensive and parallel in nature it makes sense to estimate the total available computing power by calculating P=N*F*n. To the guaranteed computing power, calculated in this manner, we would add a non-guaranteed, workload dependant extra that comes from Hyperthreading, if the processor is Intel. The ability to run a second separate thread in the same core translates in fact to a performance boost of something between 10 and 30%, depending on the workload.

If the processors are recent Opterons from AMD we can't use the same formula. We must adapt it to look like this:
P=N*F*n*S
where S is a scaling factor that compensates for the fact that what AMD advertises as a core is not as independent as an Intel core. In their new terminology AMD talks about modules inside which there are two cores present. They could, instead, have counted each module as a core and have invented some word, eg Astrothreading, that would make things easily comparable - and it seems that there is a word for that. But the result on sales might not be as good :-)

Given the described physical situation - 2 "cores" sharing resources inside a module - it is tempting to propose, as an educated guess, S=0.5. In fact, we have seen a similar value present on a HPC tender in order to allow for the comparison of recent Intel and AMD systems with a simple N*F*n formula. It was something like:

Intel E5-26XX S=1.00
Intel E5-26XXv2 S=1.00
Intel E5-24XX S=1.00
Intel E5-46XX S=1.00
Intel E7-28XX S = 0.71
AMD Opteron 62XX S = 0.47
AMD Opteron 63XX S = 0.47

Being that so, for a crude estimation of the guaranteed performance we could go back to the initial formula
P=N*F*n
where n would now be the number of Intel cores or the number of advertised AMD cores divided multiplied by 0.47 (roughly a division by 2).

We should have in mind that the non-guaranteed extra performance will, in principle, be greater in AMD Opterons than what Intel's Hyperthreading has to offer. And we should also have in mind that the S factor mentioned above was empirically calculated on a specific HPC scenario, while recognizing that it matches our educated guess. It means that for this specific HPC scenario AMD processors seem to behave as if having only half of the advertised cores.

We will see, later in this article, how this calculation holds on a simplistic benchmark we have performed, even more closely when the comparison takes into account the extra performance given by Intel's Hyperthreading and AMD's core 'duplication'.

Linux classification of computing units

We will now analyze two different systems, both exposing 16 virtual processors to the operating system. This is an example of an  Intel Xeon E5620 CPU, as seen from dmidecode
 Version: Intel(R) Xeon(R) CPU    E5620  @ 2.40GHz
 Voltage: Unknown
 External Clock: 133 MHz
 Max Speed: 2400 MHz
 Current Speed: 2400 MHz
 Status: Populated, Enabled
 Upgrade: Other
 L1 Cache Handle: 0x0005
 L2 Cache Handle: 0x0006
 L3 Cache Handle: 0x0007
 Serial Number: To Be Filled By O.E.M.
 Asset Tag: To Be Filled By O.E.M.
 Part Number: To Be Filled By O.E.M.
 Core Count: 4
 Core Enabled: 4
 Thread Count: 8
 And this is an example of an AMD Opteron 6328
 Version: AMD Opteron(tm) Processor 6328
 Voltage: 1.1 V
 External Clock: 200 MHz
 Max Speed: 3200 MHz
 Current Speed: 3200 MHz
 Status: Populated, Enabled
 Upgrade: Socket G34
 L1 Cache Handle: 0x0005
 L2 Cache Handle: 0x0006
 L3 Cache Handle: 0x0007
 Serial Number: To Be Filled By O.E.M.
 Asset Tag: To Be Filled By O.E.M.
 Part Number: To Be Filled By O.E.M.
 Core Count: 8
 Core Enabled: 8
 Thread Count: 8
From these examples it would seem that Linux had adopted the AMD's terminology regarding core counts. However, dmidecode only displays what is stored in the computer BIOS and available through the DMI interface (man dmidecode). Let's see what is present on the /proc/cpuinfo.

For the Intel processor we have
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping        : 2
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
whereas for the AMD Opteron we see
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 2
model name      : AMD Opteron(tm) Processor 6328
stepping        : 0
cpu MHz         : 1400.000
cache size      : 2048 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
From what is present on /proc/cpuinfo we see that Linux has NOT adopted AMD's terminology. In fact it is treating AMD's core duplication in the same way it treats Intel's Hyperthreading: the number of cores is exactly the same and the total number of core threads, called "sibblings", is twice as large. That is, in both cases we have 2 threads per core and each of those appears to the operating system as a virtual CPU. We will see as many entries as the number of sibblings in each processor times the number of processors. But not all entries are equal: each core with Hyperthreading (Intel) or "double core in a module" (AMD) represents two entries that share resources and, thus, don't behave as independent processors.

Note: the displayed cpu MHZ values are not the maximum ones - the energy saving  mechanism lowers the frequency when the system is idle and increases it on demand.

Distinguishing independent and coupled virtual CPUs

We have seen that anytime the number of sibblings is twice the number of cores the system will contain pairs of virtual CPUs that are dependent on each other. The Linux kernel is aware of that and spreads the load over available physical cores before allowing a second thread of each core be put to use. But the user might want to manually bind certain long running cpu intensive processes to specific cpus. How can the user know which virtual CPUs depend on each other?

We can group the relevant processor info by running something like
 cat /proc/cpuinfo|egrep "processor|physical id|core id" |sed 's/^processor/\nprocessor/g'
 For the Intel system we have
processor       : 0
physical id     : 0
core id         : 0


processor       : 1
physical id     : 0
core id         : 1

processor       : 2
physical id     : 0
core id         : 9

processor       : 3
physical id     : 0
core id         : 10

processor       : 4
physical id     : 1
core id         : 0

processor       : 5
physical id     : 1
core id         : 1

processor       : 6
physical id     : 1
core id         : 9

processor       : 7
physical id     : 1
core id         : 10

processor       : 8
physical id     : 0
core id         : 0


processor       : 9
physical id     : 0
core id         : 1

processor       : 10
physical id     : 0
core id         : 9

processor       : 11
physical id     : 0
core id         : 10

processor       : 12
physical id     : 1
core id         : 0

processor       : 13
physical id     : 1
core id         : 1

processor       : 14
physical id     : 1
core id         : 9

processor       : 15
physical id     : 1
core id         : 10
For the AMD system we have
processor       : 0
physical id     : 0
core id         : 0


processor       : 1
physical id     : 0
core id         : 1


processor       : 2
physical id     : 0
core id         : 2

processor       : 3
physical id     : 0
core id         : 3

processor       : 4
physical id     : 0
core id         : 0

processor       : 5
physical id     : 0
core id         : 1

processor       : 6
physical id     : 0
core id         : 2

processor       : 7
physical id     : 0
core id         : 3

processor       : 8
physical id     : 1
core id         : 0

processor       : 9
physical id     : 1
core id         : 1

processor       : 10
physical id     : 1
core id         : 2

processor       : 11
physical id     : 1
core id         : 3

processor       : 12
physical id     : 1
core id         : 0

processor       : 13
physical id     : 1
core id         : 1

processor       : 14
physical id     : 1
core id         : 2

processor       : 15
physical id     : 1
core id         : 3
For each of the previous examples we have formatted as bold two threads belonging to a single physical core. We see that even with the same number of cores and threads per processor, the output is different on these two systems. For the Intel processor the two threads of the same core are the ones that share the same physical id and core id whereas for the AMD processor the thread pairs have adjacent processor ids.

We can reach this conclusion empirically by developing a simple cpu performance test program, say called cputest.sh, and running two instances of it on an otherwise idle machine. With the taskset command we can bind each instance to a different virtual cpu and evaluate the performance. Whenever we detect a performance penalty due to the second instance we have hit a pair of core threads.

Example for the Intel CPU - different physical cores
[user@srv1 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.11 ops per sec 24
proc id 0 time 4.09 ops per sec 24


[user@srv1 ~]$ taskset -c 7 ./cputest.sh 7
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.10 ops per sec 24
proc id 0 time 4.11 ops per sec 24
proc id 0 time 4.09 ops per sec 24
Example for the Intel CPU - same physical core
[user@srv1 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.09 ops per sec 24
proc id 0 time 6.06 ops per sec 16
proc id 0 time 7.05 ops per sec 14
proc id 0 time 7.02 ops per sec 14
[user@srv1 ~]$ taskset -c 8 ./cputest.sh 8
proc id 8 time 7.05 ops per sec 14
proc id 8 time 7.03 ops per sec 14
proc id 8 time 7.03 ops per sec 14
proc id 8 time 7.01 ops per sec 14
Example for the AMD CPU - different physical cores
[user@terminalserver01 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.00 ops per sec 25
proc id 0 time 4.01 ops per sec 24
proc id 0 time 4.02 ops per sec 24
proc id 0 time 4.02 ops per sec 24

  
[user@terminalserver01 ~]$ taskset -c 3 ./cputest.sh 3
proc id 3 time 4.00 ops per sec 25
proc id 3 time 4.01 ops per sec 24
proc id 3 time 4.00 ops per sec 25
proc id 3 time 4.00 ops per sec 25
Example for the AMD CPU - same physical core
[user@terminalserver01 ~]$ taskset -c 0 ./cputest.sh 0
proc id 0 time 4.01 ops per sec 24
proc id 0 time 4.00 ops per sec 25
proc id 0 time 4.79 ops per sec 20
proc id 0 time 5.00 ops per sec 20


[user@terminalserver01 ~]$ taskset -c 1 ./cputest.sh 1
proc id 1 time 5.02 ops per sec 19
proc id 1 time 5.00 ops per sec 20
proc id 1 time 5.01 ops per sec 19
proc id 1 time 5.00 ops per sec 20
On the previous examples we see, as expected, a per program performance drop when two CPU intensive programs run on virtual CPUs that share common resources, i.e. , programs running on different threads of the same core. We tried to normalize the examples by translating AMD's new  terminology to the usual Intel case - otherwise we would have to say, in the AMD case, that the performance penalty happens whenever we run two programs in the cores belonging to the same module.

It is also interesting to compare performance penalties, or put in another way, the amount of extra computing power arising from Hyperthreading or AMD's core duplication. Even though we see a per program performance penalty, the total number of operations increases when the two programs are running on the same core.

On the Intel system we saw the number of operations per second increase from 24 to 28 (14+14) as we added a second instance of cputest.sh in the same core. On the AMD processor that number increased from 24 to 40. So, for this particular workload - which is a rather trivial series of floating point operations - Hyperthreading increases performance by 16% whereas AMD's core duplication increases performance by 67%. On the other hand we see that Intel's 2.4GHZ processor performs the same single-thread 24 operations per second that AMD's 3.2GHZ processor is capable of. With two threads per core the AMD processor performs 67% more operations. These numbers are not of absolute value in real world scenarios - for more rigorous benchmarking you can look at this article.

Still, it is at least instructive to work out some trivial numbers. If we were to compare systems made of these two processors, for the specific simple calculations we referred to, we could write
P=N*F*n*S
For the Intel system, considering the advertised 4 cores without making use of Hyperthreading and using this system as the reference (S=1) we would have
P_Intel=N*2.4*4*1
and for the AMD system, considering the advertised 8 'cores' put to use, we would have
P_AMD=P_Intel*(40/24)
Since we also have
P_AMD=N*3.2*8*S
we conclude that  S=0.625.

Of course this comparison is not fair since we are treating AMDs 'cores' as if they were real cores and ignoring the extra performance that would come from Intel's Hyperthreading.  More fair comparisons can and should also be performed.

Let us consider single threaded performance first. Since in that case the number of operations per second is the same for  both tested processors we have
P_AMD=P_Intel
P_AMD=N*3.2*4*S
and therefore S=2.4 / 3.2 = 0.75. In this case S accounts only for the single thread efficiency of each processor - Intel computes more per available Mhz.

Now if we prefer to consider the more efficient, and more realistic, scenario where we run two threads on each core (Intel's terminology), we should write
P_AMD=P_Intel*(40/28)
P_AMD=N*3.2*A*S
from where we obtain S=(30/7A). In the previous expression A is what we consider to be AMD's core count. If we use AMD's marketing core count (A=8) we would find S=0.54. Otherwise, ie using A=4, we would have S ~ 1.

Thus, if we count the cores the same way Intel and the Linux kernel do we see that the multi-threaded throughput per MHZ of the AMD Opteron processor is roughly equivalent to the throughput per MHZ of the Intel processor, at least for this particular workload.

Couting cores in  that way enables us to use directly the formula
 P=N*F*n
Surprisingly, the net effect between better single threaded throughput per Mhz on Intel and higher performance bonus from using more than one thread per core (Intel's terminology) on AMD is such that, for this simple workload, a counting correction is enough to enable fair comparisons. 

Therefore, on fully busy systems the number of correctly counted cores and their working frequency are enough to define the expected performance.

What we have found using a very simple benchmark is consistent with our first "educated guess" and with what we mentioned before as seen on a Linux HPC system tender.

Another important conclusion that we arrive at, from this analysis, is /proc/cpuinfo not being a consistent source of information anymore. What is present there depends on the specific kernel CPU driver and there were public discussions between Intel and AMD engineers about what should be available on cpuinfo in face of the new processor architectures. Apparently /proc/cpuinfo is seen as deprecated its replacement being the information available at
 /sys/devices/system/node/nodeX/cpuY/topology
where X is the NUMA node and Y the virtual CPU id. For example, by looking at the contents of
 /sys/devices/system/node/node0/cpu0/topology/thread_sibblings_list
on the AMD system we would have seen that virtual processors 0 and 1 are in fact a pair of sibblings.

NUMA is the architecture that replaced SMP - and we expect to be present in all new multi processor machines - where each NUMA "node" has faster access to a specific memory region. We won't go into details on this post besides mentioning that a summary of NUMA related information can be seen by running
numactl --hardware
The same command can be used to bind taks to a NUMA node. This might be useful if the tasks are not only CPU intensive but very I/O intensive in terms of memory.

Conclusions

A short summary of what we learned / reviewed
  • performance is not totally comparable by clock speed values
  • performance is not totally measurable by number of cores
  • the advertised number of cores is a very different thing for Intel processors and current AMD Opterons
To compare the guaranteed parallel CPU performance of  different machines one needs to calculate N*F*n and run a single threaded test application on each one in order to derive a performance scale factor.

The first value estimates the theoretical parallel computing power, which needs to be normalized across machines by a value proportional to the effective number of "operations" (whatever those are) per second per MHZ, each machine is capable of.

The same thing can be done with two or more working instances of a test application in order to calculate the total throughput of a physical core, exploring its internal subdivisions.

If testing is not possible it is a good starting point to calculate N*F*n for the candidate systems dividing n by 2 for each AMD system in the list.

From there, we can estimate the cost per "effective parallel MHZ" and become conscious buyers. All conclusions will, of course, be very specific to the tested workloads.

While this article is focused on CPU performance please note that I/O is just as important in many real world scenarios.  

sábado, 30 de Novembro de 2013

Fair share user scheduling on Linux terminal servers

Keywords

Linux terminal server, LTSP, xrdp, kernel, scheduler, cgroups.

Summary

The default Linux scheduling configuration divides CPU time on a per process basis. If a system accepts multiple interactive users, that is not the desired situation since a user that is running many processes will have a larger CPU share than the others. However, it is possible to configure fair share user scheduling on Linux using cgroups together with some scripting. That will distribute CPU time, as well as other resources, in a flexible user based manner. NFS I/O escapes the control groups mechanisms but we will provide a method for handling such situations. The motivation for this article is the need to ensure that, on a correctly configured server, a single user is unable to cause system wide performance degradation

Introduction

Once up on a time there was a kernel option called FAIR_USER_SCHED that divided the CPU time slices on a per user basis. For some time, later on, that option was called just USER_SCHED. Eventually it got removed because its existence meant a lot of code to maintain. As a result, each google search about "fair share user scheduling" brings now up old discussion threads and college homework assignments. But as this option was removed a hot replacement called cgroups was ready to rock... well, sort of.

In system administration nothing is ever that simple. A kernel option that would automatically set the scheduler to work on a per user basis got replaced by a much more general process group approach that needs manual configuration and user space real time process classification to achieve the same effect. Less code to maintain in the kernel - lots of homemade scripts to be written worldwide and no documentation on how to do it properly. There is, however, a great thing in the new approach: it is not only the CPU that can be fairly shared. We can now control the scheduling of other resources such as I/O on block devices and network activity.

Resource sharing on terminal servers

There are different systems that can benefict from this such as mass hosting web servers running Apache with Suexec, large university development machines that take thousands of SSH logins and terminal servers. We will focus on terminal servers for this article.

The case for terminal servers is a rather trivial one, in terms of system administration efficiency. But configuration and maintenance require a deep understanding of many aspects of OS and network operations. Part of that will be the subject of a future article. For now, we would like to stress that when talking about terminal servers we don't mean the single application remote session running on a simplified desktop (or no desktop at all). Neither we mean a virtual remote single-user PC. We are talking about a complete remote desktop server where N users deal simultaneously with KDE, Firefox, Libreoffice, Gwenview, Gimp, Okular, arK, rdesktop and others, as if they had a local machine with all that software installed.


From time to time Linus Torvalds tells the media that one of the remaining challenges for the Linux kernel is desktop workloads. So, the greatest possible challenge must be having N desktop users on the same machine sharing the same Linux kernel. We will  now show you how to optimize the scheduling for fair sharing of CPU and block I/O. Everything described in this article was successfully tested on Ubuntu 12.04 LTS using XRDP as the remote desktop server.

Kernel control groups

Control groups, aka cgroups, are sets of processes to which the Linux kernel assigns specific resource shares. When you install cgroups on Ubuntu
apt-get install cgroup-bin libcgroup1 libpam-cgroup
you will get a default configuration where all the running processes belong to the same cgroup. That is a particular configuration where the kernel behaves just as if the cgroup feature was not enabled - since there is a single group it takes all the resources and shares them internally on a per-process basis. From that point on we can start building a custom configuration.








1. The first step is creating a cgroup for all users that will ever access the terminal server.

Essentialy, this is done using
getent group REMOTEUSERS
and pushing the list to the cgcreate command, that belongs to libcgroup.

There are some details related to the fact that the cgroups must be refreshed from time to time to keep the terminal server in sync with the list of users stored on the authentication server. Since there may be thousands of users and, in that case, we don't want to recreate the already existing cgroups, we will provide you a script that takes care of such details.

In case you find this over engineering we will refer you back to the Introduction where we warned that the "much more general group approach" needed "manual configuration...". "Generality comes at this price", must have thought the kernel team.

2. The second step is having PAM and the cgroup utilities classify the post login processes, for each user, into the right cgroup.

 /etc/pam.d/common-session
[...]
session required        pam_script.so
session optional        pam_cgroup.so
session sufficient      pam_winbind.so
session optional        pam_ck_connector.so nox11
[...]
/etc/cgrules.conf
@REMOTEUSERS             cpu,blkio     %U
root                              cpu,blkio     system
xrdp                             cpu,blkio     system
*                                   cpu,blkio     others
With this configuration users belonging to the group REMOTEUSERS will have an individual cgroup. On the other hand, users that are allowed to log in but don't belong to that group will compete for resources within a group called "others". Processes belonging to users root and xrdp will have a reserved share via a group called "system".

 3. The final step is allocating the available resources to the created groups.

That is achieved by using the cgset command on the existing cgroups. What we do in this case is giving each user cgroup the same CPU and I/O share while leaving a fixed share for the "system" and "others" cgroups.

Something like this:
for i in $USERLIST; do
  cgcreate -g cpu:/$i
  cgcreate -g blkio:/$i
  cgset -r blkio.weight=$USERBLKIO /$i
done

# other users that have processes running
if [ ! -e /sys/fs/cgroup/cpu/others ]; then
  cgcreate -g cpu:/others
  cgcreate -g blkio:/others
  cgset -r blkio.weight=$USERBLKIO /others
fi

# for anything that runs as root and something else that cgrules.conf puts in thys group
if [ ! -e /sys/fs/cgroup/cpu/system ]; then
  cgcreate -g cpu:/system
  cgcreate -g blkio:/system
  # reserved for root ssh actions - default shares value is 1024
  cgset -r cpu.shares=$SYSCPU /system
  # default value is 500, max value is 1000
  cgset -r  blkio.weight=$SYSBLKIO /system
fi
Note: if you are seeing pointless cgroup messages on the syslog add
LOG="--no-log"
to /etc/cgred.conf.

Automating setup and maintenance

To automate the synchronization between the terminal server and the authentication directory it connects to (be it Active Directory, LDAP or Samba...) we suggest this script. It makes cgroup management a much easier task and can be edited to suit your needs.

Testing the setup

After the the system is configured you should see one directory named as each user at /sys/fs/cgroup/cpu and /sys/fs/cgroup/blkio/. Inside each of those directories there will be a file called "tasks" that lists the processes of the corresponding user. You can compare the content of that file to the content of "ps -u USERNAME" and check that it matches. You can see that as another user logs in its processes are automatically inserted at both /sys/fs/cgroup/cpu/ANOTHERUSER/tasks and /sys/fs/cgroup/blkio/ANOTHERUSER/tasks.

You can query the properties of each cgroup with commands such as the following:
cgget USERNAME
cgget system
cgget others
Once you are sure that group configuration and process classification are working you can run CPU and block I/O sharing tests.

Imagine that you have a single core VM. If a certain CPU intensive test process takes 1 minute to complete for user A while pushing the virtual CPU to 100%, two of the same processes running - one for user A and another for user B - shall take 2 minutes to complete. Each of the users will wait 2 minutes for the result. Now, what if user B decides to run 2 instances of the same process? The answer is easy: it would take a total of 3 minutes for all the processes to be ready. But here you will see the important difference: without our cgroup configuration both users would wait 3 minutes for all their processes to complete whereas with our configuration user A would wait 2 minutes and B would wait 3. That is, each user will get half of the CPU time regardless of the number of processes it runs. When user A's process is done, after 2 minutes, user B will get the whole CPU and finish the remaining work in a single minute. The evolution of CPU shares in time is summarized in the following table.

Without fair sharing
1st minute
2nd minute
3rd minute
A
1/3
1/3
1/3
B
2/3
2/3
2/3
With fair sharing
1st minute
2nd minute
3rd minute
A
1/2
1/2
0
B
1/2
1/2
1
 
We should stress that for this behaviour to be verifiable the test must fully occupy the available CPUs. On a single core machine it is easy to do it with a single test process. To perform this simple test on a multicore machine you need to run extra CPU intensive processes just to keep all but one cores at 100% thus enabling competition between test processes - cgroups have no effect if there is no resource scarcity.

To test block I/O resource sharing the procedure is similar. We suggest that you saturate disk read access by using dd to read through a large file. Running concurrent dd processes mustn't allow the owner of those processes to write at a higher rate. After preparing a couple of large files with
dd if=/dev/zero of=/tmp/zerofile1 bs=1024 count=4096000
cp -f /tmp/zerofile1 /tmp/zerofile2
cp -f /tmp/zerofile1 /tmp/zerofile3
one simple test would be executing something like
clear
sync
echo 3 > /proc/sys/vm/drop_caches

cgexec -g blkio:ONEUSER dd if=/tmp/zerofile1 of=/dev/null &
cgexec -g blkio:ANOTHERUSER dd if=/tmp/zerofile2 of=/dev/null &
cgexec -g blkio:ANOTHERUSER dd if=/tmp/zerofile3 of=/dev/null &
Just as with the CPU test you would conclude that ANOTHERUSER would not see an advantage in running 2 concurrent processes.

CPU and block I/O can be monitored with the atop utilty - use shift+d to sort by DSK activity and shift+c to sort by CPU usage. If you'd like to see CPU usage grouped per user you can use this script.

NFS - not fair shareable

If everything above is correctly setup you should have a system where, if pressure mounts, each user becomes bound to a certain amount of CPU and block I/O. Thus, the impact of misbehaviour or accidental misuse is limited.

Still, problems may arise if you are using NFS to store home directories. If you are doing that, probably due to having several terminal servers balanced across a large group of users, you should be aware that NFS usage is not fair shareable. This suggests a new funky acronym for NFS but that is purely accidental and we won't go further that way.

The problem is that NFS I/O does not count as block I/O - NFS is a network filesystem not a block device. Furthermore, network transfers on NFS are done as root so network related cgroups mechanisms can't be used either.

To mitigate this problem we developed a simple I/O governor that permanently monitors suspect processes - that must be defined in a case by case basis - and calms them down if they are taking too much I/O. By calming them down, we mean sending them SIGSTOP and, after a while, SIGCONT. What the governor actually does to an I/O intensive process, after leaving it alone for a certain configurable grace period, is put it to sleep for a decreasing fraction of its running time until its I/O activity reaches a specific pre-configured limit.

The usual suspects are processes that tend to saturate the I/O bandwidth to the physical storage by reading or writing at a very high rate causing large delays to the short reads and writes of other applications. For example, we wouldn't like that a user copying a 1GB folder to a shared network drive would degrade the startup time of Libreoffice or Firefox for all the others. But that could happen, since NFS is not fair shareable and that's precisely what the I/O governor avoids. The original idea came from this CPU governor (which we don't need since cgroups are doing the cpu governing job) and the working logic is probably not very different from the Process Lasso windows tool that we mentioned here.

We usually monitor kio_file processes which are responsible for handling file copy operations performed with Dolphin, on KDE. That, of course, needs to be adapted to the particular use of each terminal server. You can take a look at the governor here. Please note that it should be running at all times and depends on the helper script iolim.sh that you can find here.

Conclusion

Resource governing can be achieved on terminal servers by using a combination of the cgroups kernel feature and a couple of bash scripts. Performance loss situations, that could become stressful to the users, can be effectively avoided. Runaway processes that potentially consume too much CPU, block I/O or NFS I/O can be controlled automatically.

References

http://www.janoszen.com/2013/02/06/limiting-linux-processes-cgroups-explained
http://utcc.utoronto.ca/~cks/space/blog/linux/CGroupsPerUser
https://www.kernel.org/doc/Documentation/cgroups

sábado, 23 de Novembro de 2013

Ali há link, ali há lata

Poderia, talvez, pensar-se que a tragi-comédia à volta das discussões sobre conteúdos digitais e pirataria estivesse já esgotada, em Portugal. Isso seria, vemos agora, uma ideia infantil já que é perfeitamente possível um assunto estar esgotado na substância mas não o estar, de todo, na irritância.

Que a palavra irritância não exista pouco nos importa. Aliás, não é líquido que não venha a tornar-se um substantivo precisamente aplicável ao dom dos que nada de substancial fazem para resolver um problema próprio, mas do qual recorrentemente se queixam.

Ao dom da irritância, dos tais, contrapõe-se a pertinência da blogger Jonasnuts que, com perspicácia, nos demonstra que os do costume se precipitaram, outra vez, nas conclusões e que o deputado Galamba está inocente, pelo menos deste crime e até prova em contrário.

Aproveitamos esta oportunidade para perguntar se "há link", não para o Porto-Sporting mas para algumas das obras que remetam vagamente para esta novela sem fim. Há link para o "The usual suspects"? Há link para o "2001 Odisseia no espaço"? (é especialmente relevante a cena do monolito). Há link  para o "O Império Contra Ataca"?

Neste caso, para evitar polémicas diremos explicitamente que só estamos interessados em links onde se possa fazer um pagamento do qual, como contrapartida, se obtenha o direito de visionar a obra ou de, da mesma, fazer download.

Posto este breve comentário remeter-nos-emos ao silêncio, sobre o assunto, até que "haja links" ou alguem nos acorde com uma overdose de piada. Galambagate, really? :-) Não será fácil fazer melhor.

terça-feira, 20 de Agosto de 2013

Silly season: os atributos de um café Português

O café é uma ciência complexa. 

E não, isto não se trata de mais uma piada sobre Java. O café em Portugal é apreciado pela subtileza das suas configurações e pela pouca subtileza de alguns empregados de mesa. 

O café em Portugal é perfeitamente dominado por um serviço de mesa experiente mas potencialmente confuso para um programador desprovido de cafeína.

Aqui fica uma implementação rudimentar em Python.
 
#!/usr/bin/python
# coding=utf-8

class Length:
        short, normal, long = range (3)

class Sweetner:
        sugar, sweetner = range (2)

class Mixer:
        spoon, cinnamon_stick = range (2)

class Type:
        normal, decaf = range (2)

class Coffee:
        "representation of a portuguese coffee"

        # the members are "semi private" - not meant to be accessed directly    
        def __init__(self, ctype = Type.normal , length = Length.normal , milk_spot = False , cup_preheating = False , sweetening = Sweetner.sugar ,  mixer_type = Mixer.spoon ):
                self.__type = ctype
                self.__length = length
                self.__milk_spot = milk_spot
                self.__cup_preheating = cup_preheating
                self.__sweetening = sweetening
                self.__mixer_type = mixer_type
                # homework: validate input and raise ValueError if it's wrong :-)

        # this function translates the internal representation of Coffee atrributes into an order that can be sent to the waiter
        def __str__(self):
                result = "Um %s %s %s %s %s %s" % ( self.get_type_pt(), self.get_length_pt(), self.get_milk_spot_pt(), self.get_cup_preheating_pt(), self.get_sweetening_pt(), self.get_mixer_type() )
                # clear redundant whitespace on return
                return ' '.join(result.split())+'.'

        def get_type_pt(self):
                if self.__type == Type.normal:
                        return "café"
                return "descafeinado"

        def get_length_pt(self):
                if self.__length == Length.short:
                        return "curto"
                if self.__length == Length.normal:
                        return ""
                if self.__length == Length.long:
                        return "cheio"
                return "erro"

        def get_milk_spot_pt(self):
                if self.__milk_spot == True:
                        return "pingado"
                return ""

        def get_cup_preheating_pt(self):
                if self.__cup_preheating == True:
                        return "em chávena escaldada"
                return ""

        def get_sweetening_pt(self):
                if self.__sweetening == Sweetner.sweetner:
                        return "com adoçante"
                return ""

        def get_mixer_type(self):
                if self.__mixer_type == Mixer.cinnamon_stick:
                        return "e pauzinho de canela"
                return ""

Enough said. Posto isto, há que fazer os pedidos. Experimentem estes.




# café default - for machos
cafe = Coffee()
print cafe

# ultrademanding customer
cafe = Coffee(Type.normal, Length.long, True, True, Sweetner.sweetner, Mixer.cinnamon_stick)
print cafe

# no comments
cafe = Coffee(Type.decaf, Length.normal, False, False, Sweetner.sweetner, Mixer.spoon)
print cafe

 

quinta-feira, 18 de Abril de 2013

Random youtube goodness

We just found out that it is possible to disable the greatest Youtube annoyance of all time: autoplay. Yes, like with many other browsing annoyances... there is a script for that.

YousableTube is a script for the Greasemonkey extension that can disable autoplay (while keeping autobuffering!), allow for easy downloading in several formats and much more. It's just a click away, after installing Greasemonkey. We tested it using the Youtube HTML5 mode and worked great.

And since this is not enough, seems that many people use Youtube to play music at parties. That must feel very clumsy :-) even now that we know how to disable the dreadful autoplay feature. But hey, there's a web app for that!

Turnetubelist is as crossfader application that can search on youtube and manage a pair of playlists. I wouldn't call this an DJ application :-) but it certainly can improve ad-hoc parties.




quarta-feira, 10 de Abril de 2013

1º de Abril atrasado



Nem houve tempo para completar a análise do "mercado" nacional de conteúdos digitais, agora com ao inclusão do Video Clube Online Optimus Clix, porque a Zon já se antecipou com uma oferta fortemente inovadora e concorrencial. Em Portugal, a inovação acontece a uma velocidade estonteante, com as empresas de telecomunicações a liderarem o progresso.

Mas é que é mesmo verdade. Parece que a Zon montou um vídeo clube online, no Facebook (que moderno!) e o inaugurou com um catálogo de dois (2) filmes. Ou pelo menos é o que têm garantido diversos media.

Uma visita rápida à página do vídeo clube conduz-nos a uma abundância nunca vista de filmes disponíveis "no Videoclube da tua TV". 

"Ah...então não era para alugar na Internet, ó chefe?!"

Bom, isso é do mais simples. Basta vascular as dezenas de posts que lá estão até encontrar um dos dois (2) que supostamente se podem alugar.

E depois é só dar um clique num URL to tipo "bit.ly" - estando logged in na sua conta do Facebook, é claro - sem medo do que possa acontecer a seguir. Afinal de contas isto é de confiança... é um clube da Zon...

Mas será mesmo?

É que na realidade não se vê qualquer referência à Zon na referida página. E na descrição do perfil lê-se

O serviço de Videoclube da TV está disponível nos diversos operadores de TV- ZON, MEO, Cabovisão, Optimus e Vodafone [...] Para mais informação sobre os termos deste serviço e preços aplicáveis, contacta o teu Operador de TV.
Alguém poderá garantir quem faz a gestão deste serviço?

Mas a Zon tem com certeza qualquer coisa a ver com isto. Ou pelo menos virá a ter no futuro. Para confirmar isso basta ler o Press Release oficial que data de Abril de 2103.

Em resumo, para aceder a este serviço futurista de 2103, que lhe permite alugar dois (2) filmes na Internet, basta fazer login com a sua conta do Facebook (se não tiver é favor ter), percorrer sequencialmente dezenas de posts, escolher um dos dois filmes - ou então o outro, clicar num link "bit.ly", aceitar o acesso a uma aplicação do Facebook, rezar para que não haja intromissões na sua privacidade e, se tudo correr bem, pagar e ver um filme (ou então o outro).

A revolução está ON. E com serviços online deste calibre a pirataria em breve estará OFF :-)


domingo, 7 de Abril de 2013

Xorg, vesa, modelines, intel and 915resolution

TLDR


This post is perhaps a bit too technical and probably too long, but it is important that certain things are google indexed, for the Linux community to work better. That has been helpful to us many times in the past, so we are glad to do our share.

The Intel Cedarview Xorg driver seems to be quite slow. It is outperformed by the Xorg Vesa driver, at least for basic operations. You can use any graphics mode (including wide modes such as 1920x1080) with the Vesa driver thanks to a small program called 915resolution. For multiple devices connected to different monitors you can use Vesa + 915resolution and control the display resolution using DHCP.

Intel's Cedarview mess


Seems that the Intel Cedarview chipset had a tough start on Linux devices. The stupidity of outsourcing the GPU to PowerVR, while not ensuring to have working drivers in time, caused many complaints (example) from users around the world. That meant lots of hardware working at the wrong resolution or with sub-optimal performance. And it happened even though the very same Intel employs a highly trained Linux driver team. That team, unfortunately, can't do anything for rebranded 3rd-party hardware, if no specifications are delivered.

At some point, Intel finally made official drivers available. The Xorg driver is called pvr and can be found here. But guess what the Release Notes say:

 * Poor 2D rendering performance through Xlib protocol. (#9 #125 #126)

Having tried the version that landed on Ubuntu 12.04 official repositories (apt-get install cedarview) I must say the Release Notes are right. The driver seems unusable, as dragging and maximizing windows is painfully slow. While this would embarassing enough for Intel, I must add something to the situation. Their driver is outperformed by the Vesa driver and, with an up to date kernel such as the default kernel from Ubuntu 12.04.2, the Modesetting driver. Yes, it's true. The late coming specific driver from Intel is beaten hands down by two generic Linux drivers, one of which is several years old.

I must say we haven't tried OpenGL, XV or VAAPI with Intel's pvr driver. Even if it worked and performed well, it would still be pointless since the user can't properly drag a window without waiting one second for each repaint.

So, as a company that's responsible for thousands of EUR of Intel related purchases every year, we would like to greet Intel with the famous words, Linus Torvalds has given NVIDIA, and kindly suggest that they use their own driver team to have proper drivers written in time.


The Vesa driver


The Vesa driver is a generic Xorg driver that uses one of the VESA graphics modes that are present in the graphics cards BIOS. This driver doesn't know how to program low level details of the card and doesn't accelerate any operations. However, given a powerful enough CPU the Vesa driver is more than acceptable for daily work.

But there is a problem: the official VESA modes are not suitable for current displays. None of the currently used wide modes is available in the cards BIOSes and the driver will ignore any modelines that are added to Xorg.conf.

To help with that, a utility called 915resolution that works with Intel's cards was created. An updated version that works with the Cedarview chipset can be found here.

To run any graphics mode using the Vesa driver, you just need to load it into the card BIOS before Xorg is started. Some examples:


[root@host ubuntupxe]#./915resolution -c Cedarview 3c 1920 1080 16 2576 1120

[root@host ubuntupxe]#./915resolution -c Cedarview  3c 1600 900 16 2112 934

[root@host ubuntupxe]#./915resolution -c Cedarview 3c 1600 900 16 1904 934

The first argument of 915resolution, 3c,  is the code of the standard mode that will be overwritten. You can see a list of modes using the -l flag.

The remaining arguments are X, Y, DEPTH, HTOTAL and VTOTAL. Even though HTOTAL and VTOTAL are said to be optional, they are in fact necessary for the picture to have the right size and placement on screen. You can find values for HTOTAL and VTOTAL generating a modeline with cvt, as in the following example

[root@host ubuntupxe]# cvt 1600 900 60

In this example, we requested a modeline for 1600x900 @ 60 Hz. The result is

# 1600x900 59.95 Hz (CVT 1.44M9) hsync: 55.99 kHz; pclk: 118.25 MHz
Modeline "1600x900_60.00"  118.25  1600 1696 1856 2112  900 903 908 934 -hsync +vsync

The values for HTOTAL and VTOTAL are always the 7th and 11th field (emphasis added above).

A generic xorg.conf for multiple devices


Once you ensure each of your devices has the best mode for its display made available using 915resolution, you only need a generic Xorg.conf. That file must contain  "vesa" as the graphics driver and a list of wide modes. In the example below, xorf.conf contains a set of wide modes all of which are ignored, except the one loaded into the card BIOS using 915resolution.

[...]
Section "Screen"
        Identifier "Screen0"
        Device     "Card0"
        Monitor    "Monitor0"
        DefaultDepth 24
        SubSection "Display"
                Modes      "1600x900" "1440x900" "1366x768" "1360x768"
                Viewport   0 0
                Depth     24
        EndSubSection
        EndSection
If you must support multiple identical Intel devices, connected to different monitors you can even send the resolution, colour depth, HOTAL and VTOTAL via a DHCP custom variable, to configure their resolutions centrally. Please note that although the mode that was written to the card BIOS is "selected" automatically (ie, the others are rejected) using the above xorg.conf example, not all combinations of colour depth and resolution are possible. It may happen that 1600x900@24 bpp is possible whereas the maximum bpp for 1920x1080 is 16.

Acknowlegdments

Having the Vesa driver working at any resolution is due to the work of Steve Tomljenovic, author of the original 915resolution, and user vtaylor who posted a modified version at slitaz.org.