sábado, 30 de novembro de 2013

Fair share user scheduling on Linux terminal servers


Linux terminal server, LTSP, xrdp, kernel, scheduler, cgroups.


The default Linux scheduling configuration divides CPU time on a per process basis. If a system accepts multiple interactive users, that is not the desired situation since a user that is running many processes will have a larger CPU share than the others. However, it is possible to configure fair share user scheduling on Linux using cgroups together with some scripting. That will distribute CPU time, as well as other resources, in a flexible user based manner. NFS I/O escapes the control groups mechanisms but we will provide a method for handling such situations. The motivation for this article is the need to ensure that, on a correctly configured server, a single user is unable to cause system wide performance degradation


Once up on a time there was a kernel option called FAIR_USER_SCHED that divided the CPU time slices on a per user basis. For some time, later on, that option was called just USER_SCHED. Eventually it got removed because its existence meant a lot of code to maintain. As a result, each google search about "fair share user scheduling" brings now up old discussion threads and college homework assignments. But as this option was removed a hot replacement called cgroups was ready to rock... well, sort of.

In system administration nothing is ever that simple. A kernel option that would automatically set the scheduler to work on a per user basis got replaced by a much more general process group approach that needs manual configuration and user space real time process classification to achieve the same effect. Less code to maintain in the kernel - lots of homemade scripts to be written worldwide and no documentation on how to do it properly. There is, however, a great thing in the new approach: it is not only the CPU that can be fairly shared. We can now control the scheduling of other resources such as I/O on block devices and network activity.

Resource sharing on terminal servers

There are different systems that can benefict from this such as mass hosting web servers running Apache with Suexec, large university development machines that take thousands of SSH logins and terminal servers. We will focus on terminal servers for this article.

The case for terminal servers is a rather trivial one, in terms of system administration efficiency. But configuration and maintenance require a deep understanding of many aspects of OS and network operations. Part of that will be the subject of a future article. For now, we would like to stress that when talking about terminal servers we don't mean the single application remote session running on a simplified desktop (or no desktop at all). Neither we mean a virtual remote single-user PC. We are talking about a complete remote desktop server where N users deal simultaneously with KDE, Firefox, Libreoffice, Gwenview, Gimp, Okular, arK, rdesktop and others, as if they had a local machine with all that software installed.

From time to time Linus Torvalds tells the media that one of the remaining challenges for the Linux kernel is desktop workloads. So, the greatest possible challenge must be having N desktop users on the same machine sharing the same Linux kernel. We will  now show you how to optimize the scheduling for fair sharing of CPU and block I/O. Everything described in this article was successfully tested on Ubuntu 12.04 LTS using XRDP as the remote desktop server.

Kernel control groups

Control groups, aka cgroups, are sets of processes to which the Linux kernel assigns specific resource shares. When you install cgroups on Ubuntu
apt-get install cgroup-bin libcgroup1 libpam-cgroup
you will get a default configuration where all the running processes belong to the same cgroup. That is a particular configuration where the kernel behaves just as if the cgroup feature was not enabled - since there is a single group it takes all the resources and shares them internally on a per-process basis. From that point on we can start building a custom configuration.

1. The first step is creating a cgroup for all users that will ever access the terminal server.

Essentialy, this is done using
getent group REMOTEUSERS
and pushing the list to the cgcreate command, that belongs to libcgroup.

There are some details related to the fact that the cgroups must be refreshed from time to time to keep the terminal server in sync with the list of users stored on the authentication server. Since there may be thousands of users and, in that case, we don't want to recreate the already existing cgroups, we will provide you a script that takes care of such details.

In case you find this over engineering we will refer you back to the Introduction where we warned that the "much more general group approach" needed "manual configuration...". "Generality comes at this price", must have thought the kernel team.

2. The second step is having PAM and the cgroup utilities classify the post login processes, for each user, into the right cgroup.

session required        pam_script.so
session optional        pam_cgroup.so
session sufficient      pam_winbind.so
session optional        pam_ck_connector.so nox11
@REMOTEUSERS             cpu,blkio     %U
root                              cpu,blkio     system
xrdp                             cpu,blkio     system
*                                   cpu,blkio     others
With this configuration users belonging to the group REMOTEUSERS will have an individual cgroup. On the other hand, users that are allowed to log in but don't belong to that group will compete for resources within a group called "others". Processes belonging to users root and xrdp will have a reserved share via a group called "system".

 3. The final step is allocating the available resources to the created groups.

That is achieved by using the cgset command on the existing cgroups. What we do in this case is giving each user cgroup the same CPU and I/O share while leaving a fixed share for the "system" and "others" cgroups.

Something like this:
for i in $USERLIST; do
  cgcreate -g cpu:/$i
  cgcreate -g blkio:/$i
  cgset -r blkio.weight=$USERBLKIO /$i

# other users that have processes running
if [ ! -e /sys/fs/cgroup/cpu/others ]; then
  cgcreate -g cpu:/others
  cgcreate -g blkio:/others
  cgset -r blkio.weight=$USERBLKIO /others

# for anything that runs as root and something else that cgrules.conf puts in thys group
if [ ! -e /sys/fs/cgroup/cpu/system ]; then
  cgcreate -g cpu:/system
  cgcreate -g blkio:/system
  # reserved for root ssh actions - default shares value is 1024
  cgset -r cpu.shares=$SYSCPU /system
  # default value is 500, max value is 1000
  cgset -r  blkio.weight=$SYSBLKIO /system
Note: if you are seeing pointless cgroup messages on the syslog add
to /etc/cgred.conf.

Automating setup and maintenance

To automate the synchronization between the terminal server and the authentication directory it connects to (be it Active Directory, LDAP or Samba...) we suggest this script. It makes cgroup management a much easier task and can be edited to suit your needs.

Testing the setup

After the the system is configured you should see one directory named as each user at /sys/fs/cgroup/cpu and /sys/fs/cgroup/blkio/. Inside each of those directories there will be a file called "tasks" that lists the processes of the corresponding user. You can compare the content of that file to the content of "ps -u USERNAME" and check that it matches. You can see that as another user logs in its processes are automatically inserted at both /sys/fs/cgroup/cpu/ANOTHERUSER/tasks and /sys/fs/cgroup/blkio/ANOTHERUSER/tasks.

You can query the properties of each cgroup with commands such as the following:
cgget system
cgget others
Once you are sure that group configuration and process classification are working you can run CPU and block I/O sharing tests.

Imagine that you have a single core VM. If a certain CPU intensive test process takes 1 minute to complete for user A while pushing the virtual CPU to 100%, two of the same processes running - one for user A and another for user B - shall take 2 minutes to complete. Each of the users will wait 2 minutes for the result. Now, what if user B decides to run 2 instances of the same process? The answer is easy: it would take a total of 3 minutes for all the processes to be ready. But here you will see the important difference: without our cgroup configuration both users would wait 3 minutes for all their processes to complete whereas with our configuration user A would wait 2 minutes and B would wait 3. That is, each user will get half of the CPU time regardless of the number of processes it runs. When user A's process is done, after 2 minutes, user B will get the whole CPU and finish the remaining work in a single minute. The evolution of CPU shares in time is summarized in the following table.

Without fair sharing
1st minute
2nd minute
3rd minute
With fair sharing
1st minute
2nd minute
3rd minute
We should stress that for this behaviour to be verifiable the test must fully occupy the available CPUs. On a single core machine it is easy to do it with a single test process. To perform this simple test on a multicore machine you need to run extra CPU intensive processes just to keep all but one cores at 100% thus enabling competition between test processes - cgroups have no effect if there is no resource scarcity.

To test block I/O resource sharing the procedure is similar. We suggest that you saturate disk read access by using dd to read through a large file. Running concurrent dd processes mustn't allow the owner of those processes to write at a higher rate. After preparing a couple of large files with
dd if=/dev/zero of=/tmp/zerofile1 bs=1024 count=4096000
cp -f /tmp/zerofile1 /tmp/zerofile2
cp -f /tmp/zerofile1 /tmp/zerofile3
one simple test would be executing something like
echo 3 > /proc/sys/vm/drop_caches

cgexec -g blkio:ONEUSER dd if=/tmp/zerofile1 of=/dev/null &
cgexec -g blkio:ANOTHERUSER dd if=/tmp/zerofile2 of=/dev/null &
cgexec -g blkio:ANOTHERUSER dd if=/tmp/zerofile3 of=/dev/null &
Just as with the CPU test you would conclude that ANOTHERUSER would not see an advantage in running 2 concurrent processes.

CPU and block I/O can be monitored with the atop utilty - use shift+d to sort by DSK activity and shift+c to sort by CPU usage. If you'd like to see CPU usage grouped per user you can use this script.

NFS - not fair shareable

If everything above is correctly setup you should have a system where, if pressure mounts, each user becomes bound to a certain amount of CPU and block I/O. Thus, the impact of misbehaviour or accidental misuse is limited.

Still, problems may arise if you are using NFS to store home directories. If you are doing that, probably due to having several terminal servers balanced across a large group of users, you should be aware that NFS usage is not fair shareable. This suggests a new funky acronym for NFS but that is purely accidental and we won't go further that way.

The problem is that NFS I/O does not count as block I/O - NFS is a network filesystem not a block device. Furthermore, network transfers on NFS are done as root so network related cgroups mechanisms can't be used either.

To mitigate this problem we developed a simple I/O governor that permanently monitors suspect processes - that must be defined in a case by case basis - and calms them down if they are taking too much I/O. By calming them down, we mean sending them SIGSTOP and, after a while, SIGCONT. What the governor actually does to an I/O intensive process, after leaving it alone for a certain configurable grace period, is put it to sleep for a decreasing fraction of its running time until its I/O activity reaches a specific pre-configured limit.

The usual suspects are processes that tend to saturate the I/O bandwidth to the physical storage by reading or writing at a very high rate causing large delays to the short reads and writes of other applications. For example, we wouldn't like that a user copying a 1GB folder to a shared network drive would degrade the startup time of Libreoffice or Firefox for all the others. But that could happen, since NFS is not fair shareable and that's precisely what the I/O governor avoids. The original idea came from this CPU governor (which we don't need since cgroups are doing the cpu governing job) and the working logic is probably not very different from the Process Lasso windows tool that we mentioned here.

We usually monitor kio_file processes which are responsible for handling file copy operations performed with Dolphin, on KDE. That, of course, needs to be adapted to the particular use of each terminal server. You can take a look at the governor here. Please note that it should be running at all times and depends on the helper script iolim.sh that you can find here.


Resource governing can be achieved on terminal servers by using a combination of the cgroups kernel feature and a couple of bash scripts. Performance loss situations, that could become stressful to the users, can be effectively avoided. Runaway processes that potentially consume too much CPU, block I/O or NFS I/O can be controlled automatically.

Here is how an 8 VCPU virtual machine, supported on 8 real CPU cores, handles 15 concurrent full desktop users with great performance. Note that we are in average delivering half a core and 400MB of RAM to each user with excellent performance. The number of users could still grow significantly without causing work delays.



Sem comentários: