segunda-feira, 20 de abril de 2015

Docker test drive - the sane way

Introduction

Would you like to try Docker effortlessly on a standard Centos or Ubuntu machine? The summary below will take you there. These instructions work both on physical and virtual machines and will get you Docker containers working in a couple of minutes.

Installation

Centos 6.x
Requirements: EPEL repo, kernel >= 2.6.32-431
yum install docker-io
service docker start
chkconfig docker on
Ubuntu 14.04
wget -qO- https://get.docker.com/ | sh
Test run
 docker run hello-world

Ubuntu 14.04 minimal guest

 Download image, create a container that stays running, run a shell on it:
docker pull ubuntu
MYHOSTNAME=ubuntu01
docker create --name=$MYHOSTNAME --hostname=$MYHOSTNAME ubuntu sleep infinity
docker start  $MYHOSTNAME
docker exec -it $MYHOSTNAME bash
 Create a non-root user and enable ssh on the container
apt-get update
apt-get install -y openssh-server
useradd -m -s /bin/bash myuser
passwd myuser
sed -ri 's/^session\s+required\s+pam_loginuid.so$/session optional pam_loginuid.so/' /etc/pam.d/sshd
service ssh start
exit
Stop the the container in order to test the startup procedure
docker stop $MYHOSTNAME
 Test the container startup procedure and try to access it using ssh
docker start  $MYHOSTNAME
docker exec  $MYHOSTNAME service ssh start
GUESTIP=`docker exec $MYHOSTNAME ip -4 -o  addr  list eth0 label eth0 | awk '{print $4}' | awk -F/ '{print $1}'`
ssh myuser@$GUESTIP
Done! We have a simple procedure to create as many Ubuntu 14.04 general purpose guests as needed using Docker technology.

Final notes

This article aims to be the Docker test drive I haven't found elsewhere. It provides a basic OpenSSH server installation and deals with the exit-after-start nonsense discussed here - the "sleep infinity" command is there to ensure the container stays running until it is explicitly stopped. For other purposes the default behaviour might be better but this is what makes sense in a Docker test drive, especially for people with a virtualization background.
.

sábado, 18 de abril de 2015

Linux load average - the definitive summary

What is the Linux load average?

This is not exactly an orphan question but, as many other questions we tried to address in this blog, it is surrounded by misconceptions and incorrect information. Every time one starts discussing load averages, either in person or online, confusion steps in... and refuses to leave. We will try to provide an explanation that is "as simple as possible, but not simpler", as Einstein said once, and also short enough to be worth reading.

Definition 1

We will call the instantaneous load of a system the number of tasks (processes and threads) that are willing to run at a given time t.

Tasks willing to run are either in state R or D. That is, they are either actually running or blocked on some resource (CPU, IO, ...) waiting for an opportunity to run. The instantaneous number of such tasks can be determined using the following command

ps -eL h -o state | egrep "R|D" | wc -l

(see footnote [1] for more info on this)

Definition 2

We will call the load average of a system a specific averaging function of the instantaneous load value and all the previous ones.

For historical reasons the Linux kernel adopted the recursive functions

a(t,A) = a(t-1)exp(-5/60A) + l(t)(1-exp(-5/60A))

where parameter A takes the values of 1,5 and 15 and l(t) is the instantaneous load. To the above set of 3 functions, corresponding to the 3 values of A, we call 1m, 5m and 15m load averages. If we set A=0 we find a(t,0)=l(t) recovering, therefore, definition 1. That means, l(t) would be the 0m load average.

The load average values are calculated by the kernel every 5 seconds using a(t,A).

Discussion

First of all we should stress that the load average from definition 2 is just a generalization of definition 1.

While their values are similar in nature, the larger the value of A, the lower the contribution of the instantaneous load compared to the contribution of the historic load average value. The main purpose of using an "averaging" function is the smoothening of fast oscilations that could render human inspection of load values nearly impossible. The timespan of that smoothening effect is influenced by parameter A.

The load average can be calculated from a bash or python script, using definitions 1 and 2, just as the linux kernel does (see /proc/loadavg and https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/proc.c). Here is the example output of one such calculation, using ps and a(t,1) to estimate the 1m load average:

Kernel vs Script 1m load calculation
 
Second of all we need to argue that there is no such thing as a too high load average, in absolute terms. In fact, number of tasks that are willing to run on a given system depends on:
  • the architecture of the software that is running (is it mostly monolithic? or prone to spawn many processes? how dependent are such processes between each other?)
  • the CPU throughput requested by the software that is running
  • the I/O throughput requested by the software that is running
  • the CPU performance of that system
  • the I/O performance of that system
  • the number of available cores
Therefore, we can only say "the load average is too high on that system" if we know the "the normal value for that system". The "normal value" is an empirically discovered value under which that system usually runs and is known to perform acceptably. The normal value could well be 2 for a server with a low number of cores that runs an interactive web application, or could be 50 for a server that runs (non-interactive) numeric simulations jobs during the night.

Furthermore...
  • for the same requested I/O effort and the same hardware, a software implementation that spreads the computation across many processes or threads will generate a higher load average; even though the actual throughput is the same, 10 processes trying to write 10MB each, on an I/O starved system, generate a higher load average than one process trying to write 100MB on the same system
  • given a certain software that sets all the existing CPU cores to 100% while running on a specific machine, its execution on a system with smaller a number of cores, or slower cores, will generate a higher load average; whether that higher load is a problem or not depends on the use case (if it means your numeric simulation or your file server backup takes 10 more minutes during the night but will still be ready in the morning then no harm is done)
To finish the article we should describe the important relationship between load average values and the CPU usage values that can be seen with utilities like top or iostat (%usr, %sys, %wait, %idle). As we have seen, load average values don't have an absolute numerical meaning unlike CPU usage values, which are are expressed in % of CPU time:
%usr
Time spent running non-kernel code. (user time, including nice time)

%sys
Time spent running kernel code. (system time)

%wait
Time spent waiting for IO. Note: %iowait is not an indication of the amount of IO going on, it is only an indication of the extra %usr time that the system would show if IO transfers weren't delaying code execution.

%idle
Time spent idle.
For systems running below their limits, CPU usage values are much more useful than load average values, since their numeric interpretation is universal. But once limits are hit, i.e, CPU %idle time becomes nearly zero, load average values allow us to see how much off the limits the system is running... once we establish a baseline, which is the normal load average for that system (software+hardware combination).

We summarize the load average / CPU usage relationship with a short list of true statements:
  • if all system cores are running at %sys+%usr=100 the instantaneous load is equal to or higher than the number of cores
  • the instantaneous load being higher than the number of cores doesn't mean all cores are running at %sys+%usr=100, since many processes may be I/O waiting (state D)
  • the instantaneous load being higher than the number of cores implies that the system can't be mostly idle; at least some of the cores will be seen for a relevant amount of the time in sys,usr or wait states
  • a system can be slow / unresponsive even with an instantaneous load below the number of cores because a small number of I/O intensive processes may become a bottleneck
  • in a pure CPU intensive scenario (negligible I/O, no processes in state D) where %idle > 0, the instantaneous load is equal to ((100 - %idle)/100) * NCORES; for example, on a 4 core system at steady %sys+%usr=90 we would have an instant load of ((100-10)/100)*4 = 3.6
Statement 5) can be easily tested by running

stress -c X
while looking at the output of top on a different terminal, waiting for the 1m load average to stabilize. It is trivial to see that the above formula holds until X=NCORES, which will cause %idle=0.

We haven't discussed Hyperthreading, or the equivalent AMD feature, to avoid complicating the discussion but where above we say NCORES, it could be the number of virtual CPUs, including CPU threads. Of course, each additional % usage on the second thread of an already busy core doesn't yield a proportional throughut.

Footnotes

[1] - The same result should be obtainable by parsing /proc/loadavg (4th field) or /proc/stat (procs_running, procs_blocked) but we have seen from experience that multiple processes in state D are shown by ps but not counted on /proc/loadavg and that neither /proc/loadavg (4th field) nor /proc/stat include threads in the task counters, even though they are taken into account in the load average numbers exposed by the kernel.

References

The load average bible in 3 volumes:

http://www.teamquest.com/import/pdfs/whitepaper/ldavg1.pdf
http://www.teamquest.com/import/pdfs/whitepaper/ldavg2.pdf
http://www.teamquest.com/files/3114/2049/9759/ldavg3.pdf

domingo, 8 de fevereiro de 2015

Multi Wan OpenVPN - the setup

keywords: multi wan openvpn, double wan openvpn, double openvpn,

We have recently realized that the "Multi Wan OpenVPN" question was either an orphaned question or one with a too low signal/noise ratio. We intend to clear things up with this short post.

Facts:
  • some VPNs need multiple entry points with different public IPs for redundancy 
  • if more than one public IP belongs to the same VPN gateway we have a problem if we use OpenVPN because it sends all reply packets through the default route of the system instead of respecting the routing rules, preventing the connection from being established except on the IP of the interface the default route is assigned to - here is the bug report
  • there is a PFSense document on top of Google's results suggesting that the OpenVPN process should bind to localhost and one should setup port forwardings to 127.0.0.1 from both WAN links - as far as we know this doesn't work at least with Linux + iptables
  • the problem can be worked around easily by replicating the server configuration file and letting the init script start several instances of OpenVPN each one bound to a single WAN interface
  • there is an even easier workaround which is changing the protocol from UDP to TCP (but certainly UDP is the default protocol for a reason - changing to TCP should be a last resort measure, some discussion here)
Below you will find an example of an OpenVPN bridge configuration on a multihomed Centos based gateway with two public IPs. The parameters in bold are the ones directly related to the Multi Wan configuration.

 Server configuration for the first WAN interface - /etc/openvpn/server.conf
plugin /usr/lib/openvpn/plugin/lib/openvpn-auth-pam.so passwd
client-cert-not-required
username-as-common-name
proto udp
dev tap0
local wan1.mycompany.com

server-bridge 192.168.5.254 255.255.255.0 192.168.5.230 192.168.5.253
ifconfig-pool-persist /var/tmp/ipp.txt
push "redirect-gateway bypass-dhcp bypass-dns"
push "dhcp-option DNS 192.168.5.254"
client-to-client
keepalive 10 120
persist-key
persist-tun
dh /etc/openvpn/keys/dh1024.pem
ca /etc/openvpn/keys/ca.crt
cert /etc/openvpn/keys/server.crt
key /etc/openvpn/keys/server.key
port 1194
user nobody
comp-lzo
verb 0

 Server configuration for the second WAN interface - /etc/openvpn/server2.conf
plugin /usr/lib/openvpn/plugin/lib/openvpn-auth-pam.so passwd
client-cert-not-required
username-as-common-name
proto udp
dev tap1
local wan2.mycompany.com

server-bridge 192.168.5.254 255.255.255.0 192.168.5.230 192.168.5.253
ifconfig-pool-persist /var/tmp/ipp.txt
push "redirect-gateway bypass-dhcp bypass-dns"
push "dhcp-option DNS 192.168.5.254"
client-to-client
keepalive 10 120
persist-key
persist-tun
dh /etc/openvpn/keys/dh1024.pem
ca /etc/openvpn/keys/ca.crt
cert /etc/openvpn/keys/server.crt
key /etc/openvpn/keys/server.key
port 1194
user nobody
comp-lzo
verb 0

Bridge setup script
#!/bin/bash

#################################
# Set up Ethernet bridge on Linux
# Requires: bridge-utils
#################################

# Define Bridge Interface
br="br0"

# Define list of TAP interfaces to be bridged, one per WAN interface
tap="tap0 tap1"

# Define physical ethernet interface to be bridged
# with TAP interface(s) above.
eth="eth0"
eth_ip="192.168.5.254"
eth_netmask="255.255.255.0"
eth_broadcast="192.168.5.255"

brctl addbr $br
brctl addif $br $eth

for t in $tap; do
    brctl addif $br $t
done

for t in $tap; do
    ifconfig $t 0.0.0.0 promisc up
done

ifconfig $eth 0.0.0.0 promisc up

ifconfig $br $eth_ip netmask $eth_netmask broadcast $eth_broadcast
 To start up the OpenVPN instances we only need to run
/usr/local/AS/bin/bridge-start.sh
/etc/init.d/openvpn start
One can check that the init script launched two independent instances as follows
 [root@gateway ~]# netstat  --udp -nlp |grep openvpn
udp        0      0 WANIP1:1194           0.0.0.0:*                               1806/openvpn       
udp        0      0 WANIP2:1194        0.0.0.0:*                               1795/openvpn       
[root@gateway ~]#

For everything to work we need, of course, the linux multihoming configuration to be correclty implemented first. We wrote about that some years ago.

On the client side fault tolerance can be implemented simply by using multiple "remote" statements on the configuration file as in the following example
remote wan1.mycompany.com
remote wan2.mycompany.com
Alternatively, fault tolerance can be controlled by pointing the DNS entry, eg vpn.mycompany.com, to the WAN interface to be used. If that interface fails, a DNS update will enable a reconnection to a different one, given a small enough DNS record TTL.

quarta-feira, 8 de outubro de 2014

Core confusion - round II

Introduction

In the previous article we explained how core counting should be done for comparison between Intel and AMD processors to be possible and got some performance numbers from a simplistic (and non-realistic) "floating point" benchmark (more on this later). We performed tests on two different server machines we have available in production:

Intel Xeon E5620 2.40GHz
AMD Opteron 6328 3.2 GHZ

In terms of operations per GHZ per second we can summarize the previously obtained results as follows:
Intel single thread: 24/2.4 = 10
Intel two threads: 28/2.4 = 11.67
AMD single thread: 24/3.2 = 7.5
AMD two threads: 40/3.2 = 12.5
Normalizing the results from Intel's single thread score up, we get
AMD single thread: 0.75
Intel single thread: 1
Intel two threads: 1.17
AMD two threads: 1.25
This normalization serves the purpose of evaluating both single thread performance and SMT scalability. We see that Intel's single thread performance is significantly better than AMD's. We also see that Intel's Hyperthreading enhances performance by 17% whereas AMD's core subdivision allows for 67% more throughput compared to the single thread situation.

If we assume that we want most of our machines to be busy running a large number of threads, we might prefer to use Intel's two thread performance as a reference. In that case, we obtain
AMD single thread: 0.64
Intel single thread: 0.86
Intel two threads: 1
AMD two threads: 1.07
These scores can be regarded as relative efficiencies of each processor. From here we can derive an expression for the optimal per GHZ throughput of a fully busy machine, ie, a machine that is executing at least 2 threads per core:
Per GHZ perf level = N * n * S

N - number of processors
n - number of cores per processor
S - processor score running two threads
If we need to calculate the total performance of a system we can just multiply the previous formula by the processor clock frequency
Perf level = N * n * S * F
Note: the previous formula assumes that we are dividing AMD's advertised number of cores by two, as explained in the previous article.

We have seen so far that, for the tested workflow, the difference in performance between Intel and AMD is not significant IF the systems is running at least two threads per core. But what about the financials?

We recently had the opportunity to compare two different server machines. All other things equal, we had

AMD dual Opteron 6320 x 4 core 2.8 GHZ machine = 2708 EUR
Intel dual E5-2620V2  x 6 core 2.1 GHZ machine = 2890 EUR

Assuming the efficiencies are the same as the ones found on the test machines (more on this below) we would find:

Throughput per EUR
TPE_AMD = 2 * 4 * 2.8 * 1.07 / 2708 = 0.008850
TPE_INTEL = 2 * 6 * 2.1 * 1 / 2890 =  0.008720
Thus, we get nearly a match
TPE_AMD = 0.998 * TPE_INTEL
Further testing

But it turns out that our simplistic floating-point benchmark was actually doing only integer calculations, due to the use of the bc command line calculator that works internally with integers. This was pointed out by Henry Wong, from stuffedcow.net, during a discussion of his own test results. By replacing bc with a simple loop of mathematical libc operations, compiled with gcc, we were able to test pure (still very simplistic) floating-point performance on the same machines tested before.

 In terms of operations per GHZ per second we found
Intel single thread: 3.33/2.4 =1.39
Intel two threads: 5.59/2.4 =  2.33
AMD single thread: 2.86/3.2 = 0.89
AMD two threads: 4.41/3.2 = 1.38
By performing the same normalizations as before we got
AMD single thread: 0.64
AMD two threads: 0.99
Intel single thread: 1
Intel two threads: 1.68 
and
AMD single thread: 0.38
AMD two threads:0.56
Intel single thread: 0.60
Intel two threads: 1
The results seem very disappointing for the AMD machine: single thread performance difference is even higher than on the previous test and this time the extra scalability that compensated for the weak single thread performance is not there.  Intel shows a hyperthreading bonus of 68% whereas for AMD we see only about 55%.

The testing we performed was meant to allow some intuition to be gained into the subject. But we know that floating point performance is a subtle topic and that we must be careful about drawing conclusions from basic testing. Therefore, we decided to have a look at an industry standard benchmark.

Reference results from spec.org

Looking at real benchmark output at spec.org we found again different results. For the throughput of floating point operations we have the following base scores:
Dell Inc. - PowerEdge M710 (Intel Xeon E5620, 2.40 GHz)
two thread score (16 threads on 8 cores): 164 / 8 = 20.50

Advanced Micro Devices - Supermicro A+ Server 1022G-NTF, AMD Opteron 6328
two thread score (16 threads on 8 cores, counted Intel's way ): 289 / 8 = 36.13
Please note that at spec.org they are using AMD's core counting on the score table...In terms of operations per GHZ per second values above translate to:
Intel two threads: 20.5/2.4 = 8.54
AMD two threads: 36.13/3.2 = 11.29
Normalizing we get:
Intel two threads: 1
AMD two threads: 1.32
In terms of floating point throughput per EUR we would find, by combining the performance numbers from our directly tested processors with the quotes for the new machines,
TPE_AMD = 2 * 4 * 2.8 * 1.32 / 2708 = 0.0109
TPE_INTEL = 2 * 6 * 2.1 * 1 / 2890 =  0.0087
TPE_AMD = 1.25 * TPE_INTEL
Fortunately, we can get exact numbers from spec.org for the processors we got quotes for
Dell Inc. - PowerEdge R720 (Intel Xeon E5-2620 v2, 2.10 GHz)
two thread score (16 threads on 8 cores): 375 /  = 31.25

Advanced Micro Devices - Supermicro A+ Server 4022G-6F, AMD Opteron 6320
two thread score (16 threads on 8 cores, counted Intel's way ): 268 / 8 = 33.50
In terms of operations per GHZ
Intel two threads: 31.25/2.1 = 14.88
AMD two threads: 33.5/2.8 = 11.96
Normalizing we get:
Intel two threads: 1
AMD two threads: 0.80
That would mean
TPE_AMD = 2 * 4 * 2.8 * 0.8 / 2708 = 0.0066
TPE_INTEL = 2 * 6 * 2.1 * 1 / 2890 =  0.0087
TPE_AMD = 0.76 * TPE_INTEL

What this means is that Intel has dramatically increased its throughput per GHZ at least for the SPEC benchmark, that uses 2 threads per core. Therefore, efficiency factors from older processors can hardly be used for TPE comparisons.

Note: Unfortunately we couldn't find a way to compare single thread scores  with multi thread scores for these CPUs because spec.org runs different tests for speed (single thread) and throughput (multiple threads). In the floating point speed test AMD delivers just slightly less per GHZ (96%) then Intel, for these specific CPUs. An comparison of the two different test types is available here.

Conclusion

Since multicore processors are standard nowadays and multiprocessor machines are becoming more and more affordable it is more important to compare total per EUR throughput than maximum single thread performance.

Virtualization is here to stay and therefore the parallel throughput of current processors is of paramount importance - if the system doesn't perform well enough one can buy another one or a larger one. Unless, of course, one runs a small number of non-parallel workloads where peak single thread throughput is the defining variable.

However, we have seen that both single thread processor performance and double-thread scalability are highly dependent on the workload. The difference between integer and floating point calculations became evident from a pair of very simple cpu tests.

The most important conclusion is that in face of the inherent complexity of the subject  and the artificial complexity introduced by certain marketing teams (the core confusion...) it is very hard to base purchase decisions on third party benchmarks.

For mission critical computing situations we should certainly test our specific workload on different processors and calculate the specific TPE (throughput per EUR) for the candidate systems.

References

Integer calculation script
Floating point calculation script and aux C loop (mathtest)

.

sexta-feira, 26 de setembro de 2014

VDI Out of the box

On this year's edition of the well known Linux event in Lisbon we presented our Linux based multiuser VDI solution to a specialized audience. Under the VDI Out of the box mantra we explained how the new generation VDI technology is cheaper and much easier to manage than any set of tratidional PCs.

We also ran a hands on demo with diskless clients connected to single multiuser VM capable of running a full KDE desktop and playing fullscreen 1080p video with sound.


More info about this solution is available here.

quinta-feira, 18 de setembro de 2014

Confusão de Autor

No que respeita à absurda proposta de revisão da Lei da Cópia Privada (PL-246), após o que se explicou aqui confirma-se, acima de qualquer dúvida, a confusão generalisada aqui.

Em face da subscrição por mais de 6000 cidadãos da petição contra o PL-246, decorre neste momento um apelo à presidente da AR para que seja adiada a discussão do projecto até que a petição possa ser discutida.

Este blog está solidário com esse apelo:

Face aos mais de 6000 subscritores da petição contra o pl-246 é urgente lembrar à presidente da Assembleia da República que a votação não deve ter lugar antes de ser debatida a petição em plenário, tal como a regulamento da Assembleia da República exige. Assim, é imperativo suspender-se a votação até que o debate seja alargado e a confusão que orbita em torno deste assunto seja totalmente eliminada. Todos os subscritores da petição estão convidados a contactar a presidente através do link

http://tinyurl.com/stop246

O contacto deve ser feito em termos objectivos e cordiais. Lembramos que a presidente da AR não é responsável pelo pl-246, que à presidente da AR solicitamos apenas uma revisão de agenda e que na política saber exprimir opiniões em termos próprios e tão importante quanto ter razão.

Pelo interesse geral do país, participe!

Referências:

http://jonasnuts.com/lei-da-copia-privada-o-que-fazer-494368
https://blog.1407.org/2014/09/17/por-favor-pecam-para-suspender-pl246/ 
 

sexta-feira, 29 de agosto de 2014

Ice bucket ao direito de autor / estranhas prioridades

Re-entrou em cena, de fininho, a discussão da cópia privada, em forma de projecto-Lei aprovado em Conselho de Ministros. Ainda tive esperança que fosse um boato da silly season mas diz que não, que a coisa é mesmo a sério.

Os resultados disto vão ser bonitos, já se imagina. O primeiro deles é que o povo se vai convencer que agora pode piratear à vontade porque já está tudo incluído no imposto - a confusão intencional entre cópia privada e cópia pirata não terá certamente outro desfecho. E o segundo é que com o aumento artificial do custo dos dispositivos electrónicos pode vir a renascer uma actividade que, em princípio, já estaria extinta: o contrabando. Arrisca-se o Governo a, de uma só cajadada, matar o já quase moribundo respeito pelo Direito de Autor junto com a débil credibilidade moral do sistema fiscal português.

Estando o executivo tão empenhado em abrir hostilidades em novas frentes seria de pensar que a sua popularidade gozasse de eminente folga e que as linhas restantes da política digital estivessem todas muito bem encaminhadas. As questões que me ocorrem mais frequentemente são as seguintes:
  • qual o ponto de situação da racionalização das TIC?
  • qual o estado da adopção das Normas Abertas?
  • quanto milhões já se pouparam via RNID e PGERRTIC e tecnologia Open Source?
  • como vai a segurança dos sistemas de informação do Estado?
Não sei e não sei se alguém sabe. Quiçá no Evento Linux 2014 consigamos um comentário oficial ou oficioso, mas até agora está por explicar onde param os 500M de poupança anual que se previam com estas (louváveis e necessárias) iniciativas.

Nesta situação, não parece justo que sejam mais uma vez os empresários e os cidadãos a financiar a despesa incontrolável a que o Estado ao longo do tempo se foi habituando. Já estaria na altura de, em vez de criar ainda mais entraves à actividade económica, o Governo apresentar trabalho feito naquilo que deveria ser prioritário: a redução da despesa pública.

O assunto da cópia privada foi analisado aqui diversas vezes e está a ser seguido de perto pela indispensável Maria João Nogueira. É essencial ler esta FAQ.

Como nota final sugiro que, se o assunto efectivamente em causa é a pirataria, acabemos com esta farsa da cópia privada e comecemos uma discussão alargada sobre esse tema. Pontos nos is, bois pelos nomes.