How To Install Ganeti Clusters and Instances

suppositions

Suppose that there are two identical hosts: foo.debian.org and bar.debian.org.

They are running stable and have been integrated into Debian infrastructure.

They will serve as nodes in a ganeti cluster named foobar.debian.org.

They have a RAID1 array exposing three partitions: c0d0p1 for /, c0d0p2 for swap and c0d0p3 for lvm volume groups to be used by ganeti via drbd.

They have two network interfaces: eth0 (public) and eth1 (private).

The public network is A.B.C.0/24 with gateway A.B.C.254.

The private network is E.F.G.0/24 with no gateway.

Suppose that the first instance to be hosted on foobar.debian.org is qux.debian.org.

The following DNS records exist:

    foobar.debian.org.                  IN A   A.B.C.1
    foo.debian.org.                     IN A   A.B.C.2
    bar.debian.org.                     IN A   A.B.C.3
    qux.debian.org.                     IN A   A.B.C.4
    foo.debprivate-hoster.debian.org.   IN A   E.F.G.2
    bar.debprivate-hoster.debian.org.   IN A   E.F.G.3

install required packages

On each node, install the required packages:

    # maybe: apt-get install drbd-utils
    apt-get install ganeti ganeti-instance-noop qemu-kvm

configure kernel modules

On each node, ensure that the required kernel modules are loaded at boot:

    ainsl /etc/modules 'drbd'
    echo options drbd minor_count=255 usermode_helper=/bin/true > /etc/modprobe.d/local-drbd.conf
    ainsl /etc/modules 'hmac'
    ainsl /etc/modules 'tun'
    ainsl /etc/modules 'ext3'
    ainsl /etc/modules 'ext4'
    ainsl /etc/modules 'dm_mod'

configure networking

On each node, ensure that br0 (not eth0) and eth1 are configured.

The bridge interface, br0, is used by the guest virtual machines to reach the public network.

If the guest virtual machines need to access the private network, then br1 should be configured rather than eth1.

To prevent the link address changing due to startup/shutdown of virtual machines, explicitly set the value.

This is the interfaces file for foo.debian.org:

    auto br0
    iface br0 inet static
      bridge_ports eth0
      bridge_maxwait 0
      bridge_fd 0
      address A.B.C.2
      netmask 255.255.255.0
      gateway A.B.C.254
      up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE

    auto eth1
    iface eth1 inet static
      address E.F.G.2
      netmask 255.255.255.0

This is the interfaces file for bar.debian.org:

    auto br0
    iface br0 inet static
      bridge_ports eth0
      bridge_maxwait 0
      bridge_fd 0
      address A.B.C.3
      netmask 255.255.255.0
      gateway A.B.C.254
      up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE

    auto eth1
    iface eth1 inet static
      address E.F.G.3
      netmask 255.255.255.0

configure lvm

On each node, configure lvm to ignore drbd devices and to prefer /dev/cciss devices names over /dev/block device names (https://code.google.com/p/ganeti/issues/detail?id=93):

    ssed -i \
      -e 's#^\(\s*filter\s\).*#\1= [ "a|.*|", "r|/dev/drbd[0-9]+|" ]#' \
      -e 's#^\(\s*preferred_names\s\).*#\1= [ "^/dev/dm-*/", "^/dev/cciss/" ]#' \
      /etc/lvm/lvm.conf
    service lvm2 restart

create lvm volume groups

On each node, create a volume group:

    vgcreate vg_ganeti /dev/cciss/c0d0p3

exchange ssh keys

on each node:

   mkdir -m 0700 -p /root/.ssh &&
   ln -s /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa

configure iptables (via ferm)

the nodes must connect to each other over the public and private networks for a number of reasons; see the ganeti2 module in puppet

instantiate the cluster

On the master node (foo) only:

    gnt-cluster init \
      --master-netdev br0 \
      --vg-name vg_ganeti \
      --secondary-ip E.F.G.2 \
      --enabled-hypervisors kvm \
      --nic-parameters link=br0 \
      --mac-prefix 00:16:37 \
      --no-ssh-init \
      --no-etc-hosts \
      --uid-pool=29400-29439 \
      --hypervisor-parameters kvm:initrd_path=,kernel_path= \
      foobar.debian.org

Note the following:

On hosting locations with limited number of IPv4 addresses, it might be worth using –primary-ip-version=6. This asks ganeti to use an IPv6 master address. This also enforce using IPv6 nodes addresses, however the secondary addresses need to be IPv4 addresses. For a single node cluster, the public IPv4 address can be used as the secondary address. Note that this can not be changed easily after the cluster creation.

add slave nodes

For each slave node (only bar for this example):

on the slave, append the master's /etc/ssh/ssh_host_rsa_key.pub to /etc/ssh/userkeys/root. This is only required temporarily - once everything works, puppet will put it/keep it there.

on the master node (foo):

    gnt-node add \
      --secondary-ip E.F.G.3 \
      --no-ssh-key-check \
      bar.debian.org

more stuff:

  gnt-cluster modify --reserved-lvs='vg0/local-swap.*'
  maybe: gnt-cluster modify --nic-parameters mode=openvswitch

Note the following:

verify cluster

On the master node (foo):

    gnt-cluster verify

If everything has been configured correctly, no errors should be reported.

create the 'noop' variant

Ensure that the ganeti-os-noop is installed.


How To Install Ganeti Instances

Suppose that qux.debian.org will be an instance (a virtual machine) hosted on the foobar.debian.org ganeti cluster.

Before adding the instance, an LDAP entry must be created so that an A record for the instance (A.B.C.4) exists.

create the instance

On the master node (foo):

    gnt-instance add \
      --node foo:bar \
      --disk-template drbd \
      --os-size 4GiB \
      --os-type debootstrap+dsa \
      --hypervisor-parameters kvm:initrd_path=,kernel_path= \
      --net 0:ip=A.B.C.4 \
      qux.debian.org

Note the following:


Variations

If the instances require access to the private network, then there are two modifications necessary.

re-configure networking

On the nodes, ensure that br1 is configured (rather than eth1).

This is the interfaces file for foo.debian.org:

    auto br0
    iface br0 inet static
      bridge_ports eth0
      bridge_maxwait 0
      bridge_fd 0
      address A.B.C.2
      netmask 255.255.255.0
      gateway A.B.C.254
      up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE

    auto br1
    iface br1 inet static
      bridge_ports eth1
      bridge_maxwait 0
      bridge_fd 0
      address E.F.G.2
      netmask 255.255.255.0
      up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE

This is the interfaces file for bar.debian.org:

    auto br0
    iface br0 inet static
      bridge_ports eth0
      bridge_maxwait 0
      bridge_fd 0
      address A.B.C.3
      netmask 255.255.255.0
      gateway A.B.C.254
      up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE

    auto br1
    iface br1 inet static
      bridge_ports eth1
      bridge_maxwait 0
      bridge_fd 0
      address E.F.G.3
      netmask 255.255.255.0
      up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE

create or update the instance

When creating the instance, indicate both networks:

    gnt-instance add \
      --node foo:bar \
      --disk-template drbd \
      --os-size 4GiB \
      --os-type debootstrap+dsa \
      --hypervisor-parameters kvm:initrd_path=,kernel_path= \
      --net 0:ip=A.B.C.4 \
      --net 1:link=br1,ip=E.F.G.4 \
      qux.debian.org

When updating an existing instance, add the interface:

    gnt-instance shutdown qux.debian.org
    gnt-instance modify \
      --net add:link=br1,ip=E.F.G.4 \
      qux.debian.org
    gnt-instance startup qux.debian.org

Please note that the hook scripts are run only at instance instantiation. When adding interfaces to an instance, the guest operating system must be updated manually.

    gnt-instance add -t plain --os-type debootstrap+dsa-wheezy \
      --disk 0:adopt=lully-boot \
      --disk 1:adopt=lully-root \
      --disk 2:adopt=lully-swap \
      --disk 3:adopt=lully-log  \
      --hypervisor-parameters kvm:initrd_path=,kernel_path= \
      --net 0:ip=82.195.75.99 -n clementi.debian.org  lully.debian.org

And you want to convert it to use DRBD afterwards and start it on the other cluster node, so we can ensure that DRBD is correctly working.

    gnt-instance shutdown lully.debian.org
    gnt-instance modify -t drbd -n czerny.debian.org lully.debian.org
    gnt-instance failover lully.debian.org
    gnt-instance startup lully.debian.org
    gnt-instance modify --hypervisor-parameters disk_type=ide fils.debian.org
    gnt-instance add -t blockdev --os-type debootstrap+dsa \
      --disk 0:adopt=/dev/disk/by-id/scsi-reger-boot \
      --disk 1:adopt=/dev/disk/by-id/scsi-reger-root \
      --hypervisor-parameters kvm:initrd_path=,kernel_path= \
      --net 0:ip=206.12.19.124 -n rossini.debian.org reger.debian.org

Add new LUN to MSA and export to all blades

  Log into MSA controller

  Choose which vdisk to use, use "show vdisks" to list

Add the volume:
  # create volume vdisk msa2k-2-500gr10 size 5G donizetti

Find a free LUN:

  # show lun-maps
  or (if we assume they are all the same)
  # show host-maps 3001438001287090

Make a note of the next free LUN

Generate map commands for all blades, all ports, run locally:

  $ for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "map volume donizetti lun 27 host bm-bl$bl-p$p" ; done ; done

Paste the output into the MSA shell

Find the WWN by doing show host-maps and looking for the volume name.
Transform it using the sed run at the top of /etc/multipath.conf:

echo "$WWN" | sed -re 's#(.{6})(.{6})0000(.{2})(.*)#36\1000\2\3\4#'

  then:
  gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
  
  then:
  reload multipath-tools on gnt-master (normaly bm-bl1):
  service multipath-tools reload
  add the WWNs to dsa-puppet/modules/multipath/files/multipath-bm.conf and define the alias and commit that file to git.

  then:
  gnt-cluster command "puppet agent -t"

  This will update the multipath config on all cluster nodes. WITHOUT doing this, you can't migrate VMs between nodes.

Remove LUNs.

Order is important, or else things get very, very confused and the world needs a reboot.

* Make sure nothing uses the volume anymore.

* Make sure we do not have any partitions lying around for it:

  gnt-cluster command "ls -l /dev/mapper/backuphost*"
  # and maybe:
  gnt-cluster command "kpartx -v -p -part -d /dev/mapper/backuphost"

* flush the device, remove the multipath mapping, flush all backing devices:

  root@bm-bl1:~# cat flush-mp-device 
  #!/bin/sh

  dev="$1"

  if [ -z "$dev" ] || ! [ -e "$dev" ]; then
    echo 2>&1 "Device $dev does not exist."
    exit 1
  fi

  devs=$(multipath -ll "$dev" | grep cciss | sed -e 's/.*cciss!//; s/ .*//;')

  if [ -z "$devs" ]; then
    echo 2>&1 "No backends found for $dev."
    exit 1
  fi

  set -e

  blockdev --flushbufs "$dev"
  multipath -f "$dev"
  for d in $devs; do
    blockdev --flushbufs "/dev/cciss/$d"
  done
  echo done.
  gnt-cluster command "/root/flush-mp-device /dev/mapper/backuphost"

* Immediately afterwards, paste the output of the following to the MSA console. Best prepare this before, so you do it quickly before anything else rescans stuff, reloads or restarts multipathd, and the devices become used again.

  for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "unmap volume DECOMMISSION-backuph host bm-bl$bl-p$p" ; done ; done

* Lastly, rescan the scsi bus on all hosts. Do not forget that. hpacucli and the monitoring tools might lock up the machine if they try to check the status of a device that now no longer exists but that the system still thinks is around.

  gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"

DRBD optimization

The default DRBD parameters are not really optimized, which means very slow (re)syncing. The following commands might help to make it faster. Of course the max speed can be increased if both the network and disk speed allow that.

    gnt-cluster modify -D drbd:net-custom="--max-buffers 36k --sndbuf-size 1024k --rcvbuf-size 2048k"
    gnt-cluster modify -D drbd:c-min-rate=32768
    gnt-cluster modify -D drbd:c-max-rate=98304
    gnt-cluster modify -D drbd:resync-rate=98304

Change the disk cache

When using raw volumes or partitions, it is best to avoid the host cache completely to reduce data copies and bus traffic. This can be done using:

    gnt-cluster modify -H kvm:disk_cache=none

Change the CPU type

Modern processors come with a wide variety of additional instruction sets (SSE, AES-NI, etc.) which vary from processor to processor, but can greatly improve the performance depending on the workload. Ganeti and QEMU default to a compatible subset of cpu features called qemu64, so that if the host processor is changed, or a live migration is performed, the guest will see its CPUfeatures unchanged. This is great for compatibility but comes at a performance cost.

On x86

The CPU presented to the guests can easily be changed, using the cpu_type option in Ganeti hypervisor options. However to still be able to live-migrate VMs from one host to another, the CPU presented to the guest should be the common denominator of all hosts in the cluster. Otherwise a live migration between two different CPU types could crash the instance.

For homogeneous clusters it is possible to use the host cpu type:

  gnt-cluster modify -H kvm:cpu_type='host'

Otherwise QEMU provides a set of generic CPU for each generation, that can be queried that way:

$ qemu-system-x86_64 -cpu ?

x86           qemu64  QEMU Virtual CPU version 2.1.2
x86           phenom  AMD Phenom(tm) 9550 Quad-Core Processor
x86         core2duo  Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
x86            kvm64  Common KVM processor
x86           qemu32  QEMU Virtual CPU version 2.1.2
x86            kvm32  Common 32-bit KVM processor
x86          coreduo  Genuine Intel(R) CPU           T2600  @ 2.16GHz
x86              486
x86          pentium
x86         pentium2
x86         pentium3
x86           athlon  QEMU Virtual CPU version 2.1.2
x86             n270  Intel(R) Atom(TM) CPU N270   @ 1.60GHz
x86           Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)
x86           Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
x86          Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)
x86         Westmere  Westmere E56xx/L56xx/X56xx (Nehalem-C)
x86      SandyBridge  Intel Xeon E312xx (Sandy Bridge)
x86          Haswell  Intel Core Processor (Haswell)
x86        Broadwell  Intel Core Processor (Broadwell)
x86       Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)
x86       Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)
x86       Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)
x86       Opteron_G4  AMD Opteron 62xx class CPU
x86       Opteron_G5  AMD Opteron 63xx class CPU
x86             host  KVM processor with all supported host features (only available in KVM mode)

Recognized CPUID flags:
  pbe ia64 tm ht ss sse2 sse fxsr mmx acpi ds clflush pn pse36 pat cmov mca pge mtrr sep apic cx8 mce pae msr tsc pse de vme fpu
  hypervisor rdrand f16c avx osxsave xsave aes tsc-deadline popcnt movbe x2apic sse4.2|sse4_2 sse4.1|sse4_1 dca pcid pdcm xtpr cx16 fma cid ssse3 tm2 est smx vmx ds_cpl monitor dtes64 pclmulqdq|pclmuldq pni|sse3
  smap adx rdseed rtm invpcid erms bmi2 smep avx2 hle bmi1 fsgsbase
  3dnow 3dnowext lm|i64 rdtscp pdpe1gb fxsr_opt|ffxsr mmxext nx|xd syscall
  perfctr_nb perfctr_core topoext tbm nodeid_msr tce fma4 lwp wdt skinit xop ibs osvw 3dnowprefetch misalignsse sse4a abm cr8legacy extapic svm cmp_legacy lahf_lm
  invtsc
  pmm-en pmm phe-en phe ace2-en ace2 xcrypt-en xcrypt xstore-en xstore
  kvmclock-stable-bit kvm_pv_unhalt kvm_pv_eoi kvm_steal_time kvm_asyncpf kvmclock kvm_mmu kvm_nopiodelay kvmclock
  pfthreshold pause_filter decodeassists flushbyasid vmcb_clean tsc_scale nrip_save svm_lock lbrv npt

For example on a cluster using both Sandy Bridge and Haswell CPU, the following command can be used:

  gnt-cluster modify -H kvm:cpu_type='SandyBridge'

Here is a typical improvement one can get on the AES openssl benchmarks.

With the default qemu64 CPU type:

  type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
  aes-128-cbc     175481.21k   195151.55k   199307.09k   201209.51k   201359.36k
  aes-128-gcm      49971.64k    57688.17k   135092.14k   144172.37k   146511.19k
  aes-256-cbc     130209.34k   141268.76k   142547.54k   144185.00k   144777.22k
  aes-256-gcm      39249.19k    44492.61k   114492.76k   123000.83k   125501.44k

With the SandyBridge CPU type:

  type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
  aes-128-cbc     376040.16k   477377.32k   484083.37k   391323.31k   389589.67k
  aes-128-gcm     215921.26k   592407.87k   777246.21k   836795.39k   835971.75k
  aes-256-cbc     309840.39k   328612.18k   330784.68k   324245.16k   328116.91k
  aes-256-gcm     160820.14k   424322.20k   557212.50k   599435.61k   610459.65k

On POWER

There are two KVM implementations on POWER, the KVM-PR (kvm-pr.ko) one which uses the "PRoblem state" of the ppc CPUs to run the guests and the KVM-HV (kvm-hv.ko) one uses the hardware virtualization support of the POWER CPU. In the later case the guest CPU has to be of the same type as the host CPU. However, there is at least the possibility to run the guest in a backward-compatibility mode of the previous CPU generation by using the compat parameter:

  gnt-cluster modify -H kvm:cpu_type='host\,compat=power8'

Add a virtio-rng device

VirtIO RNG (random number generator) is a paravirtualized device that is exposed as a hardware RNG device to the guest. Virtio RNG just appears as a regular hardware RNG to the guest, which the kernel reads from to fill its entropy pool. Unfortunately Ganeti does not support it natively, therefore the kvm_extra option should be used. Ganeti forces the allocation of the PCI devices to specific slots, which means it is not possible to use the QEMU autoallocation and that an explicit PCI slot has to be provided. There 32 possible slots on the default QEMU machine, so we can use one of the last ones for example 0x1e.

The final command to add a virtio-rng device cluster-wise is therefore:

  gnt-cluster modify -H kvm:kvm_extra="-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000"

The max-bytes and period options limit the entropy rate a guest can get to 1kB/s.

Remove the floppy drive

The default x86 machine ("pc") emulated by QEMU has a floppy drive. Unfortunately Ganeti currently does not allow one to emulate another system. However it is possible to disable the floppy drive with the current trick (possibly to be combined with the above one):

  gnt-cluster modify -H kvm:kvm_extra="-global isa-fdc.fdtypeA=none"

POWER specific settings

On POWER the Ganeti doesn't enable the KVM module by default. Therefore the -enable-kvm option has to be passed.

In addition by disabling the video card automatically makes the guest (firmware, grub, kernel) to use the serial console. This can be done with the -vga none option.

The command to setup KVM on POWER cluster-wise is therefore the following one (possibly combined with the vritio-rng one):

  gnt-cluster modify -H kvm:kvm_extra="-enable-kvm -vga none"

Site specific Notes

At grnet, we are managing IP address space using gnt-network.

The network was added using

  grnet-node03:~# gnt-network add \
    --network=194.177.211.192/27 \
    --gateway=194.177.211.193 \
    --add-reserved-ips=194.177.211.195,194.177.211.196,194.177.211.199 \
    --network6 2001:648:2ffc:deb::/64 \
    --gateway6 2001:648:2ffc:deb::1 \
    gnt-grnet-01
  Wed May 19 17:23:07 2021  - INFO: Reserved IP address of node 'grnet-node03.debian.org' (194.177.211.197)
  Wed May 19 17:23:07 2021  - INFO: Reserved IP address of node 'grnet-node04.debian.org' (194.177.211.198)
  Wed May 19 17:23:07 2021  - INFO: Reserved cluster master IP address (194.177.211.194)
  grnet-node03:~#
  grnet-node03:~# gnt-network list
  Network      Subnet             Gateway         MacPrefix GroupList Tags
  gnt-grnet-01 194.177.211.192/27 194.177.211.193 -
}}

Then we connected that new network with the default hostgroup:

{{{
  grnet-node03:~# gnt-network connect --nic-parameters=link=br-inet,mode=openvswitch gnt-grnet-01 default

For migrating a setup, we then noted the correct addresses for all existing instances. e.g.:

  # gnt-instance modify --net 0:modify,network=gnt-grnet-01,ip=194.177.211.208 melartin.debian.org
  Wed May 19 17:33:25 2021  - INFO: OVS links are currently not checked for correctness
  Wed May 19 17:33:25 2021  - INFO: Reserving IP 194.177.211.208 in network gnt-grnet-01
  Modified instance melartin.debian.org
   - nic.ip/0 -> 194.177.211.208
   - nic.network/0 -> gnt-grnet-01
   - nic.link/0 -> br-inet
   - nic.mode/0 -> openvswitch
   - nic.vlan/0 ->
  Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti; restarting from within the instance will not be enough.

To install a new host, run gnt-instance add. E.g.:

    gnt-instance add \
      -o debootstrap+buster \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-grnet-01 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --backend-parameters memory=4g,vcpus=2 \
      -n grnet-node03:grnet-node04 \
      test-01.debian.org

In `/var/log/ganeti/os/` the output of the bootstrap process can be found. Note: This is on the host that ran the OS setup script. This is the instance's primary node and not necessarily the ganeti master node. The information includes the initial root password and ssh host keys. To get them, use something like

  egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)

Images are cached for a few days, so if you want to force a new base iamge, remove the file in `/var/cache/ganeti-instance-debootstrap`.