Managing updates to debian.org systems

Initial setup

Clone the dsa-misc repository.

Create links to scripts/multi-tool/* in ~/bin.

These scripts assume that you have

a root SSH key in dsa-puppet
configured your SSH client so that you can access all debian.org systems, either directly or via a pre-configured jumphost

Most of the scripts will connect to hosts as root, so will need to be able to access your root SSH key. An exception is debian-upgrade-prepare, which connects as your normal user and will therefore need access to your standard key.

Security updates (and point releases)

Run debian-upgrade-prepare. This will connect to each system, analyse the available updates and present them for confirmation. If more than one system requires exactly the same updates then these will be grouped together, for confirmation as a unit. Finally, a debian-upgrade command will be output to the console that will install those upgrades that were confirmed.

The install process will launch a tmux process, with one window per system to upgrade. If there are any unconfirmed samhain alerts for the system, you will need to confirm these before the upgrades will be installed. After installing the upgrades, a check will be made for packages that are flagged for autoremoval. Finally, the samhain database will be updated and where possible any affected processes will be restarted.

Note that if a pending kernel reboot is detected, the process restart step will be skipped.

Reboots

General purpose systems

Systems with a reboot policy of justdoit configured in LDAP can be rebooted without warning.

For a mass reboot of such systems (e.g. following a point release) run the debian-reboot-simple helper script.

Buildds

To reboot all buildds, e.g. following a point release, run the debian-reboot-buildd helper script.

To reboot an individual buildd, connect to the system and run (in a screen)

buildd-reboot [-h] <REASON>

This will wait for any running build to end, ask the buildd process to terminate and then reboot the system. The optional -h flag will request a halt rather than reboot.

If the host is a VM running in a Ganeti cluster, then requesting a halt will result in Ganeti automatically restarting the system. This is particularly important for instances where e.g. the QEMU or KVM process needs to be restarted, which a reboot of the VM will not achieve.

Porter boxes

Run the debian-reboot-porterboxes helper script.

Redundant services

Several services are provided by more than one machine. In these cases, it is possible to reboot the nodes separately, ideally with a delay to ensure that the system is removed from the relevant rotation(s) first. The debian-reboot-rotation helper script can facilitate this.

snapshot.debian.org

The snapshot service consists of three clusters, hosted at 31173 Services, Sanger and Leaseweb / manda. Both the Leaseweb and 31173 clusters offer the snapshot.debian.org service over HTTP, with updates to the data happening at 31173.

31173 Services

Prerequisites

the DNS rotation for snapshot.debian.org includes lw07
snapshot-mlm-01:/srv/snapshot.debian.org/log/snapshot.log does not indicate that an import is currently running
dinstall was more than an hour ago

Schedule a reboot for snapshot-mlm-01 with a 15 minute delay:

FIRSTWAIT=5 debian-reboot-many snapshot-mlm-01.debian.org
Hosts: 5:snapshot-mlm-01.debian.org
Continue (or ^C)?

Sanger

Prerequisites

none

Schedule reboots for sibelius with an initial 20 minute delay, and sallinen 10 minutes later (to ensure that sibelius is back up before sallinen)

FIRSTWAIT=10 HOSTWAIT=10 debian-reboot-many sibelius.debian.org sallinen.debian.org
Hosts: 10:sibelius.debian.org 20:sallinen.debian.org
Continue (or ^C)?

Leaseweb

Prerequisites

the DNS rotation for snapshot.debian.org includes snapshot-mlm-01.debian.org

Schedule reboots with an initial 20 minute delay, and then a 5 minute interval between hosts. A suitable invocation is:

    FIRSTWAIT=10 HOSTWAIT=5 debian-reboot-many lw01.debian.org lw02.debian.org lw03.debian.org lw04.debian.org lw09.debian.org lw10.debian.org lw08.debian.org lw07.debian.org
    Hosts: 10:lw01.debian.org 15:lw02.debian.org 20:lw03.debian.org 25:lw04.debian.org 30:lw09.debian.org 35:lw10.debian.org 40:lw08.debian.org 45:lw07.debian.org
    Continue (or ^C)?

(the primary goal is to ensure that all of the storage servers are back up before lw07 and lw08, and that the database is back up before lw07.)

Note that FIRSTWAIT=10 results in a delay of 10 minutes before a shutdown -r +10 is issued, thus creating a 20 minute delay from the initial invocation.

Ganeti clusters

2 node x86 / ARM

Connect to the master node, and run ganeti-reboot-cluster in a root screen.

If the hosts do not require rebooting, but their QEMU processes require restarting, this can be achieved by running ganeti-shuffle-cluster [currently in adsb's home directory but should be integrated into ganeti-reboot-cluster]

3-or-more node ARM

Connect to the master node, and for each other node in turn:

gnt-node migrate -f $node to migrate any VMs on the node to their secondary node
reboot the node
once the node has rebooted, wait for DRBD to be synced on all nodes

Finally apply the above steps to the master node, and:

hbal -L -C -v -v --no-disk-moves -X to move VMs back to the node

UBC x86

Only three nodes of the cluster should have running VMs, with either the first or last node being empty.

Reboot the empty node.
Connect to the master node and migrate VMs to the empty node from its nearest neighbour.
Repeat until all nodes have been rebooted.

ppc64el

We have two single node Ganeti clusters running ppc64el - pijper and prokofiev.

As there is only one node in each cluster, the VMs must be shut down in order to reboot the host.

gnt-cluster watcher pause <SECONDS>
halt the VMs (with an appropriate delay if they are part of a rotation)
gnt-cluster watcher continue
reboot the host as soon as possible (before the watcher has chance to restart the VMs)

The rest

Hosts not covered by the above section will tend to be marked as "rebootPolicy:manual" in LDAP.

ftp-master

dak holds the reboot lock while it is running. A generally reliable process is:

check for running codesigning
check how close dinstall is
if you can aquire the reboot lock, schedule a reboot for 5 mintues time

image-building machines

If you can aquire the reboot lock, schedule a reboot for 5 mintues time.

Hetzner

The storage server at Hetzner uses full disk encryption, with dropbear in the initramfs.

After rebooting, connect as root using IPv4 an unlock the filesystem.

Mirrors [needs improving]

Don't reboot during dinstall or mirror pushes.

Only reboot one at a time (scheduling multiple hosts in sequence is OK as long as each should be back up before the next shuts down).

Schedule reboots with enough lead time for the host to be removed from any DNS rotations.