Managing updates to debian.org systems
Initial setup
Clone the dsa-misc repository.
Create links to scripts/multi-tool/*
in ~/bin
.
These scripts assume that you have
- a root SSH key in dsa-puppet
- configured your SSH client so that you can access all debian.org systems, either directly or via a pre-configured jumphost
Most of the scripts will connect to hosts as root, so will need to be able to access your root SSH key. An exception is debian-upgrade-prepare
, which connects as your normal user and will therefore
need access to your standard key.
Security updates (and point releases)
Run debian-upgrade-prepare
. This will connect to each system, analyse the available updates and present them for confirmation. If more than one system requires exactly the same updates then these will
be grouped together, for confirmation as a unit. Finally, a debian-upgrade
command will be output to the console that will install those upgrades that were confirmed.
The install process will launch a tmux process, with one window per system to upgrade. If there are any unconfirmed samhain alerts for the system, you will need to confirm these before the upgrades will be installed. After installing the upgrades, a check will be made for packages that are flagged for autoremoval. Finally, the samhain database will be updated and where possible any affected processes will be restarted.
Note that if a pending kernel reboot is detected, the process restart step will be skipped.
Reboots
General purpose systems
Systems with a reboot policy of justdoit
configured in LDAP can be rebooted without warning.
For a mass reboot of such systems (e.g. following a point release) run the debian-reboot-simple
helper script.
Buildds
To reboot all buildds, e.g. following a point release, run the debian-reboot-buildd
helper script.
To reboot an individual buildd, connect to the system and run (in a screen)
buildd-reboot [-h] <REASON>
This will wait for any running build to end, ask the buildd process to terminate and then reboot the system. The optional -h
flag will request a halt rather than reboot.
If the host is a VM running in a Ganeti cluster, then requesting a halt will result in Ganeti automatically restarting the system. This is particularly important for instances where e.g. the QEMU or KVM process needs to be restarted, which a reboot of the VM will not achieve.
Porter boxes
Run the debian-reboot-porterboxes
helper script.
Redundant services
Several services are provided by more than one machine. In these cases, it is possible to reboot the nodes separately, ideally with a delay to ensure that the system is removed from the relevant
rotation(s) first. The debian-reboot-rotation
helper script can facilitate this.
snapshot.debian.org
The snapshot service consists of three clusters, hosted at 31173 Services, Sanger and Leaseweb / manda. Both the Leaseweb and 31173 clusters offer the snapshot.debian.org service over HTTP, with updates to the data happening at 31173.
31173 Services
Prerequisites
- the DNS rotation for snapshot.debian.org includes lw07
- snapshot-mlm-01:/srv/snapshot.debian.org/log/snapshot.log does not indicate that an import is currently running
- dinstall was more than an hour ago
Schedule a reboot for snapshot-mlm-01 with a 15 minute delay:
FIRSTWAIT=5 debian-reboot-many snapshot-mlm-01.debian.org
Hosts: 5:snapshot-mlm-01.debian.org
Continue (or ^C)?
Sanger
Prerequisites
- none
Schedule reboots for sibelius with an initial 20 minute delay, and sallinen 10 minutes later (to ensure that sibelius is back up before sallinen)
FIRSTWAIT=10 HOSTWAIT=10 debian-reboot-many sibelius.debian.org sallinen.debian.org
Hosts: 10:sibelius.debian.org 20:sallinen.debian.org
Continue (or ^C)?
Leaseweb
Prerequisites
- the DNS rotation for snapshot.debian.org includes snapshot-mlm-01.debian.org
Schedule reboots with an initial 20 minute delay, and then a 5 minute interval between hosts. A suitable invocation is:
FIRSTWAIT=10 HOSTWAIT=5 debian-reboot-many lw01.debian.org lw02.debian.org lw03.debian.org lw04.debian.org lw09.debian.org lw10.debian.org snapshotdb-manda-01.debian.org lw08.debian.org lw07.debian.org
Hosts: 10:lw01.debian.org 15:lw02.debian.org 20:lw03.debian.org 25:lw04.debian.org 30:lw09.debian.org 35:lw10.debian.org 40:snapshotdb-manda-01.debian.org 45:lw08.debian.org 50:lw07.debian.org
Continue (or ^C)?
(the primary goal is to ensure that all of the storage servers are back up before lw07 and lw08, and that the database is back up before lw07.)
Note that FIRSTWAIT=10
results in a delay of 10 minutes before a shutdown -r +10
is issued, thus creating a 20 minute delay from the initial invocation.
Ganeti clusters
2 node x86 / ARM
Connect to the master node, and run ganeti-reboot-cluster
in a root screen.
If the hosts do not require rebooting, but their QEMU processes require restarting, this can be achieved by running ganeti-shuffle-cluster
[currently in adsb's home directory but should be integrated
into ganeti-reboot-cluster
]
3-or-more node ARM
Connect to the master node, and for each other node in turn:
gnt-node migrate -f $node
to migrate any VMs on the node to their secondary node- reboot the node
- once the node has rebooted, wait for DRBD to be synced on all nodes
Finally apply the above steps to the master node, and:
hbal -L -C -v -v --no-disk-moves -X
to move VMs back to the node
UBC x86
Only three nodes of the cluster should have running VMs, with either the first or last node being empty.
- Reboot the empty node.
- Connect to the master node and migrate VMs to the empty node from its nearest neighbour.
- Repeat until all nodes have been rebooted.
ppc64el
We have two single node Ganeti clusters running ppc64el - pijper and prokofiev.
As there is only one node in each cluster, the VMs must be shut down in order to reboot the host.
gnt-cluster watcher pause <SECONDS>
- halt the VMs (with an appropriate delay if they are part of a rotation)
gnt-cluster watcher continue
- reboot the host as soon as possible (before the watcher has chance to restart the VMs)
The rest
Hosts not covered by the above section will tend to be marked as "rebootPolicy:manual" in LDAP.
ftp-master
dak holds the reboot lock while it is running. A generally reliable process is:
- check for running codesigning
- check how close dinstall is
- if you can aquire the reboot lock, schedule a reboot for 5 mintues time
image-building machines
If you can aquire the reboot lock, schedule a reboot for 5 mintues time.
Hetzner
The storage server at Hetzner uses full disk encryption, with dropbear in the initramfs.
After rebooting, connect as root using IPv4 an unlock the filesystem.
Mirrors [needs improving]
Don't reboot during dinstall or mirror pushes.
Only reboot one at a time (scheduling multiple hosts in sequence is OK as long as each should be back up before the next shuts down).
Schedule reboots with enough lead time for the host to be removed from any DNS rotations.