Postgres backup

Backing up postgres consists of two main pieces: backing up the WALs (write ahead logs) and regular backups of the base. See https://www.postgresql.org/docs/9.4/static/continuous-archiving.html

The WALs are copied from the postgres server to the backup hosts using ssh with the pg-backup-file script. Base backups are pulled on the backup hosts using postgres-make-base-backups. The former requires the postgres servers be able to ssh to the backup servers, and the latter requires the postgres server listen on the network, have ssl set up correctly, access is allowed in the firewall, a postgres user with replication privileges exists and is configured to allowed to connect in pg_hba.

server config

(2018-02 - This documentation has not really been tested since it was written – weasel)

There are two possible ways to configure the server.

Single cluster on a host

If there is only one cluster on the host, we can use puppet's postgresql::server to configure the cluster and any databases on it, see modules/salsa/manifests/database.pp for an example. In particular, you want to set archive_command and the ssl options in pg.conf, as well as set listen_address correctly.

Add a postgres::backup_cluster stanza to get it backed up.

Multiple clusters/compatibility mode

Since we often have more than one cluster, we cannot use the puppet postgresql::server class for most things.

Add the server to the roles::postgresql::server class role in hiera. This will cause some scripts to be installed on the host, as well as an ssh key to be created for the postgres user.

Add these to /etc/postgresql/9.6/main/postgresql.conf or equivalent

  track_counts = yes
  archive_mode = on
  wal_level = archive
  max_wal_senders = 3
  archive_timeout = 1h
  archive_command = '/usr/local/bin/pg-backup-file mXXXXXX-CLUSTERNAMEHERE-XXXXain WAL %p'

Run puppet on the postgresql server,

If the server is a replication receiver, it needs read access to the sender's WALs on the backup host (to recover from situations where the source might no longer have the WALs.) This can be configured via hiera as well. Example:

[git|master] weasel@orinoco:~/projects/debian/d-a/dsa-puppet$ cat data/nodes/snapshotdb-manda-01.debian.org.yaml
classes:
  - roles::snapshot_db
  - roles::postgresql::server

postgres::backup_server::register_backup_clienthost::allow_read_hosts: ['sallinen']

base backup config

Run puppet on the backup hosts (storace and backuphost as of 2019).

On the db server, create a role. Find the password to use on the backup host in ~debbackup/.pgpass:
sudo -u postgres createuser -D -E -P -R -S debian-backup
Give the role replication access:
sudo -u postgres psql -c 'ALTER ROLE "debian-backup" REPLICATION;'
Add an entry to pg_hba to allow access:
hostssl replication debian-backup 5.153.231.12/32 md5 # backuphost hostssl replication debian-backup 2001:41c8:1000:21::21:12/128 md5 # backuphost hostssl replication debian-backup 93.94.130.161/32 md5 # storace hostssl replication debian-backup 2a02:158:380:280::161/128 md5 # storace
Ensure pg is listening on *.
Ensure the server is using ssl and a proper debian auto-ca cert.
Reload db server.
Test running "postgres-make-base-backups host:port".
You should see a tarball and WALs

Nagios warnings

BASE-IS-OLD

(2018-02) Our nagios check warns us when a backup server has not successfully fetched a base backup recently. The causes often are that either the postgres server or the backup host went down or was down during the time of the weekly cronjob.

To re-run a base backup for a specific cluster, log into the backup server(s) (storace and/or backuphost), cat /etc/dsa/postgresql-backup/base-backup-clusters to see the port for the cluster, and run

  sudo -u debbackup /usr/local/bin/postgres-make-base-backups <host>:<port>

probably best to do that in a screen as it might take a while.

MISSING-BASE

e.g.:

sudo -u debbackup /usr/lib/nagios/plugins/dsa-check-backuppg  | grep BASE
[fasolo, dak] MISSING-BASE: dak.BASE.backuphost.debian.org-20180211-012002-fasolo.debian.org-dak-9.6-backup.tar.gz

This means that we started doing a base backup (as witnessed by a .backup file next to a WAL), but for some reason we don't have the corresponding base file.

root@backuphost:/srv/backups/pg/fasolo# ls -l *backup*
-rw------- 1 debbackup debbackup 9201093916 Jan 14 06:18 dak.BASE.backuphost.debian.org-20180114-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-rw------- 1 debbackup debbackup 9227651542 Jan 21 06:25 dak.BASE.backuphost.debian.org-20180121-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-rw------- 1 debbackup debbackup 9266306750 Jan 28 07:59 dak.BASE.backuphost.debian.org-20180128-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-rw------- 1 debbackup debbackup 9312602089 Feb  5 11:00 dak.BASE.backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-rw------- 1 debbackup debbackup 9346830509 Feb 12 10:25 dak.BASE.backuphost.debian.org-20180212-094930-fasolo.debian.org-dak-9.6-backup.tar.gz
-rw------- 1 debbackup debbackup        353 Jan 14 06:18 dak.WAL.0000000100000033000000A6.00000028.backup
-rw------- 1 debbackup debbackup        350 Jan 20 11:20 dak.WAL.00000001000000350000008C.00000028.backup
-rw------- 1 debbackup debbackup        353 Jan 21 06:25 dak.WAL.000000010000003600000068.00000028.backup
-rw------- 1 debbackup debbackup        353 Jan 28 07:59 dak.WAL.0000000100000038000000E3.00000028.backup
-rw------- 1 debbackup debbackup        353 Feb  5 11:00 dak.WAL.000000010000003B00000090.00000028.backup
-rw------- 1 debbackup debbackup        350 Feb  5 15:49 dak.WAL.000000010000003B0000009B.00000108.backup
-rw------- 1 debbackup debbackup        353 Feb 11 10:09 dak.WAL.000000010000003D000000AC.00000028.backup
-rw------- 1 debbackup debbackup        353 Feb 12 10:25 dak.WAL.000000010000003E00000027.00000178.backup

.backup files are created on the postgres server and shipped to the backup hosts whenever a base backup is initiated. We do some labelling, so we know which backup host the corresponding tarball should end up with.

e.g.:

root@backuphost:/srv/backups/pg/fasolo# cat dak.WAL.000000010000003B00000090.00000028.backup
START WAL LOCATION: 3B/90000028 (file 000000010000003B00000090)
STOP WAL LOCATION: 3B/97CF2138 (file 000000010000003B00000097)
CHECKPOINT LOCATION: 3B/90000098
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2018-02-05 10:25:28 UTC
LABEL: backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup
STOP TIME: 2018-02-05 10:59:50 UTC

To fix this, verify we have a later base tarball, or that we are fine for some other reason, and remove the corresponding .backup file from the backup host. In the case above, we would remove dak.WAL.000000010000003D000000AC.00000028.backup.

WAL-MISSING-AFTER

e.g.:

[bmdb1, main] WAL-MISSING-AFTER: bmdb1/main.WAL.0000000100001340000000DB

If it's just one WAL file missing, it can be recovered from the other backup host. If more logs are missing, check the server's logs for archive errors.