Recently we got alerted by our NetOps team about a replication lag in one PostgreSQL database. That by itself is not a major problem, nevertheless something which deserves investigation. NetOps is using Zabbix, and the replication lag in the database system in question is measured using a plpgsql function provided by an open source template. In the replica, not in the primary.

DO LANGUAGE plpgsql $$
DECLARE
	ver integer;
	res text;
BEGIN
	SELECT current_setting('server_version_num') INTO ver;

	IF (ver >= 100000) THEN
		SELECT * INTO res from (
			SELECT
				CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
					THEN 0
					ELSE COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::integer, 0)
				END
			) T;

	ELSE
		SELECT * INTO res from (
			SELECT
				CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
					THEN 0
					ELSE COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::integer, 0)
				END
			) T;
	END IF;

	perform set_config('zbx_tmp.repl_lag_res', res, false);
END $$;

Turns out that using pg_last_xact_replay_timestamp() in the replica is not a good idea: it returns the timestamp of the last transaction which was replayed during recovery.

Usually our database systems are rather busy, and there is always some activity. In such cases, the above function works well. However when there is no activity in the primary database, this will be shown as delay, even though there is no actual delay.

We changed this system and are now using pg_stat_replication on the primary to figure out if there is a replication delay, and how big it is.

SELECT
	pid,
	client_addr,
	application_name,
	state,
	COALESCE(pg_current_wal_lsn() - sent_lsn, 0) AS sent_lag_bytes,
	COALESCE(sent_lsn - write_lsn, 0) AS write_lag_bytes,
	COALESCE(write_lsn - flush_lsn, 0) AS flush_lag_bytes,
	COALESCE(flush_lsn - replay_lsn, 0) AS replay_lag_bytes,
	COALESCE(pg_current_wal_lsn() - replay_lsn, 0) AS total_lag_bytes
FROM
	pg_stat_replication;

The approach using the replica has several faults:

  • The replica doesn’t know the current transaction position of the primary, and therefore can’t figure out how big the actual delay is. Using pg_last_xact_replay_timestamp() did always only measure the data which already arrived on the replica.
  • We always checked the replication lag on each replica. This can be streamlined by doing this once on the primary, and possibly only raise an alert if the lag is too high for anyone of the connected replicas. This reduces overall overhead, and also reduces the number of Zabbix entries.
  • We had to enable this for each replica, and create Zabbix configurations for each replica. Checking this in the primary needs only one Zabbix configuration entry.

Summary

Check replication lag at origin, to ensure that all parameters are taken into account. Measuring lag on the replica can only measure the data which is already on the replica.

References