PostgreSQL 16 replication lag spikes—any ideas?
Running PostgreSQL 16 on Hetzner dedicated servers (2 primaries + 3 replicas). Lately we're seeing replication lag jump from <100ms to 2-3 seconds randomly, then settle back down. No obvious load spikes or network issues visible.
WAL archiving looks fine, pg_stat_replication shows healthy flush_lsn. We're using streaming replication with synchronous_commit = remote_apply. Disk I/O is normal (~40% utilization).
Has anyone encountered this? Could it be checkpoint-related or something with the replica query load? Any diagnostic queries I should run?
TIA
Edited at 26 Mar 2026, 08:11
Check your checkpoint_timeout and max_wal_size settings—random lag spikes often point to checkpoint storms eating I/O. With synchronous_commit = remote_apply, replicas block on fsync, so a big checkpoint on primary can stall everyone. Try increasing max_wal_size to 4-8GB and lowering checkpoint_timeout to trigger more frequent, smaller checkpoints instead of rare huge ones. Also verify replicas aren't hitting OOM or swap during these spikes—use watch 'free -h && vmstat 1 2' to confirm. Hetzner's dedicated gear usually doesn't have these issues, so it's almost always config-related rather than hardware.
Thanks for that! Yeah, I didn't think about checkpoint storms—we've got max_wal_size set to 4GB which might be too aggressive. Gonna lower that and monitor the lag. Good point about synchronous_commit making replicas wait on fsync, that could definitely explain the spikes.
Before tweaking checkpoint settings, also check if your replicas are falling behind on query load. With remote_apply, even a slow SELECT on replica can block writes on primary. Try running SELECT pid, query FROM pg_stat_activity; on replicas during lag spikes—you might have a long-running query holding things up.
Have you checked wal_keep_size? With 3 replicas and remote_apply, if one replica temporarily lags, the primary might be recycling WAL segments faster than they can be consumed. Also worth checking pg_stat_activity on replicas during a spike—look for long-running queries or CommitTs waits. fwiw, synchronous_commit = remote_apply is pretty strict for 5-node setups; consider remote_write if you don't absolutely need it.
One thing nobody mentioned yet—check your backend_flush_after and wal_writer_delay settings. If WAL writer is batching too aggressively, replicas might see bursty flushes instead of steady streaming. Also, with remote_apply on 3 replicas, network variance between them matters more than you'd think. A single slow replica causes the primary to wait. Run SELECT client_addr, state, sync_state, reply_time FROM pg_stat_replication; during a spike to see which replica(s) are lagging. The answer might be datacenter/network tuning rather than DB config.