HostingArtisan Community for Web Artisans
SQL: MySQL, PostgreSQL, MSSQL

PostgreSQL 16 replication lag spikes—any ideas?

5 replies · 3 views
#1 — Original Post
26 Mar 2026, 07:20
M
mysql_dba

Running PostgreSQL 16 on Hetzner dedicated servers (2 primaries + 3 replicas). Lately we're seeing replication lag jump from <100ms to 2-3 seconds randomly, then settle back down. No obvious load spikes or network issues visible.

WAL archiving looks fine, pg_stat_replication shows healthy flush_lsn. We're using streaming replication with synchronous_commit = remote_apply. Disk I/O is normal (~40% utilization).

Has anyone encountered this? Could it be checkpoint-related or something with the replica query load? Any diagnostic queries I should run?

TIA

Edited at 26 Mar 2026, 08:11

#2
26 Mar 2026, 07:25
P
pipeliner

Check your checkpoint_timeout and max_wal_size settings—random lag spikes often point to checkpoint storms eating I/O. With synchronous_commit = remote_apply, replicas block on fsync, so a big checkpoint on primary can stall everyone. Try increasing max_wal_size to 4-8GB and lowering checkpoint_timeout to trigger more frequent, smaller checkpoints instead of rare huge ones. Also verify replicas aren't hitting OOM or swap during these spikes—use watch 'free -h && vmstat 1 2' to confirm. Hetzner's dedicated gear usually doesn't have these issues, so it's almost always config-related rather than hardware.

#3
26 Mar 2026, 07:30
M
mysql_dba

Thanks for that! Yeah, I didn't think about checkpoint storms—we've got max_wal_size set to 4GB which might be too aggressive. Gonna lower that and monitor the lag. Good point about synchronous_commit making replicas wait on fsync, that could definitely explain the spikes.

#4
26 Mar 2026, 07:35
D
dcops

Before tweaking checkpoint settings, also check if your replicas are falling behind on query load. With remote_apply, even a slow SELECT on replica can block writes on primary. Try running SELECT pid, query FROM pg_stat_activity; on replicas during lag spikes—you might have a long-running query holding things up.

#5
26 Mar 2026, 07:40
P
pipe_grep

Have you checked wal_keep_size? With 3 replicas and remote_apply, if one replica temporarily lags, the primary might be recycling WAL segments faster than they can be consumed. Also worth checking pg_stat_activity on replicas during a spike—look for long-running queries or CommitTs waits. fwiw, synchronous_commit = remote_apply is pretty strict for 5-node setups; consider remote_write if you don't absolutely need it.

#6
26 Mar 2026, 07:45
S
shardmaster

One thing nobody mentioned yet—check your backend_flush_after and wal_writer_delay settings. If WAL writer is batching too aggressively, replicas might see bursty flushes instead of steady streaming. Also, with remote_apply on 3 replicas, network variance between them matters more than you'd think. A single slow replica causes the primary to wait. Run SELECT client_addr, state, sync_state, reply_time FROM pg_stat_replication; during a spike to see which replica(s) are lagging. The answer might be datacenter/network tuning rather than DB config.

You need to be logged in to reply.

Log in to Reply

Cookie Preferences

We use cookies to improve your experience and analyse traffic. You can accept all or use only essential cookies.

Essential Always on
Analytics Optional
Marketing Optional
Privacy · Terms ·