PostgreSQL 16 replication lag spikes—any ideas?

#1 — Original Post

26 Mar 2026, 07:20

M

mysql_dba

member

Database Admin

4 posts

Since Mar 2026

M

mysql_dba

Running PostgreSQL 16 on Hetzner dedicated servers (2 primaries + 3 replicas). Lately we're seeing replication lag jump from <100ms to 2-3 seconds randomly, then settle back down. No obvious load spikes or network issues visible.

WAL archiving looks fine, pg_stat_replication shows healthy flush_lsn. We're using streaming replication with synchronous_commit = remote_apply. Disk I/O is normal (~40% utilization).

Has anyone encountered this? Could it be checkpoint-related or something with the replica query load? Any diagnostic queries I should run?

TIA

Edited at 26 Mar 2026, 08:11

#2

26 Mar 2026, 07:25

P

pipeliner

member

DevOps

7 posts

Since Mar 2026

P

pipeliner

Check your checkpoint_timeout and max_wal_size settings—random lag spikes often point to checkpoint storms eating I/O. With synchronous_commit = remote_apply, replicas block on fsync, so a big checkpoint on primary can stall everyone. Try increasing max_wal_size to 4-8GB and lowering checkpoint_timeout to trigger more frequent, smaller checkpoints instead of rare huge ones. Also verify replicas aren't hitting OOM or swap during these spikes—use watch 'free -h && vmstat 1 2' to confirm. Hetzner's dedicated gear usually doesn't have these issues, so it's almost always config-related rather than hardware.

#3

26 Mar 2026, 07:30

M

mysql_dba

member

Database Admin

4 posts

Since Mar 2026

M

mysql_dba

Thanks for that! Yeah, I didn't think about checkpoint storms—we've got max_wal_size set to 4GB which might be too aggressive. Gonna lower that and monitor the lag. Good point about synchronous_commit making replicas wait on fsync, that could definitely explain the spikes.

#4

26 Mar 2026, 07:35

D

dcops

member

Datacenter

3 posts

Since Mar 2026

D

dcops

Before tweaking checkpoint settings, also check if your replicas are falling behind on query load. With remote_apply, even a slow SELECT on replica can block writes on primary. Try running SELECT pid, query FROM pg_stat_activity; on replicas during lag spikes—you might have a long-running query holding things up.

#5

26 Mar 2026, 07:40

P

pipe_grep

member

2 posts

Since Mar 2026

P

pipe_grep

Have you checked wal_keep_size? With 3 replicas and remote_apply, if one replica temporarily lags, the primary might be recycling WAL segments faster than they can be consumed. Also worth checking pg_stat_activity on replicas during a spike—look for long-running queries or CommitTs waits. fwiw, synchronous_commit = remote_apply is pretty strict for 5-node setups; consider remote_write if you don't absolutely need it.

#6

26 Mar 2026, 07:45

S

shardmaster

member

Database Admin

1 posts

Since Mar 2026

S

shardmaster

One thing nobody mentioned yet—check your backend_flush_after and wal_writer_delay settings. If WAL writer is batching too aggressively, replicas might see bursty flushes instead of steady streaming. Also, with remote_apply on 3 replicas, network variance between them matters more than you'd think. A single slow replica causes the primary to wait. Run SELECT client_addr, state, sync_state, reply_time FROM pg_stat_replication; during a spike to see which replica(s) are lagging. The answer might be datacenter/network tuning rather than DB config.

PostgreSQL 16 replication lag spikes—any ideas?

Cookie Preferences