Daily Database Roasts

Postgres High Availability with CDC

Originally from planetscale.com/blog/feed.atom

September 12, 2025 • Roasted by Alex "Downtime" Rodriguez Read Original Article

Oh, this is just a fantastic piece of theoretical literature. A truly delightful read for anyone who enjoys designing systems on a whiteboard, far, far away from the warm glow of a production terminal at 3 AM. It’s always refreshing to see such a well-articulated preview of my next root cause analysis meeting.

I especially appreciate the section on the Postgres approach. It’s described with the loving detail of an artisan crafting a ship in a bottle. You have this beautiful, delicate primary, and these two standbys in semi-synchronous replication. And then you have the CDC client, which—and I love this part—"polls every few hours." It’s the intermittent-fasting approach to data pipelines. What could possibly go wrong?

The explanation of how a logical replication slot works is a masterpiece of understatement. It "pins WAL on the primary until the CDC client advances." That’s a very polite way of saying it holds your primary database hostage. It's not a bug, it's a feature that teaches you the importance of disk space alerts. We had a saying back in my last shop: the slowest consumer is your new primary. Sounds like that's still the gospel.

But the real stroke of genius is Postgres 17's failover logic. Let me see if I have this right:

A standby only becomes eligible to carry the slot after the subscriber has actually advanced the slot at least once while that standby is receiving the slot metadata.

This is beautiful. It’s a philosophical purity test for your replicas. A node can't just say it's ready for failover; it must have experienced true data progression. It's not a replica; it's a spiritual apprentice on a journey to enlightenment. So, the disaster recovery plan for my primary failing is to... wait six hours for the batch job to run and bless one of the standbys? Brilliant. I'll just tell the C-suite we're "observing a period of quiet contemplation" during the outage.

The explicit failure scenarios read like my team's greatest hits:

During a CDC quiet period... failover occurs, the temporary slots are not failover-ready. This is my favorite. The system is designed to be "highly available" except during the exact moments it’s not busy. It’s like a lifeguard who goes on break when the pool is empty, but then refuses to come back to save someone because they weren't in the water when his shift started.
Replacing replicas... all new replicas remain ineligible for promotion until that polling event occurs. Ah, yes, the zero-downtime maintenance window that is now entirely dependent on another team’s batch schedule. “Hey Data Engineering, can you, uh, just run your six-hour analytics job real quick? Ops needs to reboot a server. No, we can’t wait.”

Then we get to the MySQL approach. It's almost... disappointingly straightforward. The connector just whispers its last known GTID to any available server, and life goes on. There’s no eligibility gate, no existential dread about whether your replica has achieved the proper state of grace. Where's the challenge? Where's the adrenaline rush of realizing your entire HA strategy is coupled to an external consumer you don't control? It lacks the artisanal, hand-crafted failure modes I’ve come to expect. You’re telling me you can just... promote a replica? And it just... works? Sounds like vendor-sponsored propaganda to me.

This whole Postgres setup has the same vibe as a few stickers on my laptop from companies that no longer exist. They all promised a revolution in data management. What I got was a collection of vinyl rectangles and a very detailed PagerDuty incident history. This article has expertly captured why. You’ve tied your database’s core function—accepting writes and staying online—to the behavior of the flakiest, most unpredictable part of any architecture: the downstream consumer.

But no, really, keep writing these. It’s great work. It gives us ops folks something to read on our phones at 3 AM on Memorial Day weekend while we're manually running pg_drop_replication_slot() on a read-only primary just to get the site back up. Builds character. Truly.

🔥 The DB Grill 🔥