Daily Database Roasts

Using PXC Replication Manager to Auto Manage Both Source and Replica Failover in Galera-Based Environments

Originally from percona.com/blog/feed/

January 12, 2026 • Roasted by Alex "Downtime" Rodriguez Read Original Article

Alright, gather 'round the warm glow of the terminal, kids. Alex here. I’ve just finished reading this... aspirational document about the "PXC Replication Manager script/tool." My laptop lid, a veritable graveyard of vendor stickers from databases that promised the world and delivered a 2 AM PagerDuty alert, has a little space waiting. Let's break down this latest pamphlet promising a digital panacea, shall we?

First, let's talk about the word "facilitates." It claims this tool "facilitates both source and replica failover." In my experience, "facilitates" is marketing-speak for “it runs a few bash commands we duct-taped together, but you, dear operator, get to manually verify all 87 steps, figure out why it failed silently, and then perform the actual recovery yourself.” This isn't a robust replication manager; it's a glorified gaggle of grep commands with a README that hasn't been updated since it was a gist on some intern's GitHub.
They dangle the carrot of handling "database version upgrades" across clusters. I love this one. It’s the DevOps equivalent of a magician sawing a person in half. It looks great on stage, but you know there’s a trick. The unspoken part is that this "seamless" process has zero tolerance for the real world—things like network latency between your DCs, a minor schema mismatch, or a slightly different my.cnf setting. The promise is zero-downtime, but the reality is “zero-downtime for the first 30 seconds before the replication lag spirals into infinity and you hit a data-drift disaster that’ll take a week to reconcile.”
The very phrase "script/tool" sets off every alarm bell I own. Which is it? Is it a "script" when it fails and overwrites your primary's data with a backup from last Tuesday? And a "tool" when it's featured in the sales deck? This tells me it has no persistent state management, no idempotent checks, and its entire concept of a "cluster-wide lock" is probably a touch /tmp/failover.lock file that won't work across different machines anyway.
I see a lot about what this thing does, but absolutely nothing about how I'm supposed to know what it's doing. Where are the Prometheus exporters? The Grafana dashboard templates? The configurable alert hooks? Oh, I see. The monitoring strategy is, and always is, an afterthought. It's me, staring at a tail -f of some obscure log file, trying to decipher cryptic error messages at 400 lines per second. This isn’t observability; it’s a paltry pageant of print statements masquerading as operational insight.

So here’s my prediction, based on the scar tissue from a dozen similar "solutions." It'll be 3:15 AM on the Saturday of a long weekend. A minor network flutter between your data centers will cause a 5-second blip in asynchronous replication. The "facilitator" will heroically declare the primary dead, initiate a "failover," and promote a replica that's 200 crucial transactions behind. You'll wake up to a split-brain scenario with two active primaries, both cheerfully accepting writes and corrupting your data into a transactional Jackson Pollock painting.

And I'll be there, fueled by stale coffee and pure spite, untangling your "facilitated" future. Now if you'll excuse me, I need to go clear-coat the spot for that new sticker.

🔥 The DB Grill 🔥