Daily Database Roasts

Performing Standby Datacentre Promotions of a Patroni Cluster

Originally from percona.com/blog/feed/

November 20, 2025 • Roasted by Marcus "Zero Trust" Williams Read Original Article

Ah, yes. A lovely piece. I have to applaud the sheer, unadulterated bravery on display here. It’s not every day you see someone publish a blog post that reads like the "pre-incident" section of a future data breach notification. It’s truly a masterclass in transparency.

It’s just so charming how we start with the premise that Patroni offers automatic failovers, a comforting little security blanket for the C-suite. But then, like a magician pulling away the tablecloth, you reveal the real trick: "...this is not the case when dealing with inter-datacentre failovers." Beautiful. You’ve built an airbag that only deploys in a fender bender, but requires the driver to manually assemble it from a kit while careening off a cliff. What could possibly go wrong?

I especially admire the description of the "mechanisms required to perform such a procedure." I love a good manual, artisanal, hand-crafted disaster recovery plan. Nothing inspires more confidence than knowing the entire fate of your production database rests on a sleep-deprived on-call engineer at 3 AM, frantically trying to follow a 27-step wiki page while the world burns. It’s a fantastic way to stress-test your team’s ability to correctly type complex commands under duress. I'm sure there's zero chance of a fat-fingered rm -rf or accidentally promoting the wrong standby, exposing stale data to the world. Zero chance.

This whole setup is a beautiful, hand-written invitation for any attacker. You’re not just building a system; you’re authoring a playbook for chaos. An insider threat, or anyone who’s breached your perimeter, now has a documented, step-by-step guide on how to trigger a catastrophic state change in your most critical infrastructure during a moment of maximum confusion. It’s less of a DR plan and more of a feature. Let's call it "User-Initiated Unscheduled Disassembly."

And the compliance implications… it’s breathtaking. I can already see the SOC 2 auditors drooling.

"So, let me get this straight. Your primary datacenter fails, a P1 incident is declared, and your documented recovery process involves a human manually running a series of privileged commands over a WAN link? Can you show me the immutable audit logs for the last three times this 'procedure' was executed successfully and securely in an emergency?"

The silence that follows will be deafening. You’ve essentially created a compliance black hole, a singularity where auditability goes to die. Every manual step is a deviation, every human decision a potential finding. Each time this runs, you're basically rolling the dice on whether you'll be spending the next six months explaining yourselves to regulators.

Honestly, this isn't just a process for failing over a database. It's a rich, fertile ecosystem for novel vulnerabilities. A whole new class of CVEs is just waiting to be born from this.

The inevitable race condition when two admins think they’re the ones in charge of the failover.
The command injection vector in the script that one of them inevitably writes to "help" automate one of the 27 steps.
The PII that gets stuck in replication limbo between the two data centers, violating GDPR in three different countries simultaneously.

It’s a truly impressive way to take a tool designed for reliability and find its single most fragile, explosive failure mode, and then document it for the world as a "how-to" guide. A real gift to the community.

Sigh. And we wonder why we can't have nice things. Back to my Nessus scans. At least those failures are predictable.

🔥 The DB Grill 🔥