Daily Database Roasts

The principles of extreme fault tolerance

Originally from planetscale.com/blog/feed.atom

July 3, 2025 Read Original Article

Alright, gather 'round, folks, because PlanetScale has apparently cracked the code on database reliability! And by "cracked the code," I mean they've eloquently restated principles that have been foundational to any competent distributed system for the past two decades. You heard it here first: "PlanetScale is fast and reliable!" Truly groundbreaking stuff, I tell ya. Who knew a database company would aspire to that? My mind is simply blown.

They kick off by telling us their "shared nothing architecture" makes them the "best in the cloud." Because, you know, no one else has ever thought to use local storage. It's a miracle! Then they pivot to reliability, promising "principles, processes, and architectures that are easy to understand, but require painstaking work to do well." Ah, the classic corporate paradox: it's simple, but we're brilliant for doing it. Pick a lane, chief.

Then, brace yourselves, because they reveal their "principles," which, they admit, "are neither new nor radical. You may find them obvious." They're not wrong! They've basically pulled out a textbook on distributed systems circa 2005 and highlighted "Isolation," "Redundancy," and "Static Stability." Wow. Next, they'll be telling us about data integrity and ACID properties like they just invented the wheel. My favorite part is "Static stability: When something fails, continue operating with the last known good state." So, when your database is actively failing, it… tries to keep working? What revolutionary concept is this?! Did they stumble upon this by accident, perhaps after a particularly vigorous game of Jenga with their servers?

Their "Architecture" section is equally thrilling, introducing the "Control plane" (the admin stuff) and the "Data plane" (the actual database stuff). More mind-bending jargon for basic components. The "Data plane" is "extremely critical" and has "extremely few dependences." So critical, in fact, they had to say it twice. Like a child trying to convince you their imaginary friend is really real.

But the real gem, the absolute crown jewel of their "Processes," is the wonderfully alarming "Always be Failing Over." Let me repeat that: "Always be Failing Over." They "exercise this ability every week on every customer database." Let that sink in. They're intentionally failing your databases every single week just to prove they can fix them. It's like a mechanic who regularly punctures your tires just to show off how fast they can change a flat. And they claim "Query buffering minimizes or eliminates disruption." So, not eliminates then? Just "minimizes or eliminates." Good to know my business-critical application might just experience "some" disruption during their weekly reliability charade. Synchronous replication? Progressive delivery? These are standard practices, not Nobel-Prize-winning innovations. They’re just... how you run a competent cloud service.

And finally, the "Failure modes." They proudly announce that "Non-query-path failures" don't impact customer queries. Because, you know, a well-designed system's control plane shouldn't take down the data plane. Who knew decoupling was a thing?! And for "Cloud provider failures," their solution is... wait for it... to fail over to a healthy instance or zone. Shocking! Who knew redundancy would protect you from failures? And the truly heartwarming admission: "PlanetScale-induced failures." They say a bug "rarely impacts more than 1-2 customers." Oh, so it does impact customers? Just a couple? And infrastructure changes "very rarely" have a bigger impact. "Very rarely." That's the kind of confidence that makes me want to immediately migrate all my data.

Honestly, after this breathtaking exposé of fundamental engineering principles rebranded as revolutionary insights, I fully expect their next announcement to be "PlanetScale: We Plug Our Servers Into Walls! A Groundbreaking Approach to Power Management!" Don't worry, it'll be "extremely critical" and have "extremely few dependencies." You can count on it. Or, you know, "very rarely" count on it.

🔥 The DB Grill 🔥