Daily Database Roasts

Report on our investigation of the 2025-10-20 incident in AWS us-east-1

Originally from planetscale.com/blog/feed.atom

November 3, 2025 • Roasted by Rick "The Relic" Thompson Read Original Article

Alright, settle down, you whippersnappers, and let ol' Rick pour you a glass of lukewarm coffee from the pot that's been on since this morning. I just read this... post-mortem, and I haven't seen this much self-congratulatory back-patting for a fourteen-hour face-plant since a marketing intern managed to plug in their own monitor. You kids and your "resilience". Let me tell you what's resilient: a 200-pound tape drive and the fear of God.

You think you've reinvented the wheel, but all you've done is build a unicycle out of popsicle sticks and called it "cloud-native." Let's break down this masterpiece of modern engineering, shall we?

You're mighty proud of your "strong separation of control and data planes." You write about it like you just discovered fire. Back in my day, we called that "the master console" and "the actual database." One was for the operator to yell at, the other was for the COBOL programs to feed. This wasn't a feature, kid, it was just... how you built things so the whole shebang didn't crash when someone fat-fingered a command. We were doing this on DB2 on MVS before your parents met. The fact that your management interface going down for hours is considered a win tells me everything I need to know about the state of your architecture.
Let's talk about this beautiful chain of dependencies. Your service for making databases goes down because your secret service goes down because S3 goes down because STS goes down because DynamoDB stubbed its toe. That's not a dependency chain, that's a Jenga tower built on a fault line during an earthquake. I once spent three days restoring a customer database from a reel-to-reel tape that a junior op had stored next to a giant magnet. That was one point of failure. I could see it. I could yell at it. You're trying to debug a ghost by holding a digital seance with five other ghosts.
Your "interventions" were a real hoot. You stopped creating new databases, delayed backups, and started "bin-packing" processes more tightly. Congratulations, you rediscovered what we called "running out of resources." Advising customers to "shed whatever load they could" is a cute way of saying "please stop using our product so it doesn't fall over." Back in '89, we didn't have "diurnal autoscaling," we had a guy named Frank who knew to provision more CICS regions before the morning batch jobs hit. And our backups? We took the system down for an hour at 2 AM, wrote everything to physical tape, and drove a copy to a salt mine in another state. Your process involves spinning up more of your fragile infrastructure just to avoid slowing things down. It's like trying to put out a fire with a bucket of gasoline.
Ah, "network partitions." The boogeyman of the cloud. You say they're "one of the hardest failure modes to reason about." I'll tell you what's hard to reason about: figuring out which of the 3,000 punch cards in a C++ compiler deck was off by one column. A network partition? That's just someone tripping over the damn Token Ring cable. The fact that your servers in the same building can't talk to each other but can still talk to the internet is the kind of nonsense that only happens when you let twenty layers of abstraction do your thinking for you.
But the real kicker, the part that made me spit out my coffee, was this little gem:

PlanetScale weathered this incident well. You were down or degraded for half a business day. Your control plane was offline, your dashboard was dead, SSO failed, and you couldn't even update your own status page to tell anyone what was going on because it was broken too! That's not weathering a storm, son. That's your ship sinking while the captain stands on the bridge announcing how well the deck chairs are holding up against the waves.

You kids and your "Principles of Extreme Fault Tolerance." Here's a principle for you: build something that doesn't collapse if someone in another company sneezes.

Now if you'll excuse me, I think there's a JCL script that needs optimizing. At least when it breaks, I know who to blame.

🔥 The DB Grill 🔥