Where database blog posts get flame-broiled to perfection
Alright, team, gather 'round the warm glow of the Grafana dashboard. Someone just sent me this... this trip down memory lane. An origin story for a piece of code that has, I'm sure, contributed to the graying of my temples. "I invented this," he says. Fantastic. I've got a whole drawer full of vendor stickers from geniuses who "invented this." Clustrix, RethinkDB, FoundationDB before Apple bought it... they make a nice, colorful memorial to things that were supposed to change the world and instead just changed my on-call rotation.
So, a new in-memory sort algorithm. "Orasort." Cute. Let's look at the features, shall we? This is like reading the marketing brochure for a car I know is about to be recalled.
"Common prefix skipping." Sounds clever. It also sounds like the perfect way to introduce a subtle, data-dependent bug that only triggers when a user from a specific non-latin character set tries to sort a billion-row table full of product descriptions. I can already see the bug report: Sorting works for "apple," "apply," but fails for "applÄ" and "applĂž." And of course, there will be no logs for it.
"Adaptive." Oh, I love that word. It's corporate-speak for "unpredictable." It switches between quicksort and radix sort? Wonderful. So when I'm trying to profile a slow query, the execution plan will be different every single time based on the data distribution in the cache at that exact nanosecond. My monitoring tools won't know what to make of it. Is it slow? Is it fast? Is it just thinking about which algorithm to use? Itâs a black box inside another black box, and my job is to guess whatâs happening inside while the Vice President of Sales is breathing down my neck about the quarterly report being late.
"Key substring caching." My favorite. Another "improvement" that happens deep in the CPU where my tools can't see it. The promise is fewer cache misses. The reality is that when it goes wrong, all I'll see is CPU_WAIT pegged at 100% with absolutely zero indication as to why. Itâs the database equivalent of "have you tried turning it off and on again?"
But this... this is the real gem:
produces results before sort is done
This is the kind of feature that sounds revolutionary in a design meeting and becomes a cascading failure in production. You're telling me the query is streaming results while still actively performing a massive sort in memory? So when that query gets cancelled by a panicking user, or the connection drops, or a pod gets rescheduled by Kubernetes... what happens to that half-finished sort? Does it clean up the memory gracefully? Or does it leave behind a ten-gigabyte ghost allocation that slowly bleeds the server dry until the whole node falls over at 3 AM on the Saturday of a long holiday weekend? I don't need a Scheme interpreter to calculate the probability on that one; it's 1.
And the implementation details! He doesn't remember how they addressed the stable sort issue. HE DOESN'T REMEMBER. I can tell you what happened: they didn't, or they put in a hacky workaround, and some poor developer in accounting spent six years wondering why their financial reconciliation report was always off by a few cents in a completely non-reproducible way.
Then there's the "bad, but unlikely, worst-case." In operations, "unlikely" means "it will happen next Tuesday." All it takes is one perfectly crafted, malicious queryâor, more likely, a ridiculously stupid one from the new BI internâto hit that worst-case pivot selection every single time. And just like that, a query that should take five seconds will run for five hours, consuming all CPU, and bringing the entire cluster to its knees. The "5x performance improvement" becomes an infinity-x performance degradation.
He got a short email from Larry Ellison and then left the company. Of course he did. He lit the fuse, walked away in slow motion, and left people like me to deal with the explosion. He went on to make MySQL better, which is great. Iâve been paged for MySQL, too.
So, congratulations on your patent, buddy. I hope it brought you joy. I'll go ahead and print out your blog post and add it to the runbook for "Unexplained High CPU on Oracle Prod Cluster." I'm sure it'll be a comfort to the on-call engineer at 3 AM, reading about the theoretical elegance of the very algorithm that's currently setting their world on fire. Now, if you'll excuse me, I need to go proactively increase the memory allocation on our oldest Oracle instance. I have a hunch.