Where database blog posts get flame-broiled to perfection
Alright, let's see what fresh hell the thought leaders have cooked up for us this week. Oh, perfect. A lovely, detailed post on how we can finally understand MongoDB's storage internals with "simple queries." Simple. That's the first red flag. Nothing that requires a multi-page explanation with six different ways to run the same query is ever "simple." This isn't a blog post; it's an advance copy of the incident report for a migration that hasn't even been approved yet.
So, we've got a new magic wand: the RecordId. It's an "internal key," a "monotonically increasing 64-bit integer" that gives us physical data independence. Riiight. Because abstracting away the physical layer has never, ever come back to bite anyone. I can already feel the phantom buzz of my on-call pager. It’s the ghost of migrations past, whispering about that one "simple" switch to a clustered index in Postgres that brought the entire payment system to its knees because of write amplification that the whitepaper swore wasn't an issue.
This whole article is a masterclass in repackaging old problems. We're not dealing with heap tables and VACUUM
, no, that's for dinosaurs. We have a WiredTiger storage engine with a B+Tree structure. It's better because it "reusing space and splitting pages as needed." That sounds suspiciously like what every other database has tried to do for thirty years, but with more syllables.
And the examples, my god, the examples.
I generate ten documents and insert them asynchronously, so they may be written to the database in a random order.
Ten. Documents. Let me just spin up my 10-document production environment and test this out. I'm sure the performance characteristics I see with a dataset that fits in a single CPU cache line will scale beautifully to our 8 terabyte collection with 500,000 writes per minute. Showing that a COLLSCAN
on ten items returns them out of _id
order isn't a profound technical insight; it's what happens when you throw a handful of confetti in the air.
And then we get to the best part: the new vocabulary for why your queries are slow. It's not a full table scan anymore, sweetie, it's a COLLSCAN
. It sounds so much more... intentional. And if you don't like it, you can just .hint()
the query planner. You know, the all-powerful query planner that's supposed to offer data independence, but you, the lowly application developer, have to manually tell it how to do its job. I see a future filled with:
$natural
here?"IXSCAN
on un-selective index."Oh, and covering indexes! I love this game. To get a real index-only scan, you need to either explicitly drop _id
from your projection—something every new hire will forget to do—or, even better, you create another index that includes _id
. So now we have val_1
and val_1__id_1
. Fantastic. I can't wait for the inevitable moment when we have val_1__id_1
, val_1__user_1__id_1
, and val_1__id_1__user_1
because no one can remember which permutation is the right one, and they're all just eating up memory.
But the absolute chef's kiss, the pièce de résistance of this entire thing, is the section on clustered collections. They let the database behave like an index-organized table, which is great! Fast access! It's the solution! Except, wait... what's this tiny little sentence here?
It is not advisable to use it widely because it was introduced for specific purposes and used internally.
You cannot make this up. They're dangling the keys to the kingdom in front of us and then saying, "Oh, you don't want to use these. These are the special keys. For us. You just stick to the slow way, okay?" This isn't a feature; it's a landmine with a "Do Not Touch" sign written in invisible ink.
So let me just predict the future. Some VP is going to read the headline of this article, ignore the 3,000 words of caveats, and declare that we're moving to MongoDB because of its flexible schema and efficient space management. We'll spend six months on a "simple" migration. The first on-call incident will be because a developer relied on the "natural order" that works perfectly on their 10-document test collection but explodes in a sharded environment. The second will be when we discover that RecordId
being different on each replica means our custom diagnostic tools are giving us conflicting information.
And a year from now, I'll be awake at 3 AM, staring at an execution plan that says EXPRESS_CLUSTERED_IXSCAN
, wondering why it's still taking 5 seconds, while drinking coffee that has long since gone cold. The only difference is that the new problems will have cooler, more marketable names.
I'm going to go ahead and bookmark this. It'll make a great appendix for the eventual post-mortem.