Where database blog posts get flame-broiled to perfection
Alright, settle down, grab your kombucha. I just read the latest dispatch from the engineering-as-marketing department, and itâs a real piece of work. âHow we built vector search in a relational database.â You can almost hear the triumphant orchestral score, canât you? It starts with the bold proclamation that vector search has become table stakes. Oh, you donât say? Welcome to two years ago, glad you could make it. The rest of us have been living with the fallout while you were apparently discovering fire.
The whole premise is just... chefâs kiss. They were surprised to find no existing papers on implementing a vector index inside a transactional, disk-based relational database. Shocked, I tell you! Itâs almost as if people who design high-performance, in-memory graph algorithms werenât thinking about the glacial pace of B-tree I/O and ACID compliance. Itâs like being surprised your race car doesnât have a tow hitch. Theyâre different tools for different jobs, you absolute titans of innovation.
And the tone! This whole, âwe had to invent everything from scratchâ routine. I remember meetings just like this. Someone scribbles a diagram on a whiteboard, reinvents a concept from a 1998 research paper, and the VP of Engineering declares it novel solutions. What theyâre really saying is, âOur core architecture is fundamentally unsuited for this workload, but the roadmap says we have to ship it, so we built a skyscraper of hacks on top of it.â
They spend half the article giving a condescendingly simple explanation of HNSW, complete with a little jab at us poor mortals trapped in our "cursed prison of flesh." Real cute. Then they explain that HNSW is a mostly static data structure and doesn't fit in RAM. Again, groundbreaking stuff. This is the database equivalent of a car company publishing a whitepaper titled, "Our Discovery: Engines Require Fuel."
But this is where it gets good. This is where you see the scar tissue. Their grand design philosophy is that a vector index should behave like any other index.
We donât think this is a reasonable approach when implementing a vector index for a relational database. Beyond pragmatism, our guiding light behind this implementation is ensuring that vector indexes in a PlanetScale MySQL database behave like youâd expect any other index to behave.
I can tell you exactly how that meeting went. The engineers proposed the easy way: âItâs approximate anyway, a little eventual consistency never hurt anyone.â And then marketing and sales had a collective aneurysm, shrieking about ACID compliance until the engineers were forced into this corner. This "guiding light" wasn't a moment of philosophical clarity; it was a surrender to the sales deck.
So whatâs the solution to this problem they "discovered"? A glorious, totally-not-over-engineered Hybrid Vector Search. Itâs part in-memory HNSW, part on-disk blobs in InnoDB. And my favorite part is their "research" into alternatives. They mention the SPANN paper and say, "It is not clear to us why HNSW was not evaluated in the paper." Translation: âWe already had an HNSW implementation from a hack week project and we werenât about to throw it out.â Then they dismiss a complex clustering algorithm in favor of random sampling, because "the law of large numbers ensures that our random sampling is representative." Thatâs the most academic-sounding way of saying, âWe tried the right way, it was too hard, and this was good enough to pass the benchmark tests marketing wanted.â
And now for the main event. The part where they admit their entire foundation is made of quicksand. They lay out, in excruciating detail, why appending data to a blob in InnoDB is a performance catastrophe. Itâs a beautiful, eloquent explanation of why a B-tree is the wrong tool for this job. And then they discover⊠LSM trees! They write a love letter to LSMs, explaining how theyâre a "match made in heaven" for this exact problem. You can feel the hope, the excitement!
And then, the punchline. They canât use it.
Because their customers are on InnoDB and forcing them to switch would be an "unacceptable barrier to adoption." So instead of using the right tool, they decided to build a clattering, wheezing, duct-taped emulation of an LSM tree⊠on top of a B-tree. This isnât engineering; itâs a dare. Itâs building a submarine out of screen doors because youâve already got a surplus of screen doors.
From there, itâs just a cavalcade of complexity to paper over this original sin. We donât just have an index; we have a swarm of background maintenance jobs to keep the whole thing from collapsing.
(head_vector_id, sequence) hack creates so much fragmentation you need another janitor to clean up after the other janitors.They call this the LIRE protocol. We used to call it "technical debt containment." Every one of these background jobs is a new lock, a new race condition, a new way for the database to fall over at 3 AM. And the solution for making the in-memory part crash-resilient? A custom Write Ahead Log, on top of InnoDBâs WAL. Itâs WALs all the way down! They even admit they have to pause all the background jobs to compact this thing. I can just picture the SREs' faces when they read that. "So, the self-healing slows down⊠to heal itself?"
Look, itâs a monumental achievement in over-engineering. Theyâve successfully built a wobbly, seven-layer Jenga tower of compensations to make their relational database do something it was never designed to do, all while pretending it was a principled philosophical choice.
So, bravo. You did it. You shipped the feature on the roadmap. Itâs a testament to what you can accomplish with enough bright engineers, a stubborn architectural constraint, and a complete disregard for operational simplicity.
Try it out. Happy (approximate) firefighting