Where database blog posts get flame-broiled to perfection
Alright, settle down, kids. Let me get my reading glasses. Someone forwarded me this... article... from "CedarDB." Sounds like a brand of mulch. Let's see what the latest revolution is this week.
"Why you should care about strings."
Oh, good heavens. We're starting there? Are we? I haven't seen a headline that basic since the manual for my first Commodore PET. Back in my day, we didn't "care" about strings, we feared them. We had fixed-length COBOL records defined with a PICTURE clause, and you got your 80 characters, and you liked it. If you wanted variable length, you prayed to the VSAM gods and prepared for a week of debugging pointer errors in Assembler. These kids act like they just discovered that people write things down.
"...roughly 50% of data is stored as strings. This is largely because strings are flexible and convenient: you can store almost anything as astring."
You don't say. You can also use a wrench to hammer a nail. Doesn't make it a good idea. This isn't a feature; it's a confession. It's admitting your entire user base has the data discipline of a toddler with a box of crayons. Storing UUIDs as text? I've got JCL scripts older and smarter than that. We were storing structured data in hierarchical databases before your parents met.
And of course, some professor is quoted. “In database systems, you don’t primarily compress data to reduce size, but to improve query performance.” Deep. Truly profound. We figured that out around 1983 when we realized swapping tapes for the monthly payroll run took longer than the heat death of the universe. Smaller data meant fewer tapes to mount. It's not about "better bandwidth-utilization," you slide-deck jockey, it's about not having to call Barry from operations to go find reel 7B in a salt mine in Kansas.
So, let's get to the meat of it. Dictionary Compression. They explain it like they're unveiling the secrets of the pyramids. Storing unique values and replacing them with integers. Welcome to 1985, fellas. DB2 was doing this while you were still learning to use a fork.
The attentive reader may have noticed two things. First, we store the offsets to the strings... Second, our dictionary data is lexicographically ordered.
The attentive reader? Son, the comatose reader noticed that. If you don't store offsets, it's not a dictionary, it's a grocery list. And if you don't sort it, you can't binary search, which is the whole point. This is like a chef proudly announcing his secret technique for boiling water is to apply heat.
And then they get to the big reveal. The "problem" with dictionaries. They don't work well on high-cardinality data. No kidding. That's why we had DBAs, not just a button that says "compress." You had to actually know your data. What a concept.
So, what's their silver bullet? FSST. Fast Static Symbol Table. It replaces common substrings with tokens. My God, they've reinvented Huffman coding and given it a marketing budget. "In FSST-Lingo, substrings are called 'symbols' and tokens are called 'codes'." How precious. It's still just a lookup table, you've just made it for pieces of strings instead of whole ones. Congratulations on inventing a fractal dictionary.
And the best part, the absolute chef's kiss of this whole thing, is when they realize that querying this FSST-encoded gibberish is a nightmare.
One naive way to evaluate this query on the data would be to decompress each compressed string and then perform a string comparison. This is quite slow.
Ya think? So what's the revolutionary solution they landed on after all this brainpower?
The key idea is to create a dictionary from the input data and then use FSST to compress only the dictionary.
Stop the presses. Hold the phone. You went through all that, just to land on... compressing the dictionary? You built a whole new engine, wrote a blog post, and your grand innovation is to add a second layer of compression to the thing you just said had problems? And then, you have the nerve to footnote that DuckDB did it six months ago. You're not even the first kid on the block to rediscover this "new" idea!
Let's look at these benchmarks. They're so proud.
That's lovely. Now let's look at the fine print, shall we?
So let me get this straight. You made things faster when you have to go fetch the data from the digital equivalent of the tape library. But once the data is actually in memory, where real work gets done, your "improvement" makes the query almost three times slower. You've optimized for the one-off report that the CEO asks for once a year, while kneecapping the operational queries that run a thousand times a second. Brilliant. Absolutely brilliant. As they say, "there’s no free lunch, you can’t beat physics." No, but you can apparently build a really slow car and brag about its fuel efficiency when it's being towed.
Their proposed solution? "...one way to improve the performance... might be to cache decompressed strings in the buffer manager."
Just add another layer of caching. The answer to and cause of all of life's problems. So now we have the original data, a dictionary of the data, an FSST-compressed dictionary of the data, the FSST symbol table, and now you want to add a cache of the decompressed FSST-compressed dictionary of the data. This isn't a database system; it's a Russian nesting doll of lookup tables. At some point, you're spending more CPU cycles figuring out how to read the data than actually reading it.
It's always the same. Every decade, a new generation of bright-eyed programmers stumbles out of university, reinvents a concept from a 40-year-old IBM System R paper, gives it a four-letter acronym, and writes a blog post like this. They talk about trade-offs and resource usage like it's some new revelation. We've been making these trade-offs since our servers had less memory than your wristwatch.
They end by asking me to download the community version. I think I'll pass. Now if you'll excuse me, I've got a tape library that needs dusting. At least I know it works.