Daily Database Roasts

Atlas Search score details (the BM25 calculation)

Originally from dev.to/feed/franckpachot

December 19, 2025 • Roasted by Sarah "Burnout" Chen Read Original Article

Oh, this is just... wonderful. Another deep dive into the arcane mysteries of why one black box behaves slightly differently than another black box. I truly appreciate the meticulous effort here. It’s comforting to know the exact mathematical formula that will be waking me up at 3 AM.

It’s especially reassuring to see the test case built on such a robust, production-ready dataset of fruit emojis. That’s exactly what our multi-terabyte, horribly structured, user-generated content looks like. The clean insertMany with nine documents gives me a warm, fuzzy feeling, reminding me of that one time we had to migrate a sharded cluster with petabytes of data over a weekend. That was a "simple" copy-paste too, they said.

And the transparency! How thoughtful of them to provide the scoreDetails: true flag. It’s like getting a detailed autopsy report before the patient is even dead. I can already picture the Slack thread:

PM: "Why is 'Crispy Apple' not the top result for 'apple'?" Me: "Well, you see, the idf computed as log(1 + (N - n + 0.5) / (n + 0.5)) gives us 1.897, but the tf is only 0.539 because of the length normalization parameter b and..." PM: "...so can you fix it by lunch?"

I'm thrilled to have this level of detail. It will be an invaluable tool for explaining to stakeholders why their perfectly reasonable search query returns garbage, all while the PagerDuty alert screams in my other ear.

But my absolute favorite part, the line that truly speaks to my soul, is this little gem:

However, this has no practical impact because scores are only used to order results, so the relative ranking of documents remains the same.

Ah, yes. "No practical impact." My new favorite sentence to whisper to myself as I fall asleep, right after "this migration script is fully idempotent" and "the rollback plan is tested."

Of course, it has no impact, unless:

You have any kind of A/B test running that compares relevance engines.
You use the raw score for some downstream thresholding logic.
You have a machine learning model that, heaven forbid, uses the search score as a feature.
You're migrating from one system to the other and trying to explain to the product team why all the "relevance numbers" just dropped by a factor of 2.2 overnight, even though the order is the same.

It’s fantastic that the discrepancy is just a simple matter of one vendor using a formula from this decade and others... not. This total lack of standardization across "Lucene-based" systems is my favorite kind of surprise. It’s not a bug; it’s a historical implementation detail. I can't wait to be the one to explain this nuance to the data science team when their models break, and they accuse my service of sending bad data.

So, thank you for this meticulous, well-researched post. It’s a perfect-bound, first-edition copy of the incident report I’ll be writing in six to nine months. It's so important to understand precisely how the new magic box is going to fail in a slightly different, more academically interesting way than the last one.

Keep up the great work. Someone has to document the new and exciting ways our systems will betray us. It gives the rest of us something to read on the toilet... while on call.

🔥 The DB Grill 🔥