Math Did Not Beat the Sutras. It Gave Them Unit Tests.

June 2026 · Obsta Labs

We mined contemplative texts for search algorithms and found TF-IDF. The detour did not produce the algorithm. It forced the benchmark — and the benchmark revealed the boring fix that was sitting there the whole time.

I wanted the mystical version to be true.

Maybe the old contemplative texts contained a lost search algorithm. Maybe "see without naming" mapped to lexical de-anchoring. Maybe "experience without a center" mapped to embedding isotropy. Maybe centuries of people optimizing human attention had left behind operators we could borrow for machine retrieval.

So we tested it.

The result was both more boring and more interesting than I expected: almost every mystical-looking operator collapsed into an information-retrieval primitive that already had a name, and the thing that actually doubled our recall was TF-IDF — an old information-retrieval technique, older than most modern embedding stacks, the kind of thing you meet early in any serious IR course.

This is not a post about ancient wisdom beating modern math. It is a post about two very different traditions noticing the same constraint: attention is finite, and the signal that appears everywhere is usually noise.

The setup

We had a search problem. A growing corpus of structured records — each with a title and a body — and search that kept missing. You'd look for a record you knew existed, by a concept you knew it was about, and get nothing, or the wrong thing.

The embedder underneath was a deterministic feature-hash: turn the text into a 256-dimension vector, compare by cosine. No model in the query path — a deliberate constraint we care about (determinism, reproducibility, no per-query network call). The recall was bad and we wanted to know why.

I had a hunch — an unserious one — that the answer might be hiding in contemplative literature. These traditions are, functionally, centuries of practice on a fixed-capacity attention system: drop the mental commentary, hold the bare signal, perceive without naming. That is, structurally, a library of operations on a bandwidth-limited representation. Maybe some of them were real transforms in disguise.

So we did the disciplined version of a silly idea: translate each intuition into a deterministic operator, and measure it. Provenance scores zero. A passage only counts if it reduces to fixed math with a falsifiable prediction.

What failed (most of it)

We mined eleven texts. Roughly seventy-five candidate operators. Here is the honest scoreboard.

Center-removal looked like a win, then wasn't. "Experience without a center" maps cleanly onto subtracting the mean vector from an embedding space — real, published prior art (Mu & Viswanath, "All-but-the-Top," 2018), which improves embeddings by removing dominant common directions. On a toy set of nine vectors it beat our baseline. On a real corpus center it fell below baseline. The toy result had lied. Not invented here, and not a win at our scale.

Word order hurt. "It is not what we say, it is how we say it" translates to capturing adjacency — n-grams. Adding them halved recall: a title and its body say the same thing in different word order, so demanding the order line up shrinks the matching surface. The vocabulary-mismatch problem is also a structure-mismatch problem.

Restructuring the input failed too. "Cook the record into a specific structure before the algorithm touches it" is a beautiful idea — and the right diagnosis. But on a bag-of-words hash, weighting fields or tagging them either did nothing (the hash is additive) or broke the match (a tagged token lands in a different bucket than the plain query word). Structure can't help if the reader can't read structure.

A wall of negative results. Every operator either no-op'd, underperformed, or needed relevance labels we'd ruled out. It was starting to look like the whole detour was a wash.

What worked (the boring thing)

Then we looked at why a record's own title couldn't find its own body. We measured the vocabulary: a title shares about 74% of its words with its own body. The words overlap fine. So why does retrieval miss?

Because the shared words are the wrong words.

The most common words in our titles were the words in everyone's body. Connective tissue and house style — the "and," "for," "the," plus a handful of project-boilerplate terms that show up in 60–90% of all records. The one or two words that actually distinguish this record from its neighbors were getting exactly the same vote as the words that distinguish nothing.

The fix is the oldest trick in information retrieval. Inverse document frequency: down-weight words that appear in many documents, up-weight words that appear in few. Count how common each word is across the corpus, divide its influence by that, done. Deterministic. No model. No labels. Computed once at write time.

Weighting	recall@5	MRR
feature-hash baseline	45%	0.346
idf-weighted cosine	91%	0.798
idf, softened (idf^0.5)	92%	0.819

Recall doubled. The mean reciprocal rank more than doubled. After eleven mined texts and a dozen exotic operators, the thing that moved the number was a weighting scheme older than the people who built the embedder.

The benchmark was intentionally dumb: for each record, use the title as the query and ask the system to retrieve that record's body from the corpus of all records. It is not a universal search benchmark — the title and body genuinely share most of their vocabulary, which is exactly why the failure was interesting — but it was enough to test whether a change improved the failure mode we actually had.

Update — the production numbers. The table above is from a stripped-down term-frequency probe we used to isolate the effect. We then shipped IDF into the real production embedder — the 256-dimension feature-hash with all its character n-grams and concept layers — and re-ran a committed benchmark against it. The win held: MRR 0.73 → 0.90, recall@1 0.58 → 0.84. One nuance the probe hid: on the real embedder the vector mass lives in the n-gram and concept features, so weighting only the bare tokens barely moved the needle (+0.06 MRR); weighting every layer by IDF was what delivered the full lift (+0.17). Same conclusion, sharper: down-weight the common signal — everywhere it appears, not just where it is convenient.

The part that is actually about the sutras

Here it would be easy to overclaim, so let me not.

The sutras did not contain TF-IDF. There is no hidden search algorithm in the Gita. The texts were not secretly doing information retrieval.

What is true is smaller and, I think, more interesting: when we found a real win, it resembled an old contemplative instruction in a way that was not evidential, but was clarifying.

IDF says: do not let the words that appear everywhere dominate the representation. A contemplative tradition might say: do not let the habitual object dominate attention. Those are not the same claim. But they rhyme, because they are solving the same kind of constraint: finite attention under noisy abundance.

The "no center" instruction rhymes with mean-removal. The Middle Way rhymes with calibration — softening the weights beat raw IDF by a hair, not too sharp, not too soft. Non-attachment rhymes with downweighting popularity. None of this proves the ancients knew search. It suggests something more basic: two independent search processes — centuries of people refining attention, decades of people refining retrieval — kept arriving at the same small family of operations, because bounded systems have only a few ways to stop drowning in noise.

The eye and the camera both arrived at the lens. Neither copied the other. They were solving the same constraint.

Clarity begins by subtracting what appears everywhere. The sutras say it as a path. Information retrieval says it as a weighting scheme. They are not the same tradition, but they point at the same constraint. One of them just compiles.

The actual lesson

The actual lesson was not that math beat the sutras. It was that math made the intuition testable.

The contemplative intuition — discipline is partly the act of not attending to what is common — turned out to be useful, but only after it was made falsifiable. The translation into deterministic operators is what let us throw away the ninety percent that didn't survive contact with a benchmark and keep the ten percent that did. The detour did not produce the algorithm. The detour forced the benchmark, and the benchmark found the fix.

And the fix was humbling: we were treating search as a problem of missing semantic depth. It was a problem of basic hygiene. Every word was voting equally, and the words that appear everywhere were outvoting the words that mean something.

Stop paying attention to what appears everywhere.

If you take one thing from this: before you reach for a learned model, build the dumbest honest benchmark and try the oldest boring baseline. The exotic fix is usually downstream of the boring one. Sometimes the boring one is the win.

The sutras did not beat math.

Math gave them unit tests.