Rolling My Own Blog Search

I’ve found myself hitting ctrl+f on this blog enough that I figured it’s about time to add some search functionality to it. While there are certainly prefab solutions out there, this task is simple enough and fairly instructive. I had a few requirements, though:

The search needs to be fast, useful, and aesthetically pleasing.
Search in the browser. Standing up a server is a lot of extra work. It’s also overkill since I only have about 30 articles so far.

Semantic search

I did some experiments with small neural networks deployed using ONNX but ultimately they didn’t seem to be a good fit for this blog. The search experience was not quite as snappy as I’d have liked it to be, and while I was able to get the model under 10MB, it still added a good amount of bloat to the page size. Further, it wasn’t clear to me that the search results were significantly better, and in some cases they were worse. In any case, the advantages were not enough to justify the added page size.

I did learn a lot with this experiment which probably deserves a blog post of its own at some point. I might revisit semantic matching at a later point as well.

BM25

Rather than loading a neural network, I decided instead to index the text of my blog and use BM25 to rank documents.

One difficulty in this was finding the actual algorithm which didn’t include a ton of extra code. I was only interested in the function itself, not all of the extra boilerplate code floating around. In the end I broke down and rewrote it myself.

Tokenizing Text

BM25 works by scoring document tokens against query tokens. Documents with more relevant tokens should be scored higher than those with fewer relevant tokens.

This raises the question of how to convert a query string into a group of tokens. Initially, I simply split by word. That is,

"hello my friend" => ['hello', 'my', 'friend']

Although this works pretty well, it has the downside that different forms of a word will be missed. For example “friends” will not match with “friend”. To solve this problem, I decided to use fixed token sizes of three characters instead:

"hello my friend" => ['hel', 'ell', 'llo', 'lo#', '#my', 'my#', '#fr', 'fri', 'rie', 'ien', 'end']

In this way “friend” will be tokenized very similarly to “friends”, so the penalty for mismatches between forms of words will be minimized. Subjectively, I felt that this yielded a much more fluid and versatile ranking as well.

Multi-factor scoring

My blog posts have title, body, and tags. I could simply concatenate them all together and score the combined string, and it would work pretty well. But this seems to throw away the fact that the title and tags may have more concentrated information than the article body does. They deserve to be weighted differently. I decided to weight the title/body/tags components as 30/50/20 in the final ranking.