Friday, December 15, 2023

A new Lucene learning resource

My introduction to Lucene

In 2011, I was working for a small startup in Belgium, developing a data pipeline that would crawl news from around the world, extract named entities using natural language processing, and provide alerts when a new article related to a user’s interests. At the time, we were struggling with latency from the multi-table joins that we were executing across our fully-normalized database to surface articles discussing named entities that related to the logged-in user’s profile. 

Taking the train to work every day, I would read various programming books, trying to find things that might help us improve our architecture. One of those books was Lucene in Action, Second Edition. At some point while reading that book, I realized that we didn’t have a database problem — we had a search problem. 

Since we were already using Apache Solr in a corner of our architecture (to deduplicate articles), it was easy to add another index with articles and named entities weighted with payloads (though I would now probably use custom term frequencies). With 2-3 days of work, we were able to reduce our page load time from ~20 seconds to under 100 milliseconds. 

At that point, I became a huge fan of search in general and Apache Lucene in particular. For more than 12 years now, my work has revolved around Lucene and distributed services built on top of Lucene to support fast, scalable, highly-available search systems.

The present day

Today, I work on OpenSearch, a distributed search service powered by Apache Lucene. We have started an OpenSearch Lucene Study Group (see https://www.meetup.com/opensearch/events/) to discuss Lucene changes and exchange knowledge. In those meetings and day-to-day, OpenSearch developers ask me to recommend books or other resources to learn Lucene. While it’s over 13 years old, I still find myself pointing to Lucene in Action, Second Edition as the best Lucene book.

While it’s a great book that covers the functionality available in Lucene 3.0, there have been major new features (codecs, two-phase iterators, block-max WAND, block joins) and new index data structures (doc values, points, vectors) that add new query capabilities that were unimaginable in 2010.

Lucene university

To help OpenSearch developers and Lucene enthusiasts in 2023, including me, I created a project on GitHub that I’ve dubbed “Lucene University”. It’s a collection of worked examples, made up of self-contained classes, verbosely documented. It's kind of written assuming that readers have read Lucene in Action, but it's not a hard requirement.

I imagine people will learn from these examples one of the following ways:
  • Hands-on: Check out the repository and load it into the IDE of your choice. Pick a sample that sounds interesting and run it under a debugger. As you step through the code, the comments are your tour guide. It’s also a great idea to step into the Lucene calls to see how they work.
  • Reading the code: The comments should make it easy to read the code on GitHub if you just need an example with some explanation.
  • Read it like a book: Go through code with explanations in the margin.
For that last one, I wanted to copy the beautifully-rendered worked examples from Tantivy (a Rust search library inspired by Lucene), which I saw were generated from annotated source code. I looked into how they did it and learned about Docco, a NodeJS tool to convert code with Markdown comments into pretty HTML.

With my toolchain selected, I created my first example: a simple search application that would index a few text documents and search for the word “fox”. It provides a basic introduction to IndexWriter, IndexSearcher, TermQuery, and StoredFields — enough to get started with Lucene. Then I ran it through Docco and learned about GitHub static hosting to produce some pretty output.

Since then, I’ve added the following examples:
  • SimpleSearchWithTermsEnum: This does the same thing as the SimpleSearch example, but it uses lower-level APIs to (partly) dig into how IndexSearcher and TermQuery work.
  • AnalyzerBasics: This one doesn’t create an index at all, but uses StandardAnalyzer to produce a stream of tokens from an input string, to explain how terms are derived from a text field before getting indexed.
  • VisualizePointTree: I wanted to understand how 1-D numeric points get stored in a K-d tree. So, I wrote this example that indexes 20,000 documents with 10,000 different numeric point values and prints the resulting point tree. I also tried splitting the documents across multiple segments and tried writing the same point value for all documents, to see how the output would change.
  • PointTreeRangeQuery: After the previous example, I wanted to understand (and explain) how Lucene uses a K-d tree to efficiently implement numeric range queries. This one indexes the same 20,000 documents, but runs a range query over 2000 points three different ways: passing a range query to an IndexSearcher, passing a custom IntersectVisitor to the PointTree’s intersect method, and implementing the intersect logic myself (more or less doing what the real tree does).
  • DirectoryFileContents: Walks through a SimpleTextCodec representation of a Lucene index with a text field and an IntField, exploring doc values, points, stored fields, norms, and postings.
I’m also thrilled to call out the examples contributed by Sam Herman:
  • FunctionQuerySearchExample: Builds on SimpleSearch by wrapping the TermQuery in a FunctionScoreQuery to replace BM25 scores with the value of a floating point field.
  • KnnSearchExample: Indexes some documents with vectors and runs a k-nearest neighbors query to find the documents with vector values closest to a search vector.

What’s next?

The examples above are the result of a couple of weeks of part-time effort and are just scratching the surface of what can be done with Lucene. Going forward, I plan to add one or two worked examples per week. Over time, I hope that it grows to where it makes sense to organize content into “chapters”, such that it becomes a complete “course” or “book” on modern Lucene, available for free under the Apache 2 license.

Contributions are welcome! If you have an idea for a self-contained Lucene example, please fork the repository and submit a pull request. I take requests! If there’s something you would like to learn about in Lucene, please open an issue.