Friday, December 15, 2023

A new Lucene learning resource

My introduction to Lucene

In 2011, I was working for a small startup in Belgium, developing a data pipeline that would crawl news from around the world, extract named entities using natural language processing, and provide alerts when a new article related to a user’s interests. At the time, we were struggling with latency from the multi-table joins that we were executing across our fully-normalized database to surface articles discussing named entities that related to the logged-in user’s profile. 

Taking the train to work every day, I would read various programming books, trying to find things that might help us improve our architecture. One of those books was Lucene in Action, Second Edition. At some point while reading that book, I realized that we didn’t have a database problem — we had a search problem. 

Since we were already using Apache Solr in a corner of our architecture (to deduplicate articles), it was easy to add another index with articles and named entities weighted with payloads (though I would now probably use custom term frequencies). With 2-3 days of work, we were able to reduce our page load time from ~20 seconds to under 100 milliseconds. 

At that point, I became a huge fan of search in general and Apache Lucene in particular. For more than 12 years now, my work has revolved around Lucene and distributed services built on top of Lucene to support fast, scalable, highly-available search systems.

The present day

Today, I work on OpenSearch, a distributed search service powered by Apache Lucene. We have started an OpenSearch Lucene Study Group (see https://www.meetup.com/opensearch/events/) to discuss Lucene changes and exchange knowledge. In those meetings and day-to-day, OpenSearch developers ask me to recommend books or other resources to learn Lucene. While it’s over 13 years old, I still find myself pointing to Lucene in Action, Second Edition as the best Lucene book.

While it’s a great book that covers the functionality available in Lucene 3.0, there have been major new features (codecs, two-phase iterators, block-max WAND, block joins) and new index data structures (doc values, points, vectors) that add new query capabilities that were unimaginable in 2010.

Lucene university

To help OpenSearch developers and Lucene enthusiasts in 2023, including me, I created a project on GitHub that I’ve dubbed “Lucene University”. It’s a collection of worked examples, made up of self-contained classes, verbosely documented. It's kind of written assuming that readers have read Lucene in Action, but it's not a hard requirement.

I imagine people will learn from these examples one of the following ways:
  • Hands-on: Check out the repository and load it into the IDE of your choice. Pick a sample that sounds interesting and run it under a debugger. As you step through the code, the comments are your tour guide. It’s also a great idea to step into the Lucene calls to see how they work.
  • Reading the code: The comments should make it easy to read the code on GitHub if you just need an example with some explanation.
  • Read it like a book: Go through code with explanations in the margin.
For that last one, I wanted to copy the beautifully-rendered worked examples from Tantivy (a Rust search library inspired by Lucene), which I saw were generated from annotated source code. I looked into how they did it and learned about Docco, a NodeJS tool to convert code with Markdown comments into pretty HTML.

With my toolchain selected, I created my first example: a simple search application that would index a few text documents and search for the word “fox”. It provides a basic introduction to IndexWriter, IndexSearcher, TermQuery, and StoredFields — enough to get started with Lucene. Then I ran it through Docco and learned about GitHub static hosting to produce some pretty output.

Since then, I’ve added the following examples:
  • SimpleSearchWithTermsEnum: This does the same thing as the SimpleSearch example, but it uses lower-level APIs to (partly) dig into how IndexSearcher and TermQuery work.
  • AnalyzerBasics: This one doesn’t create an index at all, but uses StandardAnalyzer to produce a stream of tokens from an input string, to explain how terms are derived from a text field before getting indexed.
  • VisualizePointTree: I wanted to understand how 1-D numeric points get stored in a K-d tree. So, I wrote this example that indexes 20,000 documents with 10,000 different numeric point values and prints the resulting point tree. I also tried splitting the documents across multiple segments and tried writing the same point value for all documents, to see how the output would change.
  • PointTreeRangeQuery: After the previous example, I wanted to understand (and explain) how Lucene uses a K-d tree to efficiently implement numeric range queries. This one indexes the same 20,000 documents, but runs a range query over 2000 points three different ways: passing a range query to an IndexSearcher, passing a custom IntersectVisitor to the PointTree’s intersect method, and implementing the intersect logic myself (more or less doing what the real tree does).
  • DirectoryFileContents: Walks through a SimpleTextCodec representation of a Lucene index with a text field and an IntField, exploring doc values, points, stored fields, norms, and postings.
I’m also thrilled to call out the examples contributed by Sam Herman:
  • FunctionQuerySearchExample: Builds on SimpleSearch by wrapping the TermQuery in a FunctionScoreQuery to replace BM25 scores with the value of a floating point field.
  • KnnSearchExample: Indexes some documents with vectors and runs a k-nearest neighbors query to find the documents with vector values closest to a search vector.

What’s next?

The examples above are the result of a couple of weeks of part-time effort and are just scratching the surface of what can be done with Lucene. Going forward, I plan to add one or two worked examples per week. Over time, I hope that it grows to where it makes sense to organize content into “chapters”, such that it becomes a complete “course” or “book” on modern Lucene, available for free under the Apache 2 license.

Contributions are welcome! If you have an idea for a self-contained Lucene example, please fork the repository and submit a pull request. I take requests! If there’s something you would like to learn about in Lucene, please open an issue.

Monday, March 3, 2014

And I vs And me


This past week, I've seen three of my friends from university use "and I" in a sentence, where it should have been "and me". This has always been a particular grammatical pet peeve of mine, particularly since it's so easy to correct.

An example sentence (not from any of these specific cases) could be "Jane talked with Jim and I about the proposed design". This should, of course, be "Jane talked with Jim and me...".

This mistake is a little unusual in my experience, since it seems to be pretty exclusive to native English speakers. In six years of living in Romania, I don't think I ever heard a Romanian make the wrong choice between "I" and "me", even when a conjunction was involved. Nor do I remember hearing it from non-native speakers from other parts of the world (whether the native language is Latin-based, Germanic, or non-European). Furthermore, it's a mistake that comes up surprisingly frequently among well-educated English speakers.

In the context of these three university friends who independently made the same mistake, they are all native English-speakers, raised in Canada, with degrees in mathematics and/or computer science. I heard far too frequently in university that math/CS people don't need to know details of grammar, since they don't need to do a great deal of writing. I always found that argument pretty silly, given that math and computer science both benefit from fairly precise use of language (whether it be the extensive jargon and overloading of terms in math, or the languages we use to program, which tend not to tolerate mistakes in syntax -- with the notable exceptions of Javascript and PHP).

How to do it correctly?

There's a pretty simple rule, passed on to me by my mother, who learned it from her father: eliminate the other person. If you think "Jane talked with Jim and I about the proposed design" sounds fine, try eliminating Jim from the sentence: "Jane talked with I about the proposed design". That's obviously wrong. So, it's very easy to correct yourself.

If I recall correctly, my mother taught me that rule when I was six. I remember reciting it to my sister when I was eight and she was five. It's a simple enough rule for five and six year-olds to understand, and yet it's not taught in public schools across the English-speaking world. I vaguely remember correcting a friend at age 10 — yes, I was one of those kids — and being told that his teacher had told him to always use "and I", regardless of context (which is completely ridiculous). In fact, I definitely remember that my mother taught me that rule by telling me a story of when her father proofread something she had written, switched "and I" to "and me", and her teacher had marked it as incorrect. She taught me two very important lessons that day: when to use "and me" versus "and I", and the fact that sometime teachers are wrong.

Why does the rule work?

Studying math and computer science, I always loved it when I could apply the elegant formal theory that we studied to something from the "real world". In general, systems built around axioms, construction rules, and higher-order truth tend to fall apart when you're dealing with messy, human systems (like natural language).

One of the things I love about this particular case is that, so far as I can tell, there is a simple, elegant rule. The word "and" (or "or") is an operator. That is, it takes two elements from the same set as operands, and returns an element from that same set. In this case, the possible sets are simply parts of speech, or "roles" of words in a sentence.

For example, "table or red" doesn't make a lot of sense in most contexts, since table is a noun and red is an adjective. (The exceptions could be where "table" is used as an adjective, as in "table wine", or we use "red" as a noun by referring to word "red" or the colour it represents.)

Of course, in the context of a sentence, nouns aren't just nouns. They end up being either the subject or object of a verb. That is, they produce the action or they are acted upon.

Names are a little tricky, since they work either as subjects or objects (since English lacks the precision that comes with having accusative, dative, and prepositional noun declensions for names, or indeed nouns in general). For example, in "Jim talked to Jane", "Jim" is the subject and "Jane" is the object. Obviously, we can swap them and their roles in the sentence swap as well.

Fortunately, two words (pronouns, in fact) that very rarely change context are "I" and "me". Specifically, "I" (when used as such) is the first person singular subject in a sentence, while "me" is used as the object of a clause's action.

As such, since "and" is an operator, we know that "Jim and I" must be the subject of a verb, while "Jim and me" must be the object, since they are both constrained to the given sets by the presence of "I" or "me".

Hey, isn't this a programming blog?

I got to thinking, "Hey, since this is such a simple rule, couldn't we express it in code, even in a language like Java with basic type inference?" The answer I've been able to find so far is "kind of".

I wanted to make something that would work using infix notation (so that dropping dots and parentheses would result in a readable English sentence), which means using methods, which means using objects, which means using classes. Unfortunately, that means that we're constrained by Java's inheritance rules.

Luckily, the inheritance rules were relaxed a little in Java 1.8, so I can provide an example with less boilerplate than would be required with 1.7 or earlier.

By using default method implementations, we're able to provide an empty implementation of the "Name" class, as well as the anonymous class instances "I" and "me". It doesn't help a lot, but keeps the code short. The important thing to notice is that both "Subject" and "Obj" (used in place of Object, to avoid confusion with the Java base class) define methods called "and", but they only act on operands of the same type and return a result of the same type.

Let's see what the compiler does with different inputs:

We can add a few more definitions to grammar-check pop songs from 1987:

So, what now?

In short, please stop making this mistake. We've seen an explanation that a five year-old could understand, a set-theoretic explanation of why it works, and finally an example in code (along with a reference to a song that appeared on the Dirty Dancing soundtrack). I hope I've made it clear that "But I'm a math/CS person, I don't need proper grammar!" is not a valid excuse for sounding/writing like an idiot.

Wednesday, May 1, 2013

Pattern Matching and Functional Structure

Taking a break (or possibly moving on) from Project Lambda posts, I've decided that it's a good time to look at pattern matching, a common approach in functional programming that leads to a similar end result as object-oriented inheritance polymorphism. That said, I'll use Project Lambda (JDK8) features when writing Java code (since it's much nicer). I'll also diverge from most of my previous posts by using Scala here, since the Java compiler doesn't support (elegant) pattern matching (though I've just discovered and may want to play around with JMatch, which apparently extends Java with pattern matching support).

An object-oriented example

For comparison, here is an object-oriented implementation of an immutable list in Java. An immutable list is either an EmptyList (a singleton that holds nothing) or a NonEmptyList (that holds a value and a reference to the rest of the list). To hide the implementation details from calling code, I've implemented the two cases as private static classes and exposed empty() as a factory method to get the EmptyList. The other operations (head, tail, prepend, and map) are instance methods of ImmutableList.

Algebraic Data Types

This linked list is a form of algebraic data type. Algebraic data types are made up of "products" (groups of fields — records or structs) or "sums" (disjoint unions) of other types. In particular, the ImmutableList type is the sum of EmptyList and NonEmptyList (since an instance comes from one set or the other), while NonEmptyList is the product of its field types (the generic type T for head and ImmutableList<T> for tail). EmptyList is a single-element set. Note that "product", in this context effectively means Cartesian product. Really, algebraic data types describe a set of possible values (in the mathematical definition of "set").

In mathematics, we tend not to talk about elements of a set "doing" things. Instead, we have functions and operators that act on elements of a set and produce other elements (either from the same set or from some other set). Thus, the object-oriented notion of member functions or methods doesn't really map well. Instead, we might define a function over a sum type by specifying partial functions over the underlying types and producing a resulting total function. We basically say, "If you live in this bucket, return something, whereas if you live in this other disjoint bucket, return something else".

Pattern Matching

In the orthodox functional programming world, this matching by partial functions is exactly what happens, by matching the "patterns" that describe the underlying types of a sum type. Before digging deeper, I think it helps to look at the same example above, written in Scala in a more purely functional way. (Since Scala was built to compile to JVM bytecode, it still has classes and objects, and our types are still effectively Java classes.)

There are several things to note in comparison to the Java code above:

  • Neither EmptyList nor NonEmptyList has any methods. They don't "do" anything.
  • EmptyList is a singleton by virtue of the Scala object keyword. It really is a single-element set.
  • The sealed trait keyword pair basically means "Everything that implements this interface is defined in this file". Since the MyImmutableList interface defines no methods, you can consider it purely a marker interface. Thus, we have the same level of implementation-hiding as we had with our private inner classes above. As a bonus, it's illegal for a class in another Scala file to announce that it implements MyImmutableList, which cannot be said for our abstract class in the Java example above.
  • The operations head, tail, prepend, and map are defined separately operating on a MyImmutableList as input.
  • For head, tail, and map, the implementations need to decide which particular part of the disjoint union of the sum type applies. They do this by pattern matching against the underlying types. This is logically equivalent to inheritance polymorphism in the object-oriented example.
  • For prepend no pattern match is necessary. The logic is the same in both cases. This is equivalent to defining the implementation in the abstract base class in the object-oriented example.
  • Pattern matching captures variables against the fields of the product type NonEmptyList, such that we can use them on the right-hand side of the case expression. By convention, if we don't care about a particular field, we use _ as the variable name.

In the end, it's another way of saying the same thing. That said, let's look at what each approach gives us. For this next bit, I'm totally stealing from Professor Dan Grossman and material I learned from his Programming Languages course on Coursera. As a "thank you" for that knowledge he gave to me for free, I encourage you to take his future courses and use up more of his time. I'm sure he will appreciate it.

First, let's consider the types and the operations, since that's what we want to implement:

 headtailprependmap
EmptyList????
NonEmptyList????

In the object-oriented case, we basically implement the question marks by rows. That is, we implement one of EmptyList or NonEmptyList, filling in all of the methods, and then implement the other. If a particular column (or most of it) shares code, we implement it in the base class (overriding when needed).

In the functional case (using pattern matching), we implement the question marks by column. For each operation, we describe how it will work on each of the underlying types. If a particular column doesn't need to distinguish between types, we don't.

Another way of looking at this table is that the columns are verbs, and the rows are nouns.

Conclusions

Which approach is better? Given that my blog is about little excluding functional programming, you might expect me to say "the functional approach". The answer, of course, is, "It depends". Given that we spend way more time modifying code, refactoring it, and updating it, you need to ask yourself, looking at the above table, "Will I be adding more rows, or will I be adding more columns?". In most cases, you'll be adding more columns — that is, describing more operations on your data. There are, however, cases where you add more rows. In particular, I believe that it's no mistake that object-oriented programming took over at the same time as the "GUI revolution". Specifically, most graphical elements (widgets) can be summed up by two operations: paint and handleEvent. In that case, you are more likely to be adding rows, as you define different kinds of widgets. That said, I consider it unfortunate that today's "server revolution" and service-oriented architectures continue to shoehorn things into object-orientation, even when better approaches exist (and predate object-oriented programming).

In short, what would you rather your code do? Make things (create nouns)? Or do things (create verbs)? In practice, I believe most of us would rather focus on building actions, but work in development environments where we develop objects.

For some fun reading, I suggest that anyone who works in an object-oriented language (and especially folks working in Java) read Execution in the Kingdom of Nouns, a nice tongue-in-cheek look at Java anti-patterns like AbstractSingletonProxyFactoryBean (though I believe that article predates that particular monstrosity).

In the end, I am being quite disingenuous, and downright unfair to object-oriented programming. There are cases where grouping types under a single hierarchy (as one does in object-oriented programming) makes sense, even in the functional world. Specifically, my purely-functional Scala example is not how the Scala standard library implements immutable lists. Instead, they take an object-oriented approach, since it means that map can be implemented in terms of base traits like Iterable, so the same code can be reused for other types. Similarly, Haskell (the purest of semi-mainstream functional programming languages) has typeclasses, which are not object-oriented classes by any stretch of the imagination, but they allow you to define functions that take an instance of any type, so long as the type has certain associated functions that are defined in the given typeclass. (Okay... they're kind of like saying, "I will take an instance of any type that implements this particular interface".) All that to say, while I'm gently suggesting that the object-oriented mindset is not always the correct one, it still definitely has its place, even in the functional programming world.

Appendix

Are you curious about what the above Scala example compiles to? It turns out that it's actually not very exciting (using Scala 2.9). It's pretty much like what you would hand-write in Java (except that the Scala compiler doesn't actually call Java 1.7's Objects.equals, but rather inlines similar logic):


Wednesday, April 24, 2013

F%$*ing Lambdas, How Do They Work?

Update: A friend was reading over this and noted that I should have linked to my two previous posts on Project Lambda, to provide more context:

On my Twitter feed, I recently got a link to this document by Brian Goetz explaining how JDK8 translates lambdas into bytecode. I was surprised, since I had assumed that lambdas were simply syntactic sugar for anonymous inner classes that implement one of the functional interfaces, much as anonymous functions in Scala currently work. That is, I thought that Project Lambda was basically doing the same thing as I've been doing with my functional types, but eliminating the annoying boilerplate that goes along with an anonymous class and using better type inference.

It turns out that it's considerably more interesting than that, and makes use of the InvokeDynamic bytecode instruction added in JDK7. It inspired me to dig a little deeper with javap and try writing up an explanation in simpler terms (and only covering the simple case), since that article is pretty heavy (though I have to admit that I spent a solid hour on a Saturday night going through it with glee — I'll claim that my standards for a fun Saturday night have lowered considerably with a small child at home, rather than admitting that I'm a huge nerd).

How don't lambdas work?

I think it's clearer to understand how lambdas work by first considering how they don't work. That is, they're not just a more succinct way of writing anonymous inner classes.

Throughout the rest of this post, I'll use the following example:

While only a few lines in total, this example could desugar to something a fair bit messier. It's useful to note that our simple lambda is capturing the variable i from its scope.

Suppose lambdas were simply a nicer way to write anonymous classes. The above example would be equivalent to:

If you've ever looked at the output directory from compiling code with anonymous functions, you would notice that the above code would compile to AnonymousFunction.class and AnonymousFunction$1.class. If we run javap AnonymousFunction\$1.class (escaping the $ so the shell doesn't think I'm referencing a variable), we get the following output:

So, we have a class that holds the variable i as a field (passed to the constructor), and two versions of the apply method, with and without type erasure. (The Object version just casts its input to an Integer and invokes the Integer version.)

More explicitly, anonymous functions are also a kind of syntactic sugar. In effect, AnonymousFunction.java compiles down to something like the following:

The output of running javap AnonymousFunction\$1.class looks nearly identical (save the One vs. 1 difference) to javap InnerClassFunction\$One.class.

So, how do lambdas actually work?

It's worth noting that the contents of my compiled output directory (with file sizes) are as follows:

Interesting... no secondary classes are produced by compiling the original SimpleLambda code. So, how does it work?

The code of the lambda is moved to a private static method called (in my case) lambda$6, that takes n and i as parameters and does the addition. That's the easy part. What does main actually compile to? Let's ask javap (using the -v flag):

I think of this bytecode as divided into three parts:

  1. Bytecode lines (distinct from GitHub gist lines) 0 through 6 take care of assigning local variable i and putting it on the stack.
  2. Lines 7, 8, and 13 load i from the stack, do an invokedynamic to actually create the Function, then push it onto the stack.
  3. Lines 14 through 27 take care of retrieving the constant System.out, loading the Function from the stack, boxing 5 into an Integer, calling our Function.apply method, and calling println.

We're most interested in line 8, which triggers the creation of the Function. Let's see what #3 points to (near the top of the javap output):

#3 = InvokeDynamic #0:#40 // #0:lambda$:(Ljava/lang/Integer;)Ljava/util/function/Function;

So, it's pointing to #0 with the expectation of a CallSite that takes an Integer and returns a Function (effectively a Function factory). Let's look at #0:

This is basically binding the last three arguments for a call to LambdaMetaFactory.metaFactory (samMethod, implMethod, and instantiatedMethodType) to our desired "single abstract method" (#37 Function.apply), the implementation method (#38, the previously-mentioned private static lambda$6), and the type arguments for the Function.apply method (#39 — it takes an Integer (within parentheses) and returns an Integer (outside the parentheses)). The first three arguments are supplied by the invokedynamic call itself. (I believe arguments 2 and 3 are lambda$ and (Ljava/lang/Integer;)Ljava/util/function/Function are bound by #3, above).

Right now, if I look at the code for LambdaMetaFactory.metaFactory, it's basically constructing a new InnerClassLambdaMetafactory and asking it to generate a CallSite. So, at runtime, we're generating an inner class that implements Function<Integer, Integer>, takes an Integer as a constructor parameter, and implements apply by calling lambda$6 with its field and the argument to apply.

Let's confirm this by creating a class that outputs this stuff at runtime:

And here is the output from running that:

In this case, our static method got called lambda$0, but you can see that it's called from the apply method of an inner class that holds an Integer field, which it probably received through its constructor (since it's a final field). Effectively, at runtime, we're getting the behaviour from InnerClassFunction.java from above (except that the actual code makes one more method call to the static method), which in turn is the same as AnonymousFunction.java, which is not what lambdas become. The one "major" difference is that the generate class doesn't implement the Integer version of apply, sticking purely to the erased version (which makes sense, since the checked cast is already required before calling lambda$0).

Wait... what?

In a very roundabout way, I hope that I've shown you that lambdas are not equivalent to anonymous classes (which are equivalent to inner classes that capture local variables via constructor arguments), but they are actually equivalent at runtime... for now. The truth is that this somewhat convoluted invokedynamic approach leaves things open for improvements in the future.

The current implementation of Project Lambda (at the time of writing) does, in fact, generate inner classes at runtime. However, by using the invokedynamic call to LambdaMetadataFactory, future runtimes will be able to provide a different implementation that may have better performance, and still work with code compiled now (or at least after JDK8 eventually ships). As a secondary bonus, by not performing this expansion at compile time, compile time should be shorter and the generated bytecode will be smaller. (Even with this trivial example, SimpleLambda compiled to 20% less bytecode than InnerClassFunction. I believe the difference would be more significant with more lambdas.)

Conclusions

I hope that this post has made you think more about how lambdas will work, along with how anonymous classes and inner classes already work.

It's worth noting that Scala currently compiles functions to inner classes, though there has been some discussion about using the invokedynamic approach there as well.

Thursday, February 7, 2013

More fun with Project Lambda

What's new in Project Lambda this week?

It seems that the Project Lambda developers have conspired to ensure that my blog will always be out of date. I recently wrote about my first impressions from using Project Lambda, based on build b74. Now that I'm writing this, b75 is out and they've renamed the Block, IntBlock, DoubleBlock, and various BiBlock interfaces to replace the word Block with Consumer. I suppose that's more consistent with Supplier, since a Consumer consumes a value and returns nothing, while a Supplier takes no input and produces something. It also ties in well with the pipeline metaphor they're using for Streams (which will still need to wait for a later post, since I got playing with default methods in interfaces this week).

Additionally, they've added several more interfaces to java.util.function to accommodate Functions and BiFunctions that return primitive values (still to avoid the overhead of boxing, presumably). It's interesting to compare these specialized (and specially-named) interfaces to Scala's use of the @specialized annotation (discussed, here, for example) to avoid the same problem.

Pimping Collections

I'm horribly abusing a term that has grown popular in the Scala community for the case where you appear to add functionality to an existing type by providing an implicit view that converts objects of the existing type to objects of the enhanced type. This is referred to as the pimp my library pattern (going back several years, including by Prof. Odersky here). What I'm doing here isn't implicit, but it does show a way that we can use interfaces with default methods to layer functionality onto library types with relatively little ceremony.

In some of my previous posts, (like this one and especially this one), I talked about some nice higher-order functions on collections that most functional programming language environments provide as part of their standard library. The fact is that most of these higher-order functions can be implemented in terms of iterating through elements. Since Java 1.5, the standard collections (except Map and its implementations) all implement the Iterable interface. Wouldn't it be great if we could add these higher-order functions as methods on existing collection types? We can't really do that (without compiling our own version of the standard library), but we can inject the iterator method from the builtin types as a lambda to construct a type that does have this functionality.

Yo dawg, I herd you like iterators

Let's extend the Iterable interface with an AugmentedIterable interface that defines additional functionality based on the iterator method:

As you can see, many of the methods return a new AugmentedIterable whose iterator method (defined by a lambda) delegates to the iterator method of the original AugmentedIterable. I put an iterator in your iterator, so you can iterate while you iterate.

Note that this implementation with delegated iterators has the side-effect of making most of these methods lazy (except toList and foldLeft, which don't produce new AugmentedIterables).

Let's add a quick utility method to construct an AugmentedIterable from an existing Iterable, and test our methods:

Conclusions

This still isn't quite as nice as the "pimp my library" pattern, as we still need either the explicit call to augment to convert our Collection into an AugmentedIterable. Furthermore, I didn't really need an interface with default methods to do this, as I believe I could have made AugmentedIterable an abstract class. Where things get interesting is that we could add more functionality derived from Iterable in another interface and then mix them together in a new interface that inherits from both. Alternatively, I could create new collection types that extend the built-in types but also add the AugmentedIterable behaviour:


Tuesday, January 29, 2013

Project Lambda - First Impressions

Before launching into the actual post, I would like to make a quick plug for Martin Odersky's "Functional Programming Principles in Scala" course on Coursera. It will begin March 25th and run for 7 weeks. I took the course during its previous run (September-November last year), and really enjoyed it. No prior knowledge of Scala is required. The material is well-presented and quite interesting. Sign up now!

Moving on to today's topic

I created an impromptu poll on Google+ to see what I should write about next, not really expecting any suggestions (especially since most of my Google+ friends only seem to check in roughly once per month or so). As it happens, I got a vote for Project Lambda, the upcoming support for functional programming features planned for JDK8. So, with that unanimous support, I downloaded a prerelease (b74) and started porting the functional programming framework I've been developing in earlier posts.

Mapping concepts

Taking a look at that new java.util.function package, you find a fair number of interfaces (I count 38 in b74), but they're mostly specializations of the following few core interfaces:

  • Supplier: Declares a get() method that returns a value. Effectively interface-equivalent to Function0 from my previous posts. There are specialized versions (BooleanSupplier, DoubleSupplier, IntSupplier, LongSupplier, but no love for floats) to return unboxed primitive values.
  • Function: Has an apply method that takes an input of type T and returns an output of type R. Analogous to Function1 from my previous posts. Has subinterface UnaryOperator (for functions where the return type is the same as the input type), and many primitive type specializations (to avoid boxing), including Predicate (for functions that return boolean), IntFunction, LongFunction, DoubleFunction, plus many combinations of single-input functions for different input/output types (where at least one is primitive).
  • BiFunction: Analogous to the Function2 type I described in earlier posts, these take two arguments and return a value (where, in the most general case, each input and output type may be different). Has a subinterface BinaryOperator for the special case when all three types are the same, plus specialized versions of BinaryOperator for each of the primitive types.
  • Block and BiBlock: Like Function and BiFunction, but covering the case where nothing is returned (so the code must produce a side-effect). I handled these cases in my previous post by using functions that return the empty tuple (Tuple0). Again, there are several specialized interfaces to handle primitive inputs to the block.

Lambda syntax

By far, the biggest change you'll find in JDK8 from Project Lambda is the new syntax added to make working with the above interfaces much less verbose. (My previous posts serve as an example of how ugly it can be to create anonymous classes to wrap up a little bit of logic in a function object.) Here is a "before and after" example, using an anonymous class versus the lambda syntax:

I think we can all agree that the second conveys just as much useful information without the noise of the anonymous class boilerplate. IntelliJ even gives me a warning, telling me that the first version can be replaced with a lambda. This syntax can be applied to any anonymous implementation of an interface where there's a single method to implement (which includes a number of AWT or Swing EventListeners, as well as Runnable and Callable). I would argue that the lambda syntax will be pretty useful even if you're not interested in functional programming.

Since many existing codebases probably have static utility methods that behave as functions, there is also syntax to use these methods as function objects:

Similarly, an instance method without arguments can implicitly be treated as a function where the instance is the input and the result of invoking the method on the instance is the output (note that IntFunction<String> takes a String as input and returns an int):

Default methods

If you paid close attention to my previous posts (or looked at the code in GitHub), you may have noticed that most of my function types were classes. This was because I wanted to be able to implement certain helper methods that would work if you provided the actual logic of a function. For example, Function1 has a compose method that would allow you to define composition of functions. Similarly, the function classes have the logic for partial application and currying.

This is kind of annoying, as you can only inherit from a single class, whereas you can implement multiple interfaces. It would be much nicer to be able to define some logic that comes "for free" once the method(s) of an interface are defined. In particular, this is how Scala's traits work (and was one of the features that made my Why I like Scala list). As it happens, this is another feature coming to Java with Project Lambda.

There is a new keyword, default, that can be added before an interface method that allows you to define a body for the method. From what I can tell, while this introduces some multiple inheritance of behaviour to Java, it still isn't quite as flexible as Scala's multiple inheritance of traits (which avoids the diamond problem by defining a linearization of inheritance). For example, the following doesn't compile for me:

The compiler error is

java: class DefaultTest.HelloWorld inherits unrelated defaults for print() from types DefaultTest.Hello and DefaultTest.World

That said, default methods still allow some rich inheritance of interfaces, and could be used to significantly improve the collections API (much as the use of traits has given Scala a much better collections API). Based on this FAQ answer, which refers to new interfaces java.util.Sized and java.util.Traversable, it looks like this cleanup was planned at some point, but those interfaces don't exist in the b74 build of Project Lambda. I don't know if they backtracked on improving collections, or if those changes are excluded from current builds.

New collection types

In a previous post, I covered the Option type (taking the name from Scala). As it happens, Project Lambda includes the java.util.Optional type, which covers pretty much the same niche.

They've also added a hefty java.util.stream package that handles lazily-evaluated sequences, complete with evaluation pipelines (designed to accommodate concurrent evaluation). I won't go into details on streams, for the following reasons:

  • I want to discuss streams (as a type) in a later post, since they're pretty easy to implement.
  • This post is already going to be too long, and streams would roughly double the length.
  • I haven't had time to fully investigate pipelines, so anything I write wouldn't be doing them justice.

What's missing?

So far, Project Lambda doesn't seem to include helpers for partial application of functions, or lazy evaluation. Luckily, those are both pretty easy to implement, and the syntax is much nicer with the new syntax.

Lazy Evaluation

To implement lazy evaluation, let's revisit the Function0 idea, but this time, we'll make it an interface. Of course, interfaces still can't hold data, so we'll "remember" earlier evaluations using a static WeakHashMap (relying on identity equality) and take advantage of the hot new computeIfAbsent default method added to Map.

Next, let's create a Function1 that extends Function and adds the ability to take a Supplier (which may or may not be a Function0) as input to apply. This will allow us to eagerly evaluate a previous lazy result. Whereas previously, I made the return value always lazy (where you could then force evaluation with get()), the built-in Function interface we're extending performs eager evaluation. So, we'll expose lazy evaluation via the applyLazy methods. I'll use the T,R ordering of type arguments, since my earlier R,T ordering was silly and inconsistent with what everyone else does.

Partial Application

In my previous implementation, I had an abstract base class called Curryable which implemented partial application in terms of currying. Then, it was the responsibility of each FunctionN class to implement curry to produce a Function1 that binds the first argument and returns a FunctionN-1. Unfortunately, while writing this post, I discovered a fatal flaw in that implementation that actually invalidates one of my examples in my previous post. Specifically, that implementation of partial evaluation always evaluated the partial arguments eagerly, so one of the page downloads started in the middle of constructing the lazy result of the Function2. (In retrospect, I should have been suspicious when I found that my asynchronous implementation was consistently faster than my earlier concurrent implementation. Of course it was, it had already downloaded a page before the timer started.) We'll do partial application more directly this time, so that laziness works.

Anyway, let's define Function2 in terms of BiFunction, with the additional partial application methods. Function3 and up are analogous, but need to be defined from scratch, since there are no built-in interfaces for functions of more than two arguments.

Let's confirm that this works, and play some more with lambda syntax by implementing foldLeft as a static method, binding it to a function object (a Function3, specifically), and then using partial application to produce a Function1 that concatenates strings:

That's considerably easier than the hoops I jumped through to implement a higher-order version of foldLeft previously.

Conclusions

So far, Project Lambda looks like it could actually make functional programming much more viable on Java. While my examples here haven't done anything that I haven't been able to do with my existing toy framework, the syntax alone is a massive improvement. As an example of the impact of good syntax, I immediately spotted my mistake with my earlier Curryable implementation when I started to write it with lambdas. Previously, I had missed the mistake amidst all of the anonymous function boilerplate.

I'm still not convinced that it's on par with Scala, though, since it largely seems to be catching up to where Scala was a few years ago, and you're still stuck with the old "everything is mutable" collections framework. Various features of the Scala compiler (e.g. type inference, special handling of apply) allow you to write even less boilerplate than what you get with Project Lambda. Still, if Java developers get used to this stuff (and like it), it might make it easier to convince more of them to drop the uglier legacy parts of Java and move to Scala (which still keeps the good old JVM and JIT compiler).

That said, since I imagine that I'll be stuck mostly working in Java for the foreseeable future, I can only hope that JDK8 comes out soon, so I can use these goodies in my professional life.

Finally, remember that what I've covered here is only scratching the surface. This is what I've been able to pick up playing with Project Lambda in about a week, between when my daughter falls asleep and heading to bed myself. For more details, check out the Project Lambda homepage, and the resources linked from there.


Sunday, January 13, 2013

Asynchronous Evaluation

As promised at the end of my previous post, I've added (what I think is) a fairly simple asynchronous execution structure.

It's worth noting that this post actually has nothing to do with functional programming (especially since I will be producing side-effects). It just happens that what I've written works because of lazy evaluation, which itself is not a functional programming concept, so much as something that just happens to come up in some functional programming environments (most notably Haskell). The beauty of programming without side-effects is that your logic remains the same, regardless of the actual evaluation semantics. In turn, this makes it easier to "make it work, make it right, make it fast". This post largely deals with the last part.

Adding callbacks to a concurrent evaluation

Originally, when I wrote this, I had planned to write an AsyncFunction0 class, which would decorate a Function0 with asynchronous callback evaluation, and still support Function0 semantics. Upon further thought, I figured it would be easier to keep the asynchronous evaluation behaviour totally separate. Instead, I opted to create an AsyncEvaluator class which would:

  • be based on a Function0 and an Executor, and could be explicitly invoked to trigger evaluation of the result lazily defined by the Function0 on the Executor.
  • be able to accept callbacks, which will be called (on the same thread as evaluation) once the Function0 has been evaluated.
  • support convenient chaining of Function0 evaluations, so evaluation of one result could trigger evaluation of later results.

For callbacks, rather than defining a new type, we'll use a function that takes the result of evaluation as input, but it doesn't really make sense to return something -- to whom will we be returning it? Instead, the callback should "do" something. That is, it should have a side-effect. So, the callback should return void, or something to that effect. Since we already have tuples, we can depict the lack of a value by the empty tuple, or Tuple0. This is a singleton:

With that, we can move on to the core code of AsyncEvaluator:

This is simple enough, and still useful. You could write an asynchronous event handler that would respond to a UI event (like a button click) by creating a Function0 that returns a result, wrapping it in an AsyncEvaluator, and then adding a callback that writes the result to the UI. Before your event handler returns, it would call invoke to trigger the background processing.

Chaining function evaluations

Suppose you've written a nice, working program that runs well on a single thread, but you see various pieces that could run concurrently. Assuming you have lazy values representing certain "checkpoints" in the process, it would be nice to be able to say, "When result A has been calculated, start evaluating B and C (which depend on A), then when B finishes, kick off D. When C and D are both done, invoke E." We can do that, without changing any actual logic. Effectively, we just need to trigger evaluation of the Function0s based on the completed evaluation of earlier values.

First, we'll look at creating simple or branching chains. That is, from a single Function0 that has been wrapped in an AsyncEvaluator, we can trigger one or more later Function0 evaluations:

We may not want all evaluations to occur on the same Executor, so we allow an alternative Executor to be specified for the chained evaluation. If no Executor is specified, we'll use the same one used to evaluate the current result.

Here is a silly unit test that counts to ten asynchronously:

What if we have a value that is constructed based on multiple earlier asynchronously computed values? In that case, we want the new value to be computed as soon as the last of the previous values is available:

With that, we can write an asynchronous version of our example from the previous post, where we fetch two webpages and sum their byte counts:

As mentioned in the code comments, neither of these tests really reflects what you would actually want to do with an asynchronous execution path. In the real world, your final callback would push the result as a network response or output it to the user, such that you would never actually need to block while waiting for an answer.

Conclusion

Asynchronous processing is a pretty powerful tool, made quite popular (and certainly more familiar) thanks to AJAX calls from the browser. Now that many of us are comfortable with this approach, it's not a bad idea to add more asynchronous processing to our server-side or desktop software. In particular, while a blocked thread uses fewer resources than a busy-waiting thread (or a thread that periodically polls for a result), it's better to only use a thread when there's stuff to be done. In the second unit test example above, we still have two threads blocking on the HttpURLConnection responses. A better (more scalable) program would use Java NIO to fire off the requests and react to the responses asynchronously. Unfortunately, that's outside the scope of the current post.