Saturday, June 30, 2012

Immutable Hash Trie Maps in Java

In this post, I aim to introduce another immutable data structure, following up on my posts on immutable binary trees ([1] and [2]) and immutable lists. We'll be looking at hash tries, which provide an efficient way of implementing an immutable HashMap-like structure. Like a HashMap, we can provide more or less constant time operations in the absence of hash collisions, and linear time when collisions occur. The catch is that, in practice, our constants will be considerably larger. That said, we'll have a data structure that requires no synchronization when accessed from multiple threads, since it cannot be modified in-place.

Our hash trie map will be constructed as a tree with a high branching factor. If you haven't already read the posts on immutable binary trees, I suggest you do so, especially the second one. In particular, we're going to be relying heavily on specialized node types to save memory and split our logic into different cases based on polymorphic behaviour.

What is a trie?

Tries are a kind of tree where the path from the root to a given node encodes some earlier information. A nice use for tries is to store dictionaries in a relatively efficient way. Your root node represents the empty string, and it has 26 children, for the letters of the alphabet. Each node could be implemented as an array of 26 (possibly null) elements, plus a boolean to indicate whether that node represents a valid word. The word 'cat' would be encoded by the third-level node reached by going to 'c', then 'a', then 't', from the root. Since 'cat' is a word, the boolean value at that node would be true. By comparison, the node for 'coug' would be false, but would need to exist as a parent for 'cough'.

Our hash trie map will work on a somewhat similar principle, except that our valid "words" will be hash codes for the keys contained in the map.

An overview of what our tree will look like

To implement our trie structure, we'll have nodes that branch over part of the hash code of the elements (BranchedArrayHashNode), a singleton EmptyHashNode that represents a lack of children for a given partial hash, an EntryHashNode that represents a single key/value pair, and a ListHashNode that contains a linked list of key/value pairs whose keys' hashes collide. As a useful optimization, we'll implement a specialized SingleArrayHashNode for internal nodes who only have one child, so we don't unnecessarily allocate an entire array.

Like with the immutable binary trees, an add or remove operation will descend the tree until it reaches a (possibly empty) leaf node, which will provide its own replacement. Then, as the recursive calls return, the parents will provide their replacements (which will reuse the other children), until we get a new root node. In this way, we can guarantee (in the absence of a collision) running time that is logarithmic over the possible hash codes. Since the space of possible hash codes is fixed, the maximum depth of the tree is log(2^32) (which would be 32 with a branching factor of 2, but we'll use a branching factor of 2^5 = 32, guaranteeing a maximum depth of 7). This is where we get the claim of constant running time in the absence of collisions. (In the event of a collision, we hit a ListHashNode and we add a linear traversal of the list.)

An example of how we'll store our data

The above image depicts a hash trie map with 5 values stored in it (represented by the five EntryHashNode leaves). We break up the 32-bit hash codes of the keys into 5-bit chunks (with a 2-bit chunk at the deepest level), and use the sequence of 5/10/15/20/25/30-bit prefixes in our trie. It's probably clearest if I describe the five entries' hash codes:

  • The (1) entry has a hash code that begins with 11001. It is the only node with this prefix, so it can be stored directly below the root.
  • Entries (2) and (3) both have hash codes that begin with 10110, but entry (2)'s hash code starts with 10110 00011, while entry (3)'s hash code starts with 10110 10101.
  • Entries (4) and (5) both have hash codes that start with 01101 00101. Since we have at most 32 branches at each level, so can only encode 5 bits of hash code, we stick in a SingleArrayHashNode to avoid wasting 31 empty array elements on the shared 01101 prefix. You can see that the first 15 bits of entry (4)'s hash code are therefore 01101 00101 01011, while entry (5)'s hash code begins with 01101 00101 11101.

I hope this example explains the basic principle a little more clearly. (I've read a number of explanations of hash tries, but none have struck me as particularly good. In particular, several of them try to "simplify" down to 1 bit per level, creating a binary tree. I actually found that more confusing, personally.)

The interface and implementation

The basic ImmutableMap interface is similar to that of java.util.map, except that the put and remove operations need to return the new map. Also, rather than having get return null (when no entry is found), I've decided to make it return an Option (described in a previous post).

My implementation starts with an abstract base class, ImmutableHashTrieMap, which implements these operations based on our sliding 5-bit window looking at hash codes:

As you can see, the window size is hardcoded as a constant, so we can experiment to find the optimal branch factor. (Smaller branch factors mean deeper trees, while bigger branch factors lead to larger array copies when replacing a branched node.)

Then we just need to implement each of the different concrete node types, with an understanding of how they get replaced when adding and removing nodes.

EmptyHashNode

The simplest node type is EmptyHashNode, which is a singleton (since all empty nodes are the same after type erasure).

The operations on EmptyHashNode are as follows:

  • put: Returns a new EntryHashNode containing the given key/value pair.
  • remove: Returns this, since there is nothing to remove.
  • get: Returns None, since no entry may be found here.

EntryHashNode

Each EntryHashNode holds a single key/value pair. It defines the operations as follows:

  • put:
    • If the given key is equal to this entry's key, return a new EntryHashNode, replacing this node.
    • If the given key is not equal to this entry's key, but they have the same hash code, return a new ListHashNode containing this entry and the new entry.
    • Otherwise, return a new ArrayHashNode (Branched or Single, depending on whether the current hash code windows collide) with this entry and the new entry as children.
  • remove: If the given key matches this entry's key, return the EmptyHashNode. Otherwise, the given key was not found, so return this.
  • get: If the given key equals this entry's key, return this entry's value. Otherwise, return None.

ListHashNode

ListHashNode is used in the unfortunate event of a hash collision (that is, different keys, same hash code). It is implemented as an immutable list of key/value pairs (using the Tuple2 class from a previous post).

The operations on ListHashNode are as follows:

  • put:
    • If the given key's hash code is different from that of the entries in this node (since they all have the same hash code), return a new ArrayHashNode (Branched or Single, depending on whether the current hash code windows collide) with this node and a new EntryHashNode (for the new entry) as children.
    • If the given key is equal to the key of an entry in this node, return a new ListHashNode of the same size, with the new entry replacing the old entry.
    • If the given key is not equal to the key of any entry in this node, but the hash code is the same as that of the entries in this node, return a new ListHashNode with the new entry prepended to this node's entries.
  • remove:
    • If the given key is not equal to the key of any entries in this node, return this.
    • If the given key is equal to the key of an entry in this node, return a new ListHashNode containing the list of entries minus the removed one. However, if the new list would have size 1, return a new EntryHashNode instead.
  • get: If the given key matches the key of an entry in this node, return that node's value. Otherwise, return None.

This is the last of the leaf nodes. Next, we get into the internal nodes, BranchedArrayHashNode and SingleArrayHashNode. Neither of these actually hold data. They simply hold references to deeper nodes, grouped by partial hash codes. I'm going to leave off the diagrams for those nodes, since they get pretty complex (since the internal nodes have children).

ArrayHashNode

Both SingleArrayHashNode and BranchedArrayHashNode inherit from ArrayHashNode, a marker abstract class that indicates that they are non-leaf nodes. ArrayHashNodes need to know how to compute a partial hash code from a full 32-bit hash code and a shift parameter. The formula I used is hash >> shift & 0x1F, which actually takes the lowest 5 bits at each shift level. (So, in my example tree above, I actually lied -- we're not actually looking at prefixes, but rather a sequence of 5-bit suffixes. This is usually more efficient for many of Java's builtin hashCode functions, since they tend to vary heavily on the low bits and less on the high bits. This makes sense when your hash-based structures rely on modular arithmetic, like the builtin implementation of HashMap. For more randomly-distributed hashCode implementations, the choice of how you slice your bits makes no difference.)

SingleArrayHashNode

This represents an internal node that has a single child (so we don't waste a 32-element array on that one child). That child is in one of 32 possible branches, so we need to keep track of which branch it's in, given by the partial hash code that it and all of its children share. Therefore, SingleArrayHashNode stores the child index (in the 0-31 range) and the child node. The child of a SingleArrayHashNode should always be an ArrayHashNode, whether another SingleArrayHashNode or a BranchedArrayHashNode. If a SingleArrayHashNode had a leaf node child, that child could simply be promoted toward the root, replacing the SingleArrayHashNode. By this axiom, we can guarantee that a chain of SingleArrayHashNodes ends in a BranchedArrayHashNode.

Here are the operations on a SingleArrayHashNode:

  • put:
    • If the 5-bit partial hash code of the given key matches this node's partial hash code, then delegate the put to this node's child and return a new SingleArrayHashNode with the result of that operation as a child.
    • Otherwise return a new BranchedArrayHashNode with this node's child and a new EntryHashNode (to hold the new entry) as children.
  • remove:
    • If the 5-bit partial hash code of the given key does not match this node's partial hash, return this. There is nothing to remove.
    • Otherwise, delegate the remove call to this node's child. If the resulting node is not an instance of ArrayHashNode, return it (since it's a leaf node and doesn't need a SingleArrayHashNode as a parent). Otherwise, return a new SingleArrayHashNode with the new child node (and the same partial hash).
  • get: If the 5-bit partial hash code of the given key does not match this node's partial hash, return None. Otherwise, return the result of delegating the call to this node's child.

BranchedArrayHashNode

This is the node type that actually makes the whole structure work. It doesn't store any data itself, but it provides a dispatch mechanism to zero in on the subtree that contains the relevant (partial) hash code. The BranchedArrayHashNode has an array of 32 ImmutableHashTrieMaps (meaning nodes), where unoccupied branches point to the EmptyHashNode. Since we have SingleArrayHashNode, we guarantee that BranchedArrayHashNode has at least two non-empty children.

The operations are comparatively expensive, since puts and removes usually involve creating a copy of the 32-element array and replacing a single child.

Here are the operations:

  • put: Delegate the put call to the relevant child, based on the 5-bit partial hash code of the given key. Return a new BranchedArrayHashNode with that child replaced with the outcome of the delegated call. If the relevant child was previously empty, the size of the new BranchedArrayHashNode is incremented by 1.
  • remove: Determine the relevant child based on the 5-bit partial hash code of the given key. If the relevant child is already empty, return this. Otherwise, delegate the call to the relevant child. If the returned node is empty (i.e. the child was an EntryNode that was removed), the new size will be 1 less than the current size, otherwise the new size will be the old size. If the new size is 1, return a new SingleArrayHashNode containing the remaining child. Otherwise, return a new BranchedArrayHashNode with the relevant child replaced by the outcome of the delegated remove call.
  • get: Return the result of delegating the call to the relevant child, given by the 5-bit partial hash code. (If the relevant child is the EmptyHashNode, it will take care of returning None.)

Conclusions

You can check out the complete implementation on GitHub, here.

I ran some microbenchmarks, adding and removing the integers from 0 to 1 million (in random order), comparing against the builtin Java HashMap. I found that my ImmutableHashTrieMap was about 2.5 times slower. Increasing to 10 million, I was 5 times slower.

I think I might be able to improve on my time by adding classes to handle other internal nodes with a small number of branches (like what I did with SingleArrayHashNode). I may try implementing specialized internal node classes with up to four child branches, since I believe the cost of checking each child would be less expensive than the additional array copies involved in working with BranchedArrayHashNode.


Monday, June 11, 2012

Intro to Functional Programming (with examples in Java) - Part 5

Tuples

In this post, we're going to cover a less-exciting, but still surprisingly useful concept from the world of functional programming. Tuples are a way of producing "Cartesian products" of types. Thinking about it mathematically, types can be characterized as sets of valid values. Like a mathematical Cartesian product, where we define a composite value that draws a component from each of the underlying sets, a tuple is an object that encapsulates objects of each of the underlying types.

What are tuples?

Tuples are objects that group other objects together according to some logical need. They have very little in the way of functionality on their own. That is, you don't want a tuple that "does stuff". It's just data. In a way, Java beans can be considered a form of tuple -- they group together a set of fixed set of (heterogeneous) fields so they can be passed around. I kind of think of them like structs in C. They have fields but no "behaviour". (Our tuples will have methods, but they will be convenience things, like equals, hashCode, and toString.)

What are tuples not?

Tuples are not general-purpose collections. They are not lists of objects of some type, since they are made up of a fixed-sized set of heterogeneous components.

What are tuples good for?

Tuples are useful for returning a (fixed-sized, related) group of values without having having to create a new class. Suppose you have a private method that needs to return a String and an int. Normally in Java, you would need to create a small wrapper class to encapsulate the two return values, which isn't necessarily a bad thing. If the two values go together, the "thing" that they represent together should probably have a name, to improve readability. That said, since this hypothetical method is private, you know it's only going to be used within the current class, so you can choose instead to return a Tuple%lt;String, Integer%gt;. Similarly, there are times within a method body where you may end up constructing a pair of arrays of different types, and iterating through them with a common index (with checks and double-checks to make sure that the arrays really have the same length). Instead, if you can create a single array of tuples, your new "composite type" still doesn't escape the scope of the method.

A simple Tuple class

Here is a simple, immutable 2-dimensional Tuple class:

It's pretty simple -- it's like an immutable struct in C, where the fields don't have names, but they have positions. Let's see an example of this tuple class in action:

In this test, we managed to do a simple operation on a 2D integral vector (returning a 2D integral vector) and operate on a "person", without defining an IntegerVector2D class or a Person class. That said, in practice, if you're frequently dealing with vectors or people, you should probably have classes. In particular, Tuple2%lt;String, Integer%gt; doesn't tell you what the String and Integer types represent. You could totally pass a product SKU and inventory count to describePerson and have it tell you "DV014533423H is 56 years old". As I mentioned above, tuples are most useful for those cases where you want to group values at a very local scope (i.e. as local variables, where a private static class would, by necessity, escape the method scope, or, worst-case, as a return value of a private method).

Making tuple construction less painful

The construction of tuples in the above test case is pretty verbose, partly thanks to the repeated type parameters (which are largely eliminated in Java 1.7), but also because of the new keyword and the specification of the exact size of tuple we want. Instead, we can make factory methods in a TupleUtils class as follows:

This allows us to rewrite the above test case a bit more succinctly:

Where are the functions?

Since this is a series on functional programming, we should actually see some functions, right? Well, there's an interesting analogue between tuples and function parameter lists. In particular, function parameter lists are heterogeneous, positional values grouped together for a common purpose. Instead of passing a list of parameters to a function, you could pass a single tuple value. In effect, every FunctionN<R, T1, ..., TN> is isomorphic to a Function1<R, TupleN<T1, ..., TN>>. Furthermore, this isomorphism can be described in code by decorating FunctionN with a tupled Function1 and vice-versa, as follows:

There are the functions! Now what?

Recall from my third post in this series, where I presented the MemoizedFunctionN subclasses of FunctionN, and pointed out that other functionality could be added by decorating existing functions. Unfortunately, for every new piece of functionality you want to add according to that pattern, you need to generate wrapped versions of all FunctionN, potentially making use of custom currying on the functions of 2 or more parameters.

Using tupled and untupled, you can define your decorator in a Function1 subclass, and tuple your FunctionN down to a Function1, decorate it, then untuple the decorated class.

As an example, let's create a decorator that logs the input and return values for every call to a given Function1. It will need a Function1 to wrap, a stream to which it will log, and a name for the function (so the log messages will have some context).

Let's confirm that this works with a Function1, before trying functions of more parameters.

Now, let's try it with a Function2 that is tupled and untupled:

I wanted to rewrite memoization in terms of tupling and untupling functions, but my previous work at implementing all functions of more than one parameter via currying gets in the way (along with the whole notion of needing a recursion target). That said, memoization is a bit of an anomaly in terms of the complexity of decorating functionality. Most other cases I can think of are like this logging example, where tupling and untupling "just work", and you don't need to create your own curried chain of LoggingFunctionN equivalents.

Conclusions

The major takeaway from this post is the fact that tuples provide another (arguably simpler) way of looking at functions of more than one parameter as being equivalent to Function1 (compared to currying). Tuples can also provide a convenient way of returning more than one value without having to create a new class to hold the composite value.

To avoid writing TupleN classes by hand, I created a TupleGenerator class. You give it a source directory, a target package, and maximum number of tuple dimensions (Scala supports up to 22), and it will generate for each dimension a reasonable TupleN class with hashCode, equals, and toString defined in terms of the underlying elements. It also creates the TupleUtils utility class with overloaded static methods for tuple (to instantiate new tuples), tupled (to turn a FunctionN<R, T1, ..., TN>> into a Function1<R, TupleN<T1, ..., TN>>), and untupled (to turn a Function1<R, TupleN<T1, ..., TN>> into a FunctionN<R, T1, ..., TN>> )