Saturday, June 30, 2012

Immutable Hash Trie Maps in Java

In this post, I aim to introduce another immutable data structure, following up on my posts on immutable binary trees ([1] and [2]) and immutable lists. We'll be looking at hash tries, which provide an efficient way of implementing an immutable HashMap-like structure. Like a HashMap, we can provide more or less constant time operations in the absence of hash collisions, and linear time when collisions occur. The catch is that, in practice, our constants will be considerably larger. That said, we'll have a data structure that requires no synchronization when accessed from multiple threads, since it cannot be modified in-place.

Our hash trie map will be constructed as a tree with a high branching factor. If you haven't already read the posts on immutable binary trees, I suggest you do so, especially the second one. In particular, we're going to be relying heavily on specialized node types to save memory and split our logic into different cases based on polymorphic behaviour.

What is a trie?

Tries are a kind of tree where the path from the root to a given node encodes some earlier information. A nice use for tries is to store dictionaries in a relatively efficient way. Your root node represents the empty string, and it has 26 children, for the letters of the alphabet. Each node could be implemented as an array of 26 (possibly null) elements, plus a boolean to indicate whether that node represents a valid word. The word 'cat' would be encoded by the third-level node reached by going to 'c', then 'a', then 't', from the root. Since 'cat' is a word, the boolean value at that node would be true. By comparison, the node for 'coug' would be false, but would need to exist as a parent for 'cough'.

Our hash trie map will work on a somewhat similar principle, except that our valid "words" will be hash codes for the keys contained in the map.

An overview of what our tree will look like

To implement our trie structure, we'll have nodes that branch over part of the hash code of the elements (BranchedArrayHashNode), a singleton EmptyHashNode that represents a lack of children for a given partial hash, an EntryHashNode that represents a single key/value pair, and a ListHashNode that contains a linked list of key/value pairs whose keys' hashes collide. As a useful optimization, we'll implement a specialized SingleArrayHashNode for internal nodes who only have one child, so we don't unnecessarily allocate an entire array.

Like with the immutable binary trees, an add or remove operation will descend the tree until it reaches a (possibly empty) leaf node, which will provide its own replacement. Then, as the recursive calls return, the parents will provide their replacements (which will reuse the other children), until we get a new root node. In this way, we can guarantee (in the absence of a collision) running time that is logarithmic over the possible hash codes. Since the space of possible hash codes is fixed, the maximum depth of the tree is log(2^32) (which would be 32 with a branching factor of 2, but we'll use a branching factor of 2^5 = 32, guaranteeing a maximum depth of 7). This is where we get the claim of constant running time in the absence of collisions. (In the event of a collision, we hit a ListHashNode and we add a linear traversal of the list.)

An example of how we'll store our data

The above image depicts a hash trie map with 5 values stored in it (represented by the five EntryHashNode leaves). We break up the 32-bit hash codes of the keys into 5-bit chunks (with a 2-bit chunk at the deepest level), and use the sequence of 5/10/15/20/25/30-bit prefixes in our trie. It's probably clearest if I describe the five entries' hash codes:

  • The (1) entry has a hash code that begins with 11001. It is the only node with this prefix, so it can be stored directly below the root.
  • Entries (2) and (3) both have hash codes that begin with 10110, but entry (2)'s hash code starts with 10110 00011, while entry (3)'s hash code starts with 10110 10101.
  • Entries (4) and (5) both have hash codes that start with 01101 00101. Since we have at most 32 branches at each level, so can only encode 5 bits of hash code, we stick in a SingleArrayHashNode to avoid wasting 31 empty array elements on the shared 01101 prefix. You can see that the first 15 bits of entry (4)'s hash code are therefore 01101 00101 01011, while entry (5)'s hash code begins with 01101 00101 11101.

I hope this example explains the basic principle a little more clearly. (I've read a number of explanations of hash tries, but none have struck me as particularly good. In particular, several of them try to "simplify" down to 1 bit per level, creating a binary tree. I actually found that more confusing, personally.)

The interface and implementation

The basic ImmutableMap interface is similar to that of java.util.map, except that the put and remove operations need to return the new map. Also, rather than having get return null (when no entry is found), I've decided to make it return an Option (described in a previous post).

My implementation starts with an abstract base class, ImmutableHashTrieMap, which implements these operations based on our sliding 5-bit window looking at hash codes:

As you can see, the window size is hardcoded as a constant, so we can experiment to find the optimal branch factor. (Smaller branch factors mean deeper trees, while bigger branch factors lead to larger array copies when replacing a branched node.)

Then we just need to implement each of the different concrete node types, with an understanding of how they get replaced when adding and removing nodes.

EmptyHashNode

The simplest node type is EmptyHashNode, which is a singleton (since all empty nodes are the same after type erasure).

The operations on EmptyHashNode are as follows:

  • put: Returns a new EntryHashNode containing the given key/value pair.
  • remove: Returns this, since there is nothing to remove.
  • get: Returns None, since no entry may be found here.

EntryHashNode

Each EntryHashNode holds a single key/value pair. It defines the operations as follows:

  • put:
    • If the given key is equal to this entry's key, return a new EntryHashNode, replacing this node.
    • If the given key is not equal to this entry's key, but they have the same hash code, return a new ListHashNode containing this entry and the new entry.
    • Otherwise, return a new ArrayHashNode (Branched or Single, depending on whether the current hash code windows collide) with this entry and the new entry as children.
  • remove: If the given key matches this entry's key, return the EmptyHashNode. Otherwise, the given key was not found, so return this.
  • get: If the given key equals this entry's key, return this entry's value. Otherwise, return None.

ListHashNode

ListHashNode is used in the unfortunate event of a hash collision (that is, different keys, same hash code). It is implemented as an immutable list of key/value pairs (using the Tuple2 class from a previous post).

The operations on ListHashNode are as follows:

  • put:
    • If the given key's hash code is different from that of the entries in this node (since they all have the same hash code), return a new ArrayHashNode (Branched or Single, depending on whether the current hash code windows collide) with this node and a new EntryHashNode (for the new entry) as children.
    • If the given key is equal to the key of an entry in this node, return a new ListHashNode of the same size, with the new entry replacing the old entry.
    • If the given key is not equal to the key of any entry in this node, but the hash code is the same as that of the entries in this node, return a new ListHashNode with the new entry prepended to this node's entries.
  • remove:
    • If the given key is not equal to the key of any entries in this node, return this.
    • If the given key is equal to the key of an entry in this node, return a new ListHashNode containing the list of entries minus the removed one. However, if the new list would have size 1, return a new EntryHashNode instead.
  • get: If the given key matches the key of an entry in this node, return that node's value. Otherwise, return None.

This is the last of the leaf nodes. Next, we get into the internal nodes, BranchedArrayHashNode and SingleArrayHashNode. Neither of these actually hold data. They simply hold references to deeper nodes, grouped by partial hash codes. I'm going to leave off the diagrams for those nodes, since they get pretty complex (since the internal nodes have children).

ArrayHashNode

Both SingleArrayHashNode and BranchedArrayHashNode inherit from ArrayHashNode, a marker abstract class that indicates that they are non-leaf nodes. ArrayHashNodes need to know how to compute a partial hash code from a full 32-bit hash code and a shift parameter. The formula I used is hash >> shift & 0x1F, which actually takes the lowest 5 bits at each shift level. (So, in my example tree above, I actually lied -- we're not actually looking at prefixes, but rather a sequence of 5-bit suffixes. This is usually more efficient for many of Java's builtin hashCode functions, since they tend to vary heavily on the low bits and less on the high bits. This makes sense when your hash-based structures rely on modular arithmetic, like the builtin implementation of HashMap. For more randomly-distributed hashCode implementations, the choice of how you slice your bits makes no difference.)

SingleArrayHashNode

This represents an internal node that has a single child (so we don't waste a 32-element array on that one child). That child is in one of 32 possible branches, so we need to keep track of which branch it's in, given by the partial hash code that it and all of its children share. Therefore, SingleArrayHashNode stores the child index (in the 0-31 range) and the child node. The child of a SingleArrayHashNode should always be an ArrayHashNode, whether another SingleArrayHashNode or a BranchedArrayHashNode. If a SingleArrayHashNode had a leaf node child, that child could simply be promoted toward the root, replacing the SingleArrayHashNode. By this axiom, we can guarantee that a chain of SingleArrayHashNodes ends in a BranchedArrayHashNode.

Here are the operations on a SingleArrayHashNode:

  • put:
    • If the 5-bit partial hash code of the given key matches this node's partial hash code, then delegate the put to this node's child and return a new SingleArrayHashNode with the result of that operation as a child.
    • Otherwise return a new BranchedArrayHashNode with this node's child and a new EntryHashNode (to hold the new entry) as children.
  • remove:
    • If the 5-bit partial hash code of the given key does not match this node's partial hash, return this. There is nothing to remove.
    • Otherwise, delegate the remove call to this node's child. If the resulting node is not an instance of ArrayHashNode, return it (since it's a leaf node and doesn't need a SingleArrayHashNode as a parent). Otherwise, return a new SingleArrayHashNode with the new child node (and the same partial hash).
  • get: If the 5-bit partial hash code of the given key does not match this node's partial hash, return None. Otherwise, return the result of delegating the call to this node's child.

BranchedArrayHashNode

This is the node type that actually makes the whole structure work. It doesn't store any data itself, but it provides a dispatch mechanism to zero in on the subtree that contains the relevant (partial) hash code. The BranchedArrayHashNode has an array of 32 ImmutableHashTrieMaps (meaning nodes), where unoccupied branches point to the EmptyHashNode. Since we have SingleArrayHashNode, we guarantee that BranchedArrayHashNode has at least two non-empty children.

The operations are comparatively expensive, since puts and removes usually involve creating a copy of the 32-element array and replacing a single child.

Here are the operations:

  • put: Delegate the put call to the relevant child, based on the 5-bit partial hash code of the given key. Return a new BranchedArrayHashNode with that child replaced with the outcome of the delegated call. If the relevant child was previously empty, the size of the new BranchedArrayHashNode is incremented by 1.
  • remove: Determine the relevant child based on the 5-bit partial hash code of the given key. If the relevant child is already empty, return this. Otherwise, delegate the call to the relevant child. If the returned node is empty (i.e. the child was an EntryNode that was removed), the new size will be 1 less than the current size, otherwise the new size will be the old size. If the new size is 1, return a new SingleArrayHashNode containing the remaining child. Otherwise, return a new BranchedArrayHashNode with the relevant child replaced by the outcome of the delegated remove call.
  • get: Return the result of delegating the call to the relevant child, given by the 5-bit partial hash code. (If the relevant child is the EmptyHashNode, it will take care of returning None.)

Conclusions

You can check out the complete implementation on GitHub, here.

I ran some microbenchmarks, adding and removing the integers from 0 to 1 million (in random order), comparing against the builtin Java HashMap. I found that my ImmutableHashTrieMap was about 2.5 times slower. Increasing to 10 million, I was 5 times slower.

I think I might be able to improve on my time by adding classes to handle other internal nodes with a small number of branches (like what I did with SingleArrayHashNode). I may try implementing specialized internal node classes with up to four child branches, since I believe the cost of checking each child would be less expensive than the additional array copies involved in working with BranchedArrayHashNode.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.