In this post, I aim to introduce another immutable data structure, following up on my posts on immutable binary trees ([1] and [2]) and immutable lists. We'll be looking at hash tries, which provide an efficient way of implementing an immutable HashMap
-like structure. Like a HashMap
, we can provide more or less constant time operations in the absence of hash collisions, and linear time when collisions occur. The catch is that, in practice, our constants will be considerably larger. That said, we'll have a data structure that requires no synchronization when accessed from multiple threads, since it cannot be modified in-place.
Our hash trie map will be constructed as a tree with a high branching factor. If you haven't already read the posts on immutable binary trees, I suggest you do so, especially the second one. In particular, we're going to be relying heavily on specialized node types to save memory and split our logic into different cases based on polymorphic behaviour.
What is a trie?
Tries are a kind of tree where the path from the root to a given node encodes some earlier information. A nice use for tries is to store dictionaries in a relatively efficient way. Your root node represents the empty string, and it has 26 children, for the letters of the alphabet. Each node could be implemented as an array of 26 (possibly null) elements, plus a boolean to indicate whether that node represents a valid word. The word 'cat' would be encoded by the third-level node reached by going to 'c', then 'a', then 't', from the root. Since 'cat' is a word, the boolean value at that node would be true. By comparison, the node for 'coug' would be false, but would need to exist as a parent for 'cough'.
Our hash trie map will work on a somewhat similar principle, except that our valid "words" will be hash codes for the keys contained in the map.
An overview of what our tree will look like
To implement our trie structure, we'll have nodes that branch over part of the hash code of the elements (BranchedArrayHashNode
), a singleton EmptyHashNode
that represents a lack of children for a given partial hash, an EntryHashNode
that represents a single key/value pair, and a ListHashNode
that contains a linked list of key/value pairs whose keys' hashes collide. As a useful optimization, we'll implement a specialized SingleArrayHashNode
for internal nodes who only have one child, so we don't unnecessarily allocate an entire array.
Like with the immutable binary trees, an add or remove operation will descend the tree until it reaches a (possibly empty) leaf node, which will provide its own replacement. Then, as the recursive calls return, the parents will provide their replacements (which will reuse the other children), until we get a new root node. In this way, we can guarantee (in the absence of a collision) running time that is logarithmic over the possible hash codes. Since the space of possible hash codes is fixed, the maximum depth of the tree is log(2^32) (which would be 32 with a branching factor of 2, but we'll use a branching factor of 2^5 = 32, guaranteeing a maximum depth of 7). This is where we get the claim of constant running time in the absence of collisions. (In the event of a collision, we hit a ListHashNode
and we add a linear traversal of the list.)
An example of how we'll store our data
The above image depicts a hash trie map with 5 values stored in it (represented by the five EntryHashNode
leaves). We break up the 32-bit hash codes of the keys into 5-bit chunks (with a 2-bit chunk at the deepest level), and use the sequence of 5/10/15/20/25/30-bit prefixes in our trie. It's probably clearest if I describe the five entries' hash codes:
- The (1) entry has a hash code that begins with 11001. It is the only node with this prefix, so it can be stored directly below the root.
- Entries (2) and (3) both have hash codes that begin with 10110, but entry (2)'s hash code starts with 10110 00011, while entry (3)'s hash code starts with 10110 10101.
- Entries (4) and (5) both have hash codes that start with 01101 00101. Since we have at most 32 branches at each level, so can only encode 5 bits of hash code, we stick in a
SingleArrayHashNode
to avoid wasting 31 empty array elements on the shared 01101 prefix. You can see that the first 15 bits of entry (4)'s hash code are therefore 01101 00101 01011, while entry (5)'s hash code begins with 01101 00101 11101.
I hope this example explains the basic principle a little more clearly. (I've read a number of explanations of hash tries, but none have struck me as particularly good. In particular, several of them try to "simplify" down to 1 bit per level, creating a binary tree. I actually found that more confusing, personally.)
The interface and implementation
The basic ImmutableMap interface is similar to that of java.util.map
, except that the put
and remove
operations need to return the new map. Also, rather than having get
return null
(when no entry is found), I've decided to make it return an Option
(described in a previous post).
My implementation starts with an abstract base class, ImmutableHashTrieMap
, which implements these operations based on our sliding 5-bit window looking at hash codes:
As you can see, the window size is hardcoded as a constant, so we can experiment to find the optimal branch factor. (Smaller branch factors mean deeper trees, while bigger branch factors lead to larger array copies when replacing a branched node.)
Then we just need to implement each of the different concrete node types, with an understanding of how they get replaced when adding and removing nodes.
EmptyHashNode
The simplest node type is EmptyHashNode
, which is a singleton (since all empty nodes are the same after type erasure).
The operations on EmptyHashNode
are as follows:
- put: Returns a new
EntryHashNode
containing the given key/value pair. - remove: Returns
this
, since there is nothing to remove. - get: Returns
None
, since no entry may be found here.
EntryHashNode
Each EntryHashNode
holds a single key/value pair. It defines the operations as follows:
- put:
- If the given key is equal to this entry's key, return a new
EntryHashNode
, replacing this node. - If the given key is not equal to this entry's key, but they have the same hash code, return a new
ListHashNode
containing this entry and the new entry. - Otherwise, return a new
ArrayHashNode
(Branched
orSingle
, depending on whether the current hash code windows collide) with this entry and the new entry as children.
- If the given key is equal to this entry's key, return a new
- remove: If the given key matches this entry's key, return the
EmptyHashNode
. Otherwise, the given key was not found, so returnthis
. - get: If the given key equals this entry's key, return this entry's value. Otherwise, return
None
.
ListHashNode
ListHashNode
is used in the unfortunate event of a hash collision (that is, different keys, same hash code). It is implemented as an immutable list of key/value pairs (using the Tuple2
class from a previous post).
The operations on ListHashNode
are as follows:
- put:
- If the given key's hash code is different from that of the entries in this node (since they all have the same hash code), return a new
ArrayHashNode
(Branched
orSingle
, depending on whether the current hash code windows collide) with this node and a newEntryHashNode
(for the new entry) as children. - If the given key is equal to the key of an entry in this node, return a new
ListHashNode
of the same size, with the new entry replacing the old entry. - If the given key is not equal to the key of any entry in this node, but the hash code is the same as that of the entries in this node, return a new
ListHashNode
with the new entry prepended to this node's entries.
- If the given key's hash code is different from that of the entries in this node (since they all have the same hash code), return a new
- remove:
- If the given key is not equal to the key of any entries in this node, return
this
. - If the given key is equal to the key of an entry in this node, return a new
ListHashNode
containing the list of entries minus the removed one. However, if the new list would have size 1, return a newEntryHashNode
instead.
- If the given key is not equal to the key of any entries in this node, return
- get: If the given key matches the key of an entry in this node, return that node's value. Otherwise, return
None
.
This is the last of the leaf nodes. Next, we get into the internal nodes, BranchedArrayHashNode
and SingleArrayHashNode
. Neither of these actually hold data. They simply hold references to deeper nodes, grouped by partial hash codes. I'm going to leave off the diagrams for those nodes, since they get pretty complex (since the internal nodes have children).
ArrayHashNode
Both SingleArrayHashNode
and BranchedArrayHashNode
inherit from ArrayHashNode
, a marker abstract class that indicates that they are non-leaf nodes. ArrayHashNode
s need to know how to compute a partial hash code from a full 32-bit hash code and a shift
parameter. The formula I used is hash >> shift & 0x1F
, which actually takes the lowest 5 bits at each shift level. (So, in my example tree above, I actually lied -- we're not actually looking at prefixes, but rather a sequence of 5-bit suffixes. This is usually more efficient for many of Java's builtin hashCode
functions, since they tend to vary heavily on the low bits and less on the high bits. This makes sense when your hash-based structures rely on modular arithmetic, like the builtin implementation of HashMap
. For more randomly-distributed hashCode
implementations, the choice of how you slice your bits makes no difference.)
SingleArrayHashNode
This represents an internal node that has a single child (so we don't waste a 32-element array on that one child). That child is in one of 32 possible branches, so we need to keep track of which branch it's in, given by the partial hash code that it and all of its children share. Therefore, SingleArrayHashNode
stores the child index (in the 0-31 range) and the child node. The child of a SingleArrayHashNode
should always be an ArrayHashNode
, whether another SingleArrayHashNode
or a BranchedArrayHashNode
. If a SingleArrayHashNode
had a leaf node child, that child could simply be promoted toward the root, replacing the SingleArrayHashNode
. By this axiom, we can guarantee that a chain of SingleArrayHashNode
s ends in a BranchedArrayHashNode
.
Here are the operations on a SingleArrayHashNode
:
- put:
- If the 5-bit partial hash code of the given key matches this node's partial hash code, then delegate the
put
to this node's child and return a newSingleArrayHashNode
with the result of that operation as a child. - Otherwise return a new
BranchedArrayHashNode
with this node's child and a newEntryHashNode
(to hold the new entry) as children.
- If the 5-bit partial hash code of the given key matches this node's partial hash code, then delegate the
- remove:
- If the 5-bit partial hash code of the given key does not match this node's partial hash, return
this
. There is nothing to remove. - Otherwise, delegate the
remove
call to this node's child. If the resulting node is not an instance ofArrayHashNode
, return it (since it's a leaf node and doesn't need aSingleArrayHashNode
as a parent). Otherwise, return a newSingleArrayHashNode
with the new child node (and the same partial hash).
- If the 5-bit partial hash code of the given key does not match this node's partial hash, return
- get: If the 5-bit partial hash code of the given key does not match this node's partial hash, return
None
. Otherwise, return the result of delegating the call to this node's child.
BranchedArrayHashNode
This is the node type that actually makes the whole structure work. It doesn't store any data itself, but it provides a dispatch mechanism to zero in on the subtree that contains the relevant (partial) hash code. The BranchedArrayHashNode
has an array of 32 ImmutableHashTrieMap
s (meaning nodes), where unoccupied branches point to the EmptyHashNode
. Since we have SingleArrayHashNode
, we guarantee that BranchedArrayHashNode
has at least two non-empty children.
The operations are comparatively expensive, since put
s and remove
s usually involve creating a copy of the 32-element array and replacing a single child.
Here are the operations:
- put: Delegate the
put
call to the relevant child, based on the 5-bit partial hash code of the given key. Return a newBranchedArrayHashNode
with that child replaced with the outcome of the delegated call. If the relevant child was previously empty, the size of the newBranchedArrayHashNode
is incremented by 1. - remove: Determine the relevant child based on the 5-bit partial hash code of the given key. If the relevant child is already empty, return
this
. Otherwise, delegate the call to the relevant child. If the returned node is empty (i.e. the child was anEntryNode
that was removed), the new size will be 1 less than the current size, otherwise the new size will be the old size. If the new size is 1, return a newSingleArrayHashNode
containing the remaining child. Otherwise, return a newBranchedArrayHashNode
with the relevant child replaced by the outcome of the delegatedremove
call. - get: Return the result of delegating the call to the relevant child, given by the 5-bit partial hash code. (If the relevant child is the
EmptyHashNode
, it will take care of returningNone
.)
Conclusions
You can check out the complete implementation on GitHub, here.
I ran some microbenchmarks, adding and removing the integers from 0 to 1 million (in random order), comparing against the builtin Java HashMap
. I found that my ImmutableHashTrieMap
was about 2.5 times slower. Increasing to 10 million, I was 5 times slower.
I think I might be able to improve on my time by adding classes to handle other internal nodes with a small number of branches (like what I did with SingleArrayHashNode
). I may try implementing specialized internal node classes with up to four child branches, since I believe the cost of checking each child would be less expensive than the additional array copies involved in working with BranchedArrayHashNode
.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.