In this post, I aim to introduce another immutable data structure, following up on my posts on immutable binary trees ([1] and [2]) and immutable lists. We'll be looking at **hash tries**, which provide an efficient way of implementing an immutable `HashMap`

-like structure. Like a `HashMap`

, we can provide more or less constant time operations in the absence of hash collisions, and linear time when collisions occur. The catch is that, in practice, our constants will be considerably larger. That said, we'll have a data structure that requires no synchronization when accessed from multiple threads, since it cannot be modified in-place.

Our hash trie map will be constructed as a tree with a high branching factor. If you haven't already read the posts on immutable binary trees, I suggest you do so, especially the second one. In particular, we're going to be relying heavily on specialized node types to save memory and split our logic into different cases based on polymorphic behaviour.

### What is a trie?

Tries are a kind of tree where the path from the root to a given node encodes some earlier information. A nice use for tries is to store dictionaries in a relatively efficient way. Your root node represents the empty string, and it has 26 children, for the letters of the alphabet. Each node could be implemented as an array of 26 (possibly null) elements, plus a boolean to indicate whether that node represents a valid word. The word 'cat' would be encoded by the third-level node reached by going to 'c', then 'a', then 't', from the root. Since 'cat' is a word, the boolean value at that node would be true. By comparison, the node for 'coug' would be false, but would need to exist as a parent for 'cough'.

Our hash trie map will work on a somewhat similar principle, except that our valid "words" will be hash codes for the keys contained in the map.

### An overview of what our tree will look like

To implement our trie structure, we'll have nodes that branch over part of the hash code of the elements (`BranchedArrayHashNode`

), a singleton `EmptyHashNode`

that represents a lack of children for a given partial hash, an `EntryHashNode`

that represents a single key/value pair, and a `ListHashNode`

that contains a linked list of key/value pairs whose keys' hashes collide. As a useful optimization, we'll implement a specialized `SingleArrayHashNode`

for internal nodes who only have one child, so we don't unnecessarily allocate an entire array.

Like with the immutable binary trees, an add or remove operation will descend the tree until it reaches a (possibly empty) leaf node, which will provide its own replacement. Then, as the recursive calls return, the parents will provide their replacements (which will reuse the other children), until we get a new root node. In this way, we can guarantee (in the absence of a collision) running time that is logarithmic over the possible hash codes. Since the space of possible hash codes is fixed, the maximum depth of the tree is log(2^32) (which would be 32 with a branching factor of 2, but we'll use a branching factor of 2^5 = 32, guaranteeing a maximum depth of 7). This is where we get the claim of constant running time in the absence of collisions. (In the event of a collision, we hit a `ListHashNode`

and we add a linear traversal of the list.)

### An example of how we'll store our data

The above image depicts a hash trie map with 5 values stored in it (represented by the five `EntryHashNode`

leaves). We break up the 32-bit hash codes of the keys into 5-bit chunks (with a 2-bit chunk at the deepest level), and use the sequence of 5/10/15/20/25/30-bit prefixes in our trie. It's probably clearest if I describe the five entries' hash codes:

- The (1) entry has a hash code that begins with 11001. It is the only node with this prefix, so it can be stored directly below the root.
- Entries (2) and (3) both have hash codes that begin with 10110, but entry (2)'s hash code starts with 10110 00011, while entry (3)'s hash code starts with 10110 10101.
- Entries (4) and (5) both have hash codes that start with 01101 00101. Since we have at most 32 branches at each level, so can only encode 5 bits of hash code, we stick in a
`SingleArrayHashNode`

to avoid wasting 31 empty array elements on the shared 01101 prefix. You can see that the first 15 bits of entry (4)'s hash code are therefore 01101 00101 01011, while entry (5)'s hash code begins with 01101 00101 11101.

I hope this example explains the basic principle a little more clearly. (I've read a number of explanations of hash tries, but none have struck me as particularly good. In particular, several of them try to "simplify" down to 1 bit per level, creating a binary tree. I actually found that more confusing, personally.)

### The interface and implementation

The basic ImmutableMap interface is similar to that of `java.util.map`

, except that the `put`

and `remove`

operations need to return the new map. Also, rather than having `get`

return `null`

(when no entry is found), I've decided to make it return an `Option`

(described in a previous post).

My implementation starts with an abstract base class, `ImmutableHashTrieMap`

, which implements these operations based on our sliding 5-bit window looking at hash codes:

As you can see, the window size is hardcoded as a constant, so we can experiment to find the optimal branch factor. (Smaller branch factors mean deeper trees, while bigger branch factors lead to larger array copies when replacing a branched node.)

Then we just need to implement each of the different concrete node types, with an understanding of how they get replaced when adding and removing nodes.

`EmptyHashNode`

The simplest node type is `EmptyHashNode`

, which is a singleton (since all empty nodes are the same after type erasure).

The operations on `EmptyHashNode`

are as follows:

**put**: Returns a new`EntryHashNode`

containing the given key/value pair.**remove**: Returns`this`

, since there is nothing to remove.**get**: Returns`None`

, since no entry may be found here.

`EntryHashNode`

Each `EntryHashNode`

holds a single key/value pair. It defines the operations as follows:

**put**:- If the given key is equal to this entry's key, return a new
`EntryHashNode`

, replacing this node. - If the given key is not equal to this entry's key, but they have the same hash code, return a new
`ListHashNode`

containing this entry and the new entry. - Otherwise, return a new
`ArrayHashNode`

(`Branched`

or`Single`

, depending on whether the current hash code windows collide) with this entry and the new entry as children.

- If the given key is equal to this entry's key, return a new
**remove**: If the given key matches this entry's key, return the`EmptyHashNode`

. Otherwise, the given key was not found, so return`this`

.**get**: If the given key equals this entry's key, return this entry's value. Otherwise, return`None`

.

`ListHashNode`

`ListHashNode`

is used in the unfortunate event of a hash collision (that is, different keys, same hash code). It is implemented as an immutable list of key/value pairs (using the `Tuple2`

class from a previous post).

The operations on `ListHashNode`

are as follows:

**put**:- If the given key's hash code is different from that of the entries in this node (since they all have the same hash code), return a new
`ArrayHashNode`

(`Branched`

or`Single`

, depending on whether the current hash code windows collide) with this node and a new`EntryHashNode`

(for the new entry) as children. - If the given key is equal to the key of an entry in this node, return a new
`ListHashNode`

of the same size, with the new entry replacing the old entry. - If the given key is not equal to the key of any entry in this node, but the hash code is the same as that of the entries in this node, return a new
`ListHashNode`

with the new entry prepended to this node's entries.

- If the given key's hash code is different from that of the entries in this node (since they all have the same hash code), return a new
**remove**:- If the given key is not equal to the key of any entries in this node, return
`this`

. - If the given key is equal to the key of an entry in this node, return a new
`ListHashNode`

containing the list of entries minus the removed one. However, if the new list would have size 1, return a new`EntryHashNode`

instead.

- If the given key is not equal to the key of any entries in this node, return
**get**: If the given key matches the key of an entry in this node, return that node's value. Otherwise, return`None`

.

This is the last of the leaf nodes. Next, we get into the internal nodes, `BranchedArrayHashNode`

and `SingleArrayHashNode`

. Neither of these actually hold data. They simply hold references to deeper nodes, grouped by partial hash codes. I'm going to leave off the diagrams for those nodes, since they get pretty complex (since the internal nodes have children).

`ArrayHashNode`

Both `SingleArrayHashNode`

and `BranchedArrayHashNode`

inherit from `ArrayHashNode`

, a marker abstract class that indicates that they are non-leaf nodes. `ArrayHashNode`

s need to know how to compute a partial hash code from a full 32-bit hash code and a `shift`

parameter. The formula I used is `hash >> shift & 0x1F`

, which actually takes the lowest 5 bits at each shift level. (So, in my example tree above, I actually lied -- we're not actually looking at prefixes, but rather a sequence of 5-bit suffixes. This is usually more efficient for many of Java's builtin `hashCode`

functions, since they tend to vary heavily on the low bits and less on the high bits. This makes sense when your hash-based structures rely on modular arithmetic, like the builtin implementation of `HashMap`

. For more randomly-distributed `hashCode`

implementations, the choice of how you slice your bits makes no difference.)

`SingleArrayHashNode`

This represents an internal node that has a single child (so we don't waste a 32-element array on that one child). That child is in one of 32 possible branches, so we need to keep track of which branch it's in, given by the partial hash code that it and all of its children share. Therefore, `SingleArrayHashNode`

stores the child index (in the 0-31 range) and the child node. The child of a `SingleArrayHashNode`

should always be an `ArrayHashNode`

, whether another `SingleArrayHashNode`

or a `BranchedArrayHashNode`

. If a `SingleArrayHashNode`

had a leaf node child, that child could simply be promoted toward the root, replacing the `SingleArrayHashNode`

. By this axiom, we can guarantee that a chain of `SingleArrayHashNode`

s ends in a `BranchedArrayHashNode`

.

Here are the operations on a `SingleArrayHashNode`

:

**put**:- If the 5-bit partial hash code of the given key matches this node's partial hash code, then delegate the
`put`

to this node's child and return a new`SingleArrayHashNode`

with the result of that operation as a child. - Otherwise return a new
`BranchedArrayHashNode`

with this node's child and a new`EntryHashNode`

(to hold the new entry) as children.

- If the 5-bit partial hash code of the given key matches this node's partial hash code, then delegate the
**remove**:- If the 5-bit partial hash code of the given key does not match this node's partial hash, return
`this`

. There is nothing to remove. - Otherwise, delegate the
`remove`

call to this node's child. If the resulting node is not an instance of`ArrayHashNode`

, return it (since it's a leaf node and doesn't need a`SingleArrayHashNode`

as a parent). Otherwise, return a new`SingleArrayHashNode`

with the new child node (and the same partial hash).

- If the 5-bit partial hash code of the given key does not match this node's partial hash, return
**get**: If the 5-bit partial hash code of the given key does not match this node's partial hash, return`None`

. Otherwise, return the result of delegating the call to this node's child.

`BranchedArrayHashNode`

This is the node type that actually makes the whole structure work. It doesn't store any data itself, but it provides a dispatch mechanism to zero in on the subtree that contains the relevant (partial) hash code. The `BranchedArrayHashNode`

has an array of 32 `ImmutableHashTrieMap`

s (meaning nodes), where unoccupied branches point to the `EmptyHashNode`

. Since we have `SingleArrayHashNode`

, we guarantee that `BranchedArrayHashNode`

has at least two non-empty children.

The operations are comparatively expensive, since `put`

s and `remove`

s usually involve creating a copy of the 32-element array and replacing a single child.

Here are the operations:

**put**: Delegate the`put`

call to the relevant child, based on the 5-bit partial hash code of the given key. Return a new`BranchedArrayHashNode`

with that child replaced with the outcome of the delegated call. If the relevant child was previously empty, the size of the new`BranchedArrayHashNode`

is incremented by 1.**remove**: Determine the relevant child based on the 5-bit partial hash code of the given key. If the relevant child is already empty, return`this`

. Otherwise, delegate the call to the relevant child. If the returned node is empty (i.e. the child was an`EntryNode`

that was removed), the new size will be 1 less than the current size, otherwise the new size will be the old size. If the new size is 1, return a new`SingleArrayHashNode`

containing the remaining child. Otherwise, return a new`BranchedArrayHashNode`

with the relevant child replaced by the outcome of the delegated`remove`

call.**get**: Return the result of delegating the call to the relevant child, given by the 5-bit partial hash code. (If the relevant child is the`EmptyHashNode`

, it will take care of returning`None`

.)

### Conclusions

You can check out the complete implementation on GitHub, here.

I ran some microbenchmarks, adding and removing the integers from 0 to 1 million (in random order), comparing against the builtin Java `HashMap`

. I found that my `ImmutableHashTrieMap`

was about 2.5 times slower. Increasing to 10 million, I was 5 times slower.

I think I might be able to improve on my time by adding classes to handle other internal nodes with a small number of branches (like what I did with `SingleArrayHashNode`

). I may try implementing specialized internal node classes with up to four child branches, since I believe the cost of checking each child would be less expensive than the additional array copies involved in working with `BranchedArrayHashNode`

.

## No comments:

## Post a Comment