# A Probing Hash Table Framework

Posted: January 29, 2016

Data locality is very important. Keeping data close together in memory is a huge win on basically every modern computing device due to our systems’ inherently multi-level memory hierarchy. Contiguous data leads to speed, and even some perhaps counter-intuitive results (like insert-and-shift into the middle of a dynamic array actually being faster than the equivalent linked list insertion1).

Despite these facts, interestingly enough the C++11 (and beyond) standard library provides a hash table implementation in std::unordered_map that is a separate chaining hash table, with linked list buckets. This implementation implies a strong trade-off: while elements inserted into the map are guaranteed to be stable in memory (never needing to be moved or copied), we now must chase pointers when walking through the buckets in the hash table.

What if we allow for keys to be moved/copied around as the hash table grows? If we relax the requirement that we never move key/value pairs after insertion, we now open the door for implementing a hash table using open addressing. Here, the table is stored as a single huge array containing either a key/value pair or nothing. On an insertion where a collision occurs, an empty slot is located by “probing” through the array looking for an empty slot, following some probing strategy. The most well known is linear probing, which has developed a bit of a bad reputation. But is the disdain deserved?

In this post, I’m going to walk you through some results that motivated the development of an open source (insertion-only) probing hash table framework that’s distributed as part of the MeTA toolkit, and attempt to prove to you that naive, linear probing (or, at least, blocked probing) strategies are not nearly so bad as you might initially think. In particular, we’ll be benchmarking hash tables with both integer and string keys, taking a look at how they perform in terms of building time, memory consumption, and query throughput.

# Integer Hash Functions

First, let’s address something about hashing. Hash tables depend very strongly on their hash function behaving close enough to a random function (as in, the output of the function over the key space is uniform over the output space) in order to give their nice $O(1)$ amortized guarantees2.

Furthermore, if your hash function is predictable (as in, someone can guess what your hash function might be), people will find out and will craft inputs to take your server/application down. Having a good (nearly uniform), unpredictable hash function is very important.

With these criteria in mind, let’s see how two C++ standard library implementations fare in the case of hashing integers. I tested with the latest libstdc++ from GCC (packaged with 5.3.0 as of the time of writing) and libc++ (packaged with llvm/clang 3.7.1 as of the time of writing).

Here’s a very basic snippet:

Can you guess what the output is? To my surprise, it was:

0
1
2
3


for both standard library implementations I have available.

Every run.

Yes my friends, the standard library hash function for integral types for the most commonly used standard libraries on Linux and OS X, respectively, is the identity function and has no randomization whatsoever, making it perfectly predictable.

This is capital-B Bad and is a wide open door for attacks.

OK, so fine, we’ll just not use std::hash and write one ourselves, right?

And we’ll definitely be able to write a good one, right?

And we’ll definitely not forget one of the 30 different specializations, right?

And you’ll definitely notice that I’m strongly hinting to you that this is a Very Bad Idea(TM), right?

So what do we do? We certainly need to replace the hash function that’s being used by default if we care about minimizing our hash table attack surface, first and foremost. But you should never, ever write your own hash function3, if you can help it.

Fortunately, the C++ standard proposal paper N3980: Types Don’t Know # has got this problem figured out (seriously, read that paper if you’re a C++ person: it’s absolutely brilliant). Hinnant et al. provide an easily extensible framework that decouples the hashing problem into two steps: providing to the implementation what to hash and then hashing those elements with a generic, byte-oriented hash function. This allows you to easily substitute your hash algorithm with a different one without having to rewrite all of your 30+ specializations to integrate it properly.

You write, once, the definition of what it means for your specific type to be hashed (via a namespace-local, potentially friend, overload of hash_append() for your type, implemented in terms of hash_append() for the member elements of that type that need to contribute to generating the hash code for it).

You write, once, the definition of what it means to has a sequence of bytes using your chosen hash function. Changing the hash function to another one at a later date is trivial, and you do not have to write any specializations beyond providing the overload for hash_append() to specify how to hash your type.

As our starting point, I implemented this framework and provided several randomly-seeded hash functions that we can use for implementing our hash tables. By default, we’re currently using FarmHash for the default hash function on 64-bit machines, and Murmur34 as the default hash function on 32-bit machines. But the wonderful thing is that I can change my mind at any date, change one line of code, and be off to the races using any of the other hash function’s I’ve implemented. This is impossible if you’re using std::hash.

# Probing Hash Tables for Integers

Now that we’ve settled on a hash function, let’s do some benchmarking, shall we? The code for this benchmark is available on Github. For each hash table I compared against, I generated data for insertion calls by randomly generating uint64_t elements from the range $[0, n)$, where $n$ was varied from 1,000,000 to 100,000,000 in increments of 1,000,000.

To evaluate the tables, we’ll take three looks on efficiency: building time, memory consumption, and query throughput.

Our probing hash table in this case consists of a single probing array that contains the key/value pairs directly, ensuring that (in this case) the cells are exactly 16 bytes. However, we do need some way of telling whether a particular cell in that array is occupied by a valid key/value pair. To do this, and not waste any space, we sacrifice a single key value (called the sentinel) that will be stored in the cells that are not occupied by valid key/value data. The particular value that’s sacrificed to the hash table deities can be configured via key_traits. (And, if sacrificing a specific key value is not an option, you can use an alternative storage strategy that I’ll talk about later in the section on string hash tables.)

## Building Time

For the running time, there’s an interesting trade-off here. If you aren’t concerned about std::hash<uint64_t> being the identity function (maybe you can carefully control the input to your hash table, for example, to guarantee no evilness), it’s clear that just using unordered_map<uint64_t, uint64_t> is fine. In fact, it’s the fastest among the tables we tested.

But if you’re at least a little concerned about those DoS attacks I mentioned earlier, you can see that the probing hash table (probe_map in the graph) is the clear winner.

So then I asked the question: what if I use std::hash in a probe_map? Clearly, evaluating the hash function is more expensive when using meta::hashing::hash<> (FarmHash, by default), so maybe I can beat out unordered_map if I use a faster hash function?

…and after about 10 hours of running what was supposed to be a “quick” test, I realized the problem: as soon as the table expands to larger than my random number input range, the keys are essentially guaranteed to coalesce into the front half of the table. The only case where elements can end up in cells $n$ and beyond is if there was a collision at cell $n - 1$. So, because of our naive hash function, we’ve got the primal clustering problem on steroids. Not good.

So don’t do that.

I also compared gcc against clang here, and there was no discernible difference (the lines are literally right on top of one another in the graph, so I don’t present these results here).

## Memory Consumption

Next, let’s have a look at the memory consumption during building. I measured this by using getrusage() on my Linux-based testing environment and extracting the ru_maxrss field from the struct to get the number of kilobytes used by the program at its peak memory consumption.

Here we can see one clear advantage of using the probe_map: it (most of the time) consumes less memory than the equivalent unordered_map. You’ll also see the obvious resizing jumps occurring more often, reflecting the library’s choice of a default resizing ratio of 1.5x instead of the commonly used 2x (though you have full control over this with probe_map, unlike unordered_map’s hardcoded default).

I also compared against gcc again and, to no one’s surprise, it is exactly the same as clang here, so I didn’t bother cluttering the graph.

## Query Throughput

So far so good. Now let’s consider the query throughput: how quickly can I look up every element (including the duplicates) that I inserted in the table? We report the number of queries one can perform in a second here, in millions.

Since you can finally tell the difference between gcc and clang here, I report both in the graph (though their differences are insignificant). At the very top we have probe_map, which seems to remain relatively constant at around 1.56 million lookups/second after it dips around the 10 million insertion mark. Below that, perhaps predictably, are unordered_map with std::hash and unordered_map with meta::hashing::hash<>. The throughput is hurt significantly by having a more sophisticated hash function. Fortunately, it seems that you can easily get the best of both worlds by using probe_map.

# Probing Hash Tables with String Keys

Strings are perhaps the most common key type used in hash tables, and represent a significantly different challenge compared to hashing integer keys. A C++11 compliant std::string implementation is significantly larger than a uint64_t: as of the time of writing libc++ uses 24 bytes and libstdc++ uses 32 bytes. This makes using a simplistic probing hash table significantly less efficient. When using integer keys and values, we could get away with just storing them directly in the array used for probing due to their small size (the probing elements were only 16 bytes). If we store a probing element that contains a std::string key and a uint64_t value, we waste somewhere between 32 and 40 bytes per empty cell, and there will typically be a lot of empty cells in a probing hash table (common wisdom suggests maximum load factors of around 0.7 for probing tables5; we’ve opted for a more aggressive 0.85 default, but these values are configurable in probe_map just as they are for unordered_map.).

To get around this, the probe_map implementation detects whether your key is “inlineable” or not using a key_traits class. Right now, only integral values are considered “inlineable” and everything else is not. If a key type is not “inlineable”, we change our representation of the table up slightly. We still have a large array for probing, which contains two integers: one that indicates the hash code of the element that occupies that cell, and one that indicates the index into another array where that key/value pair can be found, plus 1 (if the index value is 0, this signifies an empty cell).

This has a couple advantages. First, these kinds of nontrivial keys tend to have more complicated equality operators than just a single integer comparison. In the case of a std::string, this is a standard lexicographic comparison against a byte buffer that could be located in a non-local part of memory provided the string is long enough. If we call operator== on every string as we probe, each call could potentially result in a cache-miss even if the cells are right next to each other since the “real” data (the byte buffer) is elsewhere. Instead, we’ll first check whether the hash code of the element we’re looking for is equal to the hash code that we’ve cached at that cell in the probing array (which won’t incur an additional cache miss). Only if those hash codes are equal do we go ahead and perform potentially expensive operator== call.

Second, it allows the wasted space for empty cells in the probing table to continue to be only 16 bytes as opposed to 32 to 40. This can save quite a bit of space since the probing table is likely to have a lot of empty cells. Unfortunately, it’s not all space savings, since the secondary array storing the key/value pairs is also likely to have many empty cells, since we use an exponential-resizing std::vector for their storage.

With that explanation out of the way, let’s compare our probe_map against unordered_map again, using the same criteria as we did for the integer case: building speed, memory consumption, and query throughput. Here, we’ve generated string keys by taking the exact same data as we used in the integer benchmarks, but converting the integers to their string representations first.

## Building Time

Building time is essentially the same across the tables, with probe_map performing slightly worse in some spots due to the more frequent resizing cost. However, it’s interesting to see that unordered_map with std::hash actually starts to perform worse than unordered_map with meta::hashing::hash<> towards the very end of the benchmark (at around 85,000,000 insertions). It’s also interesting to note that unordered_map and probe_map converge to about the same performance around 95,000,000 insertions if they are both using FarmHash.

## Memory Consumption

probe_map loses the memory battle to unordered_map by significant margins at many insertion counts, and we suspect the secondary array’s blank cells account for a lot of the extra weight. Fortunately, the extra memory usage is frequently followed by a long period of growth that is at a significantly lower rate than unordered_map, allowing it to “catch up” at a couple of insertion count points.

## Query Throughput

Here we can start to see some interesting differences between clang and gcc. At the top of the graph is probe_map, which has a higher query throughput than any of the standard library implementations by a good margin. Interestingly, though, gcc seems to deteriorate faster than clang starting around the 50,000,000 insertion mark (and I have no explanation for why it suddenly jumps back up at the very last 100,000,000 insertion count, especially because it doesn’t appear to be due to variance).

gcc, however, does redeem itself when comparing the unordered_map implementations. When both are using std::hash, gcc starts to pull away from clang at around the 55,000,000 insertion mark. Here, clang with std::hash begins a steady decline that makes it even slower than the unordered_map implementations with FarmHash at around 70,000,000 insertions.

# Conclusions

These results to me speak the following trade-offs for inlineable and non-inlineable keys:

• For small (integer) keys, probe_map trades build time for query throughput. It would be interesting to see whether increasing the resize ratio or maximum load factor could change the incline of the build time graph.

• For large (string) keys, probe_map trades memory usage for query throughput. Investigating different secondary storage types might be helpful in reducing the wasted space, but may imply other trade-offs.

If you have a need for a hash table with very fast query speed, using probe_map over unordered_map might make sense (provided you won’t miss erase(), which is currently unimplemented because we simply don’t use it often enough to care).

# Acknowledgments

I’d like to thank Sean Massung and Chelsea Neely for their feedback on a draft version of this post, and Jeff Erickson’s Advanced Data Structures course for inspiring the work.

# Edits

• 2016-01-31: Added a paragraph that discusses the use of a sentinel value for inlineable keys to signify the empty cell.
1. Seriously. Check out Bjarne Stroustrup’s GoingNative 2012 keynote at about 44 minutes.

2. For linear probing (and blocked probing strategies in general), it’s been shown that your hash function needs to be five-way independent in order to guarantee good performance. Fortunately, one can construct hash functions that satisfy this property relatively easily. Unfortunately, nobody really seems to be using functions that guarantee this property in practice (including me).

3. Or your own crypto system. These are both Very Bad Ideas(TM).

4. Although, since then, it’s been discovered that Murmur3 hash has seed-independent collision problems, meaning that even in the face of a randomly-generated seed at program startup, people can still break your stuff. The best alternative thus far seems to be SipHash, which can be easily integrated into this framework.

5. With the exception of quadratic probing, which isn’t guaranteed to find an empty slot in the table if it’s more than half full!

comments powered by Disqus