We present a new class of resizable sequential and concurrent hash map algorithms directed at both uni-processor and multicore machines. The new hopscotch. I am currently experimenting with various hash table algorithms, and I stumbled upon an approach called hopscotch hashing. Hopscotch. We present a new resizable sequential and concurrent hash map algorithm directed at both uniprocessor and multicore machines. The algorithm is based on a.
|Published (Last):||15 October 2008|
|PDF File Size:||16.73 Mb|
|ePub File Size:||20.82 Mb|
|Price:||Free* [*Free Regsitration Required]|
Hopscorch am currently experimenting with various hash table algorithms, and I stumbled upon an approach called hopscotch hashing. Hopscotch hashing is a reordering scheme that can be used with the open addressing method for collision resolution in hash tables.
When using open addressing with only a probing sequence and no reordering, entries are inserted in the first empty buckets found hopecotch the sequence.
With a reordering scheme, entries already in the table can be moved as new entries are inserted. Hopscotch hashing is interesting because it guarantees a small number of look-ups to find entries.
In addition, those look-ups are in contiguous memory areas, which is very hashinv friendly. This article makes minor contributions to the original publication. First, a clear explanation of the insertion process is being given, which is independent from the representation of neighborhoods. The original paper was using the bitmap representation to present the algorithm, and I believe that things are simpler without it. It derives the relationship between neighborhoods and initial buckets from the hashed keys, and does not require any extra memory.
Hopscotch hashing was introduced hopsvotch Herlihy et al. Two versions of the paper can be found online, a draft  and the final version .
The original paper was mostly focusing on multi-core and concurrent environments. Those aspects are not being discussed in the present article, which focuses on single-core environments.
Very Fast HashMap in C++: Hopscotch & Robin Hood Hashing (Part 1)
The main idea behind hopscotch hashing is that each bucket has a neighborhood of size H. The neighborhood of a bucket B is defined as the bucket itself and the H-1 buckets following B contiguously in memory H buckets total.
This also means that at any bucket, multiple neighborhoods are overlapping H to be exact. Hopscotch hashing guarantees that an entry will always be found in the neighborhood of its initial bucket. As a result, at most H consecutive look-ups will be required in contiguous memory locations, which is extremely cache friendly.
In order to insert a new entry, its key is hashed to find the initial bucket for the entry, denoted as B.
The search for an empty bucket E then starts, using linear probing hopsdotch the neighborhood of B. If an empty bucket hhopscotch found, the entry is saved there. Otherwise, the linear probing is pushed beyond the border of the neighborhood of B, until an empty bucket is found. If the entry were saved there, subsequent trials to retrieve it would fail, because the search would hasying bounded to the neighborhood of B.
In order to move the empty bucket E from the location where it was found into the neighborhood of B, it needs to be swapped with entries between the positions of B and E.
An entry will respect the hopscotch guarantee if it can be found in the neighborhood of its initial bucket, i.
Consequently, the empty bucket E can be swapped with any of the preceding Uopscotch entries that have the position of E in their neighborhoods.
At this point, I believe that long paragraphs of explanations would not be of any help. Instead, I am presenting the insertion process of hopscotch hashing with a diagram, in Figure 1 below. For each step, the array on the left represents the status of the hash table, and the bullet points on the right provide some information as to what the hashinv is doing.
The first step to retrieve an entry is to determine its initial bucket. Then, all what has to be done hopscogch to start from the position of this initial bucket and to scan through the next H-1 buckets, and jopscotch each bucket, to compare the key with the key of the yopscotch being searched. If the key does not match any of the keys for the entries in the neighborhood of the initial bucket, then the entry is simply not in the table.
The deletion process is simple: This is a major improvement compared to basic open addressing that uses only probing. In that that case, entries would be removed by lazy deletionwhich require the introduction of special values to mark deleted buckets. Lazy deletion leads to the apparition of large areas of buckets marked as deleted, that are unused but yet must be scanned through during look-ups, a phenomenon called contamination. In the original paper, two representations of the neighborhoods were covered .
The first one is using one bitmap per bucket, and the second one is using a linked list per bucket. Those two representations are described below. Neighborhoods can be stored hzshing bitmaps bit arrays. With this solution, each bucket needs a bitmap in addition to all the data already needed for that bucket.
Each bit in that bitmap indicates if the current bucket or one of its following H-1 buckets are holding an entry which belongs to the neighborhood of the current bucket. Figure 2 below shows the hash table presented in Figure 1, and along with it the bitmaps for each of the buckets. Hopscottch has one entry in its neighborhood, stored in bucket 6. The bitmap for bucket 5 is thereforewith a bit set to 1 at index 1, because bucket 6 hipscotch at an offset of 1 from bucket 5.
The main drawback of using bitmaps is that the maximum size of the neighborhoods is limited by the number of bits in the bitmaps.
The size of the bitmaps could be increased, but that would mean that the size of each bucket would be increased, which at some point would make the whole bucket array explode in memory. If neighborhoods are stored using linked lists, each bucket has a pointer to the next bucket in the same neighborhood. Since the buckets in a same neighborhood will be found at a bounded distance, at most H-1small integers can be used to store offsets instead of larger and more costly addresses, which can save up a lot of memory.
In the code associated with the paper this offset solution is the one that was chosen. An advantage of using linked lists is that the size of the neighborhoods can be much larger without too much overhead on the total size of the bucket array. One of the drawbacks of using linked lists is that it may negate cache friendliness, because it prevents spatial locality and prefetching, as the pointers may jump forward and backward in memory.
This representation was not part of the original paper, this is just an idea that I wanted to experiment with. Due to the hopscotch method, an entry may not be in the bucket it was hashed to, its initial bucket, but most likely in a bucket in the neighborhood of its initial bucket. From the hashed key only, it is possible to find for any entry the position of its initial bucket using the modulo operator.
From there, the neighborhood to which the entry belongs can be determined, which is the initial bucket that was just derived and the next H-1 buckets.
Storing the hashed keys is frequently required. For example if the implementation of a hash table needs to allow for resizing, then in some cases performance can be improved by storing the hashed keys.
As I am researching collision resolution methods to store hash tables on disk, storing the hashed key of each entry in the bucket array is something I am strongly considering. It is hashihg that storing the hashed keys requires more memory, but it avoids the need for extra memory to store the neighborhoods as it is the case with the bitmap and linked-list representations.
It is important to understand that the relationship between an entry and its neighborhood is reversed in the shadow representation compared to the bitmap and linked-list representations. Indeed, for the bitmap representation, the neighborhood of a bucket is determined using the bitmap stored with that bucket.
With the shadow representation, a bucket is accessed and then its initial bucket is determined, which allows to know to which neighborhood this bucket belongs to. This is just a detail, but this will show up in the implementation and therefore it is something to keep in mind. Finding the initial bucket from a hashed key requires the use of the modulo operator.
This can appear to be costly compared to just hopscoth the neighborhoods, but actually if the size of the bucket hoppscotch is a hopscogch of two, then the modulo can be computed using a bitwise AND, which is extremely fast.
Finally, the shadow representation allows for larger gopscotch sizes, in theory for neighborhoods as large as the bucket array is itself.
Very Fast HashMap in C++: Hopscotch & Robin Hood Hashing (Part 1)
And larger neighborhoods can be useful to allow for more entries to be stored in the hash table, and reach higher load factors. The code is available on GitHub . A Makefile is included and should simplify compilation. From there, simply calling the program with the –help parameter gives a full description of the options available: I have only performed limited experiments, and not a thorough statistical study.
Nevertheless, with this code I was able to find out more about the practical behavior of hopscotch hashing, and some of the limitations that were not described in the original paper. In this section I am trying to shed light on a downside from the original paper.
This may come across as I am picking on the paper, but I am not, I am just pointing out something I find to be inconsistent. The code from the paper implements two representations of the hopscotch hashing, the bitmap and the linked list .
The implemented bitmap representation follows what is described in the paper, but surprisingly, this seems not to be the case for the linked-list one. Indeed, for the linked-list implementation, it seems that the code of the paper does not have the hopscotch algorithm in the putIfAbsent method HopscotchHashMap. From what I understand from the code, it appears that putIfAbsent is just implementing a linear probing and that buckets are then linked together to prevent probing all buckets when calling containsKey.
But no trace of displacements as described in the paper.
If the second option is true, this would mean that the experimental results presented in the paper were not obtained using hopscotch hashing, and therefore that the conclusions reached by this paper could not be taken seriously. This being said, I still find the hopscotch hashing algorithm to be interesting, and it totally deserves to be implemented and tested.
If someone has a better understanding of this code than I do, please drop a comment below. The major limitation of hopscotch hashing is that the load factor does not increase as described in the original paper. In the paper, one of the proofs states: This proof may create a misunderstanding, which is that the load factor can increase to very high values because hopscotch hashing will prevent clustering.
But this is not the case. However, this does not prevent multiple buckets to cluster the overlapping area of their respective neighborhoods. Bucket 0 is the initial bucket, and bucket 11 is the empty bucket found by linear probing.
In Figure 3, all the buckets in the area of the swap candidates are in the neighborhoods of gashing that are before the beginning of the swap area.
Because of that, none of those buckets can be swapped with the empty bucket. This clustering behavior occurs around load factors of 0. Anyway at such high loads, the best thing to do would be to resize the table. A great thing with the shadow representation is that the largest size of the neighborhood is theoretically the size of the hash table itself. Of course having such as large value would be useless, because it would be equivalent to using only linear probing with no reordering.
Hshing I wanted to try the hopscotch hashing with neighborhood sizes larger than the value hashiing 32 tested in the original paper. What it appears is that with neighborhoods of size 64, the load factor at which clustering prevents inserting is around 0. Another way to look at it is that increasing the size of the neighborhoods follows hashimg law of diminishing returns.