Dr. Carlo Pescio Minimal Perfect Hashing |
Published in Dr. Dobb's Journal, No. 249, July 1996.
Hashing is a powerful and simple technique for accessing information:
each key is mathematically converted into a number, which is then
used as an index into a table. In typical applications, it's common
for two different keys to map to the same indexrequiring some
strategy for collision resolution. However, many applications
use a static set of records; for instance, the table of
keywords for a compiler, the contents of a dictionary, or the
topics in a help file. In these cases, it's nice to have a perfect
hashing function, one that maps different keys into different
numbers. With a perfect function, you could retrieve the record
(or detect that no record exists) in exactly one probe, without
worring about collisions. A special case of perfect hashing is
minimal perfect hashing, were no empty locations in the
table exist. Minimal perfect hashing satisfies the ideal of fast
performance and low memory usage. However, finding perfect hashing
functions is difficult. In this article, I'll present a simple
algorithm to find perfect (and optionally minimal) hashing functions
for a static set of data.
Cichelli's functions
A family of perfect hashing functions was introduced by R. Cichelli
in [1]. Cichelli suggests an exhaustive search to find such functions,
and discusses some heuristics to speed the search. I have extended
the heuristics to obtain a general implementation and speed up
the original algorithm by more than an order of magnitude.
Figure 1:
Minimal perfect hashing for days of the week
sunday | 0 |
monday | 5 |
tuesday | 3 |
wednesday | 6 |
thursday | 4 |
friday | 1 |
saturday | 2 |
f | -8 |
m | -4 |
s | -9 |
t | -7 |
w | -6 |
y | 3 |
Cichelli's hashing functions are very simple. Given a word w,
the hash value is Value( First_Letter( w ) ) + Value( Last_Letter(
w ) ) + Length( w ). The mapping function Value()
is normally implemented as a table that maps characters into numbers.
Figure 1 is an example of table and hash values of a minimal perfect
hashing functions for the days of the week. Note that you only
need to keep a small table for Value(), since only first and last
letters need to be mapped. Calculating the hash value takes only
two accesses in the table and two additions.
Cichelli's algorithm
The original algorithm uses an exhaustive approach with backtracking.
Every possible mapping is tried for Value(), and whenever
an assignment results in a collision, the search path is pruned
and another solution is tried. The algorithm is easy to code,
but can take exponential time (in the number of words) to calculate.
Cichelli suggested two ways to speed up this approach. One way
is to sort the keys: compute the frequency of the first and last
letter, then sort the words according to the sum of the frequency
for the first and last letters. This guarantees that the most
common letters are assigned values early, and increases the probability
that collisions are found when the depth of the search tree is
still small. This reduces the cost of backtracking.
This isn't always sufficient, however. Consider the set of keys:
{aa, ab, ac, ad, ae, af, ag, bd, cd, ce, de, fg},
which sorts to
{aa, ad, ac, ae, ab, af, ag, cd, de, bd, ce, fg}.
Now, the value of the key cd is determined as soon as ac
is given a value, but a test for collision on cd is not
done until five more branches in the search tree are taken. This
makes a collision on cd much more costly than it ought
to be.
Cichelli suggested a second heuristic to address the problem.
If a key in this sorted list has letters that have already occurred,
the key is placed immediately after the one that completes its
mapping. In the aforementioned example, cd will be moved
immediately after ac. While doing so, you must try to maintain
the original sorting order whenever possible: listing 1 is a snippet
of my C++ implementation.
There are five steps to Cichelli's algorithm:
(1) Calculate the frequency of first and last letters.
(2) Order the keys in decreasing order of the sum of the frequencies.
(3) Reorder the keys from the beginning of the list, so that if
a key's first and last letters have already appeared, that key
is placed next to the completing key in the list.
(4) Search cycle: for each key, if both values for first and last
letters are already assigned, try to place the key in the table
(step 5); otherwise, if only one value needs to be assigned, try
all the possible assignments. If both values must be assigned,
use two nested loops to try all the possible assignments for the
two letters.
(5) Given the key, the length, and the Value mapping, compute
the address in the table. Backtrack to step (4) if there is a
collision; otherwise place the key in the table. If is the last
letter, a perfect hashing has been found; otherwise, recursively
go into step (4) for the next key.
The table size is not fixed. If the table size is equal to the
number of keys, we are looking for a minimal perfect hashing;
otherwise, some empty buckets remain at the end of the process.
Improvements
Cichelli did not suggest any way to restrict the range of values
to try for each letter. This range plays an important role in
the algorithm. A small range may result in no solutions, and an
overly large range may increase the width of the search tree unmanageably.
I've developed a general rule that works fine for most sets, resulting
in a moderately small range that does not exclude solutions. Since
you add the length of the word and you want to obtain small index
values (including zero), most of the letters will be assigned
to negative values. Also, the letter values will never be very
large becasue the size of the table is an upper bound for the
index.
I chose a range of [-MaxLen, TableSize - MinLen),
where MaxLen is the maximum length of a key belonging to
the set; MinLen the minimum length of a key; and TableSize
the size of the hash table. The range is normally small (for the
keys in figure 1, assuming minimal hashing, it is equal to [-9,1))
but still effective.
This can be furthher improved by adapting the range dynamically.
Consider what happens when one of the two letters has already
been assigned a value. This occurs twice in the algorithm: when
only one value must be assigned and in the inner of the two nested
loops. In this case, you already know one of the values (say v)
and the length of the word, so you can recalculate the range as
[-v - Len, TableSize - v - Len],
where Len is the length of the key to which you are trying
to assign an index. This optimization alone accounted for more
than an order-of-magnitude improvement in the performance of the
algorithm: I've frequently seen a speedup of 40 or more. Note
that the dynamic range may occasionally extend the original one:
for instance, in figure 1 y is assigned a value of 3, that
is outside the static range. Further optimization of the dynamic
range can be obtained by considering which buckets are free. If
you keep track of the first and last free buckets in the hash
table, the dynamic range can be restricted to [FirstFree -v
- Len, LastFree - v - Len]. Listing
2 is my C++ implementation of steps (4) and (5). I also added
some logic to stop when the first solution is found.
The result is a fairly quick algorithm: a minimal perfect hash
for the 32 keywords of ANSI C (see figure 2) can be found in less
than one second on a Pentium/90, and the 46 noncolliding keywords
of the C++ need about four seconds. The complete C++ implementation
of my algorithm is available in the Dr. Dobb's forum of directly
from me via email. It has been compiled and tested on several
different architectures and operating systems, and it should work
without changes on most systems.
Figure 2:
Minimal perfect hashing for ANSI C keywords
a | -5 | l | 29 |
b | -8 | m | -4 |
c | -7 | n | -1 |
d | -10 | o | 22 |
e | 4 | r | 5 |
f | 12 | s | 7 |
g | -7 | t | 10 |
h | 4 | u | 26 |
i | 2 | v | 19 |
k | 31 | w | 2 |
auto | 21 | int | 15 |
break | 28 | long | 26 |
case | 1 | register | 18 |
char | 2 | return | 10 |
const | 8 | short | 22 |
continue | 5 | signed | 3 |
default | 7 | sizeof | 25 |
do | 14 | static | 6 |
double | 0 | struct | 23 |
else | 12 | switch | 17 |
enum | 4 | typedef | 29 |
extern | 9 | union | 30 |
float | 27 | unsigned | 24 |
for | 20 | void | 13 |
goto | 19 | volatile | 31 |
if | 16 | while | 11 |
Collisions resolution and extensions
Some words, as the C++ keywords double and delete,
results into a collision no matter which assignment is selected.
There are several ways to overcome this problem: the first is
to enhance the algorithm to consider letter positions other than
the first and last. Another is to consider more than two letters.
For instance [2] Haggard and Karplus tried using all letters with
encouraging results.
Another idea is to partition the set of words in two or more sets.
This is effective also when there are many words, and the algorithm
takes hours (or days) to calculate a perfect hashing. The best
idea would be to partition the set of keys into disjoint sets
of first and last letters. This allows you to use a single hash
table simply by adding an offset to each partition. Unfortunately,
the distribution of letters usually does not allow such a partition,
and you must often resort to multiple tables, and consequently
to multiple probes.
I'm currently investigating two very promising ideas, one based
on the concept of articulation point in a graph, which should
reduce backtracking, and another based on a look-ahead technique
to prune dead paths more quickly. Both techniques should reduce
the computational complexity of the algorithm in the average case,
so that larger set of keys can be conveniently mapped. If you
are interested in the results of my research, drop me an email
message.
Bibliography
[1] R. Cichelli, "Minimal Perfect Hashing Made Simple",
Comm. ACM Vol. 23 No. 1, Jan. 1980.
[2] Haggard and Karplus, "Finding Minimal Perfect Hash
Functions", ACM SICCSE Bull., Vol. 18, No. 1, Feb. 1986
Biography
Carlo Pescio holds a doctoral degree in Computer
Science and is a consultant and mentor for various European companies
and corporations, including the Directorate of the European Commission.
He specializes in object oriented technologies and is a member
of IEEE Computer Society, the ACM, and the New York Academy of
Sciences. He lives in Savona, Italy and can be contacted at pescio@eptacom.net.
Listing 1 void WordSet :: Reorder() { // reorder items in the set, so that if an item // first and last letter occurred before, the // item is moved upward in the list. This heuristic // maximize the likelihood that collisions are found // early in the assignment of values. const unsigned NO_OCCURRENCE = MAXINT ; for( unsigned i = 1; i < card; i++ ) { const Word& w = *(set[ i ]) ; char first = w.First() ; char last = w.Last() ; unsigned firstOccurredIndex = NO_OCCURRENCE ; unsigned lastOccurredIndex = NO_OCCURRENCE ; for( unsigned j = 0; j < i; j++ ) { const Word& prev = *(set[ j ]) ; if( lastOccurredIndex == NO_OCCURRENCE && ( prev.Last() == last || prev.First() == last ) ) lastOccurredIndex = j ; if( firstOccurredIndex == NO_OCCURRENCE && ( prev.First() == first || prev.Last() == first ) ) firstOccurredIndex = j ; if( firstOccurredIndex != NO_OCCURRENCE && lastOccurredIndex != NO_OCCURRENCE ) { // both letters occurred before: move the item upward, by // pushing down the ones between the maximum occurence // index and the item index itself. This preserve the // original ordering as much as possible unsigned firstToMoveDown = max( firstOccurredIndex, lastOccurredIndex ) + 1 ; unsigned lastToMoveDown = i - 1 ; Word* itemToMoveUp = set[ i ] ; for( unsigned k = lastToMoveDown; k >= firstToMoveDown; k-- ) set[ k + 1 ] = set[ k ] ; set[ firstToMoveDown ] = itemToMoveUp ; break ; // item 'i' handled } } } } bool HashFinder :: Search( unsigned index ) { const Word& w = words[ index ] ; if( value[ w.First() ] != NO_VALUE && value[ w.Last() ] != NO_VALUE ) { // try to insert into the table int h = Hash( w ) ; if( h < 0 || h >= maxTableSize || ! buckets.IsFree( h ) ) return( FALSE ) ; else { buckets.MarkUsed( h ) ; index++ ; int card = words.Card() ; if( index == card ) { cout << "------------\n" ; cout << "perfect hashing found\n" ; for( int i = 0; i < CHAR_MAX; i++ ) if( value[ i ] != NO_VALUE ) cout << (char)i << " = " << value[ i ] << " " ; cout << endl ; for( i = 0; i < card; i++ ) cout << words[ i ] << " = " << Hash( words[ i ] ) << endl ; // calculate and print loading factor int used = 0 ; for( i = 0; i < maxTableSize; i++ ) if( ! buckets.IsFree( i ) ) used++ ; cout << "loading factor: " << (float)used/maxTableSize << endl; // decide if it is time to stop now stopNow = stopFirst || ( used == maxTableSize && stopMinimal ); buckets.MarkFree( h ) ; // restore for backtracking return( TRUE ) ; } else { bool res = Search( index ) ; buckets.MarkFree( h ) ; // restore for backtracking return( res ) ; } } } else if( value[ w.First() ] == NO_VALUE && value[ w.Last() ] == NO_VALUE ) { // must find values for both first and last; int firstValToAssign = w.First() ; int lastValToAssign = w.Last() ; bool found = FALSE ; for( int i = minVal; i < maxVal && ! stopNow; i++ ) { value[ firstValToAssign ] = i ; // here we can change the heuristics: we already assigned one of // the values, so we have modified the boundaries. // Now we must have FirstFree <= to-assign + 'i' + len <= LastFree int minVal2 = buckets.FirstFree() - i - w.Len() ; int maxVal2 = buckets.LastFree() - i - w.Len() + 1 ; for( int j = minVal2; j < maxVal2 && ! stopNow; j++ ) { value[ lastValToAssign ] = j ; bool res = Search( index ) ; found = found || res ; } } value[ firstValToAssign ] = NO_VALUE ; // restore for backtracking value[ lastValToAssign ] = NO_VALUE ; return( found ) ; } else { // must find valid values for the one without values int valToAssign ; int valAssigned ; if( value[ w.First() ] == NO_VALUE ) { valToAssign = w.First() ; valAssigned = w.Last() ; } else { valToAssign = w.Last() ; valAssigned = w.First() ; } // see above for heuristics used here int minVal2 = buckets.FirstFree() - value[ valAssigned ] - w.Len(); int maxVal2 = buckets.LastFree() -value[valAssigned] - w.Len() + 1; assert( value[ valToAssign ] == NO_VALUE ) ; bool found = FALSE ; for( int i = minVal2; i < maxVal2 && ! stopNow; i++ ) { value[ valToAssign ] = i ; bool res = Search( index ) ; // will take first branch this time found = found || res ; } value[ valToAssign ] = NO_VALUE ; // restore for backtracking return( found ) ; } }