Demythifying the Aho-Hofcroft-Ullman Hash Function

Bruno Haible `<bruno-antispam@antispam.clisp.org> 2001-07-04`

Abstract: Although frequently used by naive programmers, the Aho-Hopcroft-Ullman Hash function for strings is shown to be a major performance bottleneck for some applications, and some variants are proposed which resolve this bottleneck.

The following function, taken from [Aho, Sethi, Ullman], computes the hash code of a string, seen as a sequence of bytes in memory.

/* Original hash function */
static unsigned long
compute_hashval (const void *key, size_t keylen)
{
size_t cnt;
unsigned long int hval, g;

cnt = 0;
hval = keylen;
while (cnt < keylen)
    {
      hval <<= 4;
      hval += (unsigned long int) *(((char *) key) + cnt++);
      g = hval & ((unsigned long) 0xf << (LONGBITS - 4));
      if (g != 0)
        {
          hval ^= g >> (LONGBITS - 8);
          hval ^= g;
        }
    }
return hval != 0 ? hval : ~((unsigned long) 0);
}

This function is used in the 'localedef' program that belongs to the GNU C Library version 2.2. The command that was benchmarked was the creation of a Unicode locale: "localedef -i de_DE -f UTF-8 de_DE.UTF-8". compute_hashval is used by both the primary and secondary hash functions of two hash tables. The meaning of primary and secondary hash function becomes clear from the following hash table lookup function:

static size_t
lookup (hash_table *htab, const void *key, size_t keylen, unsigned long hval)
{
unsigned long hash;
size_t idx;
/* Lookup call counter L */
hash_entry *table = (hash_entry *) htab->table;

/* First hash function: simply take the modul but prevent zero. */
hash = 1 + hval % htab->size;

idx = hash;

if (table[idx].used)
    {
      if (table[idx].used == hval && table[idx].keylen == keylen
          && memcmp (key, table[idx].key, keylen) == 0)
        return idx;

      /* Collision counter C */
      /* Second hash function as suggested in [Knuth]. */
      hash = 1 + hval % (htab->size - 2);

      do
        {
          if (idx <= hash)
            idx = htab->size + idx - hash;
          else
            idx -= hash;

          /* If entry is found use it. */
          if (table[idx].used == hval && table[idx].keylen == keylen
              && memcmp (key, table[idx].key, keylen) == 0)
            return idx;
          /* Round-looping counter R */
        }
      while (table[idx].used);
    }
return idx;
}

The two hash tables have keys of the following form:

The character-by-name table has 230000 keys of the form "Uxxxx" and "U000xxxxx" where the 'x' represent hexadecimal digits in upper case.
The character-by-byte-sequence table has 230000 keys that correspond to valid UTF-8 sequences with up to four bytes. For instance, among those with length 3, a large portion of all possible bit patterns of the form 1110xxxx 10xxxxxx 10xxxxxx (each x either 0 or 1) is used.

One can certainly claim that these situations are extreme uses and not the average case. But on the other hand, it cannot be denied that this is a real-life situation, and that there is no obviously better data structure than a hash table that could be used without constraining the form of the input data. Well designed hash tables must also apply to extreme cases like this one.

At the marked position three counters were added:

L counts the number of calls to the lookup function,
C counts the number of times a collision is encountered, i.e. the second hash function is used,
R counts the number of round-looping over occupied entries in the hash table.

The following table shows these counts, taken over the total runtime of the application. The first column denotes the fullness at which the hash table is resized (at which its size is increased by a factor of 2). The second column shows L, which is of course independent of the "resize at" column. The next two columns denote the collisions and round-looping that occur with Aho-Hopcroft-Ullman's original hash function.

Resize at		Original hashfunc		Modified hashfunc		Best hashfunc
%	L	C	R	C	R	C	R
99	3967749	3675179	81368723	3411436	38852940	2878792	21111366
98	3967749	3673939	80417219	3409981	38222653	2875586	20758471
95	3967749	3672883	78027125	3405394	37278047	2864306	20284108
90	3967749	3660798	75756023	3388510	35800720	2838975	19679561
85	3967749	3326789	57423689	2851028	16483724	1469015	4341997
80	3967749	3319496	54165729	2834775	16047321	1439351	4094990
75	3967749	3311030	50568145	2819586	15658912	1411443	3909274
70	3967749	3305998	48357944	2804174	15330091	1385422	3764736
65	3967749	3307809	47748304	2790929	15169761	1365246	3652689
60	3967749	3299227	47180698	2777389	14913818	1344251	3554541
55	3967749	3303693	46376191	2767100	14202239	1334525	3531360
50	3967749	3306849	45578422	2756588	14105465	1326398	3490387
45	3967749	3304367	44779595	2745398	13964985	1313752	3375041
40	3967749	2888722	37253590	2493803	12063451	608490	1248833
35	3967749	2893043	35589986	2485646	11974145	599018	1221599
30	3967749	2892003	35223235	2478289	11865595	586111	1187839
25	3967749	2898534	34987248	2470076	11608237	571787	1139588
20	3967749	2635739	33145946	2420612	11010101	412730	711494
15	3967749	2631859	32882465	2405928	10942076	402464	562713

A very small "resize at" percentage means a very large table with many holes. One would expect that C and R get smaller at the bottom. But this is hardly the case: the R/L ratio (= average number of round-looping per lookup function call) decreases from 20 (top, resize at 99%) only to 8 (bottom, resize at 15%). The reason for this behaviour is

The hash function has many collisions, namely it produces the same hash code for about 16 hash keys,
Both the secondary and the primary hash function are computed from the same hash value. Knuth actually recommends using two independent hash functions.

Now here is a modification of the hash function. It uses a shift by 5 instead of 4 bits.

/* Modified hash function */
static unsigned long
compute_hashval (const void *key, size_t keylen)
{
size_t cnt;
unsigned long int hval, g;

cnt = 0;
hval = keylen;
while (cnt < keylen)
    {
      hval <<= 5;
      hval += (unsigned long int) *(((char *) key) + cnt++);
      g = hval & ((unsigned long) 0x1f << (LONGBITS - 5));
      if (g != 0)
        {
          hval ^= g >> (LONGBITS - 8);
          hval ^= g;
        }
    }
return hval != 0 ? hval : ~((unsigned long) 0);
}

The fifth and sixth columns show the collision and round-loop counts for this hash function. One can see that in this case, the R/L ratio decreases from 9.8 to about 2.75. But the gross behaviour remains the same:

The hash function has many collisions, namely the same hash code for about 5 to 6 hash keys.

Two further small modification of the hash function are: shift by 9 bits instead of 5, and keep the top level bits set instead of clearing them.

/* Best hash function */
static unsigned long
compute_hashval (const void *key, size_t keylen)
{
size_t cnt;
unsigned long int hval;

cnt = 0;
hval = keylen;
while (cnt < keylen)
    {
      hval = (hval << 9) | (hval >> (LONGBITS - 9));
      hval += (unsigned long int) *(((char *) key) + cnt++);
    }
return hval != 0 ? hval : ~((unsigned long) 0);
}

The results of these function are displayed in the seventh and eighth column of the table.

The collisions now decrease as the table size grows.

So, what was wrong for the first two hash functions? They perform extraordinarily bad on short bit patterns like 1110xxxx 10xxxxxx 10xxxxxx. The first maps the two last xx bits of the first byte and the two first xx bits of the second byte to the same positions in the final hash code, and similarly for the second and third byte. Thus four bits are effectively killed while computing the hash code, which explains the collision list length of 16 observed above.

This wasn't even the worst possible result: For other bit patterns, like 1110xxxx xxxxxxxx xxxx0101, the amount of killed bits would have been 8, giving rise to collision lists of length 256.

Let's now look at the computation time.

Resize at %	Original	Modified	Best	Best, Memory
90	19.9 sec	11.1 sec	9.4 sec	76.8 MB
85	19.8 sec	9.7 sec	8.0 sec	98.4 MB
80			7.9 sec	98.4 MB
40		10.0 sec	8.5 sec	141.7 MB
20		12.1 sec	10.8 sec	239.3 MB

It is also dependent on the resize fullness. If this factor is larger than 85%, the number of collisions is quite high (see first table), eating CPU time in the round-looping. If this factor is less than 40%, the program eats a whole lot more memory, slowing down the program through cache misses and paging. So a good resize fullness is between 40% and 85%. Fur further timings, I chose 75%.

The final speed achievement, through the changed hash function and the changed resize threshold, is as follows.

	Original	Best
Total time	19.9 sec	7.9 sec
Time spent in `lookup()`	72.5%	26.9%

Demythifying the Aho-Hofcroft-Ullman Hash Function

Bruno Haible <bruno-antispam@antispam.clisp.org> 2001-07-04

Bruno Haible `<bruno-antispam@antispam.clisp.org> 2001-07-04`