Demythifying the Aho-Hofcroft-Ullman Hash Function

Bruno Haible <bruno-antispam@antispam.clisp.org>
2001-07-04

Abstract: Although frequently used by naive programmers, the Aho-Hopcroft-Ullman Hash function for strings is shown to be a major performance bottleneck for some applications, and some variants are proposed which resolve this bottleneck.

 

 

The following function, taken from [Aho, Sethi, Ullman], computes the hash code of a string, seen as a sequence of bytes in memory.

/* Original hash function */
static unsigned long
compute_hashval (const void *key, size_t keylen)
{
  size_t cnt;
  unsigned long int hval, g;

  cnt = 0;
  hval = keylen;
  while (cnt < keylen)
    {
      hval <<= 4;
      hval += (unsigned long int) *(((char *) key) + cnt++);
      g = hval & ((unsigned long) 0xf << (LONGBITS - 4));
      if (g != 0)
        {
          hval ^= g >> (LONGBITS - 8);
          hval ^= g;
        }
    }
  return hval != 0 ? hval : ~((unsigned long) 0);
}

This function is used in the 'localedef' program that belongs to the GNU C Library version 2.2. The command that was benchmarked was the creation of a Unicode locale: "localedef -i de_DE -f UTF-8 de_DE.UTF-8". compute_hashval is used by both the primary and secondary hash functions of two hash tables. The meaning of primary and secondary hash function becomes clear from the following hash table lookup function:

static size_t
lookup (hash_table *htab, const void *key, size_t keylen, unsigned long hval)
{
  unsigned long hash;
  size_t idx;
  /* Lookup call counter L */
  hash_entry *table = (hash_entry *) htab->table;

  /* First hash function: simply take the modul but prevent zero.  */
  hash = 1 + hval % htab->size;

  idx = hash;

  if (table[idx].used)
    {
      if (table[idx].used == hval && table[idx].keylen == keylen
          && memcmp (key, table[idx].key, keylen) == 0)
        return idx;

      /* Collision counter C */
      /* Second hash function as suggested in [Knuth].  */
      hash = 1 + hval % (htab->size - 2);

      do
        {
          if (idx <= hash)
            idx = htab->size + idx - hash;
          else
            idx -= hash;

          /* If entry is found use it.  */
          if (table[idx].used == hval && table[idx].keylen == keylen
              && memcmp (key, table[idx].key, keylen) == 0)
            return idx;
          /* Round-looping counter R */
        }
      while (table[idx].used);
    }
  return idx;
}

The two hash tables have keys of the following form:

One can certainly claim that these situations are extreme uses and not the average case. But on the other hand, it cannot be denied that this is a real-life situation, and that there is no obviously better data structure than a hash table that could be used without constraining the form of the input data. Well designed hash tables must also apply to extreme cases like this one.

At the marked position three counters were added:

The following table shows these counts, taken over the total runtime of the application. The first column denotes the fullness at which the hash table is resized (at which its size is increased by a factor of 2). The second column shows L, which is of course independent of the "resize at" column. The next two columns denote the collisions and round-looping that occur with Aho-Hopcroft-Ullman's original hash function.
 
Resize at   Original hashfunc Modified hashfunc Best hashfunc
% L C R C R C R
99 3967749 3675179 81368723 3411436 38852940 2878792 21111366
98 3967749 3673939 80417219 3409981 38222653 2875586 20758471
95 3967749 3672883 78027125 3405394 37278047 2864306 20284108
90 3967749 3660798 75756023 3388510 35800720 2838975 19679561
85 3967749 3326789 57423689 2851028 16483724 1469015 4341997
80 3967749 3319496 54165729 2834775 16047321 1439351 4094990
75 3967749 3311030 50568145 2819586 15658912 1411443 3909274
70 3967749 3305998 48357944 2804174 15330091 1385422 3764736
65 3967749 3307809 47748304 2790929 15169761 1365246 3652689
60 3967749 3299227 47180698 2777389 14913818 1344251 3554541
55 3967749 3303693 46376191 2767100 14202239 1334525 3531360
50 3967749 3306849 45578422 2756588 14105465 1326398 3490387
45 3967749 3304367 44779595 2745398 13964985 1313752 3375041
40 3967749 2888722 37253590 2493803 12063451 608490 1248833
35 3967749 2893043 35589986 2485646 11974145 599018 1221599
30 3967749 2892003 35223235 2478289 11865595 586111 1187839
25 3967749 2898534 34987248 2470076 11608237 571787 1139588
20 3967749 2635739 33145946 2420612 11010101 412730 711494
15 3967749 2631859 32882465 2405928 10942076 402464 562713

A very small "resize at" percentage means a very large table with many holes. One would expect that C and R get smaller at the bottom. But this is hardly the case: the R/L ratio (= average number of round-looping per lookup function call) decreases from 20 (top, resize at 99%) only to 8 (bottom, resize at 15%). The reason for this behaviour is

  1. The hash function has many collisions, namely it produces the same hash code for about 16 hash keys,
  2. Both the secondary and the primary hash function are computed from the same hash value. Knuth actually recommends using two independent hash functions.
Now here is a modification of the hash function. It uses a shift by 5 instead of 4 bits.

/* Modified hash function */
static unsigned long
compute_hashval (const void *key, size_t keylen)
{
  size_t cnt;
  unsigned long int hval, g;

  cnt = 0;
  hval = keylen;
  while (cnt < keylen)
    {
      hval <<= 5;
      hval += (unsigned long int) *(((char *) key) + cnt++);
      g = hval & ((unsigned long) 0x1f << (LONGBITS - 5));
      if (g != 0)
        {
          hval ^= g >> (LONGBITS - 8);
          hval ^= g;
        }
    }
  return hval != 0 ? hval : ~((unsigned long) 0);
}

The fifth and sixth columns show the collision and round-loop counts for this hash function. One can see that in this case, the R/L ratio decreases from 9.8 to about 2.75. But the gross behaviour remains the same:

  1. The hash function has many collisions, namely the same hash code for about 5 to 6 hash keys.
Two further small modification of the hash function are: shift by 9 bits instead of 5, and keep the top level bits set instead of clearing them.

/* Best hash function */
static unsigned long
compute_hashval (const void *key, size_t keylen)
{
  size_t cnt;
  unsigned long int hval;

  cnt = 0;
  hval = keylen;
  while (cnt < keylen)
    {
      hval = (hval << 9) | (hval >> (LONGBITS - 9));
      hval += (unsigned long int) *(((char *) key) + cnt++);
    }
  return hval != 0 ? hval : ~((unsigned long) 0);
}

The results of these function are displayed in the seventh and eighth column of the table.

  1. The collisions now decrease as the table size grows.
So, what was wrong for the first two hash functions? They perform extraordinarily bad on short bit patterns like 1110xxxx 10xxxxxx 10xxxxxx. The first maps the two last xx bits of the first byte and the two first xx bits of the second byte to the same positions in the final hash code, and similarly for the second and third byte. Thus four bits are effectively killed while computing the hash code, which explains the collision list length of 16 observed above.

This wasn't even the worst possible result: For other bit patterns, like 1110xxxx xxxxxxxx xxxx0101, the amount of killed bits would have been 8, giving rise to collision lists of length 256.

Let's now look at the computation time.
 

Resize at % Original Modified Best Best, Memory
90 19.9 sec 11.1 sec 9.4 sec 76.8 MB
85 19.8 sec 9.7 sec 8.0 sec 98.4 MB
80     7.9 sec 98.4 MB
40   10.0 sec 8.5 sec 141.7 MB
20   12.1 sec 10.8 sec 239.3 MB

It is also dependent on the resize fullness. If this factor is larger than 85%, the number of collisions is quite high (see first table), eating CPU time in the round-looping. If this factor is less than 40%, the program eats a whole lot more memory, slowing down the program through cache misses and paging. So a good resize fullness is between 40% and 85%. Fur further timings, I chose 75%.

The final speed achievement, through the changed hash function and the changed resize threshold, is as follows.
 

Original Best
Total time 19.9 sec 7.9 sec
Time spent in lookup() 72.5% 26.9%