2001-07-04

The following function, taken from [Aho, Sethi, Ullman], computes the hash code of a string, seen as a sequence of bytes in memory.

/* Original hash function */
`static unsigned long`
`compute_hashval (const void *key, size_t keylen)`
`{`
` size_t cnt;`
` unsigned long int hval, g;`

` cnt = 0;`
` hval = keylen;`
` while (cnt < keylen)`
` {`
` hval <<= 4;`
` hval += (unsigned long int) *(((char
*) key) + cnt++);`
` g = hval & ((unsigned long)
0xf << (LONGBITS - 4));`
` if (g != 0)`
` {`
` hval ^=
g >> (LONGBITS - 8);`
` hval ^=
g;`
` }`
` }`
` return hval != 0 ? hval : ~((unsigned long) 0);`
`}`

This function is used in the 'localedef' program that belongs to the
GNU C Library version 2.2. The command that was benchmarked was the creation
of a Unicode locale: "`localedef -i de_DE -f UTF-8
de_DE.UTF-8`". `compute_hashval`
is used by both the primary and secondary hash functions of two hash tables.
The meaning of primary and secondary hash function becomes clear from the
following hash table lookup function:

`static size_t`
`lookup (hash_table *htab, const void *key, size_t keylen, unsigned long hval)`
`{`
` unsigned long hash;`
` size_t idx;`
` /* Lookup call counter L */`
` hash_entry *table = (hash_entry *) htab->table;`

` /* First hash function: simply take the modul but prevent
zero. */`
` hash = 1 + hval % htab->size;`

` idx = hash;`

` if (table[idx].used)`
` {`
` if (table[idx].used == hval &&
table[idx].keylen == keylen`
` &&
memcmp (key, table[idx].key, keylen) == 0)`
` return idx;`

` /* Collision
counter C */`
` /* Second hash function as suggested
in [Knuth]. */`
` hash = 1 + hval % (htab->size -
2);`

` do`
` {`
` if (idx
<= hash)`
`
idx = htab->size + idx - hash;`
` else`
`
idx -= hash;`

` /* If entry
is found use it. */`
` if (table[idx].used
== hval && table[idx].keylen == keylen`
`
&& memcmp (key, table[idx].key, keylen) == 0)`
`
return idx;`
` /*
Round-looping counter R */`
` }`
` while (table[idx].used);`
` }`
` return idx;`
`}`

The two hash tables have keys of the following form:

- The character-by-name table has 230000 keys of the form "Uxxxx" and "U000xxxxx" where the 'x' represent hexadecimal digits in upper case.
- The character-by-byte-sequence table has 230000 keys that correspond to valid UTF-8 sequences with up to four bytes. For instance, among those with length 3, a large portion of all possible bit patterns of the form 1110xxxx 10xxxxxx 10xxxxxx (each x either 0 or 1) is used.

At the marked position three counters were added:

- L counts the number of calls to the lookup function,
- C counts the number of times a collision is encountered, i.e. the second hash function is used,
- R counts the number of round-looping over occupied entries in the hash table.

Resize at | Original hashfunc | Modified hashfunc | Best hashfunc | ||||

% | L | C | R | C | R | C | R |

99 | 3967749 | 3675179 | 81368723 | 3411436 | 38852940 | 2878792 | 21111366 |

98 | 3967749 | 3673939 | 80417219 | 3409981 | 38222653 | 2875586 | 20758471 |

95 | 3967749 | 3672883 | 78027125 | 3405394 | 37278047 | 2864306 | 20284108 |

90 | 3967749 | 3660798 | 75756023 | 3388510 | 35800720 | 2838975 | 19679561 |

85 | 3967749 | 3326789 | 57423689 | 2851028 | 16483724 | 1469015 | 4341997 |

80 | 3967749 | 3319496 | 54165729 | 2834775 | 16047321 | 1439351 | 4094990 |

75 | 3967749 | 3311030 | 50568145 | 2819586 | 15658912 | 1411443 | 3909274 |

70 | 3967749 | 3305998 | 48357944 | 2804174 | 15330091 | 1385422 | 3764736 |

65 | 3967749 | 3307809 | 47748304 | 2790929 | 15169761 | 1365246 | 3652689 |

60 | 3967749 | 3299227 | 47180698 | 2777389 | 14913818 | 1344251 | 3554541 |

55 | 3967749 | 3303693 | 46376191 | 2767100 | 14202239 | 1334525 | 3531360 |

50 | 3967749 | 3306849 | 45578422 | 2756588 | 14105465 | 1326398 | 3490387 |

45 | 3967749 | 3304367 | 44779595 | 2745398 | 13964985 | 1313752 | 3375041 |

40 | 3967749 | 2888722 | 37253590 | 2493803 | 12063451 | 608490 | 1248833 |

35 | 3967749 | 2893043 | 35589986 | 2485646 | 11974145 | 599018 | 1221599 |

30 | 3967749 | 2892003 | 35223235 | 2478289 | 11865595 | 586111 | 1187839 |

25 | 3967749 | 2898534 | 34987248 | 2470076 | 11608237 | 571787 | 1139588 |

20 | 3967749 | 2635739 | 33145946 | 2420612 | 11010101 | 412730 | 711494 |

15 | 3967749 | 2631859 | 32882465 | 2405928 | 10942076 | 402464 | 562713 |

A very small "resize at" percentage means a very large table with many holes. One would expect that C and R get smaller at the bottom. But this is hardly the case: the R/L ratio (= average number of round-looping per lookup function call) decreases from 20 (top, resize at 99%) only to 8 (bottom, resize at 15%). The reason for this behaviour is

- The hash function has many collisions, namely it produces the same hash code for about 16 hash keys,
- Both the secondary and the primary hash function are computed from the same hash value. Knuth actually recommends using two independent hash functions.

/* Modified hash function */
`static unsigned long`
`compute_hashval (const void *key, size_t keylen)`
`{`
` size_t cnt;`
` unsigned long int hval, g;`

` cnt = 0;`
` hval = keylen;`
` while (cnt < keylen)`
` {`
` hval <<= 5;`
` hval += (unsigned long int) *(((char
*) key) + cnt++);`
` g = hval & ((unsigned long)
0x1f << (LONGBITS - 5));`
` if (g != 0)`
` {`
` hval ^=
g >> (LONGBITS - 8);`
` hval ^=
g;`
` }`
` }`
` return hval != 0 ? hval : ~((unsigned long) 0);`
`}`

The fifth and sixth columns show the collision and round-loop counts for this hash function. One can see that in this case, the R/L ratio decreases from 9.8 to about 2.75. But the gross behaviour remains the same:

- The hash function has many collisions, namely the same hash code for about 5 to 6 hash keys.

/* Best hash function */
`static unsigned long`
`compute_hashval (const void *key, size_t keylen)`
`{`
` size_t cnt;`
` unsigned long int hval;``
`

` cnt = 0;`
` hval = keylen;`
` while (cnt < keylen)`
` {`
` hval = (hval << 9) | (hval
>> (LONGBITS - 9));`
` hval += (unsigned long int) *(((char
*) key) + cnt++);`
` }`
` return hval != 0 ? hval : ~((unsigned long) 0);`
`}`

The results of these function are displayed in the seventh and eighth column of the table.

- The collisions now decrease as the table size grows.

This wasn't even the worst possible result: For other bit patterns, like 1110xxxx xxxxxxxx xxxx0101, the amount of killed bits would have been 8, giving rise to collision lists of length 256.

Let's now look at the computation time.

Resize at % | Original | Modified | Best | Best, Memory |

90 | 19.9 sec | 11.1 sec | 9.4 sec | 76.8 MB |

85 | 19.8 sec | 9.7 sec | 8.0 sec | 98.4 MB |

80 | 7.9 sec | 98.4 MB | ||

40 | 10.0 sec | 8.5 sec | 141.7 MB | |

20 | 12.1 sec | 10.8 sec | 239.3 MB |

It is also dependent on the resize fullness. If this factor is larger than 85%, the number of collisions is quite high (see first table), eating CPU time in the round-looping. If this factor is less than 40%, the program eats a whole lot more memory, slowing down the program through cache misses and paging. So a good resize fullness is between 40% and 85%. Fur further timings, I chose 75%.

The final speed achievement, through the changed hash function and the
changed resize threshold, is as follows.

Original | Best | |

Total time | 19.9 sec | 7.9 sec |

Time spent in lookup() |
72.5% | 26.9% |