Data Structures with Generic Types in C++/

...

/

Hash Codes

The hash tables are used to associate data with integer keys consisting of $w$ bits. In many cases, we have keys that are not integers. They may be strings, objects, arrays, or other compound structures. To use hash tables for these types of data, we must map these data types to $w-\text{bit}$ hash codes. Hash code mappings should have the following properties:

If x and y are equal, then x.hashCode() and y.hashCode() are equal.
If x and y are not equal, then the probability that x.hashCode() = y.hashCode() should be small (close to $1/2^{w} )$ .

The first property ensures that if we store x in a hash table and later look up a value y equal to x, then we will find x—as we should. The second property minimizes the loss from converting our objects to integers. It ensures that unequal objects usually have different hash codes and so are likely to be stored at different locations in our hash table.

Hash codes for primitive data types

Small primitive data types like char, byte, int, and float are usually easy to find hash codes for. These data types always have a binary representation and this binary representation usually consists of $w$ or fewer bits. (For example, in C++, byte is an 8-bit type and float is a 32-bit type.) In these cases, we just treat these bits as the representation of an integer in the range $\{0,...,2^{w} − 1\}$ . If two values are different, they get different hash codes. If they are the same, they get the same hash code.

A few primitive data types are made up of more than $w$ bits, usually $cw$ bits for some constant integer $c$ . (Java’s long and double types are examples of this with $c = 2$ .) These data types can be treated as compound objects made of $c$ parts, as described in the next section.

Hash codes for compound objects

For a compound object, we want to create a hash code by combining the individual hash codes of the object’s constituent parts. This is not as easy as it sounds. Although one can find many hacks for this (for example, combining the hash codes with bitwise exclusive-or operations), many of these hacks are easy to foil. However, if one is willing to do arithmetic with $2w$ bits of precision, then there are simple and robust methods available. Suppose we have an object made up of several parts $P_{0},...,P_{r−1}$ whose hash codes are ${x_{0},...,x_{r−1}}$ . Then we can choose mutually independent random $w-$ bit integers $z_{0},...,z_{r−1}$ and a random $2w-$ bit odd integer $z$ and compute a hash code for our object with

h\left ( x_{0},\ldots,x_{r-1} \right ) = \left [ \left [ z \sum_{i=0}^{r-1}z_{i}x_{i} \right] \text{mod}\ 2^{2w} \right ] \text{div}\ 2^w

Note that this hash code has a final step (multiplying by $z$ and dividing by $2^{w} )$ that uses the multiplicative hash function to take the $2w-$ bit intermediate result and reduce it to a $w$ -bit final result. Here is an example of this method applied to a simple compound object with three parts x0, x1, and x2:

The following theorem shows that, in addition to being straightforward to implement, this method is probably good:

Theorem 1: Let $x_{0},...,x_{r−1}$ and $y_{0},...,y_{r−1}$ each be sequences of $w$ bit integers in $\{0,...,2^{w} −1\}$ and assume $x_{i} \neq y_{i}$ for at least one index $i \in \{0,...,r−1\}$ . Then

\Pr\{h(x_{0},...,x_{r−1}) = h(y_{0} ,...,y_{r−1})\} \le 3/2^{w}

Proof: We will first ignore the final multiplicative hashing step and see how that step contributes later. Define:

h^{\prime}\left ( x_{o},...,x_{r-1} \right ) = \left [ \sum_{j=0}^{r-1} z_{j} x_{j}\right ] \text{mod } 2^{2w}

Suppose that $h^{\prime} (x_{0},...,x_{r−1}) = h^{\prime}(y_{0} ,...,y_{r−1})$ ...

Overview

Array-Based Lists

Linked Lists

Skiplists

Hash Tables

Binary Trees

Random Binary Search Trees

Scapegoat Trees

Red-Black Trees

Heaps

Sorting Algorithms

Graphs

Data Structures for Integers

External Memory Searching

Wrap Up

Hash Codes

Hash codes for primitive data types

Hash codes for compound objects