Many applications require a dynamic set that supports only the dictionary operations INSERT, SEARCH, and DELETE. For example, a compiler for a computer language maintains a symbol table, in which the keys of elements are arbitrary character strings that correspond to identifiers in the language. A hash table is an effective data structure for implementing dictionaries. Although searching for an element in a hash table can take as long as searching for an element in a linked list--(*n*) time in the worst case--in practice, hashing performs extremely well. Under reasonable assumptions, the expected time to search for an element in a hash table is *O*(1).

Direct addressing is a simple technique that works well when the universe *U* of keys is reasonably small. Suppose that an application needs a dynamic set in which each element has a key drawn from the universe *U* = {0,1, . . . , m - 1}, where *m* is not too large. We shall assume that no two elements have the same key.

To represent the dynamic set, we use an array, or * direct-address table*,

The dictionary operations are trivial to implement.

DIRECT-ADDRESS-SEARCH(T,k)

returnT[k]

DIRECT-ADDRESS-INSERT(T,x)

T[key[x]]x

DIRECT-ADDRESS-DELETE(T,x)

T[key[x]] NIL

Each of these operations is fast: only *O*(1) time is required.

A * bit vector* is simply an array of bits (0's and 1's). A bit vector of length

The difficulty with direct addressing is obvious: if the universe *U* is large, storing a table *T* of size |*U*| may be impractical, or even impossible, given the memory available on a typical computer. Furthermore, the set *K* of keys *actually stored* may be so small relative to *U* that most of the space allocated for *T* would be wasted.

With direct addressing, an element with key *k* is stored in slot *k*. With hashing, this element is stored in slot *h*(*k*); that is, a **hash function***h* is used to compute the slot from the key *k*. Here *h* maps the universe *U* of keys into the slots of a **hash table***T*[0 . . *m* - 1]:

h: U{0,1, . . . ,m- 1} .

We say that an element with key *k* * hashes* to slot

The fly in the ointment of this beautiful idea is that two keys may hash to the same slot--a **collision****.** Fortunately, there are effective techniques for resolving the conflict created by collisions.

In * chaining*, we put all the elements that hash to the same slot in a linked list, as shown in Figure 12.3. Slot

CHAINED-HASH-INSERT(T,x)

insertxat the head of listT[h(key[x])]

CHAINED-HASH-SEARCH(T,k)

search for an element with keykin listT[h(k)]

CHAINED-HASH-DELETE(T,x)

deletexfrom the listT[h(key[x])]

How well does hashing with chaining perform? In particular, how long does it take to search for an element with a given key?

Given a hash table *T* with *m* slots that stores *n* elements, we define the * load factor* for

The average performance of hashing depends on how well the hash function *h* distributes the set of keys to be stored among the *m* slots, on the average. Section 12.3 discusses these issues, but for now we shall assume that any given element is equally likely to hash into any of the *m* slots, independently of where any other element has hashed to. We call this the assumption of * simple uniform hashing*.

In a hash table in which collisions are resolved by chaining, a successful search takes time (1 +), on the average, under the assumption of simple uniform hashing.

Suggest how storage for elements can be allocated and deallocated within the hash table itself by linking all unused slots into a free list. Assume that one slot can store a flag and either one element plus a pointer or two pointers. All dictionary and free-list operations should run in *O*(l) expected time. Does the free list need to be doubly linked, or does a singly linked free list suffice?

In this section, we discuss some issues regarding the design of good hash functions and then present three schemes for their creation: hashing by division, hashing by multiplication, and universal hashing.

Unfortunately, it is generally not possible to check this condition, since *P *is usually unknown.

h(k) =km

can be shown to satisfy equation (12.1).

In practice, heuristic techniques can be used to create a hash function that is likely to perform well. Qualitative information about *P* is sometimes useful in this design process. For example, consider a compiler's symbol table, in which the keys are arbitrary character strings representing identifiers in a program. It is common for closely related symbols, such as pt and pts, to occur in the same program. A good hash function would minimize the chance that such variants hash to the same slot.

In the * division method* for creating hash functions, we map a key

h(k) =kmodm.

h(k) =kmod 701 .

The * multiplication method* for creating hash functions operates in two steps. First, we multiply the key

h(k) =m(kAmod 1) ,

where "*k A* mod 1" means the fractional part of *kA*, that is, *kA - **kA*.

is likely to work reasonably well.

As an example, if we have *k* = 123456, *m* = 10000, and *A* as in equation (12.2), then

h(k) = 10000 (123456 0.61803 . . . mod 1)

= 10000 (76300.0041151. . . mod 1)

= 10000 0.0041151 . . .

= 41.151 . . .

= 41 .

If a malicious adversary chooses the keys to be hashed, then he can choose *n* keys that all hash to the same slot, yielding an average retrieval time of (*n*). Any fixed hash function is vulnerable to this sort of worst-case behavior; the only effective way to improve the situation is to choose the hash function *randomly* in a way that is *independent* of the keys that are actually going to be stored. This approach, called * universal hashing*, yields good performance on the average, no matter what keys are chosen by the adversary.

The main idea behind universal hashing is to select the hash function at random at run time from a carefully designed class of functions. As in the case of quicksort, randomization guarantees that no single input will always evoke worst-case behavior. Because of the randomization, the algorithm can behave differently on each execution, even for the same input. This approach guarantees good average-case performance, no matter what keys are provided as input. Returning to the example of a compiler's symbol table, we find that the programmer's choice of identifiers cannot now cause consistently poor hashing performance. Poor performance occurs only if the compiler chooses a random hash function that causes the set of identifiers to hash poorly, but the probability of this occurring is small and is the same for any set of identifiers of the same size.

E[c] = 1/_{yz}m.

Since *n* *m*, we have E [*C _{x}*] < 1.

The class defined by equations (12.3) and (12.4) is a universal class of hash functions.

To see this property, note that since *m* is prime, the nonzero quantity *x*_{0}* - y*_{0} has a multiplicative inverse modulo *m*, and thus there is a unique solution for *a*_{0} modulo *m*. (See Section 33.4.) Therefore, each pair of keys *x* and *y* collides for exactly *m ^{r}* values of

Suppose we wish to search a linked list of length *n*, where each element contains a key *k* along with a hash value *h*(*k*). Each key is a long character string. How might we take advantage of the hash values when searching the list for an element with a given key?

In * open addressing*, all elements are stored in the hash table itself. That is, each table entry contains either an element of the dynamic set or NIL. When searching for an element, we systematically examine table slots until the desired element is found or it is clear that the element is not in the table. There are no lists and no elements stored outside the table, as there are in chaining. Thus, in open addressing, the hash table can "fill up" so that no further insertions can be made; the load factor can never exceed 1.

To perform insertion using open addressing, we successively examine, or * probe*, the hash table until we find an empty slot in which to put the key. Instead of being fixed in the order 0, 1, . . . ,

h:UX {0, 1, . . . ,m-1} {0, 1, . . . ,m-1} .

With open addressing, we require that for every key *k*, the **probe sequence**

h(k, 0),h(k, 1), . . . ,h(k,m- 1)

HASH-INSERT(T,k)

1i0

2repeatjh(k,i)

3ifT[j] = NIL

4thenT[j]k

5returnj

6elseii+ 1

7untili=m

8error"hash table overflow"

HASH-SEARCH(T,k)

1i0

2repeatjh(k,i)

3ifT[j]=j

4thenreturnj

5ii+ 1

6untilT[j] = NIL ori=m

7returnNIL

Deletion from an open-address hash table is difficult. When we delete a key from slot *i*, we cannot simply mark that slot as empty by storing NIL in it. Doing so might make it impossible to retrieve any key *k* during whose insertion we had probed slot *i* and found it occupied. One solution is to mark the slot by storing in it the special value DELETED instead of NIL. We would then modify the procedure HASH-SEARCH so that it keeps on looking when it sees the value DELETED, while HASH-INSERT would treat such a slot as if it were empty so that a new key can be inserted. When we do this, though, the search times are no longer dependent on the load factor , and for this reason chaining is more commonly selected as a collision resolution technique when keys must be deleted.

In our analysis, we make the assumption of * uniform hashing:* we assume that each key considered is equally likely to have any of the

Given an ordinary hash function *h*': *U* {0, 1, . . . , *m* - 1}, the method of * linear probing* uses the hash function

h(k,i) = (h'(k) +i) modm

Linear probing is easy to implement, but it suffers from a problem known as * primary clustering*. Long runs of occupied slots build up, increasing the average search time. For example, if we have

* Quadratic probing* uses a hash function of the form

h(k,i) = (h'(k) +c_{1}i+c_{2}i^{2}) modm,

Double hashing is one of the best methods available for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations. * Double hashing* uses a hash function of the form

h(k,i) = (h_{1}(k) +ih_{2}(k)) modm,

h_{1}(k) =kmodm,

h_{2}(k) = 1 + (kmodm'),

Our analysis of open addressing, like our analysis of chaining, is expressed in terms of the load factor of the hash table, as *n* and *m* go to infinity. Recall that if *n* elements are stored in a table with *m* slots, the average number of elements per slot is * = *n/m*. Of course, with open addressing, we have at most one element per slot, and thus *n* *m*, which implies * 1.

*p _{i}* = Pr {exactly

To evaluate equation (12.6), we define

*q _{i}* = Pr {at least

for *i* = 0, 1, 2, . . . . We can then use identity (6.28):

Theorem 12.5 gives us the performance of the HASH-INSERT procedure almost immediately.

Computing the expected number of probes for a successful search requires a little more work.

for a bound on the expected number of probes in a successful search.

Consider inserting the keys 10, 22, 31, 4, 15, 28, 17, 88, 59 into a hash table of length *m* = 11 using open addressing with the primary hash function *h*'(*k*) = *k* mod *m*. Illustrate the result of inserting these keys using linear probing, using quadratic probing with *c*_{1} = 1 and *c*_{2} = 3, and using double hashing with *h*_{2}(*k*) = 1 + (*k* mod (*m* - 1)).

Write pseudocode for HASH-DELETE as outlined in the text, and modify HASH-INSERT and HASH-SEARCH to incorporate the special value DELETED.

Suppose that we use double hashing to resolve collisions; that is, we use the hash function *h*(*k*, *i*) = (*h*_{1}(*k*) + *ih*_{2}(*k*)) mod *m*. Show that the probe sequence *h*(*k*, 0), *h*(*k*, 1), . . . , *h*(*k*, *m* - 1) is a permutation of the slot sequence 0, 1, . . . , *m *- 1 if and only if *h*_{2}(*k*) is relatively prime to *m*. (*Hint*: See Chapter 33.)

The bound on the harmonic series can be improved to

12-1 Longest-probe bound for hashing

A hash table of size *m* is used to store *n* items, with *n* *m*/2. Open addressing is used for collision resolution.

* c.* Show that Pr{

* d.* Show that the expected length of the longest probe sequence is

You are asked to implement a dynamic set of *n* elements in which the keys are numbers. The set is static (no INSERT or DELETE operations), and the only operation required is SEARCH. You are given an arbitrary amount of time to preprocess the *n* elements so that SEARCH operations run quickly.

12-3 Slot-size bound for chaining

Suppose that we have a hash table with *n* slots, with collisions resolved by chaining, and suppose that *n* keys are inserted into the table. Each key is equally likely to be hashed to each slot. Let *M* be the maximum number of keys in any slot after all the keys have been inserted. Your mission is to prove an *O*(1g *n*/1g 1g *n*) upper bound on *E*[*M*], the expected value of *M*.

**a****.** Argue that the probability *Q _{k}* that

* c. *Use Stirling's approximation, equation (2.1l), to show that

Conclude that E [*M*] = *O*(lg *n*/1g 1g *n*).

Suppose that we are given a key *k* to search for in a hash table with positions 0, 1, . . . , *m* - 1, and suppose that we have a hash function *h* mapping the key space into the set {0, 1, . . . , *m* - 1}. The search scheme is as follows.

1. Compute the value *i* *h*(*k*), and set *j* 0.

3. Set *j* (*j* + l) mod *m* and *i* (*i* + *j*) mod *m*, and return to step 2.

Assume that *m* is a power of 2.

* b.* Prove that this algorithm examines every table position in the worst case.

Let be a class of hash functions in which each *h* maps the universe *U* of keys to {0, 1, . . . , *m* - 1}. We say that is * k-universal* if, for every fixed sequence of

* a.* Show that if is 2-universal, then it is universal.

* b.* Show that the class defined in Section 12.3.3 is not 2-universal.

h,_{a}b(x)=ax+b,