- Hash Tables
Hash table is an effective data structure for implementing dictionaries (insert, search, delete).
Under reasonable assumptions, the average time to search for an element in a hash table is
Altough searching for an element in a hash table can take as long as searching for an element in a linked list (i.e.,
$O(n)$ time worst case), in practice, hashing performs extremely well.
- Lookup table.
- Caching and memoization.
- Preventing duplicate entries.
A hash table maps keys to values. It uses a hash function to figure out where to store elements.
When the set K
of keys stored in a dictionary is much smaller than the universe U
of all possible keys, a hash table requires much less storage than a direct-address table. Specifically, we can reduce the storage requirements to
With direct addressing, an element with key k is stored in slot k. With hashing, this element is stored in slot
We say that an element with key
There is one hitch, two keys may hash to the same slot (collision).
When the number of keys actually stored is small relative to the total number of possible keys, hash tables become an effective alternative to directly addressing an array, since a hash table typically uses an array of size proportional to the number of keys actually stored.
Instead of using the key as an array index directly, the array index is computed from the key.
Direct addressing is a simple technique that works well when the universe
To represent the dynamic set, we use an array, or direct-address table, denoted by
Each operation takes only
The downside is that if the universe
A good hash function satisfies (approximately) the assumption of simple uniform hashing: each key is equally likely to hash to any of the
In practice, we can often employ heuristic techniques to create a hash function that performs well.
Qualitative information about the distribution of keys may be useful in the design process. For example, consider a compiler's symbol table, in which the keys are character strings representing identifiers in a program. Closely related symbols, such as pt
and pts
, often occur in the same program. A good hash function would minimize the chance hat such variants hash to the same slot.
Some applications of hash functions might require stronger properties than are provided by simple uniform hashing. For example, we might want keys that are "close" imn some sense to yield hash values that are far apart.
The division method for creating hash functions, we map a key
$ h(k) = k\ mod\ m $
A prime not too close to an exact power of 2 is often a good choice for
For example, if we want to allocate a hash table, with collisions resolved by chaining, to hold roughly 2000 character strings, and we wouldn't mind examining an average of 3 elements in an unsuccessful search, we can allocate a hash table of size
$m = 701$ , because 701 is a prime near$2000/3$ but not near any power of 2, so we have$h(k) = k\ mod\ 701$ .
The multiplication method for creating hash functions operates in two steps. First, we multiply the key
$ h(k) = |m (k A mod 1)| $
The value of the constant
The only effective way to improve the situation where a fixed hash function retrieves
At the beginning of execution, we select the hash function at random from a carefully designed class of functions.
We begin by choosing a prime number
$ h_{ab(k)} = ((ak + b)\ mod\ p)\ mod\ m $
In chaining, we place all the elements that hash to the same slot into the same linked list.
Slot
CHAINED-HASH-INSERT(T,x) // O(1) worst-case time
insert x at the head of list T[h(x.key)]
CHAINED-HASH-SEARCH(T,k) // Theta(n) + time to hash worst-case time
search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T,x)
delete x from the list T[h(x.key)]
// should use double linked list so that we can delete an item quickly in O(1).
In open addressing, all elements occupy the hash table itself. That is, each table entry contains either an element of the dynamic set or NIL. Thus, in open addressing, the hash tablew can fill up so that no further insertions can be made. The advantage is that it avoids pointers altogether.
When searching for an element, we systematically examine table slots until either we find the desired element or we have ascertained that the element is not in the table. Instead of following pointers, we compute the sequence of slots to be examined. The extra memory freed by not storing pointers provides the hash table with a larger number of slots for the same amount of memory, potentially yielding fewer collisions and faster retrieval.
To perform insertion using open addressing, we successively examine, or probe, the hash table until we find an empt slot in which to put the key. The sequence of positions probed depends upon the key being inserted. To determine which slots to probe, we extend the hash function to include the probe number as a second input.
We require that for every key
HASH-INSERT(T,k):
i = 0
repeat
j = h(k, i)
if T[j] == NIL
T[j] = k
return j
else i = i + 1
until i == m
error "hash table overflow"
HASH-SEARCH(T,k):
i = 0
repeat
j = h(k, i)
if T[j] == k
return j
i = i + 1
until T[j] == NIL or i == m
return NIL
Deletion from an open-address hash table is difficult, because when we delete a key from slot
$i$ , we cannot simply mark that slot as empty by storing NIL in it. If we did, we might be unable to retrieve any key$k$ during whose insertion we had probed slot$i$ and found it occupied. We can solve this problem by marking the slot, storing in it the special value DELETED instead of NIL.
We use a hash function of the form:
We use a hash function of the form:
$h(k, i) = (h'(k) + c_1i + c_2i^2)\ mod\ m$
It offers one of the best methods available for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations. It uses a hash function of the form:
Hashing can provide excellent worst-case performance when the set of keys is static (once stored in the table, they never change). Some applications naturally have static sets of keys, like words in a programming language, or set of file names on a CD-ROM.
We call a hashing technique perfect hashing if O(1) memory accesses are required to perform a search in the worst case.
To create a perfect hashing scheme, we use two levels of hashing, with universal hashing at each level.
The first level is essentially the same as for hashing with chaining, we hash the
Instead of making a linked list of the keys hashing to slot j, however, we use a small secondary hash table
In order to guarantee so, we will need to let the size