Hashing

Hashing

Overview

hashing is search technique

also known as key-to-address translation

can easily achieve O(1) searching in average case

compare to linear search O(n) and binary search O(log₂n)

if not careful, hashing yields O(n) worst case!

Hashing quality measures

number of collisions (multiple keys hash to same address)

storage space required

High quality means:

few or no collisions
little wasted storage space

Note: these are mutually contradictory!

reduce collisions by providing wider range of address values

wider range of address values requires more storage (e.g. array)

Good design of hash function yields uniformity over address space

Some hash table size selection methods

1. size is prime number

2. Size is 1.5 times the expected maximum number of required addresses.

3. Experiments and experience show that these work well

Collisions occur in two situations

1. Insert new key, and hash yields address already occupied.

2. Retrieve based on key, and key value at hashed address does not match

These are equivalent situations, so we will concentrate on insertion

What to do when collisions occur upon insertion?

There are several plausible alternatives.

1. Open-address Hashing (a.k.a. open hashing, linear probe)

upon collision, proceed down table sequentially starting with the hashed address until open slot is found.
Example: hash yields 47. If entry 47 is occupied, check 48. If occupied, proceed to 49, etc. Drop into first available slot.
Prone to clustering (clusters of filled slots followed by some empty ones)
Average number of probes for successful search depends on LOAD FACTOR (proportion of table which is filled, must be >=0 and < 1). Approximate formula is (1+(1/(1-LoadFactor)))/2

2. Quadratic processing

upon collision, try slot (orig hash) +(# collisions)² mod table size.

Example: hash yields 47. Slot 47 occupied, so try 47 + 1² = 48. Slot 48 occupied, so try 47 + 2² = 51. Slot 51 occupied, so try 47 + 3² = 56.etc.
Reduces clustering.
Problem: if two keys hash to same address, each subsequent attempt by both to resolve collision will also yield same address!

Remind me to tell you about Ethernet and Binary Exponential Backoff.

3. Double Hashing.

upon collision, rehash the key using a different hash function.
Try (orig hash) + (second hash) * (# collisions) mod table size until empty slot found
Reduces clustering

4. Chained Hashing (a.k.a. separate chaining)

each slot has associated linked list
if collision occurs, insert as new linked list entry
Worst case yields one long linked list: O(n) for subsequent retrieve
Average case yields a bunch of short ones

5. Bucket hashing

Table subdivided into several multirecord buckets.
all buckets same size
hash yields bucket number, then linear probe from there.
Note: sometimes optimized by matching buckets to disk tracks. (bucket = one track)

Two notes about hashing in Java

1. All java classes derived from Object have a method called hashCode which will translate the object value into an integer suitable for sending into a hash function.

2. Java provides the Hashtable class so you can use hashing without having to roll your own.

Some hash function design methods

1. Base on sound random number generation techniques

uniform distribution
random order
reproducible (specific key always produces same address)

2. folding

translate non-integer key into integer by operating on bit representation
example:

ASCII values of JOE are J=74, O=79, E=69.
key = 74 + 79 + 69 = 222

3. digit/character extraction

before applying hash, remove patterns known to occur frequently
reduces probability that more than one key will hash to same address
example:

remove first 3 digits of SS number, then hash on the rest

4. modulo

if addresses in range 0 to limit-1 are desired,
apply address = key % limit

5. combinations of 1-4.

[ Lectures | CSC 132 | Peter Sanderson | Computer Science | SMSU ]

Last reviewed: 29 November 2000

Peter Sanderson ( PeteSanderson@smsu.edu )