Hashing
Overview
hashing is search technique
also known as key-to-address translation
can easily achieve O(1) searching in average case
compare to linear search O(n) and binary search O(log2n)
if not careful, hashing yields O(n) worst case!
Hashing quality measures
number of collisions (multiple keys hash to same address)
storage space required
High quality means:
- few or no collisions
- little wasted storage space
- Note:
these are mutually contradictory!
reduce collisions by providing wider range of address values
wider range of address values requires more storage (e.g. array)
Good design of hash function yields uniformity over address space
Some hash table size selection methods
1. size is prime number
2. Size is 1.5 times the expected maximum number of required addresses.
3. Experiments and experience show that these work well
Collisions occur in two situations
1. Insert new key, and hash yields address already occupied.
2. Retrieve based on key, and key value at hashed address does not match
These are equivalent situations, so we will concentrate on insertion
What to do when collisions occur upon insertion?
There are several plausible alternatives.
1. Open-address Hashing (a.k.a. open hashing, linear probe)
- upon collision, proceed down table sequentially starting with the hashed address until open slot is found.
- Example: hash yields 47. If entry 47 is occupied, check 48. If occupied, proceed to 49, etc. Drop into first available slot.
- Prone to clustering (clusters of filled slots followed by some empty ones)
- Average number of probes for successful search depends on LOAD FACTOR (proportion of table which is filled, must be >=0 and < 1). Approximate formula is (1+(1/(1-LoadFactor)))/2
2. Quadratic processing
- upon collision, try slot (orig hash) +(# collisions)2 mod table size.
- Example
: hash yields 47. Slot 47 occupied, so try 47 + 12 = 48. Slot 48 occupied, so try 47 + 22 = 51. Slot 51 occupied, so try 47 + 32 = 56.etc.
- Reduces clustering.
- Problem: if two keys hash to same address, each subsequent attempt by both to resolve collision will also yield same address!
- Remind me to tell you about Ethernet and Binary Exponential Backoff
.
3. Double Hashing.
- upon collision, rehash the key using a different hash function.
- Try (orig hash) + (second hash) * (# collisions) mod table size until empty slot found
- Reduces clustering
4. Chained Hashing (a.k.a. separate chaining)
- each slot has associated linked list
- if collision occurs, insert as new linked list entry
- Worst case yields one long linked list: O(n) for subsequent retrieve
- Average case yields a bunch of short ones
5. Bucket hashing
- Table subdivided into several multirecord buckets.
- all buckets same size
- hash yields bucket number, then linear probe from there.
- Note: sometimes optimized by matching buckets to disk tracks. (bucket = one track)
Two notes about hashing in Java
1. All java classes derived from Object have a method called
hashCode which will translate the object value into an integer suitable for sending into a hash function.
2. Java provides the
Hashtable class so you can use hashing without having to roll your own.
Some hash function design methods
1. Base on sound random number generation techniques
- uniform distribution
- random order
- reproducible (specific key always produces same address)
2. folding
- translate non-integer key into integer by operating on bit representation
- example:
- ASCII values of JOE are J=74, O=79, E=69.
- key = 74 + 79 + 69 = 222
3. digit/character extraction
- before applying hash, remove patterns known to occur frequently
- reduces probability that more than one key will hash to same address
- example:
- remove first 3 digits of SS number, then hash on the rest
4. modulo
- if addresses in range 0 to limit-1 are desired,
apply address = key % limit
5. combinations of 1-4.
[ Lectures | CSC 132 | Peter Sanderson | Computer Science | SMSU ]
Last reviewed: 29 November 2000
Peter Sanderson (
PeteSanderson@smsu.edu )