C SC 205 Lecture 17: Hash Maps and Hashing
major resources: Data Structures and the Java Collections Framework Second Edition,
William Collins, McGraw-Hill, 2005
Introduction to Algorithms Second Edition,
Corman, Leiserson, Rivest, and Stein, McGraw-Hill, 2003
[ previous
| schedule
| first
]
Monday November 12 (week 10)
The java.util.HashMap<K,V> class
- If you need a refresher on maps, see TreeMap notes.
- Like TreeMap, it extends AbstractMap
- Like TreeMap, it implements Map
- Unlike TreeMap, it does not implement SortedMap
- This means there is no guaranteed ordering from iterators obtained from collection views
- The important Map methods, get(), put(), remove(), containsKey(), all have linear worst case running time -- O(n)!
- Why would you use it? No sorted iterations and linear running time!
- The worst case running time is deceptive, just like quicksort.
- Recall quicksort is worst case O(n2), but average O(n log2 n) and whips other O(n log2 n) algorithms in timed runs!
- HashMap methods are similarly deceptive; average case running time is O(1), constant!
- And, like quicksort, the worst case rarely happens
- Its methods are so fast because they use the technique of hashing
- Sneak peek at implementation: The HashMap data structure is an array!
Related Classes HashSet<E> and Hashtable<K,V>
- HashSet<E> implements Set<E> and the set concept,
as you would expect, and is implemented using a HashMap instance variable.
- Hashtable<K,V> has been around since Java 1.0 and Dictionary is its
parent class. It is not, however, labeled obsolete, and has been "retrofitted" to implement Map.
- Hashtable is to HashMap as Vector is to ArrayList -- older but updated, and thread-safe (unlike its newer brethren).
Hashing
- The basics of hashing:
- Define an array to contain collection
- Define a small constant time function that transforms an element or key value into a valid
index for that array -- this is the hash
- Use that index to store or retrieve the element directly.
- Upside: Average case O(1) store and retrieve
- Downside: Worst case O(n) if array need to be expanded or if collisions occur
- Collision occurs when two or more key or element values hash to the same index.
- If O(n) of the n keys all hash to the same index, that is pretty poor hash function design!
- We'll cover collisions in more detail below.
- Ideally, every key will hash to a different index and keys will be evenly
distributed throughout the table -- uniform hashing
Designing the Hash algorithm
- I'll use the term key to refer to the value being hashed. HashSet superficially
has only an element, but a set requires its elements to have the same uniqueness property as keys.
- Simplest hash: index = key mod array.length
- Key in this formula must be integer, but an integer equivalent can be devised for any key type.
- All Java classes inherit int hashCode() from the Object class to return such an integer.
- Like equals(), the default version of hashCode() is not terribly sophisticated: "This is
typically implemented by converting the internal address of the object into an integer." (Sun's Java API)
- Many classes override inherited hashCode(), e.g. Integer simply returns
the int value itself, String multiplies each of the partial sums of its characters' Unicode values by 31 and returns the sum of the products.
- If designing a class of objects to be used as hash table keys, it is recommended that you override the inherited hashCode() method, because
the hashing classes will use it in their hashes. Here is how it is used...
- java.util.HashMap internally uses this hashing function:
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += ~(h << 4);
h ^= (h >>> 10);
return;
}
- (~ is bitwise complement, ^ is bitwise XOR, << is left shift zero fill, >>> is right shift zero fill)
- From the method comments: "In addition to the object's own hashCode, this method applies a "supplemental hash function,"
which defends against poor quality hash functions. The shift distances in this function were chosen as the
result of an automated search over the entire four-dimensional search space."
- After calling the above to refine the hash code, the HashMap methods (or delegates)
perform the hash itself by calling this method:
// h is the hash code, length is the hashtable length
static int indexFor(int h, int length) {
return h & (length-1); // bitwise AND
}
- Bitwise AND is much faster than MOD (%), but only works if hashtable length is a power of 2.
- If length is a power of 2, then length-1 in binary is all 0's followed by
log2(length) 1's. This serves as a mask to return a value with only the low order
log2(length) bits of the hash value -- effectively MOD.
- Example: length = 256, length-1 = 255 = 000000000000000000000000111111112.
When bitwise ANDing this to any value, the high order 24 bits all go to 0 and the 8 low order bits pass through as is. Same result as applying % 256, only faster.
Collisions
- If every possible key value hashed to a different index, collisions would not occur.
- Collision probability can be reduced by increasing the table size so it is much greater than
the number of keys to store, then design the hash to assure it can generate any table index.
This wastes space!
- Collision probability can be reduced through intelligent hash code design with the goal of
uniform hashing, but this may not be feasible for a given set of possible keys.
- So collisions will occur! How can we deal with them? One of these approaches is normally used:
- Chaining : Each hashtable entry points to the head of a (singly) linked list.
- Upon insertion, new entry is added as new linked list head. O(1)
- When searching, must be linear through linked list. O(n) worst case
- Upon removal, entry is removed from linked list. O(n) worst case to find it, O(1) to remove.
- Average time for all three is O(1) since list will be very short with good hashing.
- Open-address : If hashtable entry hashIndex is occupied by different key, try a different entry. Here are some techniques.
- Linear Probe : Try the next entry.
- ( hashIndex + 1 ) MOD table.length
- In place of offset 1, you can use any non-negative value relatively prime to table.length.
- If table.length is a power of 2, the offset from hashIndex can be any odd value.
- If table.length is prime, the offset from hashIndex can be any value less than table.length.
- The offset can vary from one attempt to the next, e.g. ( hashIndex + numberOfCollisions )
- Linear Probing can lead to clustering of entries rather than uniform distribution.
- Also note that initial hash index "fixes" the sequence, so all keys that hash to this index
will follow the same sequence of re-tries.
- Quadratic Probe : Offset for next try is based on a squared value
- ( hashIndex + numberOfCollisions2 ) MOD table.length
- Better results than linear, but is slower and can lead to secondary clustering
- Again, the initial hash index "fixes" the sequence.
- Double hashing : Offset for next try is based on second hashing function
- ( hashIndex + indexFromSecondaryHash ) MOD table.length
- Secondary hash should be independent of primary, so two keys hashing to same index
will have different secondary hash
- The offset must be relatively prime to table.length
- Almost as good as randomly-generated offset
Performance as Table Fills
- What happens to performance (number of probes) as a hash table becomes "full"? Depends.
- Chained hash tables technically can never become full, but chains can get long. Performance degrades linearly.
- Open Addressed hash tables can definitely fill. As the table nears capacity (particularly above 80%), performance degrades exponentially.
- What is involved in re-sizing the hash table?
- Allocate a larger table
- Traverse the original table and, for each entry, rehash and insert into new table
- Requires linear, O(n), time on average but if new table is not much larger
and hash was poor could go O(n2)?
- Resizing is triggered when load factor exceeds a specified threshold.
- Threshold for chained table may be defined differently than for open addressed table, since resizing
is not as critical.
- If threshold set too low, access will be faster but too much time is spent re-sizing.
- If threshold set too high, access can become too slow.
HashMap Implementation
- Uses an array for the hash table.
- Uses chaining with singly-linked list.
- Doubles array size when number of entries exceeds 75% of array length. Due to chaining, less than 75% of array elements are occupied.
- Does not offer an iterator per se (this is a Map)
- You can obtain a collection view, then get an iterator from it.
- Iteration sequence? Traverses the array backward and for each non-null member will traverse
linked list head to tail.
- Why backward? Loop continuation condition will be based on 0 rather than table.length, so is a bit faster
Using Hash Table versus Binary Search Tree
- Choice depends on how you will use the information
- If you need ordered access or access to sucessor/predecessor, use a BST
- If you don't need ordered access, a hash table will probably be better
- If you do a lot of insertions relative to later access, chained hash table is faster -- guaranteed O(1) insert
- If you need consistent access times, a BST may be better. Hash table access times may vary widely
depending on chain lengths or number of probes, and re-sizing could occur at any time.
[ C
SC 205 | Peter
Sanderson | Math Sciences server
| Math Sciences home page
| Otterbein ]
Last updated:
Peter Sanderson (PSanderson@otterbein.edu)