Lecture 17: Hash Maps and Hashing

C SC 205 Lecture 17: Hash Maps and Hashing
major resources: Data Structures and the Java Collections Framework Second Edition, William Collins, McGraw-Hill, 2005
Introduction to Algorithms Second Edition, Corman, Leiserson, Rivest, and Stein, McGraw-Hill, 2003

[ previous | schedule | first ]

Monday November 12 (week 10)

The `java.util.HashMap<K,V>` class

If you need a refresher on maps, see TreeMap notes.
Like TreeMap, it extends AbstractMap
Like TreeMap, it implements Map
Unlike TreeMap, it does not implement SortedMap
This means there is no guaranteed ordering from iterators obtained from collection views
The important Map methods, get(), put(), remove(), containsKey(), all have linear worst case running time -- O(n)!
Why would you use it? No sorted iterations and linear running time!

The worst case running time is deceptive, just like quicksort.
Recall quicksort is worst case O(n²), but average O(n log₂ n) and whips other O(n log₂ n) algorithms in timed runs!
HashMap methods are similarly deceptive; average case running time is O(1), constant!
And, like quicksort, the worst case rarely happens

Its methods are so fast because they use the technique of hashing
Sneak peek at implementation: The HashMap data structure is an array!

Related Classes `HashSet<E>` and `Hashtable<K,V>`

HashSet<E> implements Set<E> and the set concept, as you would expect, and is implemented using a HashMap instance variable.
Hashtable<K,V> has been around since Java 1.0 and Dictionary is its parent class. It is not, however, labeled obsolete, and has been "retrofitted" to implement Map.
Hashtable is to HashMap as Vector is to ArrayList -- older but updated, and thread-safe (unlike its newer brethren).

Hashing

The basics of hashing:

Define an array to contain collection
Define a small constant time function that transforms an element or key value into a valid index for that array -- this is the hash
Use that index to store or retrieve the element directly.

Upside: Average case O(1) store and retrieve
Downside: Worst case O(n) if array need to be expanded or if collisions occur
Collision occurs when two or more key or element values hash to the same index.
If O(n) of the n keys all hash to the same index, that is pretty poor hash function design!
We'll cover collisions in more detail below.
Ideally, every key will hash to a different index and keys will be evenly distributed throughout the table -- uniform hashing

Designing the Hash algorithm

I'll use the term key to refer to the value being hashed. HashSet superficially has only an element, but a set requires its elements to have the same uniqueness property as keys.
Simplest hash: index = key mod array.length
Key in this formula must be integer, but an integer equivalent can be devised for any key type.

All Java classes inherit int hashCode() from the Object class to return such an integer.
Like equals(), the default version of hashCode() is not terribly sophisticated: "This is typically implemented by converting the internal address of the object into an integer." (Sun's Java API)
Many classes override inherited hashCode(), e.g. Integer simply returns the int value itself, String multiplies each of the partial sums of its characters' Unicode values by 31 and returns the sum of the products.

If designing a class of objects to be used as hash table keys, it is recommended that you override the inherited hashCode() method, because the hashing classes will use it in their hashes. Here is how it is used...

java.util.HashMap internally uses this hashing function:
static int hash(Object x) { int h = x.hashCode(); h += ~(h << 9); h ^= (h >>> 14); h += ~(h << 4); h ^= (h >>> 10); return; }
(~ is bitwise complement, ^ is bitwise XOR, << is left shift zero fill, >>> is right shift zero fill)
From the method comments: "In addition to the object's own hashCode, this method applies a "supplemental hash function," which defends against poor quality hash functions. The shift distances in this function were chosen as the result of an automated search over the entire four-dimensional search space."
After calling the above to refine the hash code, the HashMap methods (or delegates) perform the hash itself by calling this method:
// h is the hash code, length is the hashtable length static int indexFor(int h, int length) { return h & (length-1); // bitwise AND }
Bitwise AND is much faster than MOD (%), but only works if hashtable length is a power of 2.
If length is a power of 2, then length-1 in binary is all 0's followed by log₂(length) 1's. This serves as a mask to return a value with only the low order log₂(length) bits of the hash value -- effectively MOD.
Example: length = 256, length-1 = 255 = 00000000000000000000000011111111₂. When bitwise ANDing this to any value, the high order 24 bits all go to 0 and the 8 low order bits pass through as is. Same result as applying % 256, only faster.

Collisions

If every possible key value hashed to a different index, collisions would not occur.
Collision probability can be reduced by increasing the table size so it is much greater than the number of keys to store, then design the hash to assure it can generate any table index. This wastes space!
Collision probability can be reduced through intelligent hash code design with the goal of uniform hashing, but this may not be feasible for a given set of possible keys.
So collisions will occur! How can we deal with them? One of these approaches is normally used:
1. Chaining : Each hashtable entry points to the head of a (singly) linked list.
2. Open-address : If hashtable entry hashIndex is occupied by different key, try a different entry. Here are some techniques.

Performance as Table Fills

What happens to performance (number of probes) as a hash table becomes "full"? Depends.
Chained hash tables technically can never become full, but chains can get long. Performance degrades linearly.
Open Addressed hash tables can definitely fill. As the table nears capacity (particularly above 80%), performance degrades exponentially.
What is involved in re-sizing the hash table?

Allocate a larger table
Traverse the original table and, for each entry, rehash and insert into new table
Requires linear, O(n), time on average but if new table is not much larger and hash was poor could go O(n²)?

Resizing is triggered when load factor exceeds a specified threshold.
Threshold for chained table may be defined differently than for open addressed table, since resizing is not as critical.
If threshold set too low, access will be faster but too much time is spent re-sizing.
If threshold set too high, access can become too slow.

`HashMap` Implementation

Uses an array for the hash table.
Uses chaining with singly-linked list.
Doubles array size when number of entries exceeds 75% of array length. Due to chaining, less than 75% of array elements are occupied.
Does not offer an iterator per se (this is a Map)
You can obtain a collection view, then get an iterator from it.
Iteration sequence? Traverses the array backward and for each non-null member will traverse linked list head to tail.
Why backward? Loop continuation condition will be based on 0 rather than table.length, so is a bit faster

Using Hash Table versus Binary Search Tree

Choice depends on how you will use the information
If you need ordered access or access to sucessor/predecessor, use a BST
If you don't need ordered access, a hash table will probably be better
If you do a lot of insertions relative to later access, chained hash table is faster -- guaranteed O(1) insert
If you need consistent access times, a BST may be better. Hash table access times may vary widely depending on chain lengths or number of probes, and re-sizing could occur at any time.

[ C SC 205 | Peter Sanderson | Math Sciences server | Math Sciences home page | Otterbein ]

Last updated:
Peter Sanderson (PSanderson@otterbein.edu)