Evil Lies about Hash Tables

Greetings readers. There are lies being told in Computer Science, lies you probably believe. Today I want to discuss the lie that is the constant time lookup or insertion of our friend the hash table. Don’t get me wrong, I love hash tables, some of my best functions use them, but they are far from constant time.

So, what the usual statement is: Hash tables have constant time lookups.

This is a lie. They have average case constant time lookups, worst case linear time lookups, and O is about worst case analysis. 

Let’s dig into this statement.

Faye trying to understand HashTables.

How do hash tables work?

Hash tables are key-value pairs, you tell them to store as keys some object type (for example strings), and as their value some other objects (for example a Person object).

This way we can have:

(let ((persons (make-hash-table :test #'equalp)))
  (setf (gethash “Lyra” persons) (list :name "Lyra" :height 36 :weight 34))
  (setf (gethash “Lrya” persons) (list :name "Lrua" :height 34:weight 36))
  (setf (gethash “Faye” persons) (list :name "Faye" :height 24 :weight 20))
  (print (gethash “Lyra” persons))
  …)

So `persons` is a hash table containing a list of person attributes.

Their power is in hashing, the usage of a function that maps an object to an integer. Take strings as an example, a simple hash function could be

(defun hash-string (str)
  (reduce #'+ str :key #'char-int))

Which adds up the integer values of each character, where "a” is 97, “b” is 98…

When we run

 (gethash “Lyra” persons)

First we hash the string “Lyra” to get 408. The constant time lookup type we know of is arrays, so as one would expect hash tables could (and should) be backed by arrays. 

This leads to two problems:

  1. The possible values of our hash function are bigints, a hash table can’t be backed by an array of arbitrary size.
  2. What if two keys have the same value?

The first is simple, we limit the hash table to a certain size, so it has m possible values. Then we just take (mod (string-hash “foo”) m) to get the array index.

The second is also fairly simple, we create a list of elements. So “Ly” and “yL” would be in the same array index, or in hash-table parlance bucket. A value in this bucket will be the key-value pair, so in the bucket containing “Lyra” in persons we might see

(list (“Lyra” .  (list :name "Lyra" :height 36 :weight 34))
       (“Lrya” , (list :name "Lrya" :height 34 :weight 36)))

Then we are left with testing the key portion of our hash-table with our test function versus the given key, and finally returning the value.

Side note: If you have a way to compare the elements in your bucket you can use a balanced binary tree instead of a list. This takes the worst case lookup time from O(k) where k is the number of elements in your bucket, to O(log k). Java 8 did this for large bucket sizes in it’s HashMap (at least some implementations).

Why is this not just O(n) or O(log n)?

Well it is, if your hash function is constant (or bad) then you will get lots of hash-collisions, and turn your nice shiny hash-table into a balanced binary tree or a list. But we really care about average lookup times.

If we fix our bucket size m, then hash-tables are back to bad… The good news is we can resize the buckets. This requires us to rehash all of the elements, and make new lists. This could be expensive, but it probably occurs rarely, and proper values for the size of our hash-table can greatly improve things. You can find the math in any CS Data Structures book, or wikipedia:

but with proper rehashing and a good hash function you’re back down to constant time lookups!

So please, don’t tell me your maps are O(1) without at least alluding to the fact that this is not quite O…

Other lie:

Quicksort is not O(n log n).

it’s actually O(n^2), but it almost always outperforms it’s O(n log n) rivals.

2 thoughts on “Evil Lies about Hash Tables

  1. >This is a lie. They have average case constant time lookups, worst case linear time lookups, and O is about worst case analysis.

    Ordo, omega and theta is not about “worst” or “best” case analysis, they simply describe the bounds of some real valued function. You can take any T(n) and find a O(f(n)) for which T(n) \in O(f(n)).

    We establish these T(n) by using a cost model for analysing our algorithms. Furthermore these T(n) depends on the type of input the algorithm receives, for example a set of keys which produces collisions for our hashtable.

    Liked by 1 person

    1. True! (But some definitions, again see wikipedia may argue). In general I ask candiadates for worst case time of an algorithm, and we say O, but one of my biggest flaws as a mathematician is my lack of exactness sometimes…

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s