This tutorial will show you how to implement a hash table with separate chaining. It’s not the most efficient method, but it is the simplest way to get started and create a fully functioning hash table.
Hash tables are indispensable tools for solving a wide assortment for so many interesting programming problems. I always love to incorporate a hash table into a problem; they can provide a clean solution to an issue that would be a mess otherwise.
For the longest time, I wondered how hash tables were created. I wanted to make my own, but I had no clue as to how they worked. Luckily, I found James Routley’s awesome post detailing how to implement one in C. For anyone interested, I highly recommend it.
Using this knowledge, I ported the hash table to Python. By the end of this tutorial, you will understand the basic ideas behind the hash table. Perhaps more importantly, you will have implemented your very own!
If you’ve ever used a dictionary in Python or an associative array in a language like PHP, You’ve probably used a hash table before. Features such as the dictionary in Python or the associative array in PHP are often implemented using a hash table. Even more straightforward is the HashTable class available in Java.
Why would we need such a structure? Well, sometimes a flat area just isn’t enough. To make sense of the problem at hand, you may need to store and access your data by a key, a definite step up from the rudimentary integer index provided by flat arrays.
The hash table we build will be used like this:
How does this really work under the hood? As it turns out, your key (
phoneDirectory in this example) is converted into an index. This index is used for storing and retrieving the data value from the hash table’s internal array. All those messy details are hidden from the user - they just have to worry about
Our hash table will need a few fields to keep it together. It needs a
size, which will be the number of elements that have been inserted. It needs a
capacity, which will determine the size of our internal array. Last, it needs
buckets - this is the internal array, storing each inserted value in a “bucket” based on the provided key.
INITIAL_CAPACITY variable, arbitrarily set to 50 in my example class. This defines the size of our internal array. In a more complex hash table implementation (i.e. an open-addressed, double-hashed hash table), it’s important that the capacity is prime, and that it can be changed. On the other hand, our separate chaining hash table sets the capacity once and never changes it, regardless of how many elements are stored. This is good for simplicity, but bad for scalability.
If you thought you were getting a break from the internal Node structure, you were wrong! Our hash table will need its own version of a Node:
Look familiar? Node has a
next field because it’s actually part of a LinkedList. Because the hash table uses separate chaining, each bucket will actually contain a LinkedList of nodes containing the objects stored at that index. This is one method of collision resolution.
Whenever two keys have the same hash value, it is considered a collision. What should our hash table do? If it just wrote the data into the location anyway, we would be losing the object that is already stored under a different key.
With separate chaining, we create a Linked List at each index of our
buckets array, containing all keys for a given index. When we need to look up one of those items, we iterate the list until we find the Node matching the requested key.
There are other, far more efficient ways of handling collisions, but separate chaining is likely the simplest method.
Now we can really get started. Let’s jump into our hash table’s methods.
Our hash method needs to take our key, which will be a string of any length, and produce an index for our internal
We will be creating a hash function to convert the string to an index. There are many properties of a good hash function, but for our purposes the most important characteristic for our function to have is uniformity. We want our hash values to be as evenly distributed among our buckets as possible, to take full advantage of each bucket and avoid collisions. The ideal case is pictured below:
On the other hand, an uneven distribution will defeat the purpose of the hash table altogether, yielding nothing more than a bloated LinkedList.
Consider an extreme case: Our hash function will be
h(x) = 1. That’s right, each input produces the same constant value. So, what happens? Every time we hash a key, the output is 1, meaning that we assign that node to bucket 1. The result would look something like this:
Not pretty! We’ll just have to make sure we avoid this bottleneck at all costs.
Here’s the code for our hash function:
While fairly arbitrary, this function will provide an acceptable degree of uniformity for our purposes.
To insert a key/value pair into our hash table, we will follow these steps:
Increment size of hash table.
indexof key using hash function.
If the bucket at
indexis empty, create a new node and add it there.
Otherwise, a collision occurred: there is already a linked list of at least one node at this index. Iterate to the end of the list and add a new node there.
This is reflected in the following code:
After storing data in our hash table, we will surely need to retrieve it at some point. To do this, we’ll perform the following steps:
indexfor the provided key using the hash function.
Go to the bucket for that
Iterate the nodes in that linked list until the key is found, or the end of the list is reached.
Return the value of the found node, or None if not found.
This idea would be expressed in code like this:
Removing an element from a hash table is similar to removing an element from a linked list. This method will return the data value removed, or None if the requested node was not found.
Compute hash for the key to determine
Iterate linked list of nodes. Continue until end of list or until key is found.
If the key is not found, return None.
Otherwise, remove the node from the linked list and return the node value.
This would be reflected in code as such:
For more information about removing a node from a linked list, see my LinkedList article.
Hash tables can be useful in a wide variety of computer science applications. Once you learn how to use them, you won’t be able to stop! It seems at every turn there is a new application for the hash table.
Below are a few problems you can attempt to solve using your new hash table:
Write a function to determine whether a string contains repeated characters.
Given a string of any length, find the most-used character in the string.
Write a function to determine whether two strings are anagrams.
Thank you for reading. Check out the full source code for what we did today below!
Enjoyed the read?
Sign up for our mailing list to get notified about new posts and exclusive information!