INST326 In-Class Exercises

20170620, Hashing

This exercise will develop some simple hashing functions to handle various data types.


Hashing

We've started using Python's built-in dictionary type, and it is most useful. But how does such a construct actually work? Looking up values by their keys is an extremely quick operation in Python, so it isn't your like your typical find() procedure.

Instead, dictionaries use a hashing function convert keys to integers, which are then used as indices into a list or set of lists. With a sophisticated implementation of this approach, one can make key lookups available in constant time!

The following exercises introduce you to some simple hashing techniques, so you get a feel for how this operation might work.


Exercise 1 - Simple Numeric Hashing

Suppose that we have an application where the keys are U.S. social security numbers. A social security number such as 123-45-6789 is a 9-digit number divided into three fields. The first field identifies the geographical area where the number was issued (for example number whose first field are 035 are from Rhode Island and numbers whose first field are 214 are from Maryland) and the other two fields identify the individual.

There are a billion different social security numbers, but suppose that our application will need to process just a few hundred keys, so that we could use a hash table of size M = 1000. One possible approach to implementing a hash function is to use three digits from the SSN. Using three digits from the field on the right is likely to be preferable to using the three digits in the field on the left (since customers may not be equally dispersed over geographic areas), but a better approach is to use all nine digits to make an int value, then consider hash functions for integers.

For this exercise, implement a hashing function for integers, my_hash1(). The function should take an integer $i$ to be hashed and a hashtable size $M$ as arguments, and the function should return the hashed value. Remember, the returned hash value should be between 0 and the hashtable size.

For my_hash1(), use the modulo operator on the input $i$ with hashtable size $M$ to determine the hashed value.

I have provided a driver function generate_ssn() that will generate a random SSN-like number. Use this function and a for loop to generate a list of 100 random SSNs and create a list containing your hash function applied to each random SSN.

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
In [ ]:
import random

def generate_ssn():
    """Generates random Social Security-Like Numbers"""
    part1RandInt = random.randint(0,999)
    part2RandInt = random.randint(0,999)
    part3RandInt = random.randint(0,9999)
    
    return (part1RandInt * 10**7) + (part2RandInt * 10**4) + part3RandInt
In [ ]:
def my_hash1(i, m):
    """Generates a hashed version of i"""
    
    hashedValue = i % m # Implement here
    
    return hashedValue # Update to return the hashed value

# A default value for M
M = 113

# Some test social security numbers
original_ssns = []

while ( len(original_ssns) < 100 ):
    # Call generate_ssn() to get a new SSN
    new_ssn = generate_ssn()
    
    if ( new_ssn not in original_ssns ):
        original_ssns.append(new_ssn)

hashed_ssns = []

for i in range(100):
    # Call my_hash1() on this new SSN and append to hashed_ssns
    new_ssn = original_ssns[i]
    hashed_ssn = my_hash1(new_ssn, M)
    hashed_ssns.append(hashed_ssn)

Now use your histogram code from above to show the frequencies of the hashed values. That is, call plt.hist() on the hashed_ssns list. What do you notice about this histogram?

In [ ]:
plt.hist(hashed_ssns, bins=M)
plt.show()

Hash Collisions

An important property any good hashing algorithm should have is a low frequency of hash collisions. A hash collision occurs when two input values hash to the same output value. Such collisions reduce hashtable efficiency and complicate checking for uniqueness.

As an example, the hash() function is often used to determine whether two files are identical. If your hash function often has collisions, however, you may have little faith that the two files are identical even if they have the same hash value.

Hash functions are also used for secure password storage. Rather than store your password directly, companies will store a hashed copy of your password and check if the password you enter when you try to log in hashes to the same value.

Exercise 1.b. Count Collisions

Write a function, called collision_counter() that takes a list of hashed values and the table size. Call the function with the hashed_ssns list to count the number of collisions, where the collision formula is as follows:

$\text{collisions} = \sum_{i=0}^{M} \text{max}(0, \text{count}(\text{hashed_ssns} == i) - 1)$

ASIDE: Is the table size argument really necessary here?

In [ ]:
def collision_counter(hashedList, tableSize):
    """Count collisions in a hashed list"""
    
    collisionCount = 0
    # Count collisions
    for i in range(tableSize):
        localCount = 0
        for hashedValue in hashedList:
            if ( i == hashedValue ):
                localCount += 1
                
        collisionCount += max(0, localCount - 1)
        
    return collisionCount

print("Collision Count:", collision_counter(hashed_ssns, M))

Exercise 1.c. Load Factor

You may expect the number of collisions to be intimately related to the size of your hash table. This observation is true, and one way hash table implementations deal with this issue is to automatically expand the size of the hash table when the number of elements in the table reaches a certain level.

This level is called the load factor and is simply the ratio of the number of elements $n$ to the size of the table $M$:

Load Factor $\alpha = \frac{n}{M}$

Write a function, called load_factor(), that takes the number of elements you've hashed and your hash table size $M$ as arguments and return the load factor.

In [ ]:
def load_factor(n, M):
    """Calculate load factor for a table of size M with n elements"""
    alpha = 0
    
    # Calculate alpha
    alpha = n / M
    
    return alpha

print("Load Factor:", load_factor(len(hashed_ssns), M))

Some Observations About Collisions and Load Factor

Answer the following questions:

  • What is the minimum number of collisions one may have if the load factor $\alpha > 1$?
    • Minimum number of collisions is 1. Pigeon hole principle states that if load factor is greater than 1, we have more items than slots in a list, so at least one item must collide with another item (e.g., must be placed in the same spot as a previous item).
  • Give a brief summary of what you think the connection is between collisions and load factor.
    • As load factor increases, the likelihood of collisions also increases.

Exercise 2 - Hashing Strings

Now that you have a function for hashing integers, we can extend this to hash strings. Write a function my_hash_str() that takes a string $s$ and a hashtable size $M$ and returns a hashed integer in 0 to M. Recall that you can convert a character to an integer using the ord() function.

To write my_hash_str(), create an accumulator variable, use a for loop to convert each character of the input string into an integer, and add this integer value to the accumulator. Then, use the modulo operator to find the remainder between this accumulated sum and the hashtable size. Return this value as the hash of the string.

In [ ]:
def my_hash_str(s, m):
    """Hash a string into an integer value"""
    
    # Sum up all character int valuee
    accum = 0
    for i in range(len(s)):
        accum += ord(s[i])

    # Calculate the hash
    hashedValue = accum % m
    
    return hashedValue
In [ ]:
# Here are some sample strings you can test. They're tuples with strings and 
#  hash values assuming M=13
sampleStrings = [
    ("damn kid.  tying up the phone line again.  they're all alike...", 1),
    ("smith, i didn't show my work.  i did it in my head...", 11),
    ("another one got caught today, it's all over the papers.", 11),
    ("or feels threatened by me...", 11),
    ("without nationality, without religious bias... and you call us criminals.", 2),
]

M = 13
for x, y in sampleStrings:
    hashedValue = my_hash_str(x, M)
    print(x, hashedValue == y)

Observations

Below, I have code that applies your hashing function to the text in the Hack Manifesto and prints a histogram of the resulting hash values.

You should see that a large number of lines get mapped to the same value. These are called hash collisions.

If we were to build a hashtable with a list and this hash function, we would need to handle these collisions or risk data loss.

We could write a better hash function that distributes things more evenly, or we could increase our hashtable size, or we could do a mix of the two.

In [ ]:
# Simple code for reading the hacker manifesto text from the Internet
import requests # So we can pull from a website
import string # for ascii letters

# Get the text of the Hack Manifesto, "The Conscience of a Hacker"
manifestoText = set(requests.get("https://dl.dropboxusercontent.com/u/66442241/manifesto.txt").text.splitlines())
print("Number of lines of text:", len(manifestoText))

# Given a hashtable size M, hash each line of the manifesto
M = 113

hashed_values = [my_hash_str(x, M) for x in manifestoText]
print("Number of Collisions:", collision_counter(hashed_values, M))
print("Load Factor:", load_factor(len(hashed_values), M))

plt.hist(hashed_values, bins=M)
plt.show()

Exercise 3 - Develop Your Own String Hashing Algorithm

As mentioned, there are many ways one could implement a hash function. For instance, one could add more weight to letters that appear early in the string, or one could add more weight to letters that appear rarely in English.

Work with your partner to design your own function for hashing strings.

Describe your hashing algorithm below, and implement it.

Use the manifestoText file, collision_counter() function, and the load_factor() function to test your hashing function with various hashtable sizes.

Hashing Algorithm Description:

Give extra weight to rare letters in the corpus. We do this by creating a weighting dictionary in which weights are set to the inverse frequency of a token. The more often a token appears, the less weight it gets.

We then multiple this weight by the letter when calculating its ordinal value when we sum up the ordinal values of a string.

In [ ]:
# Put your implementation here

from collections import Counter

freqDict = Counter(' '.join(manifestoText))
hashingDictionary = {}
for char in sorted(freqDict, key=freqDict.get, reverse=True):
    charFreq = freqDict[char]
    hashingDictionary[char] = 1 / charFreq

    
def my_hash_str2(s, m):
    # Sum up all character int valuee
    accum = 0
    for i in range(len(s)):
        if i in hashingDictionary:
            accum += ord(s[i]) * hashingDictionary[i]
        else:
            accum += ord(s[i])

    # Calculate the hash
    hashedValue = accum % m

Use the code below to test your hashing function compared to the my_hash_str() function I had you implement earlier.

If your function achieves gets fewer collisions, you've developed a better hashing function.

In [ ]:
loadFactors = []
collisions = []
for M in range(60, 1000, 37):
    
    hashed_values = [my_hash_str2(x, M) for x in manifestoText]
    alpha = load_factor(len(hashed_values), M)
    collisionCount = collision_counter(hashed_values, M)
    
    loadFactors.append(alpha)
    collisions.append(collisionCount)
    
plt.scatter(loadFactors, collisions)
plt.xlabel("Load Factor")
plt.ylabel("Collisions")
plt.show()
In [ ]: