\n", "\n", "# Hashing\n", "\n", "We've started using Python's built-in dictionary type, and it is most useful. But how does such a construct actually work? Looking up values by their keys is an extremely quick operation in Python, so it isn't your like your typical `find()` procedure.\n", "\n", "Instead, dictionaries use a hashing function convert keys to integers, which are then used as indices into a list or set of lists. With a sophisticated implementation of this approach, one can make key lookups available in constant time!\n", "\n", "The following exercises introduce you to some simple hashing techniques, so you get a feel for how this operation might work.\n", "\n", "

\n", "\n", "# Exercise 1 - Simple Numeric Hashing\n", "\n", "Suppose that we have an application where the keys are U.S. social security numbers. A social security number such as 123-45-6789 is a 9-digit number divided into three fields. The first field identifies the geographical area where the number was issued (for example number whose first field are 035 are from Rhode Island and numbers whose first field are 214 are from Maryland) and the other two fields identify the individual. \n", "\n", "There are a billion different social security numbers, but suppose that our application will need to process just a few hundred keys, so that we could use a hash table of size M = 1000. One possible approach to implementing a hash function is to use three digits from the SSN. Using three digits from the field on the right is likely to be preferable to using the three digits in the field on the left (since customers may not be equally dispersed over geographic areas), but a better approach is to use all nine digits to make an int value, then consider hash functions for integers.\n", "\n", "__For this exercise__, implement a hashing function for integers, `my_hash1()`. The function should take an integer $i$ to be hashed and a hashtable size $M$ as arguments, and the function should return the hashed value. Remember, the returned hash value should be between 0 and the hashtable size.\n", "\n", "For `my_hash1()`, use the modulo operator on the input $i$ with hashtable size $M$ to determine the hashed value.\n", "\n", "I have provided a driver function `generate_ssn()` that will generate a random SSN-like number. Use this function and a for loop to generate a list of 100 random SSNs and create a list containing your hash function applied to each random SSN." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import random\n", "\n", "def generate_ssn():\n", " \"\"\"Generates random Social Security-Like Numbers\"\"\"\n", " part1RandInt = random.randint(0,999)\n", " part2RandInt = random.randint(0,999)\n", " part3RandInt = random.randint(0,9999)\n", " \n", " return (part1RandInt * 10**7) + (part2RandInt * 10**4) + part3RandInt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def my_hash1(i, m):\n", " \"\"\"Generates a hashed version of i\"\"\"\n", " \n", " hashedValue = i % m # Implement here\n", " \n", " return hashedValue # Update to return the hashed value\n", "\n", "# A default value for M\n", "M = 113\n", "\n", "# Some test social security numbers\n", "original_ssns = []\n", "\n", "while ( len(original_ssns) < 100 ):\n", " # Call generate_ssn() to get a new SSN\n", " new_ssn = generate_ssn()\n", " \n", " if ( new_ssn not in original_ssns ):\n", " original_ssns.append(new_ssn)\n", "\n", "hashed_ssns = []\n", "\n", "for i in range(100):\n", " # Call my_hash1() on this new SSN and append to hashed_ssns\n", " new_ssn = original_ssns[i]\n", " hashed_ssn = my_hash1(new_ssn, M)\n", " hashed_ssns.append(hashed_ssn)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now use your histogram code from above to show the frequencies of the hashed values. That is, call plt.hist() on the hashed_ssns list. What do you notice about this histogram?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.hist(hashed_ssns, bins=M)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hash Collisions\n", "\n", "An important property any good hashing algorithm should have is a low frequency of __hash collisions__. A __hash collision__ occurs when two input values hash to the same output value. Such collisions reduce hashtable efficiency and complicate checking for uniqueness.\n", "\n", "As an example, the `hash()` function is often used to determine whether two files are identical. If your hash function often has collisions, however, you may have little faith that the two files are identical even if they have the same hash value.\n", "\n", "Hash functions are also used for secure password storage. Rather than store your password directly, companies will store a hashed copy of your password and check if the password you enter when you try to log in hashes to the same value.\n", "\n", "## Exercise 1.b. Count Collisions\n", "\n", "Write a function, called `collision_counter()` that takes a list of hashed values and the table size. Call the function with the `hashed_ssns` list to count the number of collisions, where the collision formula is as follows: \n", "\n", "$\\text{collisions} = \\sum_{i=0}^{M} \\text{max}(0, \\text{count}(\\text{hashed_ssns} == i) - 1)$ \n", "\n", "__ASIDE__: Is the table size argument really necessary here?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def collision_counter(hashedList, tableSize):\n", " \"\"\"Count collisions in a hashed list\"\"\"\n", " \n", " collisionCount = 0\n", " # Count collisions\n", " for i in range(tableSize):\n", " localCount = 0\n", " for hashedValue in hashedList:\n", " if ( i == hashedValue ):\n", " localCount += 1\n", " \n", " collisionCount += max(0, localCount - 1)\n", " \n", " return collisionCount\n", "\n", "print(\"Collision Count:\", collision_counter(hashed_ssns, M))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.c. Load Factor\n", "\n", "You may expect the number of collisions to be intimately related to the size of your hash table. This observation is true, and one way hash table implementations deal with this issue is to automatically expand the size of the hash table when the number of elements in the table reaches a certain level.\n", "\n", "This level is called the __load factor__ and is simply the ratio of the number of elements $n$ to the size of the table $M$:\n", "\n", "Load Factor $\\alpha = \\frac{n}{M}$\n", "\n", "Write a function, called `load_factor()`, that takes the number of elements you've hashed and your hash table size $M$ as arguments and return the load factor." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def load_factor(n, M):\n", " \"\"\"Calculate load factor for a table of size M with n elements\"\"\"\n", " alpha = 0\n", " \n", " # Calculate alpha\n", " alpha = n / M\n", " \n", " return alpha\n", "\n", "print(\"Load Factor:\", load_factor(len(hashed_ssns), M))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some Observations About Collisions and Load Factor\n", "\n", "Answer the following questions:\n", "\n", "- What is the minimum number of collisions one may have if the load factor $\\alpha > 1$? \n", " - Minimum number of collisions is 1. Pigeon hole principle states that if load factor is greater than 1, we have more items than slots in a list, so at least one item must collide with another item (e.g., must be placed in the same spot as a previous item).\n", "- Give a brief summary of what you think the connection is between collisions and load factor. \n", " - As load factor increases, the likelihood of collisions also increases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "\n", "# Exercise 2 - Hashing Strings\n", "\n", "Now that you have a function for hashing integers, we can extend this to hash strings. Write a function `my_hash_str()` that takes a string $s$ and a hashtable size $M$ and returns a hashed integer in 0 to M. Recall that you can convert a character to an integer using the `ord()` function.\n", "\n", "To write `my_hash_str()`, create an accumulator variable, use a for loop to convert each character of the input string into an integer, and add this integer value to the accumulator. Then, use the modulo operator to find the remainder between this accumulated sum and the hashtable size. Return this value as the hash of the string." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def my_hash_str(s, m):\n", " \"\"\"Hash a string into an integer value\"\"\"\n", " \n", " # Sum up all character int valuee\n", " accum = 0\n", " for i in range(len(s)):\n", " accum += ord(s[i])\n", "\n", " # Calculate the hash\n", " hashedValue = accum % m\n", " \n", " return hashedValue" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Here are some sample strings you can test. They're tuples with strings and \n", "# hash values assuming M=13\n", "sampleStrings = [\n", " (\"damn kid. tying up the phone line again. they're all alike...\", 1),\n", " (\"smith, i didn't show my work. i did it in my head...\", 11),\n", " (\"another one got caught today, it's all over the papers.\", 11),\n", " (\"or feels threatened by me...\", 11),\n", " (\"without nationality, without religious bias... and you call us criminals.\", 2),\n", "]\n", "\n", "M = 13\n", "for x, y in sampleStrings:\n", " hashedValue = my_hash_str(x, M)\n", " print(x, hashedValue == y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Observations\n", "\n", "Below, I have code that applies your hashing function to the text in the Hack Manifesto and prints a histogram of the resulting hash values.\n", "\n", "You should see that a large number of lines get mapped to the same value. These are called hash collisions.\n", "\n", "If we were to build a hashtable with a list and this hash function, we would need to handle these collisions or risk data loss.\n", "\n", "We could write a better hash function that distributes things more evenly, or we could increase our hashtable size, or we could do a mix of the two." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Simple code for reading the hacker manifesto text from the Internet\n", "import requests # So we can pull from a website\n", "import string # for ascii letters\n", "\n", "# Get the text of the Hack Manifesto, \"The Conscience of a Hacker\"\n", "manifestoText = set(requests.get(\"https://dl.dropboxusercontent.com/u/66442241/manifesto.txt\").text.splitlines())\n", "print(\"Number of lines of text:\", len(manifestoText))\n", "\n", "# Given a hashtable size M, hash each line of the manifesto\n", "M = 113\n", "\n", "hashed_values = [my_hash_str(x, M) for x in manifestoText]\n", "print(\"Number of Collisions:\", collision_counter(hashed_values, M))\n", "print(\"Load Factor:\", load_factor(len(hashed_values), M))\n", "\n", "plt.hist(hashed_values, bins=M)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "

\n", "\n", "# Exercise 3 - Develop Your Own String Hashing Algorithm\n", "\n", "As mentioned, there are many ways one could implement a hash function. For instance, one could add more weight to letters that appear early in the string, or one could add more weight to letters that appear rarely in English.\n", "\n", "__Work with your partner to design your own function for hashing strings.__ \n", "\n", "Describe your hashing algorithm below, and implement it.\n", "\n", "Use the `manifestoText` file, `collision_counter()` function, and the `load_factor()` function to test your hashing function with various hashtable sizes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hashing Algorithm Description:\n", "\n", "Give extra weight to rare letters in the corpus. We do this by creating a weighting dictionary in which weights are set to the inverse frequency of a token. The more often a token appears, the less weight it gets.\n", "\n", "We then multiple this weight by the letter when calculating its ordinal value when we sum up the ordinal values of a string." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Put your implementation here\n", "\n", "from collections import Counter\n", "\n", "freqDict = Counter(' '.join(manifestoText))\n", "hashingDictionary = {}\n", "for char in sorted(freqDict, key=freqDict.get, reverse=True):\n", " charFreq = freqDict[char]\n", " hashingDictionary[char] = 1 / charFreq\n", "\n", " \n", "def my_hash_str2(s, m):\n", " # Sum up all character int valuee\n", " accum = 0\n", " for i in range(len(s)):\n", " if i in hashingDictionary:\n", " accum += ord(s[i]) * hashingDictionary[i]\n", " else:\n", " accum += ord(s[i])\n", "\n", " # Calculate the hash\n", " hashedValue = accum % m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the code below to test your hashing function compared to the `my_hash_str()` function I had you implement earlier.\n", "\n", "If your function achieves gets fewer collisions, you've developed a better hashing function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "loadFactors = []\n", "collisions = []\n", "for M in range(60, 1000, 37):\n", " \n", " hashed_values = [my_hash_str2(x, M) for x in manifestoText]\n", " alpha = load_factor(len(hashed_values), M)\n", " collisionCount = collision_counter(hashed_values, M)\n", " \n", " loadFactors.append(alpha)\n", " collisions.append(collisionCount)\n", " \n", "plt.scatter(loadFactors, collisions)\n", "plt.xlabel(\"Load Factor\")\n", "plt.ylabel(\"Collisions\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }