INST728E - Module 3. JSON - JavaScript Object Notation¶

Much of the data with which we will work comes in the JavaScript Object Notation (JSON) format. JSON is a lightweight text format that allows one to describe objects by keys and values without needing to specify a schema beforehand (as compared to XML).

Many "RESTful" APIs available on the web today return data in JSON format, and the data we have stored from Twitter follows this rule as well.

Python's JSON support is relatively robust and is included in the language under the json package. This package allows us to read and write JSON to/from a string or file and convert many of Python's types into a text format.

import gzip # for handling compressed files
import os # For finding paths to files

# For parsing JSON
import json

# For FreqDist
import nltk

JSON and Keys/Values¶

The main idea here is that JSON allows one to specify a key, or name, for some data and then that data's value as a string, number, or object.

An example line of JSON might look like:

{"key": "value"}

jsonString = '{"car_make": "bmw"}'

# Parse the JSON string
dictFromJson = json.loads(jsonString)

# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dictFromJson)

# Will print the value
print ("Data stored in \"car_make\":\n", dictFromJson["car_make"])

# This will cause an error!
print ("Data stored in \"value\":\n", dictFromJson["bmw"])

Multile Keys and Values¶

A JSON string/file can have many keys and values, but a key should always have a value. We can have values without keys if we're doing arrays, but this can be awkward.

An example of JSON string with multiple keys is below:

{ "name": "Cody", "occupation": "Student", "goal": "PhD" }

Note the comma after the first two values. These commas are needed for valid JSON and to separate keys from other values.

jsonString = '{ "name": "Cody", "occupation": "PostDoc", "goal": "Tenure" }'

# Parse the JSON string
dictFromJson = json.loads(jsonString)

# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dictFromJson)

JSON and Arrays¶

The above JSON string describes an object whose name is "Cody". How would we describe a list of similar students? Arrays are useful here and are denoted with "[]" rather than the "{}" object notation. For example:

{ "people": [ { "name": "Cody", "occupation": "Student", "goal": "PhD" }, { "name": "Scott", "occupation": "Student", "goal": "Masters" } ] }

Again, note the comma between the "}" and "{" separating the two student objects and how they are both surrounded by "[]".

jsonString = '{"people": [{"name": "Cody", "occupation": "PostDoc", "goal": "Tenure"}, {"name": "Scott", "occupation": "Student", "goal": "Masters"}]}'

# Parse the JSON string
dictFromJson = json.loads(jsonString)

# Python now has a dictionary representing this data
print ("Resulting array:\n", dictFromJson)

print ("Each person:")
for student in dictFromJson["people"]:
    print (student)

More JSON + Arrays¶

A couple of things to note:

JSON does not need a name for the array. It could be declared just as an array.
The student objects need not be identical.

As an example:

[ { "name": "Cody", "occupation": "Student", "goal": "PhD" }, { "name": "Scott", "occupation": "Student", "goal": "Masters", "completed": true } ]

jsonString = '[{"name": "Cody","occupation": "PostDoc","goal": "Tenure"},{"name": "Scott","occupation": "Student","goal": "Masters","completed": true}]'

# Parse the JSON string
arrFromJson = json.loads(jsonString)

# Python now has an array representing this data
print ("Resulting array:\n", arrFromJson)

print ("Each person:")
for student in arrFromJson:
    print (student)

Nested JSON Objects¶

We've shown you can have an array as a value, and you can do the same with objects. In fact, one of the powers of JSON is its essentially infinite depth/expressability. You can very easily nest objects within objects, and JSON in the wild relies on this heavily.

An example:

{ "disasters" : [ { "event": "Nepal Earthquake", "date": "25 April 2015", "casualties": 8964, "magnitude": 7.8, "affectedAreas": [ { "country": "Nepal", "capital": "Kathmandu", "population": 26494504 }, { "country": "India", "capital": "New Dehli", "population": 1276267000 }, { "country": "China", "capital": "Beijing", "population": 1376049000 }, { "country": "Bangladesh", "capital": "Dhaka", "population": 168957745 } ] } ] }

jsonString = '{"disasters" : [{"event": "Nepal Earthquake","date": "25 April 2015","casualties": 8964,"magnitude": 7.8,"affectedAreas": [{"country": "Nepal","capital": "Kathmandu","population": 26494504},{"country": "India","capital": "New Dehli","population": 1276267000},{"country": "China","capital": "Beijing","population": 1376049000},{"country": "Bangladesh","capital": "Dhaka","population": 168957745}]}]}'

disasters = json.loads(jsonString)

for disaster in disasters["disasters"]:
    print (disaster["event"])
    print (disaster["date"])
    print (disaster["casualties"])
    
    for country in disaster["affectedAreas"]:
        print (country["country"])

From Python Dictionaries to JSON¶

We can also go from a Python object to JSON with relative ease.

exObj = {
    "event": "Nepal Earthquake",
    "date": "25 April 2015",
    "casualties": 8964,
    "magnitude": 7.8
}

print ("Python Object:", exObj, "\n")

# now we can convert to JSON
print ("Object JSON:")
print (json.dumps(exObj), "\n")

# We can also pretty-print the JSON
print ("Readable JSON:")
print (json.dumps(exObj, indent=4)) # Indent adds space

Reading Twitter JSON¶

We should now have all the tools necessary to understand how Python can read Twitter JSON data. To show this, we'll read in a single tweet from the Ferguson, MO protests review its format, and parse it with Python's JSON loader.

tweetFilename = "single_tweet.json"

# Use Python's os.path.join to account for Windows, OSX/Linux differences
tweetFilePath = os.path.join("..", "Datasets", tweetFilename)

print ("Opening", tweetFilePath)

# We use codecs to ensure we open the file in Unicode format,
# which supports larger character encodings
tweetFile = open(tweetFilePath, "r")

# Read in the whole file, which contains ONE tweet and close
tweetFileContent = tweetFile.read()
tweetFile.close()

# Print the raw json
print ("Raw Tweet JSON:\n")
print (tweetFileContent)

# Convert the JSON to a Python object
tweet = json.loads(tweetFileContent)
print ("Tweet Object:\n")
print (tweet)

# We could have done this in one step with json.load() 
# called on the open file, but our data files have
# a single tweet JSON per line, so this is more consistent

Twitter JSON Fields¶

This tweet is pretty big, but we can still see some of the fields it contains. Note it also has many nested fields. We'll go through some of the more important fields below.

# What fields can we see?
print ("Keys:")
for k in sorted(tweet.keys()):
    print ("\t", k)

print ("Tweet Text:", tweet["text"])
print ("User Name:", tweet["user"]["screen_name"])
print ("Author:", tweet["user"]["name"])
print("Source:", tweet["source"])
print("Retweets:", tweet["retweet_count"])
print("Favorited:", tweet["favorite_count"])
print("Tweet Location:", tweet["place"])
print("Tweet GPS Coordinates:", tweet["coordinates"])
print("Twitter's Guessed Language:", tweet["lang"])

# Tweets have a list of hashtags, mentions, URLs, and other
# attachments in "entities" field
print ("Entities:")
for eType in tweet["entities"]:
    print ("\t", eType)
    
    for e in tweet["entities"][eType]:
        print ("\t\t", e)

Tokenizing Tweets¶

Twitter language is different from standard news articles (e.g., hashtags, emoji, URLs, etc.). To account for these differences, NLTK includes a Twitter-specific tokenizer, creatively named TweetTokenizer.

from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
tokenizer.tokenize(tweet["text"])

Reading Multiple Tweets¶

Twitter data generally comes in files containg many JSON entries. The standard format is that each line in the file is a separate JSON-formatted tweet.

As a consequence, we can't read in the entire file in one call to json.load(), which would be bad for memory consumption anyway. Instead, we need to open the file, read it line by line, and call json.loads() on each line.

Also, JSON files generally compress really well (by an order of magnitude), so JSON files for Twitter, reddit, etc. are generally stored as gzipped files. We'll need to deal with that too.

# Note the .gz at the end
tweetFilename = "ferguson_sample.json.gz"

# Use Python's os.path.join to account for Windows, OSX/Linux differences
tweetFilePath = os.path.join("..", "Datasets", tweetFilename)

# Array for storing tweet objects
tweet_list = []

# Since we have a gzipped file, we can't do the normal open()
#  Instead, we use the gzip library to open the file
with gzip.open(tweetFilePath, "r") as in_file:
    
    # Now read each line like we normally would
    for line in in_file:
        
        # Convert byte stream to UTF8 string
        line_str = line.decode("utf8")
        
        # HEre, we would generally do some analysis with the tweet,
        #  but given the small number in this sample file, we'll
        #  go ahead and store each tweet in an array
        tweet_list.append(json.loads(line_str))

print("Tweet Count:", len(tweet_list))

Simple Analysis¶

Now that we've read in tweets, what can we do with them?

For homework, you'll do some of your own analysis based on the previous module. As an example though, let's see the most used hashtag and most prolific user.

# Recall that hashtags are stored in tweet["entities"]["hashtags"]

# This list comprehension iterates through the tweet_list list, and for each
#  tweet, it iterates through the hashtags list
htags = [
        hashtag["text"].lower() 
         for tweet in tweet_list 
             for hashtag in tweet["entities"]["hashtags"]
        ]

# You could rewrite this comprehension as follows, but these 
#  loops are generaly slower than Python's list comprehensions:
# htags = []
# for tweet in tweet_list:
#     for hashtag in tweet["entities"]["hashtags"]:
#         htags.append(hashtag["text"].lower())

print("Total Hashtag Count:", len(htags))
print("Unique Hashtag Count:", len(set(htags)))

htag_freq = nltk.FreqDist(htags)

for ht, count in htag_freq.most_common(20):
    print(ht, count)

Most Active Users¶

# This list comprehension iterates through the tweet_list list, and for each
#  tweet, it iterates through the hashtags list
authors = [tweet["user"]["screen_name"].lower() for tweet in tweet_list]

print("Tweet Count:", len(tweet_list))
print("Total Author Count:", len(authors))
print("Unique Author Count:", len(set(authors)))

author_freq = nltk.FreqDist(authors)

print("\nActive Users:")
for author, count in author_freq.most_common(20):
    print(author, count)