INST728E - Module 9. Topic Modeling¶

Along with sentiment analysis, a question often asked of social networks is "What are people talking about?" We can answer this question using tools from topic modeling and natural language processing. With crises, people can have many responses, from sharing specific data about the event, sharing condolonces, or opening their homes to those in need.

To generate these topic models, we will use the Gensim package's implementation of Latent Dirichlet Allocation (LDA), which basically constructs a set of topics where each topic is described as a probability distribution over the words in our tweets. Several other methods for topic modeling exist as well.

%matplotlib inline

import datetime
import json
import string
import os

import numpy as np

# For plotting
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

Event Description¶

crisisInfo = {
    
    "brussels": {
        "name": "Brussels Transit Attacks",
        "time": 1458629880, # Timestamp in seconds since 1/1/1970, UTC
                            # 22 March 2016, 6:58 UTC to 08:11 UTC
        "directory": "brussels",
        "keywords": ["brussels", "bomb", "belgium", "explosion"],
        "box": {
            "lowerLeftLon": 2.54563,
            "lowerLeftLat": 49.496899,
            "upperRightLon": 6.40791,
            "upperRightLat": 51.5050810,
        }
    },
}

# Replace the name below with your selected crisis
selectedCrisis = "brussels"

Reading Relevant Tweets¶

Re-read our relevant tweets...

in_file_path = "/Users/cbuntain/relevant_tweet_output.json" # Replace this as necessary

relevant_tweets = []
with open(in_file_path, "r") as in_file:
    for line in in_file:
        relevant_tweets.append(json.loads(line.encode("utf8")))
        
print("Relevant Tweets:", len(relevant_tweets))

Temporal Ordering¶

# Twitter's time format, for parsing the created_at date
timeFormat = "%a %b %d %H:%M:%S +0000 %Y"

# Frequency map for tweet-times
rel_frequency_map = {}
for tweet in relevant_tweets:
    # Parse time
    currentTime = datetime.datetime.strptime(tweet['created_at'], timeFormat)

    # Flatten this tweet's time
    currentTime = currentTime.replace(second=0)

    # If our frequency map already has this time, use it, otherwise add
    extended_list = rel_frequency_map.get(currentTime, [])
    extended_list.append(tweet)
    rel_frequency_map[currentTime] = extended_list
    
# Fill in any gaps
times = sorted(rel_frequency_map.keys())
firstTime = times[0]
lastTime = times[-1]
thisTime = firstTime

# We want to look at per-minute data, so we fill in any missing minutes
timeIntervalStep = datetime.timedelta(0, 60)    # Time step in seconds
while ( thisTime <= lastTime ):

    rel_frequency_map[thisTime] = rel_frequency_map.get(thisTime, [])
        
    thisTime = thisTime + timeIntervalStep

# Count the number of minutes
print ("Start Time:", firstTime)
print ("Stop Time:", lastTime)
print ("Processed Times:", len(rel_frequency_map))

fig, ax = plt.subplots()
fig.set_size_inches(11, 8.5)

plt.title("Tweet Frequencies")

sortedTimes = sorted(rel_frequency_map.keys())
postFreqList = [len(rel_frequency_map[x]) for x in sortedTimes]

smallerXTicks = range(0, len(sortedTimes), 30)
plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)

xData = range(len(sortedTimes))

ax.plot(xData, postFreqList, color="blue", label="Posts")

ax.grid(b=True, which=u'major')
ax.legend()

plt.show()

Hashtags as Topics¶

Hashtags generally have a topical connotation, so let's regenerate the common hashtags we've seen before.

# This list comprehension iterates through the tweet_list list, and for each
#  tweet, it iterates through the hashtags list
htags = [
        hashtag["text"].lower() 
         for tweet in relevant_tweets 
             for hashtag in tweet["entities"]["hashtags"]
        ]

print("\nTotal Hashtag Count:", len(htags))
print("Unique Hashtag Count:", len(set(htags)))

htags_freq = nltk.FreqDist(htags)

print("\nFrequent Hashtags:")
for tag, count in htags_freq.most_common(20):
    print(tag, count)

Topic Modeling with Gensim¶

A big part of topic modeling is pre-processing your data.

For our context, that includes:

Which tokens are used so frequently as to be useless?
Which tokens are so rare as to be uninformative?
How should we handle phrases versus single words (called n-grams)?

We'll explore this feature extraction below.

# Gotta pull in a bunch of packages for this

# Actual LDA implementation
import gensim.models.ldamulticore

# Actual ATM implementation
import gensim.models.atmodel

# CountVectorizer turns tokens into numbers for us
from sklearn.feature_extraction.text import CountVectorizer

# Gensim models
from gensim.corpora import Dictionary  # All the words that appear in our dataset
from gensim.models import TfidfModel # For down-weighting frequent tokens
from gensim.models.phrases import Phrases # For building bigrams

Now we build a list of stop words using NLTK and other words we don't care about.

# But first, read in stopwrods
enStop = stopwords.words('english')
frStop = stopwords.words('french')
esStop = stopwords.words('spanish')

# Skip stop words, retweet signs, @ symbols, and URL headers
stopList = enStop +\
    frStop + esStop +\
    ["http", "https", "rt", "@", ":", "co", "amp", "&amp;", "...", "\n", "\r"] +\
    crisisInfo[selectedCrisis]["keywords"]
stopList.extend(string.punctuation)

For memory/performance reasons, we don't want to carry around strings of characters when doing topic modeling. Instead, we can convert our tweets into a "bag of words" (BoW) model, in which a tweet is made up of words and their frequencies in that tweet. Then, for each word, we replace it with a specific indexed integer. The BoW model loses contextual information about which words occur before or after each other in the tweet, but we can address this with bigrams (to a degree).

As an example, consider the following sets of tweets:

"my best friend lives in brussels and my friend isn’t responding",
"Wish all but the best out to Brussels this morning",
"So horrible. My thoughts are with the people in Brussels",

We can extract the following unique words from these tweets: {'.', 'all', 'and', 'are', 'best', 'brussels', 'but', 'friend', 'horrible', 'in', 'isn', 'lives', 'morning', 'my', 'out', 'people', 'responding', 'so', 't', 'the', 'this', 'thoughts', 'to', 'wish', 'with', '’'}

From there, we can replace these tokens with indices: [(0, 'friend'), (1, 'lives'), (2, 'all'), (3, 'my'), (4, 'the'), (5, 'this'), (6, 'but'), (7, 'thoughts'), (8, 'best'), (9, 'and'), (10, 'are'), (11, 'so'), (12, '’'), (13, 't'), (14, 'in'), (15, '.'), (16, 'brussels'), (17, 'responding'), (18, 'wish'), (19, 'people'), (20, 'morning'), (21, 'with'), (22, 'out'), (23, 'to'), (24, 'isn'), (25, 'horrible')]

Then we can convert tweets into the BoW model:

"my best friend lives in brussels and my friend isn’t responding" --> [(0, 2), (1, 1), (3, 2), (8, 1), (9, 1), (12, 1), (14, 1), (16, 1), (17, 1), (24, 1)]
- "my" and "friend" occur twice, and their pairs (0, 2) and (3, 2) reflect this.
"Wish all but the best out to Brussels this morning", --> [(2, 1), (4, 1), (5, 1), (6, 1), (8, 1), (16, 1), (18, 1), (20, 1), (22, 1), (23, 1)]

vectorizer = CountVectorizer(strip_accents='unicode', 
                             tokenizer=TweetTokenizer(preserve_case=False).tokenize,
                             stop_words=stopList)

# Build the Analyzer
analyze = vectorizer.build_analyzer() 

# For each tweet, tokenize it according to the CountVectorizer
analyzed_text = [analyze(tweet["text"]) for tweet in relevant_tweets]

# As an example, note the removed stopwords
print(relevant_tweets[0]["text"])
print(analyzed_text[0])

# Make bigrams from the text, but only for really common bigrams
bigram = Phrases(analyzed_text, min_count=5)
bi_analyzed_text = [bigram[x] for x in analyzed_text]

# As an example, note the removed stopwords
print(relevant_tweets[0]["text"])
print(analyzed_text[0])
print(bi_analyzed_text[0])

# Build a dictionary from this text
dictionary = Dictionary(bi_analyzed_text)

# Filter out words that occur too frequently or too rarely.
# Disregarding stop words, this dataset has a very high number of low frequency words.
max_freq = 0.5
min_count = 10
dictionary.filter_extremes(no_below=min_count, no_above=max_freq)

# This sort of "initializes" dictionary.id2token.
_ = dictionary[0]

# Create a map for vectorizer IDs to words
id2WordDict = dictionary.id2token
word2IdDict = dict(map(lambda x: (x[1], x[0]), id2WordDict.items()))

# Create a bag of words
corpus = [dictionary.doc2bow(text) for text in analyzed_text]

# Train TFIDF model
tfidf_model = TfidfModel(corpus)

# Built TFIDF-transformed corpus
tfidf_corpus = [tfidf_model[text] for text in corpus]

tfidf_corpus[0]

We then use the vectorizer to transform our tweet text into a feature set, which essentially is a table with rows of tweets, columns for each keyword, and each cell is the number of times that keyword appears in that tweet.

We then convert that table into a model the Gensim package can handle, apply LDA, and grab the top 10 topics, 10 words that describe that topic, and print them.

k = 5

lda = gensim.models.LdaMulticore(tfidf_corpus, 
                                 id2word=id2WordDict,
                                 num_topics=k) # ++ iterations for better results

ldaTopics = lda.show_topics(num_topics=k, 
                            num_words=10, 
                            formatted=False)

for (i, tokenList) in ldaTopics:
    print ("Topic %d:" % i, ' '.join([pair[0] for pair in tokenList]))
    print()

Visualized Topics¶

import pyLDAvis.gensim

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda, tfidf_corpus, dictionary)

Author-Topic Models¶

Social media messages have an additional dimension we can leverage for identifying topics: their authors. If we make the assumption that, given all possible topics, a single user is only likely to tweet about a small number of them, then we can get some additional insight based on this data.

We'll try this below.

# Simple pipepline for analyzing tweet text
def analysis_pipeline(text):
    a1 = analyze(text)
    a2 = bigram[a1]
    a3 = dictionary.doc2bow(a2)
    a4 = tfidf_model[a3]

    return a4

analyzed_tweet_pairs = list(
    filter(lambda x: len(x[0]) > 0,
           [(analysis_pipeline(tweet["text"]), tweet["user"]["id"]) 
            for tweet in relevant_tweets])
)

atm_docs = [x[0] for x in analyzed_tweet_pairs]
doc_to_author = dict([(x, [y[1]]) for x, y in enumerate(analyzed_tweet_pairs)])

k = 10

atm = gensim.models.atmodel.AuthorTopicModel(corpus=atm_docs, 
                                             id2word=id2WordDict,
                                             doc2author=doc_to_author,
                                             num_topics=k) # ++ iterations for better results

atmTopics = atm.show_topics(num_topics=k, 
                            num_words=10, 
                            formatted=False)

for (i, tokenList) in atmTopics:
    print ("Topic %d:" % i, ' '.join([pair[0] for pair in tokenList]))
    print()

Per-Topic Times¶

Now we can graph each topic over time.

topic_counter = {x:[0]*len(rel_frequency_map) for x in range(lda.num_topics)}

for (i, d) in enumerate(sortedTimes):
    tweets = rel_frequency_map[d]
    
    for tweet in tweets:
        text = tweet["text"]
        topic_dist = lda.get_document_topics(analysis_pipeline(text))
        
        top_topic = sorted(topic_dist, key=lambda x: x[1])[-1][0]
        
        topic_counter[top_topic][i] += 1

fig, ax = plt.subplots()
fig.set_size_inches(11, 8.5)

plt.title("Tweet Frequencies")

smallerXTicks = range(0, len(sortedTimes), 30)
plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)

xData = range(len(sortedTimes))

for this_k in range(lda.num_topics):
    plt.plot(xData, topic_counter[this_k], label="Topic %d" % (this_k))

ax.grid(b=True, which=u'major')
ax.legend()

plt.show()