INST728E - Module 10. Graph and Network Modeling

Information flows and social networks are important considerations during crises, when people are trying to get updates on safe spaces, loved ones, places of shelter, etc. Twitter is noisy though, and a lot of the data may be irrelevant, condolences/thoughts expressed by celebrities, or otherwise uninformative. Using network analysis, we can get some idea about who the most important Twitter users were during this time, and how people split into groups online.

For this analysis, we'll use the NetworkX package to construct a social graph of how people interact. Each person in our Twitter data will be a node in our graph, and edges in the graph will represent mentions during this timeframe. Then we will explore a few simple analytical methods in network analysis, including:

  • Central accounts
  • Visualization
In [1]:
%matplotlib inline

import datetime
import json
import string
import os

import numpy as np

# For plotting
import matplotlib.pyplot as plt

# Network analysis
import networkx as nx

import nltk # Used for FreqDist

Event Description

In [2]:
crisisInfo = {
    
    "brussels": {
        "name": "Brussels Transit Attacks",
        "time": 1458629880, # Timestamp in seconds since 1/1/1970, UTC
                            # 22 March 2016, 6:58 UTC to 08:11 UTC
        "directory": "brussels",
        "keywords": ["brussels", "bomb", "belgium", "explosion"],
        "box": {
            "lowerLeftLon": 2.54563,
            "lowerLeftLat": 49.496899,
            "upperRightLon": 6.40791,
            "upperRightLat": 51.5050810,
        }
    },
}
In [3]:
# Replace the name below with your selected crisis
selectedCrisis = "brussels"

Reading Relevant Tweets

Re-read our relevant tweets...

In [4]:
in_file_path = "/Users/cbuntain/relevant_tweet_output.json" # Replace this as necessary

relevant_tweets = []
with open(in_file_path, "r") as in_file:
    for line in in_file:
        relevant_tweets.append(json.loads(line.encode("utf8")))
        
print("Relevant Tweets:", len(relevant_tweets))
Relevant Tweets: 4687

Graph Building

To limit the amount of data we're looking at, we'll only build the network for people who tweeted about a relevant keyword and the people they mention. We build this network simply by iterating through all the tweets in our relevant list and extract the "user_mentions" list from the "entities" section of the tweet object. For each mention a user makes, we will add an edge from that user to the user he/she mentioned.

In [5]:
# We'll use a directed graph since mentions/retweets are directional
graph = nx.DiGraph()
    
for tweet in relevant_tweets:
    userName = tweet["user"]["screen_name"].lower()
    graph.add_node(userName)

    mentionList = tweet["entities"]["user_mentions"]

    for otherUser in mentionList:
        otherUserName = otherUser["screen_name"].lower()
        if ( graph.has_node(otherUserName) == False ):
            graph.add_node(otherUserName)
            
        if ( graph.has_edge(userName, otherUserName)):
            graph[userName][otherUserName]["weight"] += 1
        else:
            graph.add_edge(userName, otherUserName, weight=1)
        
print ("Number of Users:", len(graph.node))
Number of Users: 6633
In [6]:
# For debugging, print edges with higher weights
for edge in graph.edges():
    if ( graph[edge[0]][edge[1]]["weight"] > 1 ):
        print(edge, graph[edge[0]][edge[1]])
('plagiat_buruk', 'intlspectator') {'weight': 2}
('herecomesthefox', 'theblaze') {'weight': 2}
('karntna_bua', 'ntvde') {'weight': 2}
('stellacreasy', 'foreignoffice') {'weight': 2}
('samiamtimet', 'centrnews') {'weight': 2}
('ukinbelgium', 'foreignoffice') {'weight': 2}
('aliceclarke2', 'brusselsairport') {'weight': 2}
('archlcltd1', 'dfatirl') {'weight': 2}
('sanatandesh', 'ravenhuwolf') {'weight': 2}
('messi89minou', 'tsalgerie') {'weight': 2}
('le_nordiste_59', 'itele') {'weight': 2}
('odogsosa', 'itele') {'weight': 2}
('monsieurmouraz', 'itele') {'weight': 2}
('awerkoff', 'itele') {'weight': 2}

Central Users

In network analysis, "centrality" is used to measure the importance of a given node. Many different types of centrality are used to describe various types of importance though. Examples include "closeness centrality," which measures how close a node is to all other nodes in the network, versus "betweeness centrality," which measures how many shortest paths run through the given node. Nodes with high closeness centrality are important for rapidly disseminating information or spreading disease, whereas nodes with high betweeness are more important to ensure the network stays connected.

The PageRank is another algorithm for measuring importance and was proposed by Sergey Brin and Larry Page for the early version of Google's search algorithm. NetworkX has an implementation of the PageRank algorithm that we can use to look at the most important/authoritative users on Twitter based on their connections to other users.

In [7]:
# Now we prune for performance reasons
# remove all nodes with few edges

for i in range(5):
    nodeList = [n for n,d in graph.degree() if d<2]
    
    if ( len(nodeList) == 0 ):
        break
    
    print("Nodes to Delete:", len(nodeList))
    
    graph.remove_nodes_from(nodeList)
    print ("Number of Remaining Users:", len(graph.node))
Nodes to Delete: 5681
Number of Remaining Users: 952
Nodes to Delete: 533
Number of Remaining Users: 419
Nodes to Delete: 42
Number of Remaining Users: 377
Nodes to Delete: 14
Number of Remaining Users: 363
Nodes to Delete: 6
Number of Remaining Users: 357
In [8]:
# THis may take a while
pageRankList = nx.pagerank_numpy(graph)
In [9]:
highRankNodes = sorted(pageRankList.keys(), key=pageRankList.get, reverse=True)
for x in highRankNodes[:20]:
    print (x, pageRankList[x])
    
conflicts 0.03974968021670997
aahronheim 0.023649905435166766
brusselsairport 0.018134816380318466
sebgorka 0.014953351190288781
skynews 0.013615419767999456
plantu 0.010939556923421545
infos140 0.010939556923421537
humanbeingone 0.01049357978265909
122751v 0.01049357978265901
kris_sacrebleu 0.010493579782659001
josephhayat 0.010493579782658883
22june1956 0.010493579782658882
bomberosgc 0.010493579782658816
iamdjcosmo 0.010493579782658802
birba_ste 0.010493579782658802
sevouuuu 0.0104935797826588
jvrjitsings 0.010493579782658791
loabrynjulfs 0.010493579782658708
rt_com 0.009713119786323339
euranetplus 0.008263694078843595
In [10]:
#plt.hist([x for x in pageRankList.values()])
plt.plot(range(len(pageRankList)), sorted([x for x in pageRankList.values()]))

plt.grid()
plt.show()

Visualize the Graph

In [11]:
plt.figure(figsize=(8,8))
pos = nx.spring_layout(graph, scale=200, iterations=100, k=0.2)
# pos = nx.fruchterman_reingold_layout(graph, weight="weight", iterations=100)
# pos = nx.random_layout(graph)
nx.draw(graph, 
        pos, 
        node_color='#A0CBE2', 
        width=1, 
        with_labels=False,
        node_size=50)

# Get the highest ranking nodes...
hrNames = highRankNodes[:10]

# Get a list of scores for these high-ranked nodes
scores = pageRankList.values()
min_val = min(scores)
max_val = max(scores)
hrValues = [((pageRankList[x]-min_val) / max_val) for x in hrNames]

# Draw our high-rank nodes with a larger size and different color
nx.draw_networkx_nodes(graph, pos, nodelist=hrNames,
                       node_size=200,
                       node_color=hrValues,
                       cmap=plt.cm.winter)

# Dummy dictionary that maps usernames to themselves
#  (we'll use this to set node labels)
hrDict = dict(zip(hrNames, hrNames))

# Add labels, so we can see them
nx.draw_networkx_labels(graph,
                        pos,
                        labels=hrDict,
                        fontsize=36,
                        font_color="g")

plt.axis('off')
plt.show()

Community Analysis

While the graph above shows many connections among these users, we can evaluate the graph's density to determine how many of the possible edges exist in this graph. This metric also gives some insight into how tightly connected these users are.

From there, we can also look at subgroups of users, or communities. These communities are groups of users that are more interconnected with each other than others in the network and may show us groups of news organizations versus regular users or users who tweet in the same language.

NetworkX has built-in support for community analysis, and as with centrality, many methods exist for evaluating this metric.

In [12]:
from networkx.algorithms import community # Community analysis functions
In [13]:
# Use Girvan-Newman algorithm to find top-level community structure
community_iter = community.girvan_newman(graph)

# The first set of communities is the top level. Subsequent elements
#  in this iterator describe subgroups within communities. We'll 
#  only use level 1 for now.
top_level_comms = next(community_iter)
In [14]:
def draw_graph(graph):
    """
    Function for drawing a given graph using the spring layout
    algorithm.
    """
    
    plt.figure(figsize=(8,8))
    pos = nx.spring_layout(graph, scale=200, iterations=100, k=0.2)
    # pos = nx.fruchterman_reingold_layout(graph, weight="weight", iterations=100)
    # pos = nx.random_layout(graph)
    nx.draw(subg, 
            pos, 
            node_color='#A0CBE2', 
            width=1, 
            with_labels=False,
            node_size=50)

    plt.axis('off')
    plt.show()
    
    
def find_auth_nodes(graph, limit=5):
    """
    Given a NetworkX Graph structure, use PageRank to find the most
    authoritative nodes in the graph.
    """
    
    # THis may take a while
    local_pg_rank = nx.pagerank_numpy(graph)
    
    # Rank the users by their PageRank score, and reverse the list
    #  so we can get the top users in the front of the list
    local_auths = sorted(local_pg_rank.keys(), key=local_pg_rank.get, reverse=True)
    
    # Take only the first few users
    local_targets = local_auths[:limit]

    # Print user name and PageRank score
    print("\tTop Users:")
    for x in local_targets:
        print ("\t", x, local_pg_rank[x])
        
    # In case we want to use these usernames later
    return local_targets

def user_hashtags(user_list, tweet_list, limit=5):
    """
    Simple function that finds all tweets by a given set of users,
    and prints the top few most frequent hashtags
    """
    
    # Keep only tweets authored by someone in our user set
    target_tweets = filter(
        lambda tweet: tweet["user"]["screen_name"].lower() in user_list, tweet_list)
    
    # This list comprehension iterates through the tweet_list list, and for each
    #  tweet, it iterates through the hashtags list
    htags = [
            hashtag["text"].lower() 
             for tweet in target_tweets 
                 for hashtag in tweet["entities"]["hashtags"]
            ]

    htags_freq = nltk.FreqDist(htags)

    print("\tFrequent Hashtags:")
    for tag, count in htags_freq.most_common(limit):
        print("\t", tag, count)
In [15]:
# Iterate through the communities and trim ones of smallish size
for i, comm in enumerate(top_level_comms):
    
    # We'll skip small communities
    if ( len(comm) < 10 ):
        continue
        
    print("Community: %d" % (i+1))
    print("\tUser Count: %d" % len(comm))
    
    # Use the username set produced by our community generator to 
    #  create a subgraph of only these users and the connections
    #  between them.
    subg = graph.subgraph(comm)
    
    # Given the subgraph...
    #  find the most authoritative nodes,
    find_auth_nodes(subg)
    
    #  the most frequent hashtags, and
    user_hashtags(comm, relevant_tweets, limit=10)
    
    #  then visualize the network
    draw_graph(subg)
Community: 1
	User Count: 56
	Top Users:
	 skynews 0.10462471569370708
	 sebgorka 0.09603234773818523
	 rt_com 0.06882318254569952
	 alexrossisky 0.04877432398281531
	 foxandfriends 0.03588577204953248
	Frequent Hashtags:
	 brussels 29
	 brusselsairport 3
	 brusselsattack 2
	 belgium 2
	 update 2
	 terrormonitor 2
	 prayforbrussels 1
	 breaking 1
	 vrtnieuws 1
	 zaventem 1
Community: 2
	User Count: 219
	Top Users:
	 conflicts 0.06744945713950533
	 aahronheim 0.04226432585755497
	 brusselsairport 0.028024920144792728
	 infos140 0.019549887834276695
	 bbcbreaking 0.014050602839167212
	Frequent Hashtags:
	 brussels 86
	 belgium 17
	 zaventem 14
	 bruxelles 11
	 malbeek 8
	 stib 7
	 mivb 7
	 brusselsattack 6
	 maelbeek 5
	 vrtnieuws 5