ENPM809G - Collecting Network Data from Wikipedia

Graphs can be found most anywhere. Here's an example with Wikipedia.

In [1]:
# First we need to make sure we have wikipedia installed...
import pip

pip.main(['install', 'wikipedia'])
Requirement already satisfied: wikipedia in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages
Requirement already satisfied: beautifulsoup4 in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages (from wikipedia)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages (from wikipedia)
Requirement already satisfied: idna<2.7,>=2.5 in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia)
Requirement already satisfied: certifi>=2017.4.17 in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia)
Out[1]:
0
In [2]:
import wikipedia

Set Up NetworkX for Wikipedia Network

We want to focus on Wikipedia articles related to complex networks.

We'll construct a network centered on the "Complex Network" article and look at all articles within 1 hop of this article and their connections (i.e., the 1.5-degree ego net).

In [3]:
"""
Construct a network of Wikipedia pages anchored at Complex Networks
"""
from operator import itemgetter
import networkx as nx
import wikipedia
In [4]:
# Where will we start
SEED = "Complex network".title()

# Wikipedia Articles to Ignore
STOPS = ("International Standard Serial Number",
         "International Standard Book Number",
         "National Diet Library",
         "International Standard Name Identifier",
         "International Standard Book Number (Identifier)",
         "Pubmed Identifier", "Pubmed Central",
         "Digital Object Identifier", "Arxiv",
         "Proc Natl Acad Sci Usa", "Bibcode",
         "Library Of Congress Control Number", "Jstor")
In [5]:
# Data structures for tracking what articles we've read
todo_lst = [(0, SEED)] # The SEED is in the degree 0
todo_set = set(SEED)   # The SEED itself
done_set = set()       # Nothing is done yet
In [6]:
# We want a directed graph since links only go one way
F = nx.DiGraph()
In [ ]:
 
In [7]:
# Initialize our search
layer, page = todo_lst[0]

# Look for all pages less than 2 hops away
while layer < 2:
    # remove the current article to process (we've already pulled it from this list)
    del todo_lst[0] #(1)
    
    # Add this article to the list of articles we've processed
    done_set.add(page)
    print(layer, page) # Show progress

    # Try to load the wikipedia page of this name
    try: #(2)
        wiki = wikipedia.page(page)
    except:
        layer, page = todo_lst[0]
        print("Could not load", page)
        continue

    # Now traverse links on this page
    for link in wiki.links: #(3)
        link = link.title()
        
        # Skip links we decided to ignore or that start with "List Of"
        if link not in STOPS and not link.startswith("List Of"):
            
            # If we haven't already processed or queued this link, add it
            if link not in todo_set and link not in done_set:
                todo_lst.append((layer + 1, link))
                todo_set.add(link)
                
            # Add an edge from the current page to this link
            F.add_edge(page, link)

    # Get the next layer and page
    layer, page = todo_lst[0] #(4)
    
print("{} nodes, {} edges".format(len(F), nx.number_of_edges(F)))
# 11597 nodes, 21331 edges
0 Complex Network
1 Adjacency List
1 Adjacency Matrix
1 Agent-Based Model
1 Albert-László Barabási
1 Artificial Neural Network
1 Assortativity
1 Autonomous System (Internet)
1 Balance Theory
1 Barabási–Albert Model
1 Betweenness Centrality
1 Biological Network
1 Biology
1 Bipartite Graph
1 Boolean Network
1 Branching Process
1 Centrality
1 Climate
1 Clique (Graph Theory)
1 Closeness (Graph Theory)
1 Clustering Coefficient
1 Combinatorial Optimization
1 Community Structure
1 Complete Graph
1 Complex Adaptive System
1 Complex Contagion
1 Complex Systems
1 Computer Network
1 Computer Science
1 Connected Component (Graph Theory)
1 Cut (Graph Theory)
1 Cycle (Graph Theory)
1 Degree (Graph Theory)
1 Degree Distribution
1 Dependency Network
1 Directed Graph
1 Distance (Graph Theory)
1 Dual-Phase Evolution
1 Duncan J. Watts
1 Dynamic Network Analysis
1 Edge (Graph Theory)
1 Efficiency (Network Science)
1 Entropy (Information Theory)
1 Epidemic Model
1 Epidemiology
1 Erdős–Rényi Model
1 Evolving Networks
1 Exponential Random Graph Models
1 Fitness Model (Network Theory)
1 Flow Network
1 Frigyes Karinthy
1 Graph (Abstract Data Type)
1 Graph (Discrete Mathematics)
1 Graph Drawing
1 Herbert A. Simon
1 Hierarchical Network Model
1 Hierarchy
1 Homophily
1 Hyperbolic Geometric Graph
1 Hypercube
1 Hypergraph
1 Incidence List
1 Incidence Matrix
1 Interdependent Networks
1 Lancichinetti–Fortunato–Radicchi Benchmark
1 Lattice Graph
1 Link Analysis
1 Loop (Graph Theory)
1 Mathematics
1 Matthew Effect (Sociology)
1 Metrics (Networking)
1 Modularity (Networks)
1 Multigraph
1 Neighbourhood (Graph Theory)
1 Network Controllability
1 Network Effect
1 Network Motif
1 Network Science
1 Network Theory
1 Pagerank
1 Path (Graph Theory)
1 Percolation
1 Percolation Theory
1 Physics
1 Poisson Distribution
1 Power-Law
1 Power Law
1 Preferential Attachment
1 Random Graph
1 Reciprocity (Network Science)
1 Reciprocity In Network
1 Sir Model
1 Scale-Free Network
1 Scale-Free Networks
1 Scientific Collaboration Network
1 Semantic Network
1 Six Degrees Of Separation
1 Small-World Network
1 Small-World Networks
1 Small-World Phenomenon
1 Small World Networks
1 Social Capital
1 Social Influence
1 Social Network
1 Social Network Analysis Software
1 Sociology
1 Spatial Network
1 Stanley Milgram
1 Steven Strogatz
1 Stochastic Block Model
1 Telecommunications Network
1 Topological
1 Transitive Relation
1 Transport Network
1 Triadic Closure
1 Trophic Coherence
1 Vertex (Graph Theory)
1 Watts And Strogatz Model
1 Weighted Network
1 World Wide Web
1 Yule-Simon Distribution
11762 nodes, 21465 edges
In [ ]:
 
In [8]:
# How dense is this network?
print("Density:", nx.density(F))
Density: 0.00015516918096161693
In [ ]:
 
In [9]:
# Delete all self-loops
F.remove_edges_from(F.selfloop_edges())

# We occasionally see "Complex Network" and "Complex Networks"
#  in Wikipedia. Let's collapse those.
duplicates = [(node, node + "s") for node in F if node + "s" in F]
for dup in duplicates:
    
    # For each pair of duplicates, collapse into one node
    F = nx.contracted_nodes(F, *dup, self_loops=False)
    
# More duplicate removal with "Complex-Networks" and "Complex Networks"
duplicates = [(x, y) for x, y 
              in [(node, node.replace("-", " ")) for node in F]
              if x != y and y in F]
for dup in duplicates:
    F = nx.contracted_nodes(F, *dup, self_loops=False)
    
# Clear all contraction attributes (necessary for graphml export)
nx.set_node_attributes(F, 0, "contraction")
In [10]:
print("{} nodes, {} edges".format(len(F), nx.number_of_edges(F)))
# 11597 nodes, 21331 edges

# How dense is this network?
print("Density:", nx.density(F))
11641 nodes, 20378 edges
Density: 0.00015038976765083478
In [ ]:
 
In [11]:
# Only keep articles with more than 1 degree
core = [node for node, deg in F.degree() if deg >= 2]

# Get a subgraph with these core articles
G = nx.subgraph(F, core)
print("{} nodes, {} edges".format(len(G), nx.number_of_edges(G)))
# 2995 nodes, 11817 edges
2816 nodes, 11553 edges
In [ ]:
 
In [ ]:
 
In [12]:
# What nodes have the highest in-degree?
top_indegree = sorted(G.in_degree(),
                      reverse=True, key=itemgetter(1))[:100]
print("\n".join(map(lambda t: "{} {}".format(*reversed(t)), top_indegree)))
61 Graph (Discrete Mathematics)
54 Vertex (Graph Theory)
49 Directed Graph
48 Social Network
44 Degree (Graph Theory)
44 Network Theory
44 Graph Theory
40 Graph Drawing
39 Adjacency Matrix
39 Complete Graph
39 Edge (Graph Theory)
38 Bipartite Graph
38 Graph (Abstract Data Type)
38 Scale-Free Network
37 Complex Network
37 Incidence Matrix
36 Loop (Graph Theory)
36 Network Science
36 Social Capital
36 Social Network Analysis Software
35 Centrality
35 Complex Contagion
35 Cycle (Graph Theory)
35 Multigraph
35 Preferential Attachment
35 Random Graph
35 Small-World Network
34 Adjacency List
34 Hypergraph
34 Path (Graph Theory)
33 Distance (Graph Theory)
32 Betweenness Centrality
32 Clique (Graph Theory)
32 Clustering Coefficient
32 Connected Component (Graph Theory)
32 Degree Distribution
32 Flow Network
32 Percolation Theory
31 Computer Network
31 Watts And Strogatz Model
30 Agent-Based Model
30 Barabási–Albert Model
30 Combinatorial Optimization
30 Cut (Graph Theory)
30 Erdős–Rényi Model
30 Semantic Network
30 Telecommunications Network
30 Transitive Relation
30 Weighted Network
29 Artificial Neural Network
29 Assortativity
29 Boolean Network
29 Closeness (Graph Theory)
29 Epidemic Model
29 Fitness Model (Network Theory)
29 Homophily
29 Incidence List
29 Interdependent Networks
29 Link Analysis
29 Mathematics
29 Neighbourhood (Graph Theory)
29 Network Motif
29 Pagerank
29 Triadic Closure
28 Balance Theory
28 Biological Network
28 Community Structure
28 Dependency Network
28 Evolving Networks
28 Lancichinetti–Fortunato–Radicchi Benchmark
28 Metrics (Networking)
28 Network Controllability
28 Network Effect
28 Sir Model
28 Social Influence
28 Stochastic Block Model
28 Transport Network
27 Efficiency (Network Science)
27 Hierarchical Network Model
27 Hyperbolic Geometric Graph
27 Modularity (Networks)
27 Scientific Collaboration Network
27 Spatial Network
27 Exponential Random Graph Model
26 Reciprocity (Network Science)
24 Social Network Analysis
23 Computer Science
23 Integrated Authority File
21 Statistic
18 Wayback Machine
15 Biology
15 Sociology
15 World Wide Web
15 Internet
15 Glossary Of Graph Theory
14 Six Degrees Of Separation
13 Power-Law
13 Machine Learning
13 Assortative Mixing
12 Algorithm
In [ ]:
 
In [13]:
# different centrality types
centrality_pairs = [
    ("d_cent", nx.degree_centrality),
    ("c_cent", nx.closeness_centrality),
    ("b_cent", nx.betweenness_centrality),
]

# Keep track of different centralities
centrality_map = {}

# For each type of centrality, apply the node attributes
for cent_label, cent_func in centrality_pairs:
    centrality = cent_func(G)
    centrality_map[cent_label] = centrality
    nx.set_node_attributes(G, centrality, cent_label)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-13-8df35514b067> in <module>()
     11 # For each type of centrality, apply the node attributes
     12 for cent_label, cent_func in centrality_pairs:
---> 13     centrality = cent_func(G)
     14     centrality_map[cent_label] = centrality
     15     nx.set_node_attributes(G, centrality, cent_label)

/Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages/networkx/algorithms/centrality/closeness.py in closeness_centrality(G, u, distance, wf_improved, reverse)
    121             # normalize to number of nodes-1 in connected part
    122             if wf_improved:
--> 123                 s = (len(sp) - 1.0) / (len(G) - 1)
    124                 closeness_centrality[n] *= s
    125         else:

/Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages/networkx/classes/graph.py in __len__(self)
    410 
    411         """
--> 412         return len(self._node)
    413 
    414     def __getitem__(self, n):

/Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages/networkx/classes/coreviews.py in __len__(self)
    287 
    288     def __len__(self):
--> 289         return sum(1 for n in self._atlas if self.NODE_OK(n))
    290 
    291     def __iter__(self):

/Users/cbuntain/Development/thirdparty/anaconda3/lib/python3.6/site-packages/networkx/classes/coreviews.py in <genexpr>(.0)
    287 
    288     def __len__(self):
--> 289         return sum(1 for n in self._atlas if self.NODE_OK(n))
    290 
    291     def __iter__(self):

KeyboardInterrupt: 
In [ ]:
for cent_label, cent_vals in centrality_map.items():
    print(cent_label)
    tops = [(x, cent_vals[x]) for x in sorted(cent_vals, key=cent_vals.get, reverse=True)]
    
    for t in tops[:10]:
        print("\t", t)
In [ ]:
 
In [ ]:
nx.write_graphml(G, "cna.graphml")
In [ ]: