ENPM809G - Jupyter Notebooks and Matplotlib

Common Python Packages

Popular packages that come standard with Anaconda include:

  • Matplotlib
  • numpy
  • pandas

Data Analysis Packages

Matplotlib, numpy, and pandas are are good for data analysis

Matplotlib for Plotting

A package for standard plotting, like you might see in Excel

In [1]:
# Magic line to ensure plotting happens in Jupyter
%matplotlib inline
In [2]:
# Matplotlib Graphing Library
import matplotlib

# PyPlot is an object-oriented plot interface to matplotlib
import matplotlib.pyplot as plt 

x_vals = list(range(8))
y1_vals = [x for x in x_vals]
y2_vals = [2*x for x in x_vals]
y3_vals = [x**0.5 for x in x_vals]

# Plot the three datasets
plt.plot(x_vals, y1_vals, label="identity")
plt.plot(x_vals, y2_vals, label="linear")
plt.plot(x_vals, y3_vals, label="sqrt")

# Set axis labels (you should always do this)
plt.xlabel("x")
plt.ylabel("f(x)")

# Activate legend in graph
plt.legend()

# Best practice is to use plt.show() to force rendering
plt.show()
In [3]:
# For Python's random number generator
import random

# Generate several random numbers
y_r_vals = [random.random() for x in range(len(x_vals))]

# Plot a scatter plot
plt.scatter(x_vals, y_r_vals)

# Set axis labels (you should always do this)
plt.xlabel("x")
plt.ylabel("f(x)")

# And show it
plt.show()
In [6]:
# Generate several random numbers
y_r_vals = [random.random() for x in range(1000)]

# Plot a histogram of random numbers
plt.hist(y_r_vals)

# Set axis labels (you should always do this)
plt.xlabel("Random Value Bins")
plt.ylabel("Count")

# Show grid lines
plt.grid()

# And show it
plt.show()

Way more can be done with Matplotlib. See https://matplotlib.org/gallery/index.html for more examples.

A nice YouTube tutorial on Matplotlis is available here: https://www.youtube.com/watch?v=q7Bo_J8x_dw&list=PLQVvvaa0QuDfefDfXb9Yf0la1fPDKluPF

In [7]:
from IPython.display import HTML

# Youtube
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/q7Bo_J8x_dw" ' + 
     'frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen></iframe>')
Out[7]:

NumPy for Numerical Calculations

Working with lists of numbers can be simplified with NumPy. In fact, most non-trivial data analysis packages rely on NumPy directly.

In [8]:
# Numpy for fast numeric computation
import numpy as np

# Want to add a value to each element of the list
py_list = [1, 2, 3, 4, 5, 6, 7]
print(py_list)

# Could do this, but it's a little verbose
print([x+5 for x in py_list])

# Makes more sense, but this will throw an error
print(py_list + 5)
[1, 2, 3, 4, 5, 6, 7]
[6, 7, 8, 9, 10, 11, 12]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-0240d2909f5c> in <module>()
     10 
     11 # Makes more sense, but this will throw an error
---> 12 print(py_list + 5)

TypeError: can only concatenate list (not "int") to list
In [9]:
arr1 = np.array(py_list) # create a numpy array
print(arr1)

# Can add value in element-wise operation
#  also works for multiply, divide, subtract, etc.
print(arr1 + 5)
[1 2 3 4 5 6 7]
[ 6  7  8  9 10 11 12]
In [10]:
# And works natively with matplotlib

plt.plot(range(7), arr1, label="Original")
plt.plot(range(7), arr1 * 2, label="Added")

# Set axis labels (you should always do this)
plt.xlabel("x")
plt.ylabel("f(x)")

plt.legend()
plt.show()

Pandas and R-like Dataframes

A dataframe is quite similar to your standard Excel spreadsheet but can be manipulated more easily in Python.

In [11]:
# Pandas for R-like DataFrames
import pandas as pd

# Test tab-separated file for reading data
#  The adjacency list for the risk board from HW2
tsv_file = "risk.adj"


# Read in TSV file and convert to DataFream
df = pd.read_csv(tsv_file, sep='\t', header=0)

# Pretty-Print the dataframe automatically
df
Out[11]:
SOURCE SINK ATTR
0 Alaska Northwest Territory {}
1 Alaska Alberta {}
2 Alaska Kamchatka {}
3 Alberta Northwest Territory {}
4 Alberta Ontario {}
5 Alberta Western United States {}
6 Central America Western United States {}
7 Central America Eastern United States {}
8 Central America Venezuela {}
9 Eastern United States Eastern Canada {}
10 Eastern United States Western United States {}
11 Greenland Northwest Territory {}
12 Greenland Iceland {}
13 Greenland Ontario {}
14 Greenland Eastern Canada {}
15 Northwest Territory Ontario {}
16 Ontario Eastern Canada {}
17 Ontario Western United States {}
18 Argentina Peru {}
19 Argentina Brazil {}
20 Brazil Venezuela {}
21 Brazil Peru {}
22 Brazil North Africa {}
23 Peru Venezuela {}
24 Great Britain Scandinavia {}
25 Great Britain Northern Europe {}
26 Great Britain Western Europe {}
27 Great Britain Iceland {}
28 Iceland Scandinavia {}
29 Northern Europe Ukraine {}
... ... ... ...
51 Egypt Middle East {}
52 Madagascar South Africa {}
53 Afghanistan Middle East {}
54 Afghanistan India {}
55 Afghanistan China {}
56 Afghanistan Ural {}
57 China India {}
58 China Siam {}
59 China Mongolia {}
60 China Siberia {}
61 China Ural {}
62 India Middle East {}
63 India Siam {}
64 Irkutsk Mongolia {}
65 Irkutsk Kamchatka {}
66 Irkutsk Yakutsk {}
67 Irkutsk Siberia {}
68 Japan Mongolia {}
69 Japan Kamchatka {}
70 Kamchatka Mongolia {}
71 Kamchatka Yakutsk {}
72 Mongolia Siberia {}
73 Siam Indonesia {}
74 Siberia Yakutsk {}
75 Siberia Ural {}
76 Eastern Australia New Guinea {}
77 Eastern Australia Western Australia {}
78 Indonesia New Guinea {}
79 Indonesia Western Australia {}
80 New Guinea Western Australia {}

81 rows × 3 columns

In [12]:
# How many rows and columns does this dataset have?
print("Dataset Size:", df.shape)
Dataset Size: (81, 3)

Adjacency Matrices

Use the adjacency list to create an adjacency matrix.

In [13]:
# How many countries
countries = set(df.SOURCE).union(set(df.SINK))
country_count = len(countries)
print("Countries:", country_count)

# Map country names to IDs
country_map = dict(zip(countries, range(country_count)))

# Initialize the adjacency matrix
adj_matrix = np.zeros((country_count, country_count))

# Populate the Adjacency Matrix
for idx, row in df.iterrows():
    i = country_map[row.SOURCE]
    j = country_map[row.SINK]
    
    adj_matrix[i][j] = 1
    adj_matrix[j][i] = 1

print("Number of Edges:", np.sum(adj_matrix))
Countries: 42
Number of Edges: 162.0

Degree for Each Country

Calculate the degree for each country in our list

In [14]:
# Invert the country map, so we can take a matrix row 
#  and convert it to the country name
inv_country_map = {x[1]:x[0] for x in country_map.items()}

# Get the degree for each country
for i in range(country_count):
    print("Country:", inv_country_map[i], "Degree:", np.sum(adj_matrix[i]))
Country: Eastern Australia Degree: 2.0
Country: Irkutsk Degree: 4.0
Country: Japan Degree: 2.0
Country: Alaska Degree: 3.0
Country: New Guinea Degree: 3.0
Country: Central America Degree: 3.0
Country: Afghanistan Degree: 5.0
Country: Scandinavia Degree: 4.0
Country: Madagascar Degree: 2.0
Country: Western Australia Degree: 3.0
Country: Kamchatka Degree: 5.0
Country: Brazil Degree: 4.0
Country: Egypt Degree: 4.0
Country: Western Europe Degree: 3.0
Country: Argentina Degree: 2.0
Country: Northwest Territory Degree: 4.0
Country: Siberia Degree: 5.0
Country: Western United States Degree: 4.0
Country: Venezuela Degree: 3.0
Country: Greenland Degree: 4.0
Country: Northern Europe Degree: 5.0
Country: North Africa Degree: 6.0
Country: Peru Degree: 3.0
Country: Siam Degree: 3.0
Country: Ukraine Degree: 6.0
Country: Eastern United States Degree: 3.0
Country: Mongolia Degree: 5.0
Country: South Africa Degree: 3.0
Country: East Africa Degree: 6.0
Country: Indonesia Degree: 3.0
Country: India Degree: 4.0
Country: Iceland Degree: 3.0
Country: Alberta Degree: 4.0
Country: Eastern Canada Degree: 3.0
Country: Congo Degree: 3.0
Country: Ontario Degree: 5.0
Country: Great Britain Degree: 4.0
Country: Yakutsk Degree: 3.0
Country: Ural Degree: 4.0
Country: China Degree: 6.0
Country: Middle East Degree: 6.0
Country: Southern Europe Degree: 5.0

Degree Distributions

In [15]:
# Get the degree for each row (i.e., country)
degrees = np.sum(adj_matrix, axis=0)

# Build a histogram of degrees
plt.hist(degrees, bins=[x+1 for x in range(int(np.max(degrees)))])

plt.xlabel("Degree")
plt.ylabel("Degree Frequency")

plt.grid()
plt.show()

Degree Centrality

In [16]:
# Calculate the degree centrality for all countries
degree_centrality = degrees / (country_count - 1)

# Map countries to their centralities
d_cent_map = {inv_country_map[i]:degree_centrality[i] for i in range(country_count)}

# Sort countries by centrality
sorted_countries = sorted(d_cent_map, key=d_cent_map.get, reverse=True)
for c in sorted_countries[:10]:
    print(c, d_cent_map[c])
North Africa 0.14634146341463414
Ukraine 0.14634146341463414
East Africa 0.14634146341463414
China 0.14634146341463414
Middle East 0.14634146341463414
Afghanistan 0.12195121951219512
Kamchatka 0.12195121951219512
Siberia 0.12195121951219512
Northern Europe 0.12195121951219512
Mongolia 0.12195121951219512

Centralization

In [17]:
max_centrality = np.max(degree_centrality)

centralization = np.sum([max_centrality - x for x in degree_centrality])
print("Unnormalized Centralization:", centralization)
Unnormalized Centralization: 2.1951219512195124
In [ ]: