ENPM809G - Jupyter Notebooks and Matplotlib¶

Common Python Packages¶

Popular packages that come standard with Anaconda include:

Matplotlib
numpy
pandas

Data Analysis Packages¶

Matplotlib, numpy, and pandas are are good for data analysis

Matplotlib for Plotting¶

A package for standard plotting, like you might see in Excel

# Magic line to ensure plotting happens in Jupyter
%matplotlib inline

# Matplotlib Graphing Library
import matplotlib

# PyPlot is an object-oriented plot interface to matplotlib
import matplotlib.pyplot as plt 

x_vals = list(range(8))
y1_vals = [x for x in x_vals]
y2_vals = [2*x for x in x_vals]
y3_vals = [x**0.5 for x in x_vals]

# Plot the three datasets
plt.plot(x_vals, y1_vals, label="identity")
plt.plot(x_vals, y2_vals, label="linear")
plt.plot(x_vals, y3_vals, label="sqrt")

# Set axis labels (you should always do this)
plt.xlabel("x")
plt.ylabel("f(x)")

# Activate legend in graph
plt.legend()

# Best practice is to use plt.show() to force rendering
plt.show()

# For Python's random number generator
import random

# Generate several random numbers
y_r_vals = [random.random() for x in range(len(x_vals))]

# Plot a scatter plot
plt.scatter(x_vals, y_r_vals)

# Set axis labels (you should always do this)
plt.xlabel("x")
plt.ylabel("f(x)")

# And show it
plt.show()

# Generate several random numbers
y_r_vals = [random.random() for x in range(1000)]

# Plot a histogram of random numbers
plt.hist(y_r_vals)

# Set axis labels (you should always do this)
plt.xlabel("Random Value Bins")
plt.ylabel("Count")

# Show grid lines
plt.grid()

# And show it
plt.show()

Way more can be done with Matplotlib. See https://matplotlib.org/gallery/index.html for more examples.

A nice YouTube tutorial on Matplotlis is available here: https://www.youtube.com/watch?v=q7Bo_J8x_dw&list=PLQVvvaa0QuDfefDfXb9Yf0la1fPDKluPF

from IPython.display import HTML

# Youtube
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/q7Bo_J8x_dw" ' + 
     'frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen></iframe>')

NumPy for Numerical Calculations¶

Working with lists of numbers can be simplified with NumPy. In fact, most non-trivial data analysis packages rely on NumPy directly.

# Numpy for fast numeric computation
import numpy as np

# Want to add a value to each element of the list
py_list = [1, 2, 3, 4, 5, 6, 7]
print(py_list)

# Could do this, but it's a little verbose
print([x+5 for x in py_list])

# Makes more sense, but this will throw an error
print(py_list + 5)

[1, 2, 3, 4, 5, 6, 7]
[6, 7, 8, 9, 10, 11, 12]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-0240d2909f5c> in <module>()
     10 
     11 # Makes more sense, but this will throw an error
---> 12 print(py_list + 5)

TypeError: can only concatenate list (not "int") to list

arr1 = np.array(py_list) # create a numpy array
print(arr1)

# Can add value in element-wise operation
#  also works for multiply, divide, subtract, etc.
print(arr1 + 5)

[1 2 3 4 5 6 7]
[ 6  7  8  9 10 11 12]

# And works natively with matplotlib

plt.plot(range(7), arr1, label="Original")
plt.plot(range(7), arr1 * 2, label="Added")

# Set axis labels (you should always do this)
plt.xlabel("x")
plt.ylabel("f(x)")

plt.legend()
plt.show()

Pandas and R-like Dataframes¶

A dataframe is quite similar to your standard Excel spreadsheet but can be manipulated more easily in Python.

# Pandas for R-like DataFrames
import pandas as pd

# Test tab-separated file for reading data
#  The adjacency list for the risk board from HW2
tsv_file = "risk.adj"


# Read in TSV file and convert to DataFream
df = pd.read_csv(tsv_file, sep='\t', header=0)

# Pretty-Print the dataframe automatically
df

# How many rows and columns does this dataset have?
print("Dataset Size:", df.shape)

Dataset Size: (81, 3)

Adjacency Matrices¶

Use the adjacency list to create an adjacency matrix.

# How many countries
countries = set(df.SOURCE).union(set(df.SINK))
country_count = len(countries)
print("Countries:", country_count)

# Map country names to IDs
country_map = dict(zip(countries, range(country_count)))

# Initialize the adjacency matrix
adj_matrix = np.zeros((country_count, country_count))

# Populate the Adjacency Matrix
for idx, row in df.iterrows():
    i = country_map[row.SOURCE]
    j = country_map[row.SINK]
    
    adj_matrix[i][j] = 1
    adj_matrix[j][i] = 1

print("Number of Edges:", np.sum(adj_matrix))

Countries: 42
Number of Edges: 162.0

Degree for Each Country¶

Calculate the degree for each country in our list

# Invert the country map, so we can take a matrix row 
#  and convert it to the country name
inv_country_map = {x[1]:x[0] for x in country_map.items()}

# Get the degree for each country
for i in range(country_count):
    print("Country:", inv_country_map[i], "Degree:", np.sum(adj_matrix[i]))

Country: Eastern Australia Degree: 2.0
Country: Irkutsk Degree: 4.0
Country: Japan Degree: 2.0
Country: Alaska Degree: 3.0
Country: New Guinea Degree: 3.0
Country: Central America Degree: 3.0
Country: Afghanistan Degree: 5.0
Country: Scandinavia Degree: 4.0
Country: Madagascar Degree: 2.0
Country: Western Australia Degree: 3.0
Country: Kamchatka Degree: 5.0
Country: Brazil Degree: 4.0
Country: Egypt Degree: 4.0
Country: Western Europe Degree: 3.0
Country: Argentina Degree: 2.0
Country: Northwest Territory Degree: 4.0
Country: Siberia Degree: 5.0
Country: Western United States Degree: 4.0
Country: Venezuela Degree: 3.0
Country: Greenland Degree: 4.0
Country: Northern Europe Degree: 5.0
Country: North Africa Degree: 6.0
Country: Peru Degree: 3.0
Country: Siam Degree: 3.0
Country: Ukraine Degree: 6.0
Country: Eastern United States Degree: 3.0
Country: Mongolia Degree: 5.0
Country: South Africa Degree: 3.0
Country: East Africa Degree: 6.0
Country: Indonesia Degree: 3.0
Country: India Degree: 4.0
Country: Iceland Degree: 3.0
Country: Alberta Degree: 4.0
Country: Eastern Canada Degree: 3.0
Country: Congo Degree: 3.0
Country: Ontario Degree: 5.0
Country: Great Britain Degree: 4.0
Country: Yakutsk Degree: 3.0
Country: Ural Degree: 4.0
Country: China Degree: 6.0
Country: Middle East Degree: 6.0
Country: Southern Europe Degree: 5.0

Degree Distributions¶

# Get the degree for each row (i.e., country)
degrees = np.sum(adj_matrix, axis=0)

# Build a histogram of degrees
plt.hist(degrees, bins=[x+1 for x in range(int(np.max(degrees)))])

plt.xlabel("Degree")
plt.ylabel("Degree Frequency")

plt.grid()
plt.show()

Degree Centrality¶

# Calculate the degree centrality for all countries
degree_centrality = degrees / (country_count - 1)

# Map countries to their centralities
d_cent_map = {inv_country_map[i]:degree_centrality[i] for i in range(country_count)}

# Sort countries by centrality
sorted_countries = sorted(d_cent_map, key=d_cent_map.get, reverse=True)
for c in sorted_countries[:10]:
    print(c, d_cent_map[c])

North Africa 0.14634146341463414
Ukraine 0.14634146341463414
East Africa 0.14634146341463414
China 0.14634146341463414
Middle East 0.14634146341463414
Afghanistan 0.12195121951219512
Kamchatka 0.12195121951219512
Siberia 0.12195121951219512
Northern Europe 0.12195121951219512
Mongolia 0.12195121951219512

Centralization¶

max_centrality = np.max(degree_centrality)

centralization = np.sum([max_centrality - x for x in degree_centrality])
print("Unnormalized Centralization:", centralization)

Unnormalized Centralization: 2.1951219512195124

	SOURCE	SINK	ATTR
0	Alaska	Northwest Territory	{}
1	Alaska	Alberta	{}
2	Alaska	Kamchatka	{}
3	Alberta	Northwest Territory	{}
4	Alberta	Ontario	{}
5	Alberta	Western United States	{}
6	Central America	Western United States	{}
7	Central America	Eastern United States	{}
8	Central America	Venezuela	{}
9	Eastern United States	Eastern Canada	{}
10	Eastern United States	Western United States	{}
11	Greenland	Northwest Territory	{}
12	Greenland	Iceland	{}
13	Greenland	Ontario	{}
14	Greenland	Eastern Canada	{}
15	Northwest Territory	Ontario	{}
16	Ontario	Eastern Canada	{}
17	Ontario	Western United States	{}
18	Argentina	Peru	{}
19	Argentina	Brazil	{}
20	Brazil	Venezuela	{}
21	Brazil	Peru	{}
22	Brazil	North Africa	{}
23	Peru	Venezuela	{}
24	Great Britain	Scandinavia	{}
25	Great Britain	Northern Europe	{}
26	Great Britain	Western Europe	{}
27	Great Britain	Iceland	{}
28	Iceland	Scandinavia	{}
29	Northern Europe	Ukraine	{}
...	...	...	...
51	Egypt	Middle East	{}
52	Madagascar	South Africa	{}
53	Afghanistan	Middle East	{}
54	Afghanistan	India	{}
55	Afghanistan	China	{}
56	Afghanistan	Ural	{}
57	China	India	{}
58	China	Siam	{}
59	China	Mongolia	{}
60	China	Siberia	{}
61	China	Ural	{}
62	India	Middle East	{}
63	India	Siam	{}
64	Irkutsk	Mongolia	{}
65	Irkutsk	Kamchatka	{}
66	Irkutsk	Yakutsk	{}
67	Irkutsk	Siberia	{}
68	Japan	Mongolia	{}
69	Japan	Kamchatka	{}
70	Kamchatka	Mongolia	{}
71	Kamchatka	Yakutsk	{}
72	Mongolia	Siberia	{}
73	Siam	Indonesia	{}
74	Siberia	Yakutsk	{}
75	Siberia	Ural	{}
76	Eastern Australia	New Guinea	{}
77	Eastern Australia	Western Australia	{}
78	Indonesia	New Guinea	{}
79	Indonesia	Western Australia	{}
80	New Guinea	Western Australia	{}