The assignments below comprise your fourth homework assignment. Provide the implementations requested.
You will be graded on the following rubric:
NOTE While you may work in groups to discuss strategies for solving these problems, you are expected to do your own work. Excessive similarity among answers may result in punitive action.
You are provided a file, movie_metadata.csv
, that contains data from the Internet Movie Database (IMDb) for around 5,000 movies. This file came from Kaggle's datasets. Use this file to answer the following questions.
pandas
library from 126, and I encourage you to use pandas
here to read the CSV file. # Read the movie data file in here as you will reference it often during this homework
## IMPLEMENT HERE ##
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
# Read data into data frame, dropping rows with not-a-number values
df = pd.read_csv("movie_metadata.csv", header=0).dropna(axis=0)
The movie_metadata.csv
file contains 28 columns of data, but for this question, we'll only look at the following columns:
Use the data file to find the top 3 highest grossing movie titles. For each movie, print its title, rank, gross income, rating, and genres.
As an example (this output shows formatting only; Hackers is not the highest grossing movie):
Rank 1: Hackers, $7564000, PG-13, Comedy|Crime|Drama|Thriller Rank 2: Hackers 2, $7564000, PG-13, Comedy|Crime|Drama|Thriller Rank 3: Hackers 3, $7564000, PG-13, Comedy|Crime|Drama|Thriller
Repeat this procedure for IMDb score and Facebook Likes.
## Implement your code here
print("By gross...")
rank = 1
for row in df.sort_values(by="gross", ascending=False).head(3).itertuples(False):
print("Rank %s:" % rank, row.movie_title, ",", row.gross, ",", row.content_rating, ",", row.genres)
rank += 1
print()
print("By IMDb Score...")
rank = 1
for row in df.sort_values(by="imdb_score", ascending=False).head(3).itertuples(False):
print("Rank %s:" % rank, row.movie_title, ",", row.imdb_score, ",", row.content_rating, ",", row.genres)
rank += 1
print()
print("By Facebook Likes...")
rank = 1
for row in df.sort_values(by="movie_facebook_likes", ascending=False).head(3).itertuples(False):
print("Rank %s:" % rank, row.movie_title, ",", row.movie_facebook_likes, ",", row.content_rating, ",", row.genres)
rank += 1
Construct a plot that relates release years to total income and average income for that year.
NOTE: Gross income may dominate the graph. To adjust, you could use the plt.semilogy()
plot instead of a standard plot. This function plots the y-axis in a logarithmic scale.
# Get each unique year in our dataset
years = set(df.title_year)
yearEarningDict = {}
# For every year, find all movies released that year, and calc
# their sum and avg incomes
for year in years:
# all movies released this year
thisYearDf = df[df.title_year == year]
# Sum across the gross column
sumGross = sum(thisYearDf.gross)
# Divide sum by number of movies
avgGross = sumGross / thisYearDf.shape[0]
# Update dictionary with the sum and avg for this year
yearEarningDict[year] = (sumGross, avgGross)
# Convert dictionary to sorted lists for display
sortedYears = sorted(years)
totalIncome = []
avgIncome = []
for year in sortedYears:
totalIncome.append(yearEarningDict[year][0])
avgIncome.append(yearEarningDict[year][1])
# Construct graphs over years and income
plt.semilogy(sortedYears, totalIncome, label="Total")
plt.semilogy(sortedYears, avgIncome, label="Average")
plt.legend()
plt.grid()
plt.show()
# Get the best years by sorting by sum and avg in our dictionary
bestSumYear = sorted(yearEarningDict, key=lambda x: yearEarningDict[x][0])[-1]
bestAvgYear = sorted(yearEarningDict, key=lambda x: yearEarningDict[x][1])[-1]
print("Highest Sum Year:", bestSumYear, yearEarningDict[bestSumYear])
print("Highest Avg Year:", bestAvgYear, yearEarningDict[bestAvgYear])
Plot movie languages to total income and average income for each language.
NOTE You can use different graphs here, but the matplotlib.bar()
function might be best. You can provide it range for x coordinates, and income data for y coordinates. I also suggest you plot these on two different plots (recall you can call plt.show()
to create a plot and can call it multiple times to create multiple plots).
To make the graph more legible, refer to the plt.xticks()
function, which takes a list of indices, a list of string labels, and an optional rotation
argument that lets you rotate the labels. For example, the following will make vertically-rotated labels:
plt.xticks(range(languageDf.shape[0]), languageDf.language, rotation="vertical")
# What are the unique languages?
languages = set(df.language)
# For each language, calculate sum and gross as we did with years
languageData = []
for lang in languages:
thisLangDf = df[df.language == lang]
sumGross = sum(thisLangDf.gross)
avgGross = sumGross / thisLangDf.shape[0]
# Add results as a small dictionary to list.
# Each element has the language, its sum, and its gross
languageData.append({"language":lang, "total":sumGross, "avg":avgGross})
# Construct dataframe from this list of languages, then we can sort
# like we did for gross, likes, and IMDb score
languageDf = pd.DataFrame(languageData)
print("By Total...")
for row in languageDf.sort_values(by="total", ascending=False).head(3).itertuples(False):
print(row)
print("By Avg...")
for row in languageDf.sort_values(by="avg", ascending=False).head(3).itertuples(False):
print(row)
plt.bar(range(len(languageData)), languageDf.total, label="Total")
plt.xticks(range(languageDf.shape[0]), languageDf.language, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
plt.bar(range(len(languageData)), languageDf.avg, label="Average")
plt.xticks(range(languageDf.shape[0]), languageDf.language, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
Plot movie ratings to total income and average income for each rating.
## Implement your code here
ratings = set(df.content_rating)
ratingData = []
for rating in ratings:
thisRatingDf = df[df.content_rating == rating]
sumGross = sum(thisRatingDf.gross)
avgGross = sumGross / thisRatingDf.shape[0]
ratingData.append({"rating":rating, "total":sumGross, "avg":avgGross})
ratingDf = pd.DataFrame(ratingData)
print("By Total...")
for row in ratingDf.sort_values(by="total", ascending=False).head(3).itertuples(False):
print(row)
print("By Avg...")
for row in ratingDf.sort_values(by="avg", ascending=False).head(3).itertuples(False):
print(row)
plt.plot(range(len(ratingDf)), ratingDf.total, label="Total")
plt.xticks(range(ratingDf.shape[0]), ratingDf.rating, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
plt.plot(range(len(ratingDf)), ratingDf.avg, label="Average")
plt.xticks(range(ratingDf.shape[0]), ratingDf.rating, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
Plot movie genres to total income and average income for each genre.
NOTE You will need to use string functions to split genres along the "|" delimiter for movies with mutliple genres.
## Implement your code here
genreStrings = set(df.genres)
genres = set()
for gStr in genreStrings:
for g in gStr.split("|"):
genres.add(g)
genreData = []
for genre in genres:
thisGenreDf = df[df.genres.str.contains(genre)]
sumGross = sum(thisGenreDf.gross)
avgGross = sumGross / thisRatingDf.shape[0]
genreData.append({"genre":genre, "total":sumGross, "avg":avgGross})
genreDf = pd.DataFrame(genreData)
print("By Total...")
for row in genreDf.sort_values(by="total", ascending=False).head(3).itertuples(False):
print(row)
print("By Avg...")
for row in genreDf.sort_values(by="avg", ascending=False).head(3).itertuples(False):
print(row)
plt.bar(range(len(genreDf)), genreDf.total, label="Total")
plt.xticks(range(genreDf.shape[0]), genreDf.genre, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
plt.bar(range(len(genreDf)), genreDf.avg, label="Average")
plt.xticks(range(genreDf.shape[0]), genreDf.genre, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
Plot movie gross income to Facebook Likes. You can do this using plt.scatter()
or with Pandas.
No, there appears to be little correlation between gross income and likes. You see movies like Avatar with the highest income and few likes, or Interstellar with the highest likes but relatively low gross.
## Implement your code here
df.plot.scatter("gross", "movie_facebook_likes")
Plot movie gross income to IMDb Score.
It appears better than Facebook likes, but the majority of IMDb's high-scoring mass is located in low-grossing movies. At least highly grossing movies tend to have higher IMDb scores, which suggests some valid, positive correlation.
## Implement your code here
df.plot.scatter("gross", "imdb_score")
To determine a movie's profits, subtract the movie's budget from its gross earnings. Use this formula to identify the top 3 most profitable movies.
## Implement your code here
df["profit"] = df["gross"] - df["budget"]
print("By gross...")
rank = 1
for row in df.sort_values(by="profit", ascending=False).head(3).itertuples(False):
print("Rank %s:" % rank, row.movie_title, ",", row.profit, ",", row.content_rating, ",", row.genres)
rank += 1
The large numbers for highest grossing films may dominate profitability analysis. A better test is to look at movie return as a percent of the budget. That is, a low-budget movie that makes a lot has a better return than a high-budget film with the same gross income.
Calculate the ratio between gross income and budget to determine the percent return for each movie, and print the top 3 highest returning movies.
## Implement your code here
df["percent"] = df["gross"] / df["budget"]
print("By gross...")
rank = 1
for row in df.sort_values(by="percent", ascending=False).head(3).itertuples(False):
print("Rank %s:" % rank, row.movie_title, ",", row.precent, ",", row.content_rating, ",", row.genres)
rank += 1
Using the percent return you calculated above, plot the average percent return for each genre.
## Implement your code here
genreData = []
for genre in genres:
thisGenreDf = df[df.genres.str.contains(genre)]
genreData.append({"genre":genre, "percentRet":thisGenreDf.percent.mean()})
genreDf = pd.DataFrame(genreData)
print("By Return...")
for row in genreDf.sort_values(by="percentRet", ascending=False).head(3).itertuples(False):
print(row)
plt.bar(range(len(genreDf)), genreDf["percentRet"], label="Average")
plt.xticks(range(genreDf.shape[0]), genreDf.genre, rotation="vertical")
plt.legend()
plt.grid()
plt.show()
The movie_metadata.csv
also contains columns, actor_X_name
, for the first three top-billed actors in each movie. Use these columns for the following questions.
As with finding the top-grossing movies, print the top 3 highest grossing actors/actresses. These people will have the highest sum of gross movie income. That is, for each actor, sum all the gross
numbers for each movie in which he/she has appeared in the top 3 billings, and order the actors by this sum.
As with the top-grossing movies, print the top 3 actors and their summed gross income in a format similar to the following:
Rank 1: Angelina Jolie, $7564000 Rank 2: Johnny Lee Miller, $7564000 Rank 3: Brad Pitt, $7564000
NOTE: Depending on how efficient your code is, it may take a minute to execute.
# A set of unique actors
actors = set()
# restrict ourselves to the actor columnes
actorColumns = ["actor_1_name", "actor_2_name", "actor_3_name"]
# For each row, add each actor to the set of actors
for row in df[actorColumns].itertuples(False):
for actor in row:
actors.add(actor)
# Dictionary of actor names -> gross
actorDict = {}
# For each actor, find all movies in which that actor appeared...
for actor in actors:
bill = df[(df["actor_1_name"] == actor) | (df["actor_2_name"] == actor) | (df["actor_3_name"] == actor)]
# And set the summed gross to that actor's value
actorDict[actor] = bill["gross"].sum()
# Sort the actors by their gross, reversing to the get the highest first
sortedActors = sorted(actorDict, key=actorDict.get, reverse=True)
# Temp variable for counting rank
rank = 1
# For top three actors, print rank, name, and gross
for actor in sortedActors[:3]:
print("Rank %s:" % rank, actor, ", $", actorDict[actor])
rank+=1
Iterate through the movies file and construct a dictionary that maps each star to his/her costars. Use this dictionary to print all the costars of the top 3 highest grossing actors.
An example format might be:
Rank 1: Angelina Jolie Costar: Johnny Lee Miller Costar: Brad Pitt Costar: Antonio Banderas Rank 2: etc...
NOTE Be sure and exclude the target actor from his/her list of costars. E.g., Robert Downey Jr. is not a costar of himself.
# Dictionary mapping actor to actor's costars
costarDict = {}
# For each actor in our list of sorted actors...
for actor in sortedActors:
# Find all movies in which that actor appeared, as before
bill = df[(df["actor_1_name"] == actor) | (df["actor_2_name"] == actor) | (df["actor_3_name"] == actor)]
# Create a set of costars
costars = set()
# For each movie, add fellow actors (i.e., costars) to dictionary
for movieIndex, movie in bill.iterrows():
for actorCol in actorColumns:
costars.add(movie[actorCol])
# Remove empty sets
costarDict[actor] = costars - set()
# Print costars for top three actors
rank = 1
for actor in sortedActors[:3]:
print("Rank %d:" % rank, actor)
for costar in costarDict[actor]:
print("\tCostar:", costar)
In the 90’s before Google was around, people used to play a game called Six Degrees from Kevin Bacon. Kevin Bacon was a relatively well-known actor at the time who was said to be connected to almost anyone in the movie industry.
The game was played by asking your friend to name a random actor. Using that random actor as a starting point, you’d choose a movie the actor was in, and then choose a costar – a first-degree costar, so to speak – from that movie. Then you’d choose a second-degree costar who shared a different movie with the first-degree costar, and so on. The goal of the game was to connect back to Kevin Bacon within six degrees (i.e. within six costars away), and since Google wasn’t around, it was pretty impressive if someone could. Here’s a video on Youtube that demonstrates it.
This question attempts to re-create that game, except it will only go to two degrees because our computers will (figuratively) explode if they try to do six degrees.
Using the code and the dictionary you developed above, write a function find_cocostars()
that takes as an argument an actor's name target_actor
and the costar dictionary you just made and returns a list of tuples. Each tuple should contain the costar's degree from the given target actor and the costar's name. This degree is a number indicating whether the actor is a direct co-star $degree == 1$ or a co-start of a co-star $degree == 2$.
Use this function to print all the first- and second-degree co-stars of Angelina Jolie Pitt.
A sample output may look like:
Actor: Angeline Jolie Pitt Degree 1: Brad Pitt Degree 1: Johnny Depp ... Degree 2: Charlize Theron Degree 2: Meryl Streep Degree 2: Bob Hoskins ...
NOTE There will be a lot of degree-2 co-stars.
## Implement your code here
def find_cocostars(target_actor, costar_dict):
"""Looks up all the co-stars for a target actor. Returns a list of
tuples containing (degree, costar) where degree equals 1 or 2.
Parameters:
target_actor: Target actor name
costar_dict: dictionary of costars
Usage Examples:
>>> target = "Angelina Jolie Pitt"
>>> twoDegreeCostars = find_cocostars(target, costarDict)
>>> print("Number of <= 2-degree costars:", len(twoDegreeCostars))
>>> print("Actor:", target)
>>> for degree, costar in twoDegreeCostars:
>>> if ( degree == 1 ):
>>> print("\tDegree 1:", costar)
>>> else:
>>> print("\t\tDegree 2:", costar)
Number of <= 2-degree costars: 832
Actor: Angelina Jolie Pitt
Degree 1: August Diehl
Degree 1: Stockard Channing
...
Degree 1: Martin Scorsese
Degree 2: Nathan Lane
Degree 2: Dianne Wiest
Degree 2: Joe Mantegna
...
"""
# List of costars for the target actor
costars = []
# Set to hold unique costars, which we use for ensuring we only
# show an actor once
firstDegreeCostars = set()
# For each costar of this target actor...
for costar in costar_dict[target_actor]:
# Add a tuple ot this list of costars with degree and name
costars.append((1, costar))
# Add costar name to set of costars
firstDegreeCostars.add(costar)
# Iterate through first-degree costars, adding co-costars
for costar in firstDegreeCostars:
# For each costar of this costar...
for cocostar in costar_dict[costar]:
# Check if we haven't already seen this actor and this actor
# isn't the target actor
if ( cocostar not in firstDegreeCostars and cocostar != target_actor):
# If unseen and not target, add tuple to costar list
costars.append((2, cocostar))
# Return list of costars and degrees
return costars
target = "Angelina Jolie Pitt"
twoDegreeCostars = find_cocostars(target, costarDict)
print("Number of <= 2-degree costars:", len(twoDegreeCostars))
print("Actor:", target)
for degree, costar in twoDegreeCostars:
if ( degree == 1 ):
print("\tDegree 1:", costar)
else:
print("\t\tDegree 2:", costar)