INST728E - Module 6. Extracting Multimedia

Having extracted a set of relevant tweets in the previous module, we will now explore the media posted in these tweets.

  • Random sample of images
  • Most retweeted multimedia
In [1]:
%matplotlib inline

import datetime
import json
import sys
import os

# For displaying HTML
from IPython.display import HTML, display

We'll keep this event description.

In [2]:
crisisInfo = {
    "brussels": {
        "name": "Brussels Transit Attacks",
        "time": 1458629880, # Timestamp in seconds since 1/1/1970, UTC
                            # 22 March 2016, 6:58 UTC to 08:11 UTC
        "directory": "brussels",
        "keywords": ["brussels", "bomb", "belgium", "explosion"],
        "box": {
            "lowerLeftLon": 2.54563,
            "lowerLeftLat": 49.496899,
            "upperRightLon": 6.40791,
            "upperRightLat": 51.5050810,
        }
    },
}
In [3]:
# Replace the name below with your selected crisis
selectedCrisis = "brussels"

Reading Relevant Tweets

We stored relevant tweets in a file at the end of the last module. To use that data here, let's go ahead and read that data into a list.

In [4]:
in_file_path = "/Users/cbuntain/relevant_tweet_output.json" # Replace this as necessary

relevant_tweets = []
with open(in_file_path, "r") as in_file:
    for line in in_file:
        relevant_tweets.append(json.loads(line.encode("utf8")))
In [5]:
len(relevant_tweets)
Out[5]:
4687

Extracting Media

Tweets that contain images or video have an associated media entity in the entities field.

We'll use that to extract URLs that point to media files, which we can then use to find frequently shared images.

CAUTION: MAY CONTAIN DISTURBING IMAGERY

In [6]:
# A map for media counts
media_map = {}

# For mapping image IDs to data
media_info_map = {}

# For each tweet, check if it has a media entity
for tweet in relevant_tweets:
    
    # If no "media" field, skip
    if ( "media" not in tweet["entities"] ):
        continue

    # Get a list of shared media
    mediaList = tweet["entities"]["media"]

    # For each piece of media, get its URL and update the map
    for media in mediaList:
        media_id = media["id"]
        media_map[media_id] = media_map.get(media_id, 0) + 1
        media_info_map[media_id] = media


print ("Unique Media:", len(media_map.keys()))
Unique Media: 1097
In [7]:
# What are the most frequently shared media
sortedMedia = sorted(media_map, key=media_map.get, reverse=True)

print ("Top Media:")
for media_id in sortedMedia[:30]:
    media = media_info_map[media_id]
    print("\tID:", media_id, "Count:", media_map[media_id], "Type:", media["type"])
    print("\t%s" % media["expanded_url"])
Top Media:
	ID: 712177667823570944 Count: 74 Type: photo
	http://twitter.com/AAhronheim/status/712177856768569344/video/1
	ID: 712208274175729665 Count: 10 Type: photo
	http://twitter.com/EuranetPlus/status/712208382896246785/photo/1
	ID: 712181069055975424 Count: 9 Type: photo
	http://twitter.com/Tabagari/status/712181078677700608/photo/1
	ID: 712205990804983808 Count: 9 Type: photo
	http://twitter.com/BBCBreaking/status/712205991513870336/photo/1
	ID: 712182954856812544 Count: 8 Type: photo
	http://twitter.com/BBCBreaking/status/712182955544809473/photo/1
	ID: 712193424498278400 Count: 8 Type: photo
	http://twitter.com/Conflicts/status/712193425467117568/photo/1
	ID: 712208482238386177 Count: 8 Type: photo
	http://twitter.com/BBCBreaking/status/712208993314328576/photo/1
	ID: 712217157174697984 Count: 8 Type: photo
	http://twitter.com/LeVraiHoroscope/status/712217158168780800/photo/1
	ID: 712178492453085184 Count: 7 Type: photo
	http://twitter.com/AmichaiStein1/status/712178836948033536/video/1
	ID: 712179810722680832 Count: 7 Type: photo
	http://twitter.com/wardmarkey/status/712179811964346369/photo/1
	ID: 712180998210002944 Count: 7 Type: photo
	http://twitter.com/intlspectator/status/712180999564738560/photo/1
	ID: 712210894936285188 Count: 7 Type: photo
	http://twitter.com/BBCBreaking/status/712210895427014656/photo/1
	ID: 712174739884806144 Count: 6 Type: photo
	http://twitter.com/News_Executive/status/712174740753018880/photo/1
	ID: 712187815648436224 Count: 6 Type: photo
	http://twitter.com/OnlineMagazin/status/712188304150679552/video/1
	ID: 712179890141982720 Count: 6 Type: photo
	http://twitter.com/AAhronheim/status/712179940788195328/photo/1
	ID: 712176801188089856 Count: 5 Type: photo
	http://twitter.com/airlivenet/status/712176808318390272/photo/1
	ID: 712184530258501632 Count: 5 Type: photo
	http://twitter.com/RT_com/status/712184531244208129/photo/1
	ID: 712193012638605313 Count: 5 Type: photo
	http://twitter.com/SkyNews/status/712193076618465280/photo/1
	ID: 712242210121637888 Count: 5 Type: photo
	http://twitter.com/BBCBreaking/status/712242210679463936/photo/1
	ID: 712188176161333248 Count: 5 Type: photo
	http://twitter.com/bala_Bomb/status/712188215394828290/photo/1
	ID: 712174091038564353 Count: 4 Type: photo
	http://twitter.com/virginieleyssen/status/712174102962950145/photo/1
	ID: 712179266532802560 Count: 4 Type: photo
	http://twitter.com/SkyNews/status/712180408847351808/photo/1
	ID: 712182658713911296 Count: 4 Type: photo
	http://twitter.com/RT_com/status/712182659695452160/photo/1
	ID: 712189669186904064 Count: 4 Type: photo
	https://twitter.com/AbraxasSpa/status/712189779736207360/video/1
	ID: 712196641462329344 Count: 4 Type: photo
	http://twitter.com/AhronYoung/status/712196653122514944/photo/1
	ID: 712210894860722176 Count: 4 Type: photo
	http://twitter.com/BBCWorld/status/712210895498317824/photo/1
	ID: 712206166386991104 Count: 4 Type: photo
	http://twitter.com/RFERL/status/712206166936440832/photo/1
	ID: 712222100455616512 Count: 4 Type: photo
	http://twitter.com/cnnbrk/status/712222101017661440/photo/1
	ID: 712226132733464576 Count: 4 Type: photo
	http://twitter.com/f_cancellara/status/712226141839466496/photo/1
	ID: 712245030367449090 Count: 4 Type: photo
	http://twitter.com/BBCBreaking/status/712245288744951809/photo/1
In [8]:
# Display the top images
for media_id in sortedMedia[:30]:
    media = media_info_map[media_id]
    print("\tID:", media_id)
    display(HTML("<img src=\"%s\"/>" % media["media_url"]))
	ID: 712177667823570944
	ID: 712208274175729665
	ID: 712181069055975424
	ID: 712205990804983808
	ID: 712182954856812544
	ID: 712193424498278400
	ID: 712208482238386177
	ID: 712217157174697984
	ID: 712178492453085184
	ID: 712179810722680832
	ID: 712180998210002944
	ID: 712210894936285188
	ID: 712174739884806144
	ID: 712187815648436224
	ID: 712179890141982720
	ID: 712176801188089856
	ID: 712184530258501632
	ID: 712193012638605313
	ID: 712242210121637888
	ID: 712188176161333248
	ID: 712174091038564353
	ID: 712179266532802560
	ID: 712182658713911296
	ID: 712189669186904064
	ID: 712196641462329344
	ID: 712210894860722176
	ID: 712206166386991104
	ID: 712222100455616512
	ID: 712226132733464576
	ID: 712245030367449090
In [ ]: