Social Media

Seminar style course that covers design and impact of computer-based systems for human communication, including email and IM, discussion boards, Computer-Supported Cooperative Work (CSCW), Group Decision Support Systems (GDSS), and Social Networking Systems. Topics include alternative design structures, impacts of primarily text-based group communication, and recent empirical studies of virtual teams, online communities, and systems used for social networking, including 3-D worlds such as Second Life and ”micro blogging” systems such as Twitter.

Class Overview

Seminar style course that covers design and impact of computer-based systems for human communication, including email and IM, discussion boards, Computer-Supported Cooperative Work (CSCW), Group Decision Support Systems (GDSS), and Social Networking Systems. Topics include alternative design structures, impacts of primarily text-based group communication, and recent empirical studies of virtual teams, online communities, and systems used for social networking, including 3-D worlds such as Second Life and ”micro blogging” systems such as Twitter.

Course Assignments

This course has individual module-level assignments in the form of Medium posts, posted to the course’s Medium Publication.

Course Objectives

At the completion of this course, students will be able to:

  1. Extract graph structures from social media data
  2. Implement and analyze standard algorithms for graph analysis, community detection, network visualization, and other social media mining tasks
  3. Characterize graph construction and evolution using temporal network analysis including statistical change detection, exponential random graph modeling, and stochastic actor oriented modeling
  4. Describe social and ethical considerations around the use of social media data for mining, analysis, and inference
  5. Work with a team to design and execute a social media analysis project on data that is not already structured for the analysis task, and to compare and evaluate the design choices

Textbooks

  • ???
  • And selected readings
  • http://www.socialmediamining.info/slides/

Class Projects

  • Dataset creation
  • Modeling and prediction
    • Link prediction
  • Search engine
    • E.g., lucene
    • ElasticSearch

Class Overview

  • Social media data
    • Semistructured Data
      • JSON
        • Newline-deliminted JSON
  • Collecting social media content
    • APIs
      • Twitter
      • Reddit
      • YouTube
      • FB/Instagram
  • social media data as graphs
    • Constructing graphs from web data
    • Adjacency lists
    • Adjacency matrices
    • Friend/follower graphs
    • Interaction graphs
    • Visualizing graphs
  • Graph analysis
    • Centrality metrics
    • PageRank
    • HITS
  • Clustering
    • k-Means
    • Spectral clustering
    • Hierarchical Clustering
  • Community detection
    • Graph partitioning
    • Modularity
    • Girvan-Newman
  • Graph Mining
    • Friend-of-friend recommendation
    • Preference mining
    • Collaborative filtering
  • Text mining
    • Sentiment analysis
  • Geolocation
    • From digital trace data
  • Image mining
    • pHash
    • pre-trained image models
  • Predicting Phenomena in Social Media
    • Social Ties and Link Prediction
      • Friend recommendation
      • Trust inference
    • Health forecasting (Google Flu)
    • Public perception/Election prediction
  • Online Diffusion and Contagion
    • Information propagation
    • Emotional contagion
  • Cross-platform alignment
    • Cold- vs. Warm-Start for recommendation
  • Social media as streams
    • Stream processing
    • Computing Metrics for Streams
  • Inauthenticity in Social Media
    • Bots and their detection
    • Sock puppets and Trolls
    • Rumors, Misinformation, and Disinformation
      • Political disinformation, astroturfing
      • Rumors during crises
  • Ethical considerations in social media data
    • Privacy concerns
      • Predicting private attributes
    • Content Creators’ Agency (deleted content)
    • Bias and representative data
      • Population bias
      • System bias (e.g., racial bias in dating apps)
    • Impacts of social media
  • Social responsibility of platforms

Modules

Module 1 - What is Social Media and Why Do We Care?

  • What?
  • Why do people use it?
  • What benefits and hazards does it have?
  • Course Logistics
    • All readings should be reviewed prior to watching the lecture videos. I will assume you are familiar with them beforehand.
    • Writing a Medium post
      • Sign up:
        • https://help.medium.com/hc/en-us/articles/115004915268
        • Send me your account info, so I can add you as a collaborator to the class’s publication
      • https://help.medium.com/hc/en-us/categories/200058025-Writing
      • https://help.medium.com/hc/en-us/articles/225168768-Write-a-post
      • https://blog.medium.com/best-practices-for-writing-on-medium-386506ae62b9
      • Where to find media:
        • https://unsplash.com/
        • WikiMedia Commons
    • Potential ethical issues on which we can present

Module 2 - Collecting and Analyzing Social Media Data

  • Ethnographic and Qualitative Analysis
  • Computer-Assisted Quantitative Analysis
    • Large-scale text, graph, and image analysis
  • Data Collection and APIs
    • RESTful APIs and JSON
      • Twitter
      • Reddit
      • YouTube
      • FB/Instagram
  • Bias and ethical questions
    • What sort of data might we miss?
    • Who might be oversampled?
    • What regions might we miss/oversample?
    • Are users okay with having their data collected?
    • Is data collection allowed (GDPR, ToS)?
  • Homework questions
    • Download data from a social media platform
    • Define and extract simple statistics from your data (top-k…)
      • Most popular users
      • Most frequent hashtags
      • Most frequent domains
      • Most frequent words
    • Build time series of data
      • Tweets per hour/day/week
      • Posts including a particular hashtag per hour/day/week

Module 3 - Building Graphs from Social Media

  • social media data as graphs
    • Constructing graphs from social media
      • Friend/follower graphs
      • Interaction graphs
  • Graph types
    • Directed and undirected
    • Weighted and unweighted
    • Non-negative
    • Bipartite graphs
  • Graph representation
    • Adjacency lists
    • Adjacency matrices
    • Node/Edge lists
  • Visualizing graphs
    • NetworkX
    • Gephi
  • Graph metrics
    • Degree
    • Degree distribution
    • Path length
    • Diameter
  • Subgraphs
    • Egocentric networks
      • 1-degree, 1.5-degree, 2-degree, etc.
  • Measuring Influence
    • Centrality metrics
      • Degree
      • Betweenness
      • Closeness
      • Eigenvector
    • PageRank
    • HITS
  • Community Detection
    • Graph partitioning
    • Modularity
    • Girvan-Newman
    • Louvain
  • Bias and ethical questions
    • Do all ties mean the same thing? (weak vs. strong)
    • Do people have the same number of connections?
  • Homework questions
    • Extract a graph from a provided dataset
      • What type of graph did you extract?
      • Does your graph have weights?
      • Provide stats on the graph
    • Construct a visualization of the graph
      • Explain an important observation you want your visualization to convey
    • Pick a random node in your graph and answer the following:
      • What is this node’s degree? Is this degree higher or lower than the average degree?
      • Construct 5 egocentric networks, starting from 1-degree to 5-degree network
        • Build a graph of the percent of nodes in your ego-network

Module 4 - Graph Mining in Social Media

  • Two main topics
    • Predicting links
    • Predicting attributes
  • Homophily
    • Preference mining
    • “Birds of a Feather” paper
  • NOTE TO SELF: Each task should have a task definition and evaluation
  • Friend recommendation/Link Prediction
    • Forbidden Triads
  • Geolocation
  • Bias and ethical questions
    • How might the different centrality metrics introduce bias?
    • What happens when we’re wrong about these inferences?
      • E.g., recommending an assailant to their victim, since many assaults occur between people who know each other
    • How might trust be impacted by these systems?
  • Homework questions
    • Pick a new dataset, extract the graph, calculate centrality metrics for each metric, and show the top 20 nodes ranked by each metric
      • Describe differences you see in these rankings and explain why they might be different
    • For your graph, apply two community detection algorithms
      • For each algorithm’s output, generate a visualization that shows these community assignments
    • For your graph, identify a random now.
      • Construct two list of 10 suggested “friends”, and describe the strategy you used to construct this list

Module 5 - Content Mining in Social Media

  • Text Mining
    • Vectorization
      • TF-IDF
      • Embeddings
    • Language Models
    • Normalization
    • Emoji
    • Sentiment Analysis and Emotion
      • When does sentiment fail?
      • SemEval
    • Topic modeling
      • LDA
      • ATM
  • Digital Story Telling
    • Community characterization
    • Sentiment analysis in communities
    • Image Mosaics
  • Image Analysis
    • pHash
  • Bias and ethical questions
    • What happens when we’re wrong about these inferences?
      • Sentiment may be wrong?
    • Who owns likenesses in images?
  • Homework Questions
    • Pick a dataset
      • Extract top-20 hashtags from data
      • Extract top-level topics
      • Extract graph, and identify communities in that graph
        • For each community, show the top-20 hashtags
        • For each community, what is the distribution of topics?
        • Rank the communities by average sentiment
          • Averaged across text messages
          • Averaged across users
      • Extract a mosaic of images from data
        • Redo this for each community

Module 6 - Temporal Dynamics in Social Media

  • Time series analysis
    • Event detection
    • Trending Topics
  • Generating graphs
    • Erdos-Renyi/Random graph
    • Watts-Strogatz Graph
    • BA graph/preferential attachment
  • Scale-Free Graphs
    • What does scale-free mean?
    • Scale-free vs. log-normal
  • Dynamic graphs
  • Propagation and Information cascades
    • Epidemiology models (SI, SIS, SIRS)
  • Bias questions
    • Again, what happens when we’re wrong in propagation?
  • Homework Questions
    • Pick a dataset
    • Identify top-k hashtags
      • Build time series of data
        • Tweets per hour/day/week
        • Posts including a particular hashtag per hour/day/week
      • Identify peaks in this time series data and provide an explanation about what is driving that peak
      • Generate descriptive statistics on this graph
      • Number of nodes, average degree, degree distribution, diameter
      • Use at least three different graph generators to try and replicate your chosen graph’s statistics
    • Build a visualization of graph spread

Module 7 - Anti-Social Social Media

  • Health misinformation
    • Vaccine information
    • Eating disorders
    • Mental health/suicidal ideation
  • Online Harassment
  • Rumors during crises
  • Inauthenticity in Social Media
    • Bots and their detection
    • Sock puppets and Trolls
    • Rumors, Misinformation, and Disinformation
      • Political disinformation, astroturfing
  • Content moderation
  • Bias questions
    • What are the social responsibility of platforms?
    • Who should be responsible for the information on platforms?
    • Who might be disproportionately impacted by anti-social behaviors online?

Module 8 - Economics of Social Media

  • Covered in Module 7
    • What behaviors do platforms incentivize?
    • Platform Affordances
      • “In Their Own Words” paper
  • Influencers and Content Creators
    • YouTube
    • Instagram
    • TikTok
  • Advertising
    • How are these platforms made available/able to provide the technology they provide?
      • E.g., how expensive is YouTube’s infrastructure? What’s being monetized?
    • Influencers
  • Hunt Alcott’s paper on paying people to stay off social media
  • Social media platforms and news sources
    • Impact of changes to social media platforms’ recsys and news revenue/views
  • Bias questions
    • Do platforms have a pro- or anti-social impact on society?
    • Which affordances do you think are pro- or anti-social and why?
  • Homework Questions
    • Describe a new social media platform that would incentivize pro-social behaviors.
    • Describe a new social media platform that would incentivize anti-social behaviors.

Class Sessions

Session 1 - Tuesday, Sept. 1 - Module 1, Intro

Session 2 - Tuesday, Sept. 8 - Module 2, Collecting Social Media Data

Session 3 - Monday, Sept. 14 - Module 3, Graphs from Social Media

  • [Project] Proposal due, Sept 18

Session 4 - Monday, Sept. 21 - Module 3, cont.

  • Assignment 1 due

Session 5 - Monday, Sept. 28 - Module 4, graph mining

  • [Project] Data collection report due, Oct 2

Session 6 - Monday, Oct. 5 - Module 4, cont.

  • Assignment 2 due

Session 7 - Monday, Oct. 12 - Module 5, Content Mining in Social Media

Session 8 - Monday, Oct. 19 - Module 5, cont.

  • Assignment 3 due

Session 9 - Monday, Oct. 26 - Module 5 cont.

  • [Project] Intermediate Report due, Oct 30

Session 10 - Monday, Nov. 2 - Module 6, Temporal Dynamics in Social Media

  • Assignment 4 Due

Session 11 - Monday, Nov. 9 - Module 6, cont.

Session 12 - Monday, Nov. 16 - Module 7, Anti-Social Social Media

  • Assignment 5 due

Session 13 - Monday, Nov. 30 - Module 8, Economics of Social Media

  • Assignment 6 due

Session 14 - Monday, Dec. 7 - Final project presentations

  • [Project] Virtual presentation due, Due Dec 7
  • [Project] Final report due, Dec 11

Assignments

Assignment 1 (Module 2): “A 1000-word blog post on retrieving and analyzing social media data. You must collect this data yourself and you cannot use a pre-existing or published data (Kaggle, data.world, etc.).

Motivate with a clear and compelling question that engages the reader starting with a headline and introductory paragraph(s) Describe the data that could answer this question, where it lives, and why it’s relevant. Discuss what biases this data might have and possible ethical issues relevant to its use. Explain how you use libraries like tweepy, praw, requests, BeautifulSoup, etc. to retrieve this data from the web Give examples of bugs you encountered and how you fixed them, how you cleaned up this data into a more usable format, etc. Perform some exploratory and explanatory data analysis that answers your research question Conclude with a discussion of the limitations of your scraping approach, the data, your analysis, ethics, etc.

Assignment 2 (Module 3): “A 1000-word blog post on extracting and visualizing networks from social media data and social networking platforms. You should include discussion on the overall network (degree distribution, diameter, etc.), influential nodes in the social graph, and communities in this network. You can use any of the datasets we’ve discussed in class or that you’ve already collected.

Motivate with a compelling question that engages the reader: headline, intro, etc. Discuss the provenance of the data, what the nodes and edges in your network represent, and how you built this graph from your data Use NetworkX, Gephi, or NodeXL to analyze and visualize the data Perform some exploratory and explanatory data analysis that answers your research question Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your approach, the data, your analysis, ethics, etc.

Assignment 3 (Module 4): “A 1000-word blog post continuing the post you wrote for Assignment 2 and discuss attributes you can infer about individuals in your network. Your post should identify one such attribute that is either incomplete or missing in the data, discuss how you implemented a method to infer this attribute from your graph, and describe potential biases ethical issues in this process.

Motivate with a compelling question that engages the reader: headline, intro, etc. Identify the missing or incomplete attribute in your network and what applications knowing it may enable Clearly describe your implementation for inferring this attribute, paying special attention to how you validate your results (i.e., how do you know your inference is likely correct) Identify potential ethical issues in this inference (e.g., is your attribute a “protected” one, would someone disagree with how you characterized them on this factor, etc.) Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your approach, the data, your analysis, ethics, etc.

Assignment 4 (Module 5): “A 1000-word blog post describing an event and how individuals on a social media platform responded to it. You should combine both graph- and content-mining approaches and can use both quantitative and qualitative approaches to analyze your data. You must use a different dataset than what you used in Assignments 2 and 3.

Motivate with a compelling question that engages the reader: headline , intro, etc. Describe the event on which you are focusing and why it is interesting or important Explain how you retrieved and cleaned the data around this event (either through your own collection or referencing someone else’s collection) Discuss how different groups of individuals responded to this event Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your scraping approach, the data, your analysis, ethics, etc.

Assignment 5 (Module 6): “A 1000-word blog post exploring the temporal dynamics or evolution of a social media dataset. Examples include changes in sentiment over time, extracting and analyzing time series data, or visualizations of propagation/cascades

Motivate with a compelling question that engages the reader: headline, intro, etc. Discuss the provenance of the data, potential biases, and ethical concerns in its use Describe the temporal granularity on which your analysis focuses and why Perform some exploratory and explanatory data analysis that answers your research question Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your scraping approach, the data, your analysis, ethics, etc.

Assignment 6: “A 500-word blog post describing the design of a new social media platform, its affordances, and the behaviors it incentivizes”

Motivate with a compelling question that engages the reader: headline, intro, etc. (Here, this means identifying whats missing from existing social media platform offerings and how your platform addresses this gap) Discuss the affordances your platform offers, how its different stakeholders may use them, how individuals will interact with your platform, what behaviors are likely to be incentivized and de-incentivized by your design Describe the responsibilities you see your platform having in controlling (or not controlling) the content users post or their behaviors on your platform Include at least one mockup of a user interface for your new platform (could be a wire diagram, simple sketch drawing, etc.)