Class Overview

Seminar style course that covers design and impact of computer-based systems for human communication, including email and IM, discussion boards, Computer-Supported Cooperative Work (CSCW), Group Decision Support Systems (GDSS), and Social Networking Systems. Topics include alternative design structures, impacts of primarily text-based group communication, and recent empirical studies of virtual teams, online communities, and systems used for social networking, including 3-D worlds such as Second Life and ”micro blogging” systems such as Twitter.

Course Assignments

This course has individual module-level assignments in the form of Medium posts, posted to the course’s Medium Publication.

Course Objectives

At the completion of this course, students will be able to:

Extract graph structures from social media data
Implement and analyze standard algorithms for graph analysis, community detection, network visualization, and other social media mining tasks
Characterize graph construction and evolution using temporal network analysis including statistical change detection, exponential random graph modeling, and stochastic actor oriented modeling
Describe social and ethical considerations around the use of social media data for mining, analysis, and inference
Work with a team to design and execute a social media analysis project on data that is not already structured for the analysis task, and to compare and evaluate the design choices

Textbooks

???
And selected readings
http://www.socialmediamining.info/slides/

Class Projects

Dataset creation
Modeling and prediction
- Link prediction
Search engine
- E.g., lucene
- ElasticSearch

Class Overview

Social media data
- Semistructured Data
  - JSON
    - Newline-deliminted JSON
Collecting social media content
- APIs
  - Twitter
  - Reddit
  - YouTube
  - FB/Instagram
social media data as graphs
- Constructing graphs from web data
- Adjacency lists
- Adjacency matrices
- Friend/follower graphs
- Interaction graphs
- Visualizing graphs
Graph analysis
- Centrality metrics
- PageRank
- HITS
Clustering
- k-Means
- Spectral clustering
- Hierarchical Clustering
Community detection
- Graph partitioning
- Modularity
- Girvan-Newman
Graph Mining
- Friend-of-friend recommendation
- Preference mining
- Collaborative filtering
Text mining
- Sentiment analysis
Geolocation
- From digital trace data
Image mining
- pHash
- pre-trained image models
Predicting Phenomena in Social Media
- Social Ties and Link Prediction
  - Friend recommendation
  - Trust inference
- Health forecasting (Google Flu)
- Public perception/Election prediction
Online Diffusion and Contagion
- Information propagation
- Emotional contagion
Cross-platform alignment
- Cold- vs. Warm-Start for recommendation
Social media as streams
- Stream processing
- Computing Metrics for Streams
Inauthenticity in Social Media
- Bots and their detection
- Sock puppets and Trolls
- Rumors, Misinformation, and Disinformation
  - Political disinformation, astroturfing
  - Rumors during crises
Ethical considerations in social media data
- Privacy concerns
  - Predicting private attributes
- Content Creators’ Agency (deleted content)
- Bias and representative data
  - Population bias
  - System bias (e.g., racial bias in dating apps)
- Impacts of social media
Social responsibility of platforms

Modules

What?
Why do people use it?
What benefits and hazards does it have?
Course Logistics
- All readings should be reviewed prior to watching the lecture videos. I will assume you are familiar with them beforehand.
- Writing a Medium post
  - Sign up:
    - https://help.medium.com/hc/en-us/articles/115004915268
    - Send me your account info, so I can add you as a collaborator to the class’s publication
  - https://help.medium.com/hc/en-us/categories/200058025-Writing
  - https://help.medium.com/hc/en-us/articles/225168768-Write-a-post
  - https://blog.medium.com/best-practices-for-writing-on-medium-386506ae62b9
  - Where to find media:
    - https://unsplash.com/
    - WikiMedia Commons
- Potential ethical issues on which we can present

Ethnographic and Qualitative Analysis
Computer-Assisted Quantitative Analysis
- Large-scale text, graph, and image analysis
Data Collection and APIs
- RESTful APIs and JSON
  - Twitter
  - Reddit
  - YouTube
  - FB/Instagram
Bias and ethical questions
- What sort of data might we miss?
- Who might be oversampled?
- What regions might we miss/oversample?
- Are users okay with having their data collected?
- Is data collection allowed (GDPR, ToS)?
Homework questions
- Download data from a social media platform
- Define and extract simple statistics from your data (top-k…)
  - Most popular users
  - Most frequent hashtags
  - Most frequent domains
  - Most frequent words
- Build time series of data
  - Tweets per hour/day/week
  - Posts including a particular hashtag per hour/day/week

social media data as graphs
- Constructing graphs from social media
  - Friend/follower graphs
  - Interaction graphs
Graph types
- Directed and undirected
- Weighted and unweighted
- Non-negative
- Bipartite graphs
Graph representation
- Adjacency lists
- Adjacency matrices
- Node/Edge lists
Visualizing graphs
- NetworkX
- Gephi
Graph metrics
- Degree
- Degree distribution
- Path length
- Diameter
Subgraphs
- Egocentric networks
  - 1-degree, 1.5-degree, 2-degree, etc.
Measuring Influence
- Centrality metrics
  - Degree
  - Betweenness
  - Closeness
  - Eigenvector
- PageRank
- HITS
Community Detection
- Graph partitioning
- Modularity
- Girvan-Newman
- Louvain
Bias and ethical questions
- Do all ties mean the same thing? (weak vs. strong)
- Do people have the same number of connections?
Homework questions
- Extract a graph from a provided dataset
  - What type of graph did you extract?
  - Does your graph have weights?
  - Provide stats on the graph
- Construct a visualization of the graph
  - Explain an important observation you want your visualization to convey
- Pick a random node in your graph and answer the following:
  - What is this node’s degree? Is this degree higher or lower than the average degree?
  - Construct 5 egocentric networks, starting from 1-degree to 5-degree network
    - Build a graph of the percent of nodes in your ego-network

Two main topics
- Predicting links
- Predicting attributes
Homophily
- Preference mining
- “Birds of a Feather” paper
NOTE TO SELF: Each task should have a task definition and evaluation
Friend recommendation/Link Prediction
- Forbidden Triads
Geolocation
Bias and ethical questions
- How might the different centrality metrics introduce bias?
- What happens when we’re wrong about these inferences?
  - E.g., recommending an assailant to their victim, since many assaults occur between people who know each other
- How might trust be impacted by these systems?
Homework questions
- Pick a new dataset, extract the graph, calculate centrality metrics for each metric, and show the top 20 nodes ranked by each metric
  - Describe differences you see in these rankings and explain why they might be different
- For your graph, apply two community detection algorithms
  - For each algorithm’s output, generate a visualization that shows these community assignments
- For your graph, identify a random now.
  - Construct two list of 10 suggested “friends”, and describe the strategy you used to construct this list

Text Mining
- Vectorization
  - TF-IDF
  - Embeddings
- Language Models
- Normalization
- Emoji
- Sentiment Analysis and Emotion
  - When does sentiment fail?
  - SemEval
- Topic modeling
  - LDA
  - ATM
Digital Story Telling
- Community characterization
- Sentiment analysis in communities
- Image Mosaics
Image Analysis
- pHash
Bias and ethical questions
- What happens when we’re wrong about these inferences?
  - Sentiment may be wrong?
- Who owns likenesses in images?
Homework Questions
- Pick a dataset
  - Extract top-20 hashtags from data
  - Extract top-level topics
  - Extract graph, and identify communities in that graph
    - For each community, show the top-20 hashtags
    - For each community, what is the distribution of topics?
    - Rank the communities by average sentiment
      - Averaged across text messages
      - Averaged across users
  - Extract a mosaic of images from data
    - Redo this for each community

Time series analysis
- Event detection
- Trending Topics
Generating graphs
- Erdos-Renyi/Random graph
- Watts-Strogatz Graph
- BA graph/preferential attachment
Scale-Free Graphs
- What does scale-free mean?
- Scale-free vs. log-normal
Dynamic graphs
Propagation and Information cascades
- Epidemiology models (SI, SIS, SIRS)
Bias questions
- Again, what happens when we’re wrong in propagation?
Homework Questions
- Pick a dataset
- Identify top-k hashtags
  - Build time series of data
    - Tweets per hour/day/week
    - Posts including a particular hashtag per hour/day/week
  - Identify peaks in this time series data and provide an explanation about what is driving that peak
  - Generate descriptive statistics on this graph
  - Number of nodes, average degree, degree distribution, diameter
  - Use at least three different graph generators to try and replicate your chosen graph’s statistics
- Build a visualization of graph spread

Health misinformation
- Vaccine information
- Eating disorders
- Mental health/suicidal ideation
Online Harassment
Rumors during crises
Inauthenticity in Social Media
- Bots and their detection
- Sock puppets and Trolls
- Rumors, Misinformation, and Disinformation
  - Political disinformation, astroturfing
Content moderation
Bias questions
- What are the social responsibility of platforms?
- Who should be responsible for the information on platforms?
- Who might be disproportionately impacted by anti-social behaviors online?

Covered in Module 7
- What behaviors do platforms incentivize?
- Platform Affordances
  - “In Their Own Words” paper
Influencers and Content Creators
- YouTube
- Instagram
- TikTok
Advertising
- How are these platforms made available/able to provide the technology they provide?
  - E.g., how expensive is YouTube’s infrastructure? What’s being monetized?
- Influencers
Hunt Alcott’s paper on paying people to stay off social media
Social media platforms and news sources
- Impact of changes to social media platforms’ recsys and news revenue/views
Bias questions
- Do platforms have a pro- or anti-social impact on society?
- Which affordances do you think are pro- or anti-social and why?
Homework Questions
- Describe a new social media platform that would incentivize pro-social behaviors.
- Describe a new social media platform that would incentivize anti-social behaviors.

Class Sessions

Session 1 - Tuesday, Sept. 1 - Module 1, Intro

[Project] Proposal due, Sept 18

Session 4 - Monday, Sept. 21 - Module 3, cont.

Assignment 1 due

Session 5 - Monday, Sept. 28 - Module 4, graph mining

[Project] Data collection report due, Oct 2

Session 6 - Monday, Oct. 5 - Module 4, cont.

Assignment 2 due

Session 8 - Monday, Oct. 19 - Module 5, cont.

Assignment 3 due

Session 9 - Monday, Oct. 26 - Module 5 cont.

[Project] Intermediate Report due, Oct 30

Assignment 4 Due

Session 11 - Monday, Nov. 9 - Module 6, cont.

Assignment 5 due

Assignment 6 due

Session 14 - Monday, Dec. 7 - Final project presentations

[Project] Virtual presentation due, Due Dec 7
[Project] Final report due, Dec 11

Assignments

Motivate with a clear and compelling question that engages the reader starting with a headline and introductory paragraph(s) Describe the data that could answer this question, where it lives, and why it’s relevant. Discuss what biases this data might have and possible ethical issues relevant to its use. Explain how you use libraries like tweepy, praw, requests, BeautifulSoup, etc. to retrieve this data from the web Give examples of bugs you encountered and how you fixed them, how you cleaned up this data into a more usable format, etc. Perform some exploratory and explanatory data analysis that answers your research question Conclude with a discussion of the limitations of your scraping approach, the data, your analysis, ethics, etc.

Motivate with a compelling question that engages the reader: headline, intro, etc. Discuss the provenance of the data, what the nodes and edges in your network represent, and how you built this graph from your data Use NetworkX, Gephi, or NodeXL to analyze and visualize the data Perform some exploratory and explanatory data analysis that answers your research question Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your approach, the data, your analysis, ethics, etc.

Assignment 3 (Module 4): “A 1000-word blog post continuing the post you wrote for Assignment 2 and discuss attributes you can infer about individuals in your network. Your post should identify one such attribute that is either incomplete or missing in the data, discuss how you implemented a method to infer this attribute from your graph, and describe potential biases ethical issues in this process.

Motivate with a compelling question that engages the reader: headline, intro, etc. Identify the missing or incomplete attribute in your network and what applications knowing it may enable Clearly describe your implementation for inferring this attribute, paying special attention to how you validate your results (i.e., how do you know your inference is likely correct) Identify potential ethical issues in this inference (e.g., is your attribute a “protected” one, would someone disagree with how you characterized them on this factor, etc.) Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your approach, the data, your analysis, ethics, etc.

Motivate with a compelling question that engages the reader: headline , intro, etc. Describe the event on which you are focusing and why it is interesting or important Explain how you retrieved and cleaned the data around this event (either through your own collection or referencing someone else’s collection) Discuss how different groups of individuals responded to this event Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your scraping approach, the data, your analysis, ethics, etc.

Motivate with a compelling question that engages the reader: headline, intro, etc. Discuss the provenance of the data, potential biases, and ethical concerns in its use Describe the temporal granularity on which your analysis focuses and why Perform some exploratory and explanatory data analysis that answers your research question Include well-designed and easy-to-interpret figures and tables summarizing your findings Conclude with a discussion of the limitations of your scraping approach, the data, your analysis, ethics, etc.

Motivate with a compelling question that engages the reader: headline, intro, etc. (Here, this means identifying whats missing from existing social media platform offerings and how your platform addresses this gap) Discuss the affordances your platform offers, how its different stakeholders may use them, how individuals will interact with your platform, what behaviors are likely to be incentivized and de-incentivized by your design Describe the responsibilities you see your platform having in controlling (or not controlling) the content users post or their behaviors on your platform Include at least one mockup of a user interface for your new platform (could be a wire diagram, simple sketch drawing, etc.)

Class Overview

Course Assignments

Course Objectives

Textbooks

Class Projects

Class Overview

Modules

Module 1 - What is Social Media and Why Do We Care?

Module 2 - Collecting and Analyzing Social Media Data

Module 3 - Building Graphs from Social Media

Module 4 - Graph Mining in Social Media

Module 5 - Content Mining in Social Media

Module 6 - Temporal Dynamics in Social Media

Module 7 - Anti-Social Social Media

Module 8 - Economics of Social Media

Class Sessions

Session 1 - Tuesday, Sept. 1 - Module 1, Intro

Session 2 - Tuesday, Sept. 8 - Module 2, Collecting Social Media Data

Session 3 - Monday, Sept. 14 - Module 3, Graphs from Social Media

Session 4 - Monday, Sept. 21 - Module 3, cont.

Session 5 - Monday, Sept. 28 - Module 4, graph mining

Session 6 - Monday, Oct. 5 - Module 4, cont.

Session 7 - Monday, Oct. 12 - Module 5, Content Mining in Social Media

Session 8 - Monday, Oct. 19 - Module 5, cont.

Session 9 - Monday, Oct. 26 - Module 5 cont.

Session 10 - Monday, Nov. 2 - Module 6, Temporal Dynamics in Social Media

Session 11 - Monday, Nov. 9 - Module 6, cont.

Session 12 - Monday, Nov. 16 - Module 7, Anti-Social Social Media

Session 13 - Monday, Nov. 30 - Module 8, Economics of Social Media

Session 14 - Monday, Dec. 7 - Final project presentations

Assignments

Assignment 1 (Module 2): “A 1000-word blog post on retrieving and analyzing social media data. You must collect this data yourself and you cannot use a pre-existing or published data (Kaggle, data.world, etc.).

Assignment 5 (Module 6): “A 1000-word blog post exploring the temporal dynamics or evolution of a social media dataset. Examples include changes in sentiment over time, extracting and analyzing time series data, or visualizations of propagation/cascades

Assignment 6: “A 500-word blog post describing the design of a new social media platform, its affordances, and the behaviors it incentivizes”