Class Overview

Web mining aims to discover useful information and knowledge from the Web hyperlink structure, page contents and usage logs. It has direct applications in e-commerce, Web analytics, information retrieval/filtering, personalization, and recommender systems. Employees knowledgeable about web mining techniques and their applications are highly sought by major web companies such as Google, Amazon, Yahoo, MSN and others who need to understand user behavior and utilize discovered patterns from terabytes of user profile data to design more intelligent applications.

The primary focus of this course is on web usage mining and its applications to business intelligence and search domains. You will learn techniques from machine learning, data mining, text mining, and databases to extract useful knowledge from the web and other unstructured/semi-structured, hyper- textual, distributed information repositories. This data could be used for site management, automatic personalization, recommendation, and user profiling. Topics covered include crawling, indexing, ranking and filtering algorithms using text and link analysis, applications to search, classification, tracking, monitoring, and Web intelligence. Programming assignments give hands-on experience. A group project highlights class topics.

Course Objectives

At the completion of this course, students will be able to:

Compare, contrast, and collect static web content/structure/usage data and data streams.
Convert un- and semi-structured data into an abstract data representation such as a vector, a set, or a matrix, with modeling considerations, for use in downstream data analysis
Implement and analyze standard data mining algorithms for clustering, dimensionality reduction, regularized regression, graph analysis, and locality sensitive hashing.
Understand, discuss, and evaluate advanced data mining algorithms for clustering, dimensionality reduction, regularized regression, graph analysis, locality sensitive hashing, and managing noisy data.
Work with a team to design and execute a multi-faceted data mining project on data that is not already structured for the analysis task, and to compare and evaluate the design choices.

Course Assignments

This course has individual module-level assignments in the form of Medium posts, posted to the course’s Medium Publication.

Textbooks

Mining Massive Datasets
- July 2019 edition
- Available free here

Module Overview

Mining Web Data and Motivations [ch1/ch6]
Web Data as Graphs [ch5/ch10]
Measuring Similarity in Web Content [ch3]
Hashing and Dimensionality Reduction [ch11]
Clustering [ch7]
Recommendation Systems [ch9]
Mining Data Streams [ch4]
Computational Advertising [ch8]

Module Breakdown

Web data and web logs [ch6?]
- Frequent itemsets as a motivating problem
- Types of data
  - Structured Data
    - XML
    - Databases
  - Semistructured Data
    - JSON
      - Newline-deliminted JSON
  - web logs
- Collecting web content
  - APIs
  - Crawlers
  - Web text processing
    - XPath
    - BeautifulSoup
Web data as graphs [ch5/ch10]
- Constructing graphs from web data
  - Adjacency lists
  - Adjacency matrices
  - Visualizing graphs
- Graph analysis
  - Centrality metrics
  - PageRank
  - HITS
- Community detection
  - Graph partitioning
  - Modularity
  - Girvan-Newman
Vectorization and Similarity [???]
- Graph vectorization
- Text vectorization
  - Bag of words
  - TF-IDF
- Similarity metrics
  - L_p distance
  - Cosine
  - Jaccard
Hashing and Dimensionality Reduction [ch11]
- LSH
- MinHashing(?)
- PCA/SVD
- pHash
- Topic Modeling as a latent space
- Embeddings
  - word2vec
  - node2vec
Clustering [ch7]
- k-Means
- Spectral clustering
- Hierarchical Clustering
Recommendation Systems [ch9]
- Opinion mining
  - Sentiment analysis
- Content-based recommendation
- Association Rules
- Friend-of-friend
- Collaborative filtering
Web data as streams [ch4]
- Stream processing
- Computing Metrics for Streams
Computational Advertising [ch8]