(INST414) Data Science Techniques

This course explores the application of data science techniques to unstructured, real-world datasets including social media and open data sources. Students are introduced to basic concents for data science, including network analysis, unsupervised and supervised learning, model evaluation, and data hygeine. Students write six public blog posts to demonstrate their understanding of a problem and communicate how they solve this problem for a particular stakeholder. The course culminates in a final project that applies multiple data science techniques to a specific decision problem a particular stakeholder has.

Class Overview

This course explores the application of data science techniques to unstructured, real-world datasets including social media and open data sources. The course will focus on techniques and approaches that allow the extraction of information relevant for experts and non-experts in a wide range of areas including smart cities, transportation or public safety. This course will explore approaches to extract insights from large-scale datasets. The course will cover the complete analytical funnel from data extraction and cleaning to data analysis and insights, interpretation, and visualization. The data analysis component will focus on techniques in both supervised and unsupervised learning to extract information from datasets. Topics will include clustering, classification, and regression techniques. Through homework assignments, a project, exams and in-class activities, students will practice working with these techniques and tools to extract relevant information from structured and unstructured data.

Course Objectives

At the completion of this course, students will be able to:

  1. Collect and clean large-scale datasets.
  2. Articulate the math behind supervised and unsupervised techniques.
  3. Execute supervised and unsupervised machine learning techniques.
  4. Select and evaluate various types of machine learning techniques.
  5. Explain the results coming out of the models.
  6. Critically evaluate the accuracy of different algorithms and the appropriateness of a given approach

Course Assignments

This course has individual module-level assignments in the form of Medium posts, posted to the course’s Medium Publication.

Textbooks

Textbooks below provide useful background and reference material. They are freely available for UMD students as well:

  • Introduction to Machine Learning with Python : A Guide for Data Scientists (IMLP) by Andreas C. Mäller and Sarah Guido, ebook available at UMD library
  • Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas, available at: https://jakevdp.github.io/PythonDataScienceHandbook/Links to an external site.

Software

Jupyter notebooks written in Python 3 will be used for all in-class examples and assignments. The Anaconda distributionLinks to an external site. of Python 3 is strongly recommended to provide all of these programs and other libraries. If students wish to use an alternative data analysis environment (R, Matlab, Julia, etc.) they are welcome to do so, but instructional support is only guaranteed for Python.

Jupyter also provides a ready-made Docker container for data science-style notebooks, available here: https://jupyter-docker-stacks.readthedocs.io/Links to an external site.

Module Overview

  • Module 1 Data Science and Motivations
  • Module 2 Web Data as Graphs
  • Module 3 Similarity, Dimensionality Reduction, and Cleaning
  • Module 4 Clustering and Unsupervised Learning
  • Module 5 Probability and Bayes’ Theorem
  • Module 6 Supervised Machine Learning
  • Module 7 Evaluating Your Models

Grade Distribution

Grades for this class are broken down as follows:

  • Module Assignments: Students will complete independent projects for each module, exercising the skills learned. These assignments will be submitted in the form of Medium posts, which the professor will aggregate for the class – 35%

  • In-Class Labs/Quizzes: This course includes quizzes and space for students to work collectively on lab assignments in class, to practice skills from each module. These lab periods occur weekly, with output submitted via ELMS. – 25%

  • Final Project: Over the semester, students will develop a project that integrates skills over the semester, applies them to a specific problem, and present a final report on this project – 30%

  • Participation: Asking questions, participating in online discussion – 10%