[“Data Science for Global Applications”]

This course material was originally prepared by Anna Haensch and Karin Knudson at the Fletcher Graduate School of Global Affairs at Tufts University. We wrote about some of our motivations and methodology for the 2022 SciPy Conference Proceedings in “Python for Global Applications: teaching scientific Python in context to law and diplomacy students”.

Course Outline

This course is designed for a 14 week semester with two 80 minute class meetings per week. Here we present the main topics and assement tools of each module.

Weekly Planner

In what follows we provide a week-by-week plan including learning objectives, exercises, and suggested readings.

Module I: Getting and Cleaning Data

Week 1

Learning Objectives:
- Students will execute their first Python code in Google Colab.
- Students will understand the difference between the Python variable types.
In-Class Exercises:
- Workbook 1: Intro to Python Basics
- Workbook 2: Lists and For Loops
Suggested Reading:
- A. Boyd et al., “Data Diplomacy” (2019).
- d. boyd & K. Crawford, “Critical Questions for Big Data.” (2012).

Week 2

Learning Objectives:
- Students will critique the assumptions and decisions that go into data acquisition and analysis.
- Students will load and inspect data using pandas.
- Students will execute for loops in Python.
In-Class Exercises:
- Workbook 3: Getting Started with Dataframes
- Workbook 4: Loading and Exploring Data
Suggested Reading:
- R. Chambers, “Editorial: Responsible Well-Being - A Personal Agenda for Development” (1997).
- C. D’Ignazio & L. Klein, “6. The Numbers Don’t Speak for Themselves.” In Data Feminism (2020).

Week 3

Learning Objectives:
- Students will contextualize data and the absence of data within social, political, and economic contexts.
- Students will experiment with methods to mitigate the “missingness” of data and justify decisions regaring the same.
- Students will manipulate and explore data with Python and pandas.
In-Class Exercises:
- Workbook 5: Dealing with Missing Data
Suggested Reading:
- C. D’Ignazio & L. Klein, “4. What Gets Counted Counts.” In Data Feminism (2020).
- H. Krause, “Data Biographies: Getting to Know Your Data” (2017)

Week 4

Learning Objectives:
- Students will set up their local Python computing environment
- Students will execute command line prompts.
- Students will transition to using Python locally in Jupyter Labs.
In Class-Exercises:
- Workbook 6: Missing Sequential Data
Suggested Reading:
- T. Gillespie, “The Relevance of Algorithms.” (2013).

Module II: Visualizaing Data

Week 5

Learning Objectives:
- Students will understand the fundamentals of data visualization, including the principle of proportional ink and chart junk.
- Students will identify and discuss effective and ineffective uses of these principles in real world charts.
- Students will have familiarity with the components of Matplotlib plots.
In Class-Exercises:
- Workbook 7: Intro to Plotting
Suggested Reading:
- C.O. Wilke, “Fundamentals of Data Visualization, Part I.” (2019).

Week 6

Learning Objectives:
- Students will understand the importance of color in data visualization.
- Students will differentiate between qualitative and sequential color scales and identify use cases for each.
- Students will become familiar with mapping tools in Geopandas.
In Class-Exercises:
- Workbook 8: Mapping and Geocoding
Suggested Reading:
- C. Weaver, “The Unmitigated Chaos of America’s Attempt at Color-Coded Covid Guidance” (2021)

Week 7

Learning Objectives:
- Students will recognize the importance and impact of labels and annotation in plotting.
- Students will identify and discuss effective and ineffective uses of annotation in real world charts.
- Students will maniplulate labels and annotation on existing plots.
- Students will become familiar with less common plot classes such as radial plots and heatmaps.
In Class-Exercises:
- Workbook 9: Other Ways to Plot
Suggested Reading:
- C.O. Wilke, “Fundamentals of Data Visualization, Part II.” (2019).

Week 8

Learning Objectives:
- Students will demonstrate an understanding of the use cases for various plot types by selecting a plot type for their own data.
- Students will execute a full Python visualization pipeline from ingesting the data, to printing a figure.
Suggested Reading:
- A. Cairo, “How Charts Lie, Ch. 1 & 2.” (2019).

Module III: Modeling Data

Week 9

Learning Objectives:
- Students will compute descriptive statistics with Python and interpret the results.
- Students will recognize the unifying structure of data models in Scikitlearn.
- Students will train k-nearest neighbor and logistic regression classification models.
- Students will assess model performance using confusion matrices and accuracy scores.
In Class-Exercises:
- Workbook 10: Classification
Suggested Reading:
- L. Breiman, “Statistical Modeling: The Two Cultures.” (2021).

Week 10

Learning Objectives:
- Students will understand the difference between supervised and unsupervised learning.
- Students will split data into training and testing sets and understand the motivation for doing so.
- Students will train Gaussian mixture model and k-means clustering models.
In Class-Exercises:
- Workbook 11: Clustering
Suggested Reading:
- J. Buolamwini & T. Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” (2018).

Week 11

Learning Objectives:
- Students will gain experience looking at data and choosing the appropriate modeling class.
- Students will train a linear regression model and understand the use cases of regression models.
- Students will analyze and interpret coefficients and feature importance across model classes.
In Class-Exercises:
- Workbook 12: Linear Regression
Suggested Reading:
- J. Angwin, J. Larson, S. Mattu & L. Kirchner, “Machine Bias.” (2016)

Week 12

Learning Objectives:
- Students will understand the terminology AI, machine learning, neural networks, and deep learning.
- Students will train a neural network model.
- Students will become familiar with generative modeling concepts.
In Class-Exercises:
- Workbook 13: Neural Networks
Suggested Reading:
- E. Bender, T. Gebru, A. McMillan-Major & S. Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” (2021).

Week 13

Learning Objectives:
- Students will execute an end-to-end machine learning pipeline using their own dataset.
- Students will interpret model results and design compelling visualizations to communicate the same.
Suggested Reading:
- Z.C. Lipton, “The Mythos of Model Interpretability.” (2017).

Week 14

Learning Objectives:
- Students will identify key takeaways from their data exploration and modeling.
- Students will concisely communicate their findings.

Python Problem Sets

To accompany each of the three course modules (i.e. data exploration, data visualization, and data modeling), there is a Python problem set. These usually take some time to complete, but should be doable with the skill set acquired over the course of the module.

The Policy Project

A core throughline of this course is the policy project. There are various checkpoints and benchmarks throughout the semester. This project culminates with a final policy paper due at the end of the semester, as well as an in-class presentation of key findings and takeaways. The critical checkpoints through the semester are as follows:

Statement of Policy Question: Students will develop 2-3 possible policy questions to explore throughout the semester. A brief explanation of each question will be given to give context to non-experts, and students will identify possible types of data that might be useful.
Identification of Datasets of Interest: Students will choose one policy question of focus, and indentify 2-3 datasets that will help them answer this question. Students will identify the steps necessary to acquire these datasets, but not necessarily carry these steps out (yet).
Draft Data Audit: Students will gain an understanding of the datasets they are using, with the goal of making sure that their data is reusable, and their work is reproducible. This is also a mechanism for understanding the origins of the data to help uncover bias. Precise text of the data audit assignment can he found here.
Draft Data Visualization: Students will use the knowledge gained in the data visualization module of the course to create at least one compelling visualization of their data. Visualizations will be accompanied by ~100 word alt-text.
Presentation: Students will create a concise one-slide presentation highlighting key takeaways from the project and at leat one illustrative data visualization. Students will present this slide in conversation with the class.
Policy Paper: Students will prepare a policy paper that specifically addresses their policy question. This paper will include a “Data Overview” section which contains a narrative version of the data audit a “Methodology” section that includes details of models used, a “Results” section that discusses model output and validation, and a “Conclusions” section that frames results in terms of the posed policy question. The rubric for the policy paper and presentation is below.

Contact Me

If you find these materials useful, let me know! If you have suggestions for other reading or materials that might work well with this class, also let me know: anna.haensch@tufts.edu