CCJS 418E: Applications of Data Science for Criminology (Fall 2023)

This page is an overview of Applications of Data Science for Criminology, an undergraduate course in the Department of Criminology and Criminal Justice at the University of Maryland. Being a relatively new course, the purpose of this page is to shed some light on what topics the course covers. Coding in the Python programming language is a central component of the course. No prior knowledge of Python or coding is assumed, but having some prior coding experience can make the initial learning curve easier to traverse. But if you don’t have it, that’s also OK — a number of students with little programming background have done well in the course with the appropriate investment. (i.e. making sure to keep up with the coding labs each week.) The course will cover topics related to data science through real-world examples in the areas of pretrial detention, shooting victimization, domestic violence, and child protective services. Don’t hesitate reach out to me directly if you have any questions about the course.

Actors in the criminal justice system are often tasked with making consequential decisions that rely on an assessment, or prediction, of what will happen in the future. Examples of these decisions that hinge (at least in part) on a prediction include:

Decision: A judge releasing or jailing an arrestee.

Prediction: Chance that a defendant will skip future court dates, or be charged with a new crime.

Decision: A child protective specialist deciding whether a child should be removed from a home.

Prediction: Chance of future maltreatment or neglect.

Decision: Shelters allocating scarce housing units to victims and survivors of domestic violence.

Prediction: Chance that a survivor is at high risk of re-victimization in the near term.

Even though these decisions are made at some of the most critical points in the lives of the people involved, there is a surprising lack of scientific evidence on the quality of these human-generated predictions:

How accurate are they?
Do they show signs of bias? Are they unfair?
Do they lead to intended outcomes? (e.g. do detention decisions lead to more court appearances; does removal of a child from a home lead to better outcomes for the child?)

At the same time, and spurred on by their success in the private sector, there has been increasing interest in the use of machine learning algorithms to generate–what are hoped to be–highly accurate predictions for applications in the public sector. But the use of algorithms in the high-stakes contexts listed above is understandably controversial. Central to the controversy is the concern that some of the data that’s commonly used to build these algorithms (e.g. arrest, court, and victimization records) contain human bias which could lead to the predictions generated by these algorithms helping to reproduce and reinforce those biases.

Through the lens of pretrial risk assessments–algorithms used in many jurisdictions across the U.S. to aid judges in the release/jail decision–this course asks and attempts to provide answers to the following questions:

What does it mean to make a prediction?
How do we know if the predictions made by an algorithm are accurate?
How do we know if the predictions made by an algorithm are more accurate than those made by humans?
How do we know if the predictions made by an algorithm are fair and equitable?

What is a prediction?

In part one of the course, we will discuss what it means to make a prediction from data. To do that, we will go over the basics of probability theory (laws of probability, difference between joint/marginal/conditional probabilities, Bayes theorem). At the end of this section, you will learn that when we are talking about generating a prediction like the one made by a judge, we are talking about estimating a conditional probability.

A conditional probability tells us about the probability of an event happening given that something else has already happened. For example, a prediction about whether a defendant will show up to their next court date will (most likely) be different if the defendant had never missed an appearance as opposed to missing court ten times previously. In other words, the chance, or probability, of missing a court date given zero prior misses is lower than the chance of missing court given ten prior misses. What using data and algorithms gives us here is an explicit estimate of these conditional probabilities.

Generating accurate conditional probabilities

In part two of the course, we will learn that what prediction algorithms aim to do is to estimate these conditional probabilities using data. We will cover a sequence of machine learning models that generate these predictions: Naive Bayes, Linear/Logistic Regression, Decision Trees, Random Forest, and Gradient Boosting.

In the process we will discuss where each algorithm comes up short in our goal of estimating accurate (or calibrated) probabilities and how the subsequent algorithm overcomes that limitation.

We will conclude this section with a discussion about the great challenge in evaluating whether these algorithms actually produce predictions that are more accurate than human assessments.

Bias in, bias out?

After discussing what a prediction is and how machine learning algorithms generate them, we will next turn to the problem of biased data. In this final part of the course, we will discuss three very reasonable conditions that an algorithm should meet in order to be called fair. (We will unfortunately, but maybe not surprisingly, also discuss why it’s nearly impossible for an algorithm to meet these conditions.) We will conclude with a discussion on the types of bias that are correctable with data, and the types that are not.

Hands-on skills

Throughout the course, we will ground the concepts we cover during lectures through three problem sets, weekly quizzes, weekly labs, a midterm, and a final exam. A central part of the course will be learning how to code some of the concepts that we discuss using the Python programming language (via Jupyter Notebooks). (In particular, we will become very familiar with pandas DataFrames). No prior knowledge of Python is assumed, and thus, we will work through the coding slowly, but it should be reiterated that it is a major part of the course. To that end, some prior coding experience will likely make the learning curve in the first part of the course easier to traverse. But if you don’t have it, that’s also OK — a number of students with little programming background have done well in the course with the appropriate investment. (i.e. making sure to keep up with the coding labs each week.)

What this course is NOT

This is an introductory course intended for students with little prior coursework in the topics mentioned above. Where this is most salient is in the coding: If you have prior Python experience and are comfortable coding, you may not find the class all that fulfilling. Email me if you have any questions about your particular situation.