INST 798/808: A.I.-Powered Research Assistants

Location + Time: TWS 0207, Thursdays 2 to 4.45p.m.

Course Description

This course explores how Large Language Models (LLMs) can transform labor-intensive research tasks in the social sciences. Using the challenge of tracing ideas and concepts through text as our primary lens, we examine how these powerful tools can aid both qualitative and quantitative research methodologies. Measuring the evolution of ideas through text presents uniquely complex challenges - concepts may be expressed through varied language, their meaning often shifts over time, and understanding them requires deep contextual knowledge that has traditionally relied heavily on human expertise.

The course begins by examining traditional approaches to concept measurement, from word embeddings to early neural architectures, before exploring how transformer-based models have revolutionized our ability to detect and track complex ideas in text. We then delve into recent advances in mechanistic interpretability to understand how these models internally represent and manipulate concepts. This foundation allows us to evaluate various approaches to concept tracing, from using LLMs to scale up qualitative research methods to exploring how modern neural topic modeling can capture evolving ideas across large corpora.

Throughout the course, we maintain a strong focus on validation and methodology, culminating in an examination of how to properly conduct downstream analyses using LLM-processed data. Through paper presentations, class discussions, hands-on labs, and a research paper, students will develop both theoretical understanding and practical experience applying these tools to real research problems.

By the course’s end, students will be equipped to evaluate when and how to effectively integrate LLMs into their research workflows, understand the methodological implications of using these tools, and implement appropriate validation strategies for LLM-assisted research. Most importantly, they will develop a critical perspective on both the transformative potential and the limitations of using LLMs as research assistants in the social and information sciences.

Course Objectives

After completing this course, students will be able to:

  1. Understand the fundamental concepts of natural language processing and their evolution
  2. Evaluate the capabilities and limitations of AI research tools
  3. Implement proper validation and error analysis techniques
  4. Design research workflows that appropriately incorporate AI assistance

Prerequisites

While students aren’t expected to have deep expertise in all areas, you should be comfortable with the following:

  • Python programming fundamentals (working with common data structures, functions, pandas)
  • Basic linear algebra. Understanding how language models represent and manipulate text requires familiarity with vector and matrix operations (addition, multiplication, transpose, distance, similarity).
  • Basic machine learning concepts (supervised vs unsupervised learning, common evaluation metrics)
  • Fundamental probability concepts (conditional probability, independence)

Technical Requirements

This course explores cutting-edge AI technologies, but we’ll be working within practical computational constraints. While large language models like GPT-4 or Claude 3.5 require significant computing resources, we’ll focus on working with smaller, more manageable models that can run on personal computers. Students will need a laptop capable of running Python and handling lightweight language models (8GB RAM minimum, 16GB+ recommended). We’ll use TerpAI for tasks requiring more computational power, but part of the learning experience will involve understanding how to conduct meaningful research within resource limitations. The course will emphasize understanding core concepts and developing practical workflows that can scale from limited to abundant computational resources.

Assessment

Weekly questions/reflections (20%) Students will submit weekly questions or reflections (maximum 500 words) by Tuesday at 5pm before each class. These submissions serve two purposes: they help shape our class discussion and demonstrate your engagement with the Reading. These submissions will be used to guide our class discussions, so be prepared to elaborate on your question or reflection during class. Your submission should do one of the following:
  • Pose a substantive question sparked by the Reading. This could be about methodology, implications, connections to other work, or potential applications. Questions should go beyond basic clarification to engage with the material's concepts or implications.
  • Offer a reflection that connects multiple Reading, relates the material to your own research, or critically examines the methodology or assumptions. Your reflection might consider how different papers approach similar problems, identify potential limitations, or propose new applications.
  • Express and explore areas of confusion in the Reading. Some of our papers are technically challenging, and identifying what you don't understand is an important part of the learning process. When discussing confusing aspects, please:
    • Describe your current understanding of the concept
    • Identify specifically what aspect is unclear
Presentations (20%) Throughout the semester, you will present assigned papers to the class. These 20-minute presentations should have 12-15 minutes of content and 5-8 minutes of discussion leading. Additionally, the presentations should:
  • Clearly state the paper's main contribution and why it matters
  • Walk through one or two illustrative examples from the paper
  • Discuss limitations and potential extensions
  • Prepare 2-3 discussion questions for the class
  • Be ready to facilitate brief discussion of these questions
Research Project Proposal (20%) The research proposal (3-5 pages) outlines your planned investigation into either evaluating LLMs for specific research tasks or using LLMs to study a substantive research question. Before submitting your proposal, you must schedule a meeting with me to discuss your ideas. You can work by yourself or with one other student in the course, which I highly encourage. Prepare a one-paragraph summary of your idea and 2-3 specific questions for our discussion. This meeting should take place at least one week before the proposal deadline. The proposal is due in week 8 of the class (3/20). Your proposal should be structured as follows:
  • Introduction (~1 page)
    Present your core research question or evaluation task. Whether you're assessing LLM capabilities or studying a substantive topic, explain why this question matters and how LLMs offer unique insights for addressing it. For instance, you might explore how LLMs could help trace the evolution of methodological discussions in your field, or evaluate how well they can identify theoretical frameworks in academic papers.
  • Proposed Methodology (1-2 pages)
    Detail your research design, including:
    • Data sources
    • Which models or tools you'll use
    • Your analytical approach
    • Why your chosen methods are appropriate for your question
    For example, if you're studying how newspaper coverage of artificial intelligence has evolved, explain why concept tracing with LLMs might capture subtle shifts in framing better than traditional content analysis.
  • Validation Strategy (~1 page)
    Describe how you'll verify your results and ensure methodological rigor. This might include:
    • Comparison with human coding
    • Use of multiple models
    • Development of benchmarks
    • Strategies for addressing potential biases
    Acknowledge the limitations of your approach and explain how you'll address them.
  • Timeline and Feasibility (~0.5 page)
    Provide a realistic schedule showing how you'll complete the work within the semester. Include key milestones such as:
    • Data collection
    • Initial analysis
    • Validation steps
    • Writing and revision
Final Paper (40%) The Final Paper will be due on the first day of finals period.

Weekly Schedule

Week 1: Introduction to Natural Language Processing

Slides | Lab Notebook

Required Reading:

Highly Recommended Video:

Recommended Video:


Week 2: Word Embeddings (No Class)

Required Reading:

Optional Reading:


Week 3: Neural Networks

Required Reading:

Highly Recommended Videos:


Week 4: RNNS and Attention

Slides | Lab Notebook

Required Reading:

Highly Recommended Videos:

Recommended Videos:

Optional Reading:


Week 5: Transformers and Mechanistic Interpretability

Required Reading:

Optional Reading:


Week 6: Modern Topic Modeling

Required Reading:

Optional Reading:


Week 7: Qualitative Coding with LLMs I

Required Reading:

Optional Reading:


Week 8: Qualitative Coding with LLMs II

Required Reading:

Optional Reading:


Week 9: Error Analysis and Validation

Required Reading:

Optional Reading:


Week 10: Concept Tracing I

Required Reading:

Optional Reading:


Week 11: Concept Tracing II

Required Reading:


Week 12-14: To Be Determined