# Deductive Coding Analysis of Synthetic Abstracts Lab Notebook
## Introduction
This lab notebook documents the third phase of our research on concept detection capabilities in large language models (LLMs). After generating a corpus of synthetic academic abstracts and analyzing their semantic relationships, we now focus on evaluating how well different LLMs can extract the underlying conceptual content from these abstracts through deductive coding.

Deductive coding is a qualitative research method where pre-defined codes or categories are applied to data. In our experiment, we leverage multiple LLMs as "coders" to identify topics and subtopics in our synthetic abstracts based on a structured codebook. This approach allows us to systematically assess how accurately different models can recover the intended topic distributions that were specified during abstract generation.

## Model Specifications
Our deductive coding experiments employ four different LLM systems with varying architectures and capabilities

| Model Name    | Provider | Architecture           | Parameters | Access Method | Key Features                                 |
|---------------|----------|------------------------|------------|---------------|----------------------------------------------|
| llama3.2      | Ollama   | Llama 3.2              | 3.21B       | Local API     | Latest Llama iteration, self-hosted          |
| llama3.1      | Groq     | llama3-70b-8192        | 70B        | Remote API    | High-performance Llama 3 on accelerated cloud|
| qwen2.5       | Ollama   | Qwen 2.5               | 7.62B      | Local API     | Multilingual, trained on 18T tokens, Apache 2.0 license |
| deepseek-r1   | Groq     | deepseek-r1-distill-qwen-32b | 32B   | Remote API    | DeepSeek Coder-specialized distillation      |

## Methodology Overview
The deductive coding process follows a structured workflow:

- **Codebook Creation**: We develop a comprehensive codebook defining the topics and subtopics from our original abstract generation experiment, along with detailed guidelines for distinguishing between conceptual categories.
- **Prompt Engineering**: For each abstract, we construct a detailed prompt that includes the codebook, disambiguation guidelines, and explicit instructions for analyzing and quantifying topic presence.
- **Multi-Model Analysis**: Each abstract is processed by all four LLM systems, generating structured outputs that identify topics, estimate their proportions, and indicate confidence levels.
- **Error Handling**: The system incorporates retry logic and careful JSON parsing to ensure reliable data collection despite potential API errors or malformed responses.



In [None]:
import pandas as pd
import numpy as np
import json
import requests
import time
import os
import re
from tqdm.auto import tqdm
import random
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from typing import Dict, List, Any, Tuple, Optional

# Define models and their configurations
MODELS = {
    "llama3.2": {
        "provider": "ollama",
        "model_name": "llama3.2",
        "max_tokens": 8000,
        "temperature": 0.1,
    },
    "llama3.1": {
        "provider": "groq",
        "model_name": "llama3-70b-8192",
        "max_tokens": 8000,
        "temperature": 0.1,
    },
    "qwen2.5": {
        "provider": "ollama",
        "model_name": "qwen2.5",
        "max_tokens": 8000,
        "temperature": 0.1,
    },
    "deepseek-r1": {
        "provider": "groq",
        "model_name": "deepseek-r1-distill-qwen-32b",
        "max_tokens": 8000,
        "temperature": 0.1,
    }
}

# API configuration
API_CONFIG = {
    "groq": {
        "api_key": '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'
,  # Set your Groq API key as environment variable
        "api_url": "https://api.groq.com/openai/v1/chat/completions"
    },
    "ollama": {
        "api_url": "http://localhost:11434/api/chat"  # Assumes Ollama is running locally
    }
}

# Topic and domain definitions from your synthetic data generation
TOPICS = {
    "T1": {
        "name": "Machine Learning",
        "subtopics": ["Neural Networks", "Reinforcement Learning", "Supervised Learning", 
                      "Unsupervised Learning", "Transfer Learning"],
        "description": "Machine Learning involves developing algorithms and models that enable computers to learn from and make predictions or decisions based on data without being explicitly programmed."
    },
    "T7": {
        "name": "Sustainable Development",
        "subtopics": ["Renewable Energy", "Climate Change Mitigation", "Resource Management", 
                      "Environmental Monitoring", "Sustainable Cities"],
        "description": "Sustainable Development focuses on meeting present needs without compromising future generations, balancing economic growth, environmental protection, and social equity."
    },
    "T8": {
        "name": "Behavioral Economics",
        "subtopics": ["Decision Making", "Cognitive Biases", "Risk Assessment", 
                      "Social Preferences", "Intertemporal Choice"],
        "description": "Behavioral Economics studies how psychological, social, cognitive, and emotional factors influence economic decisions, challenging the assumption of perfect rationality."
    },
    "T9": {
        "name": "Digital Security",
        "subtopics": ["Cybersecurity", "Privacy Enhancing Technologies", "Authentication Methods", 
                     "Threat Detection", "Security Policy"],
        "description": "Digital Security encompasses technologies, protocols, and practices designed to protect computers, networks, programs, and data from attacks, damage, or unauthorized access."
    },
    "T10": {
        "name": "Public Health",
        "subtopics": ["Epidemiology", "Health Promotion", "Disease Prevention", 
                      "Health Equity", "Health Systems"],
        "description": "Public Health focuses on protecting and improving health at the population level through organized efforts, education, policies, and research."
    }
}

In [None]:
# Create the codebook for deductive coding
def create_codebook():
    """Create a structured codebook for deductive coding"""
    
    codebook = {
        "topics": {k: {
            "name": v["name"],
            "id": k,
            "subtopics": v["subtopics"],
            "description": v["description"]
        } for k, v in TOPICS.items()},
        
    }


    # Add disambiguation guidelines
    codebook["disambiguation_guidelines"] = """
        When coding abstracts, carefully distinguish between topics and subtopics:
    
    TOPICS refer to the academic fields, subjects, or methodologies that form the theoretical or methodological foundation of the research. They answer "what knowledge area is being studied or applied?"
    
    SUBTOPICS refer a finer-grained version of the relevant TOPIC in the abstract
    
    For example, an abstract might describe using Machine Learning (TOPIC) to performed supervised learning (SUBTOPIC). Here, Machine Learning is the broad academic subject being discussed and supervised learning is the more fine-graied topic being discussed.
    
    Evidence for topics typically includes:
    - Specific methodologies, theories, or frameworks from that academic field
    - Technical terminology associated with the discipline
    - Citations or references to literature in that field
    
    
    Evidence for subtopics typically includes:
    - Evidence of less granular topic 
    - Specific methodologies, theories, or frameworks from that academic field
    - Technical terminology associated with the discipline
    - Citations or references to literature in that field
    
    Be aware that sometimes terminology can overlap. 
    """
    
    # Add examples of topic vs domain distinction
    codebook["examples"] = [
        {
            "excerpt": "This study employed neural networks to predict player performance based on biometric data collected during professional basketball games.",
            "topic": "Machine Learning (specifically Neural Networks)",
            "subtopic": "Neural networks",
            "explanation": "Neural networks (a Machine Learning technique) is the TOPIC, while professional basketball represents the DOMAIN of Sports."
        },
        {
            "excerpt": "We explore how gamification elements on educational social media platforms affect student engagement and learning outcomes.",
            "topic": "Behavioral Economics",
            "domain": "Decision Making",
            "explanation": "The study focuses on decision-making and engagement behaviors (Behavioral Economics) in the context of educational social media platforms (domains)."
        }
    ]
    
    return codebook

In [None]:
# Generate the deductive coding prompt
def create_deductive_coding_prompt(codebook, abstract):
    """Create a prompt for deductive coding with escaped curly braces for the example JSON format"""
    topics_json = json.dumps([{"id": t_id, "name": t_info["name"], "description": t_info["description"], "subtopics": t_info["subtopics"]} 
                             for t_id, t_info in codebook["topics"].items()], indent=2)
    
    # domains_json = json.dumps([{"name": d_info["name"], "description": d_info["description"]} 
    #                            for d_name, d_info in codebook["domains"]["subtopics"].items()], indent=2)
    
    # Using triple quotes and explicit curly braces to avoid f-string formatting issues
    prompt = f"""
You are a highly skilled research methodologist performing deductive coding on academic abstracts. Your task is to analyze the following abstract and systematically identify both TOPICS and SUBTOPICS based on a predefined codebook.

# CODEBOOK

{topics_json}

## DISAMBIGUATION GUIDELINES
{codebook["disambiguation_guidelines"]}

# CODING TASK

Analyze the following abstract:

---
{abstract}
---

Perform the following analysis:

1. TOPIC IDENTIFICATION:
   - Identify which topics (from the codebook) are present in the abstract.
   - For each identified topic, estimate the proportion (percentage) of the abstract devoted to it. Proportions should sum to 100%.
   - Rate your confidence for each topic identification on a scale of 1-5 (5 being highest).

2. SUBTOPIC IDENTIFICATION:
   - Identify which subtopics(s) from the codebook are represented in the abstract.
   - Rate your confidence for domain identification on a scale of 1-5.


Return your analysis in the following JSON format without additional text:

{{
  "topics": [
    {{
      "topic_id": "topic_identifier_from_codebook",
      "proportion": percentage_as_number,
      "confidence": rating_from_1_to_5
    }}
  ],
  "domains": [
    {{
      "subtopic_name": "subtopic_name_from_codebook",
      "confidence": rating_from_1_to_5
    }}
  ],
  "disambiguation_explanation": "explanation_of_how_topics_and_subtopics_were_distinguished"
}}

Important:
- Be precise in your identification and thorough in your evidence.
- Ensure your analysis is based ONLY on content explicitly present in the abstract.
- Make sure to return strictly valid JSON without any markdown formatting, additional text, or explanations.
"""
    return prompt

# API request handlers with retry logic
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10), 
       retry=retry_if_exception_type((requests.RequestException, json.JSONDecodeError)))
def query_groq_api(prompt, model_config):
    """Send a request to Groq API with retry logic"""
    headers = {
        "Authorization": f"Bearer {API_CONFIG['groq']['api_key']}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": model_config["model_name"],
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": model_config["max_tokens"],
        "temperature": model_config["temperature"]
    }
    
    response = requests.post(
        API_CONFIG["groq"]["api_url"],
        headers=headers,
        json=payload,
        timeout=60  # 60 second timeout
    )
    
    response.raise_for_status()  # Raise an exception for 4XX/5XX responses
    
    result = response.json()
    if "choices" in result and len(result["choices"]) > 0:
        return result["choices"][0]["message"]["content"]
    else:
        raise ValueError(f"Unexpected response format from Groq: {result}")

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10), 
       retry=retry_if_exception_type((requests.RequestException, json.JSONDecodeError)))
def query_ollama_api(prompt, model_config):
    """Send a request to Ollama API with retry logic"""
    payload = {
        "model": model_config["model_name"],
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
        "options": {
            "temperature": model_config["temperature"],
            "num_predict": model_config["max_tokens"]
        }
    }
    
    response = requests.post(
        API_CONFIG["ollama"]["api_url"],
        json=payload,
        timeout=60  # 60 second timeout
    )
    
    response.raise_for_status()  # Raise an exception for 4XX/5XX responses
    
    result = response.json()
    if "message" in result and "content" in result["message"]:
        return result["message"]["content"]
    else:
        raise ValueError(f"Unexpected response format from Ollama: {result}")

        
def query_model(prompt, model_name):
    """Query the appropriate model API based on model name"""
    model_config = MODELS[model_name]
    provider = model_config["provider"]
    
    try:
        if provider == "groq":
            return query_groq_api(prompt, model_config)
        elif provider == "ollama":
            return query_ollama_api(prompt, model_config)
        else:
            raise ValueError(f"Unsupported provider: {provider}")
    except Exception as e:
        print(f"Error querying {model_name}: {str(e)}")
        return json.dumps({"error": str(e)})


def extract_json_from_response(response_text):
    """Extract JSON from potentially non-JSON response text"""
    # Try to find JSON block in the response
    json_pattern = r'({[\s\S]*})'
    json_match = re.search(json_pattern, response_text)
    
    if json_match:
        json_str = json_match.group(1)
        try:
            # Try to parse the extracted JSON
            return json.loads(json_str)
        except json.JSONDecodeError:
            # If parsing fails, try to clean the JSON string
            # Remove trailing commas
            json_str = re.sub(r',\s*}', '}', json_str)
            json_str = re.sub(r',\s*]', ']', json_str)
            
            try:
                return json.loads(json_str)
            except json.JSONDecodeError:
                # If still fails, return error
                return {"error": "Could not parse JSON from response", "raw_response": response_text}
    
    # If no JSON pattern is found
    return {"error": "No JSON found in response", "raw_response": response_text}


def run_deductive_coding(abstracts_df, 
                         model_names=None, 
                         max_samples=None, output_file="deductive_coding_results.json"):
    """Run deductive coding on abstracts using multiple models"""
    if model_names is None:
        model_names = list(MODELS.keys())
    
    # Create the codebook
    codebook = create_codebook()
    
    # Select samples (all or a subset)
    if max_samples is not None and max_samples < len(abstracts_df):
        selected_df = abstracts_df.sample(max_samples, random_state=42)
    else:
        selected_df = abstracts_df
    
    results = []
    
    # Process each abstract
    for idx, row in tqdm(selected_df.iterrows(), total=len(selected_df), desc="Processing abstracts"):
        abstract_id = row.get('id', idx)
        
        # Get both groq and ollama abstracts if available
        for provider in ['groq', 'ollama']:
            abstract_key = f"{provider}_abstract"
            if abstract_key in row and pd.notna(row[abstract_key]):
                abstract = row[abstract_key]
                
                # Get ground truth data for later evaluation
                ground_truth = {
                    "id": abstract_id,
                    "provider": provider,
                    "topic_mix": row.get('topic_mix', {}),
                    "topic_mix_str": row.get('topic_mix_str', ""),
                    "diversity_params": row.get('diversity_params', {}),
                }
                
                # Extract domain from diversity params if available
                if isinstance(ground_truth["diversity_params"], dict):
                    ground_truth["domain"] = ground_truth["diversity_params"].get("domain", "")
                elif isinstance(ground_truth["diversity_params"], str):
                    try:
                        diversity_dict = json.loads(ground_truth["diversity_params"].replace("'", "\""))
                        ground_truth["domain"] = diversity_dict.get("domain", "")
                    except:
                        ground_truth["domain"] = ""
                
                # Extract domain from diversity params if available
                ground_truth["subtopics"] = row['selected_subtopics']

                # Process with each model
                for model_name in model_names:
                    print(f"Processing {provider} abstract {abstract_id} with {model_name}")
                    
                    # Create the coding prompt
                    prompt = create_deductive_coding_prompt(codebook, abstract)
                    
                    # Query the model
                    try:
                        # Add some delay to avoid rate limiting
                        time.sleep(random.uniform(0.5, 1.5))
                        response = query_model(prompt, model_name)
                        
                        # Try to parse JSON from response
                        coding_result = extract_json_from_response(response)
                        
                        # Store result with metadata
                        result_entry = {
                            "abstract_id": abstract_id,
                            "provider": provider,
                            "model": model_name,
                            "ground_truth": ground_truth,
                            "coding_result": coding_result,
                            "raw_response": response,
                            "timestamp": time.time()
                        }
                        
                        results.append(result_entry)
                        
                        # Periodically save results to avoid losing data
                        if len(results) % 10 == 0:
                            with open(output_file, 'w') as f:
                                json.dump(results, f, indent=2)
                    
                    except Exception as e:
                        print(f"Error processing abstract {abstract_id} with {model_name}: {str(e)}")
                        results.append({
                            "abstract_id": abstract_id,
                            "provider": provider,
                            "model": model_name,
                            "ground_truth": ground_truth,
                            "error": str(e),
                            "timestamp": time.time()
                        })
    
    # Save final results
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"Deductive coding complete. Results saved to {output_file}")
    return results

In [None]:
# Load the abstract data
abstracts_file = "generated_df_100_censored_longer.feather"

In [None]:
abstracts_df = pd.read_feather(abstracts_file)
print(f"Loaded {len(abstracts_df)} abstract entries from {abstracts_file}")

In [None]:
# Run deductive coding
# You can limit the number of samples for initial testing

In [None]:
model_names = ["llama3.2", "llama3.1", "qwen2.5", "deepseek-r1"]
# max_samples = 20  # Set to None to process all abstracts
# model_names = ["llama3.2", "qwen2.5"]

results = run_deductive_coding(
    abstracts_df, 
    model_names=model_names,
    max_samples=None,
    output_file="deductive_coding_results_censored.json"
)
