# Synthetic Academic Abstract Generation Lab Notebook
## Introduction

This lab notebook documents our exploration of synthetic academic abstract generation, a methodology that serves as a foundation for evaluating concept detection capabilities across different large language models (LLMs). By creating a corpus of synthetic academic abstracts with controlled topic mixtures and linguistic characteristics, we can systematically assess how well various LLMs extract and analyze conceptual information from text.

The experiment follows a two-phase approach:

- **Generation Phase**: Creating synthetic abstracts with precise topic distributions, vocabulary constraints, and stylistic parameters using two different LLM engines
- **Evaluation Phase**: Testing how effectively other LLMs can detect, extract, and analyze the concepts deliberately embedded in these synthetic abstracts

This approach establishes a ground truth dataset with known concept distributions and vocabulary constraints, creating an ideal testbed for evaluating concept detection capabilities. The key step is that abstracts must discuss specific topics without using explicit topic names or terms, requiring the models to employ alternative vocabulary and conceptual framing.

**Important Caveat: Assumption of Generation Fidelity**

A critical assumption underlying this methodology is that the LLMs faithfully follow our generation instructions, particularly regarding topic distribution, vocabulary constraints, and stylistic parameters. There is no guarantee that the generated abstracts actually contain the exact topic distribution we specified or completely avoid the prohibited vocabulary. To somewhat address this limitation, our implementation includes a verification step to check for the presence of explicitly prohibited topic and subtopic words in the generated abstracts.

For a more comprehensive validation, future work could include:

- Independent expert annotation of the generated abstracts to verify topic distributions
- Statistical analysis of linguistic features to confirm adherence to stylistic parameters
- Cross-model evaluation where different LLMs rate the adherence of generated abstracts to the original specifications

## Model Specifications
The generation phase employs two distinct LLM providers with the following specifications:

| Provider | Model           | Architecture      | Parameters | Quantization    | API Access      | Key Features                                 |
|----------|-----------------|-------------------|------------|-----------------|-----------------|----------------------------------------------|
| Groq     | llama3-70b-8192 | Llama 3           | 70B        | -               | Remote API      | High performance, instruction-tuned          |
| Ollama   | qwen2.5-7b      | Qwen 2.5          | 7.62B      | Q4_K_M            | Local API       | Multilingual, trained on 18T tokens, Apache 2.0 license |

## Methodology Overview

The lab employs the following methodology for generating synthetic abstracts:

- Topic Network Construction: Creating a toy weighted network of academic topics and subtopics
- Conceptual Mixing: Sampling topic combinations based on network proximity with controlled distribution ratios
- Parameter Sampling: Assigning diverse textual characteristics including methodology, formality, jargon density, and interdisciplinary orientation
- Prompt Engineering: Developing detailed prompts with explicit vocabulary constraints that require models to discuss topics without using their explicit names
- Parallel Generation: Generating abstracts using both Groq and Ollama services for comparative analysis



In [None]:
# ==================================================
# Synthetic Abstract Generation for Concept Evaluation
# ==================================================
import numpy as np
import pandas as pd
import json
import requests
import time
import random
import re  # Import the regex module
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from collections import defaultdict, Counter

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# --- Configuration ---
# !! SECURITY WARNING !!: Avoid hardcoding API keys in scripts.
# Consider using environment variables or a secrets management tool.
# groq_api_key = os.environ.get("GROQ_API_KEY") or getpass.getpass("Enter your Groq API Key: ")
groq_api_key = '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'

# --- Topic Space and Diversity Parameters ---
TOPICS = {
    "T1": {
        "name": "Machine Learning",
        "subtopics": ["Neural Networks", "Reinforcement Learning", "Supervised Learning", "Unsupervised Learning", "Transfer Learning"]
    },
    "T7": {
        "name": "Sustainable Development",
        "subtopics": ["Renewable Energy", "Climate Change Mitigation", "Resource Management", "Environmental Monitoring", "Sustainable Cities"]
    },
    "T8": {
        "name": "Behavioral Economics",
        "subtopics": ["Decision Making", "Cognitive Biases", "Risk Assessment", "Social Preferences", "Intertemporal Choice"]
    },
    "T9": {
        "name": "Digital Security",
        "subtopics": ["Cybersecurity", "Privacy Enhancing Technologies", "Authentication Methods", "Threat Detection", "Security Policy"]
    },
    "T10": {
        "name": "Public Health",
        "subtopics": ["Epidemiology", "Health Promotion", "Disease Prevention", "Health Equity", "Health Systems"]
    }
}

DOMAINS = ["Sports", "Marriage", "Childcare", "Exercise", "School", "Social Media", "Advertisement"]

DIVERSITY_PARAMS = {
    "methodological_approaches": [
        "Theory", "Qualitative", "Experimental", "Observational",
        "Quasi-Experimental", "Laboratory Science", "Survey",
        "Correlational", "Mixed-methods"
    ],
    "concept_granularity": ["General Principles", "Specific Applications", "Mixed"],
    "interdisciplinary_orientation": ["Pure-discipline", "Multi-disciplinary"],
    "rhetorical_structures": [
        "Problem-solution", "Contribution-focused",
        "Findings-centered", "Process-oriented"
    ],
    "formality_levels": ["Highly Formal", "Accessible"],
    "terminology_density": ["Terminology-rich", "Balanced", "Minimal Jargon"],
    "temporal_context": ["Contemporary", "Historical Context", "Future-oriented"],
    "concept_blending": ["Juxtaposition", "Terminology-level", "Methodology-level", "Deep Integration"]
}

# --- Helper Functions ---

def create_topic_network():
    """Create a weighted network of topics based on similarity/co-occurrence"""
    G = nx.Graph()
    for topic_id, topic_data in TOPICS.items():
        G.add_node(topic_id, name=topic_data["name"], subtopics=topic_data["subtopics"])

    for topic1 in TOPICS:
        for topic2 in TOPICS:
            if topic1 != topic2:
                subtopics1 = set(TOPICS[topic1]["subtopics"])
                subtopics2 = set(TOPICS[topic2]["subtopics"])
                jaccard = len(subtopics1.intersection(subtopics2)) / len(subtopics1.union(subtopics2))
                similarity = jaccard + np.random.normal(0, 0.1)
                similarity = max(0.05, min(0.95, similarity))
                if not G.has_edge(topic1, topic2): # Add edge only once
                    G.add_edge(topic1, topic2, weight=similarity)
    return G


def sample_concept_mix(topic_network, num_topics=None):
    """Sample a mix of topics from the topic network"""
    if num_topics is None:
        num_topics = np.random.choice([1, 2, 3], p=[0.2, 0.6, 0.2])

    all_topics = list(topic_network.nodes())

    if num_topics == 1:
        topic = np.random.choice(all_topics)
        return {topic: 1.0}
    else:
        selected_topics = [np.random.choice(all_topics)]
        for _ in range(num_topics - 1):
            all_neighbors = []
            neighbor_weights = []
            for t in selected_topics:
                for n in topic_network.neighbors(t):
                    # Ensure neighbor is not already selected and has weights
                    if n not in selected_topics and n in topic_network[t]:
                         # Check if neighbor exists and edge has weight data
                        if n not in all_neighbors:
                            all_neighbors.append(n)
                            neighbor_weights.append(topic_network[t][n]['weight'])
                        else:
                            # If neighbor already listed (from another selected topic),
                            # potentially average or sum weights? Let's just keep first found weight.
                            pass


            if not all_neighbors:
                remaining = list(set(all_topics) - set(selected_topics))
                if not remaining: break
                selected_topics.append(np.random.choice(remaining))
            else:
                total_weight = sum(neighbor_weights)
                if total_weight <= 0: # Handle cases where all weights are zero or negative
                     normalized_weights = [1/len(neighbor_weights)] * len(neighbor_weights) # Equal probability
                else:
                    normalized_weights = [w / total_weight for w in neighbor_weights]

                # Ensure lengths match before choice
                if len(all_neighbors) != len(normalized_weights):
                     print(f"Warning: Mismatch in neighbors ({len(all_neighbors)}) and weights ({len(normalized_weights)}). Using uniform distribution.")
                     selected_topics.append(np.random.choice(all_neighbors))
                else:
                    selected_topics.append(np.random.choice(all_neighbors, p=normalized_weights))

        if len(selected_topics) == 2:
            weights = [0.7, 0.3]
        elif len(selected_topics) == 3:
            weights = [0.5, 0.3, 0.2]
        else: # Handle cases where fewer than desired topics were found
             weights = [1.0 / len(selected_topics)] * len(selected_topics)

        # Ensure selected_topics and weights have the same length after sampling
        weights = weights[:len(selected_topics)]
        return {t: w for t, w in zip(selected_topics, weights)}


def sample_diversity_params():
    """Sample diversity parameters for an abstract"""
    params = {}
    for param_name, options in DIVERSITY_PARAMS.items():
        params[param_name] = np.random.choice(options)
    params["domain"] = np.random.choice(DOMAINS)
    return params





# --- Unified LLM Generation Function ---
def parse_llm_response(response_text):
    """Attempts to parse the LLM response to extract the JSON object."""
    response_text = response_text.strip()

    # Approach 1: Try direct JSON parsing
    try:
        # Find the first '{' and the last '}'
        start_index = response_text.find('{')
        end_index = response_text.rfind('}')
        if start_index != -1 and end_index != -1 and end_index > start_index:
            json_str = response_text[start_index : end_index + 1]
            # Basic cleaning for common issues like trailing commas
            json_str = re.sub(r',\s*}', '}', json_str)
            json_str = re.sub(r',\s*]', ']', json_str)
            result = json.loads(json_str)
            # Validate expected keys
            if "title" in result and "abstract" in result and "keywords" in result:
                 # Ensure keywords is a list
                 if not isinstance(result["keywords"], list):
                     result["keywords"] = [str(k).strip() for k in str(result["keywords"]).split(',')] # Attempt to convert if not list
                 return result
            else:
                 print("Warning: Parsed JSON missing required keys (title, abstract, keywords).")
        else:
            print("Warning: Could not find valid JSON structure '{...}'.")

    except json.JSONDecodeError as e:
        print(f"Direct JSON parse failed: {e}. Trying regex.")
        # Fallback to regex if direct parsing fails - less reliable
        try:
            title_match = re.search(r'"title"\s*:\s*"((?:\\"|[^"])*)"', response_text)
            # Regex for abstract, trying to handle escaped quotes within the abstract
            abstract_match = re.search(r'"abstract"\s*:\s*"((?:\\.|[^"\\])*)"', response_text, re.DOTALL)
            keywords_match = re.search(r'"keywords"\s*:\s*\[(.*?)\]', response_text, re.DOTALL)

            if title_match and abstract_match:
                title = title_match.group(1).encode('utf-8').decode('unicode_escape') # Handle potential escapes
                abstract = abstract_match.group(1).encode('utf-8').decode('unicode_escape') # Handle potential escapes

                keywords = []
                if keywords_match:
                    keywords_text = keywords_match.group(1)
                    # More robust keyword splitting
                    keywords = [k.strip(' "\'') for k in re.findall(r'"([^"]*)"', keywords_text)]
                    if not keywords: # Fallback if keywords aren't quoted
                        keywords = [k.strip(' "\'') for k in keywords_text.split(',') if k.strip()]


                return {
                    "title": title,
                    "abstract": abstract,
                    "keywords": keywords
                }
            else:
                 print("Regex extraction failed to find title/abstract.")

        except Exception as regex_e:
            print(f"Regex extraction attempt failed: {regex_e}")


    # If all parsing fails
    print(f"Could not parse LLM response. Raw response:\n---\n{response_text[:500]}...\n---")
    return {"error": "Failed to parse JSON response", "raw_response": response_text}


def generate_abstract(prompt, provider, config, retries=3, backoff_factor=2, session=None):
    """Calls the specified LLM provider API to generate an abstract."""
    # Use provided session or create a new one
    if session is None:
        session = requests.Session()
    
    """Calls the specified LLM provider API to generate an abstract."""
    provider_config = config.get(provider)
    if not provider_config:
        return {"error": f"Configuration for provider '{provider}' not found."}

    api_url = provider_config.get("api_url")
    model = provider_config.get("model")

    headers = {"Content-Type": "application/json"}
    
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": provider_config.get("temperature", 0.7),
         # Ensure max_tokens is included, default if not specified
        "max_tokens": provider_config.get("max_tokens", 1024),
    }

    if provider == 'ollama':
        # Ollama uses a slightly different payload structure
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": provider_config.get("stream", False),
             # Ollama options might be nested
            "options": {
                 "temperature": provider_config.get("temperature", 0.7),
                 "num_predict": provider_config.get("max_tokens", 4000), # Ollama uses num_predict,
                "num_thread" : 30,
                "mirostat" : 0,
                "repeat_penalty" : 1
             }
        }
        # No Authorization header needed for default local Ollama typically
    elif provider == 'groq':
        api_key = provider_config.get("api_key")
        if not api_key or api_key == 'gsk_...':
             return {"error": "Groq API key not configured or is placeholder."}
        headers["Authorization"] = f"Bearer {api_key}"
        # Groq payload is already mostly correct from the base structure
    else:
        return {"error": f"Unsupported provider: {provider}"}


    for attempt in range(retries):
        try:
            print(f"  Attempt {attempt + 1}/{retries} calling {provider.upper()} API ({model})...")
            response = session.post(api_url, headers=headers, json=payload, timeout=120) # Increased timeout

            if response.status_code == 429: # Rate limit error
                 wait_time = backoff_factor ** (attempt + 1) + random.uniform(0,1) # Add jitter
                 print(f"  Rate limit hit for {provider.upper()}. Retrying in {wait_time:.2f}s...")
                 time.sleep(wait_time)
                 continue # Retry the loop

            # Check for other HTTP errors
            response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)

            response_json = response.json()

            # --- Extract content based on provider ---
            if provider == 'ollama':
                if "message" in response_json and "content" in response_json["message"]:
                    response_text = response_json["message"]["content"]
                elif "error" in response_json:
                     raise Exception(f"Ollama API Error: {response_json['error']}")
                else:
                    raise Exception(f"Unexpected Ollama response format: {response_json}")
            elif provider == 'groq':
                if "choices" in response_json and len(response_json["choices"]) > 0:
                    message = response_json["choices"][0].get("message", {})
                    response_text = message.get("content")
                    if response_text is None:
                         # Check finish reason
                         finish_reason = response_json["choices"][0].get("finish_reason")
                         if finish_reason == "error":
                             raise Exception(f"Groq API Error reported in choice: {response_json}")
                         elif finish_reason == "length":
                              raise Exception(f"Groq generation stopped due to max_tokens limit.")
                         else:
                             raise Exception("Groq API response missing content in message.")
                elif "error" in response_json:
                     error_info = response_json['error']
                     raise Exception(f"Groq API Error: {error_info.get('message', 'Unknown error')}")
                else:
                    raise Exception(f"Unexpected Groq response format: {response_json}")
            else:
                 # Should not happen due to check at the start
                 raise Exception(f"Provider logic error for {provider}")

            # --- Parse the extracted text ---
            parsed_data = parse_llm_response(response_text)
            return parsed_data # Success


        except requests.exceptions.RequestException as e:
            print(f"  Network Error calling {provider.upper()} API: {e}")
            if attempt == retries - 1:
                 return {"error": f"Network Error after {retries} attempts: {e}"}
            wait_time = backoff_factor ** attempt + random.uniform(0,1)
            print(f"  Retrying in {wait_time:.2f}s...")
            time.sleep(wait_time)
        except Exception as e:
            print(f"  Error processing {provider.upper()} response (Attempt {attempt + 1}): {e}")
            if attempt == retries - 1:
                return {"error": f"Failed after {retries} attempts: {e}"}
            wait_time = backoff_factor ** attempt + random.uniform(0,1)
            print(f"  Retrying in {wait_time:.2f}s...")
            time.sleep(wait_time)

    # Should not be reached if retries loop works correctly, but as a fallback
    return {"error": f"Failed to generate abstract from {provider.upper()} after {retries} attempts."}

def create_prompt(topic_mix, diversity_params, selected_subtopics):
    """Create a prompt for the LLM to generate an abstract with consistent subtopics."""
    formatted_topics = []
    topic_names = []
    forbidden_words = []
    allowed_subtopic_words = []
    
    for topic_id, weight in topic_mix.items():
        topic_name = TOPICS[topic_id]["name"]
        topic_names.append(topic_name)
        forbidden_words.append(topic_name.lower())
        forbidden_words.extend([word.lower() for word in topic_name.split()])
        
        # Use the pre-selected subtopic
        subtopic = selected_subtopics[topic_id]
        allowed_subtopic_words.append(subtopic)
        
        percentage = int(weight * 100)
        formatted_topics.append(f"{topic_name} (specifically {subtopic}): {percentage}% focus")

    # Remove duplicates and create strings
    forbidden_words = list(set(forbidden_words))
    forbidden_words_str = ", ".join([f'"{w}"' for w in forbidden_words])
    allowed_subtopic_words_str = ", ".join([f'"{w}"' for w in allowed_subtopic_words])
    
    topics_text = "\n".join([f"- {t}" for t in formatted_topics])
    num_topics = len(topic_mix)

    # Define core topic names
    core_topic_names = [TOPICS[t]['name'] for t in topic_mix]
    focus_description = ' AND '.join(core_topic_names) if num_topics <= 2 else 'these topics (' + ', '.join(core_topic_names) + ')'

    # Create the prompt
    prompt = f"""
You are an academic expert simulating the creation of a research abstract. 
Your task is to generate ONE research abstract that fits a specific profile.

**CRITICAL REQUIREMENT: The generated 'abstract' field's text MUST be a minimum of 250 words long.** Do not generate short summaries.

Your paper synthesizes the following topics. Adhere strictly to this distribution:
{topics_text}

VOCABULARY RESTRICTIONS:
- FORBIDDEN WORDS: You must NOT use the following topic words in your abstract or title: {forbidden_words_str}
- FORBIDDEN WORDS: You must NOT use the following subtopic words in your abstract or title: {allowed_subtopic_words_str}
- DOMAIN: Do not use the word "{diversity_params['domain']}" explicitly in the abstract or title

REQUIRED ABSTRACT CONTENT GUIDELINES: 
- The study's focus: {focus_description}
- Domain application: {diversity_params['domain']}
- Methodology: {diversity_params['methodological_approaches']}
- Findings
- Conclude 
- Rhetorical style: {diversity_params['rhetorical_structures']}.

Ensure the abstract is cohesive, detailed, and meets the 250-word minimum requirement.

ADDITIONAL PAPER ATTRIBUTES TO REFLECT:
- Concept granularity: {diversity_params['concept_granularity']} (Reflects in the level of detail in findings)
- Interdisciplinary orientation: {diversity_params['interdisciplinary_orientation']} (Reflected if multiple topics are distinct)
- Temporal context: {diversity_params['temporal_context']} (Use appropriate tense/phrasing)

LINGUISTIC CHARACTERISTICS TO EMBODY:
- Formality level: {diversity_params['formality_levels']} 
- Terminology density: {diversity_params['terminology_density']} 
- Concept blending approach: {diversity_params['concept_blending']} 

MANDATORY INSTRUCTIONS:
1. **Generate ONE academic abstract where the 'abstract' text is MINIMUM 250 words.**
2. DO NOT USE THE TOPIC WORDS listed in FORBIDDEN WORDS in the abstract or title.
3. DO NOT USE THE SUBTOPIC words in your abstract or title. 
4. Strictly follow the content guidelines.
5. Adhere to the Topic Distribution percentages.
6. Include at least 3-5 specific, concrete findings, methods, or implications. AVOID VAGUENESS. Elaborate on points.
7. Ensure the abstract is academically plausible and internally consistent.
8. Do NOT mention the percentages, parameters, instructions, or section headers explicitly in the output abstract text.
9. Do NOT write a short summary; generate a detailed, well-developed abstract fulfilling the minimum word count.
10. **The final 'abstract' field content MUST meet the 250-word minimum.**

OUTPUT FORMAT:  
Return ONLY a single, valid JSON object containing the keys 'title', 'abstract', and 'keywords'.
- The 'abstract' value must be a single string containing the full abstract text (minimum 250 words).
- The 'keywords' value must be a list of 4-6 relevant strings.
- Do NOT include ```json markdown wrappers, comments, explanations, or any text outside the JSON structure.

Example JSON structure (fill with generated content):
{{
  "title": "A Plausible and Specific Academic Title Reflecting the Content",
  "abstract": "Abstract text here...\\n\\nMore text here... (Ensuring total abstract is >= 250 words)",
  "keywords": ["Relevant Keyword 1", "Keyword 2", "Topic Keyword", "Method Keyword", "Domain Keyword"]
}}
"""

    return prompt
# --- Main Pipeline ---

def generate_synthetic_dataset(config):
    # Create session objects at the beginning
    ollama_session = requests.Session()
    groq_session = requests.Session()
    
    """Generate a synthetic dataset of abstracts using configured LLM providers."""
    num_documents = config["num_documents"]
    print("Creating topic network...")
    topic_network = create_topic_network()

    print(f"\nGenerating {num_documents} abstracts using Ollama and Groq...")
    dataset = []
    providers = ['ollama', 'groq'] # Define providers to use

    for i in tqdm(range(num_documents), desc="Generating Abstracts"):
        print(f"\n--- Generating Abstract Set {i+1}/{num_documents} ---")
        try:
            # 1. Sample concept mix & diversity params (same for both LLMs)
            topic_mix = sample_concept_mix(topic_network)
            diversity_params = sample_diversity_params()

            # Store forbidden topic words and selected subtopics for later checks
            forbidden_words = []
            selected_subtopics = {}
            
            for topic_id, weight in topic_mix.items():
                topic_name = TOPICS[topic_id]["name"]
                forbidden_words.append(topic_name.lower())
                forbidden_words.extend([word.lower() for word in topic_name.split()])
                
                # Select a subtopic randomly
                subtopic = np.random.choice(TOPICS[topic_id]["subtopics"])
                selected_subtopics[topic_id] = subtopic

            # Remove duplicates in forbidden words
            forbidden_words = list(set(forbidden_words))

            # 2. Create prompt - now passing selected_subtopics as an argument
            prompt = create_prompt(topic_mix, diversity_params, selected_subtopics)

            # 3. Generate abstract from each provider
            results = {}
            for provider in providers:
                print(f" Generating with {provider.upper()}...")
                start_time = time.time()
                # Pass the main config, the function will extract provider-specific settings
                result_data = generate_abstract(prompt, 
                                                provider, 
                                                config, 
                                                session=ollama_session if provider == 'ollama' else groq_session)
                end_time = time.time()
                print(f"  {provider.upper()} generation took {end_time - start_time:.2f}s")
                results[provider] = result_data
            
            # 4. Check for forbidden words and subtopic presence
            for provider in providers:
                # Skip checks if there was an error or no abstract
                if results[provider].get("error") or not results[provider].get("abstract"):
                    continue

                abstract_text = results[provider].get("abstract", "").lower()
                title_text = results[provider].get("title", "").lower()
                combined_text = abstract_text + " " + title_text

                # Check for exact topic phrases (ignoring case)
                found_forbidden_words = []
                for topic_id, weight in topic_mix.items():
                    topic_name = TOPICS[topic_id]["name"].lower()
                    if topic_name in combined_text:
                        found_forbidden_words.append(topic_name)

                results[provider]["contains_forbidden_words"] = len(found_forbidden_words) > 0
                results[provider]["found_forbidden_words"] = found_forbidden_words if found_forbidden_words else []

                # Check for subtopic presence
                subtopic_presence = {}
                for topic_id, subtopic in selected_subtopics.items():
                    subtopic_lower = subtopic.lower()
                    subtopic_presence[subtopic] = subtopic_lower in abstract_text

                results[provider]["contains_subtopics"] = subtopic_presence
                results[provider]["any_subtopic_present"] = any(subtopic_presence.values())

            # 5. Store results with metadata
            entry = {
                "id": i + 1,
                "topic_mix": topic_mix,
                "diversity_params": diversity_params,
                "forbidden_words": forbidden_words,
                "selected_subtopics": selected_subtopics,
                "prompt": prompt,
            }
            
            # Add provider-specific results
            for provider in providers:
                entry[f"{provider}_title"] = results[provider].get("title")
                entry[f"{provider}_abstract"] = results[provider].get("abstract")
                entry[f"{provider}_keywords"] = results[provider].get("keywords")
                entry[f"{provider}_error"] = results[provider].get("error")
                entry[f"{provider}_raw_response"] = results[provider].get("raw_response")
                
                # Add new flags
                if not results[provider].get("error") and results[provider].get("abstract"):
                    entry[f"{provider}_contains_forbidden_words"] = results[provider].get("contains_forbidden_words", False)
                    entry[f"{provider}_found_forbidden_words"] = results[provider].get("found_forbidden_words", [])
                    entry[f"{provider}_contains_subtopics"] = results[provider].get("contains_subtopics", {})
                    entry[f"{provider}_any_subtopic_present"] = results[provider].get("any_subtopic_present", False)
            
            dataset.append(entry)

            # Optional delay between generating sets of abstracts
            time.sleep(1)

        except Exception as e:
            print(f"!! Unexpected error during generation loop for abstract set {i+1}: {e}")
            # Add placeholder error entry
            dataset.append({
                "id": i + 1,
                "topic_mix": topic_mix if 'topic_mix' in locals() else {},
                "diversity_params": diversity_params if 'diversity_params' in locals() else {},
                "ollama_error": f"Outer loop error: {e}",
                "groq_error": f"Outer loop error: {e}",
            })

    # --- Post-processing and Saving ---
    print("\nProcessing results...")

    # Convert mix dictionaries to strings for easier CSV viewing
    for item in dataset:
        if isinstance(item.get("topic_mix"), dict):
             item["topic_mix_str"] = ", ".join([f"{TOPICS.get(k, {'name':k})['name']}: {v*100:.0f}%" for k, v in item["topic_mix"].items()])
        else:
             item["topic_mix_str"] = "Error processing mix"

        # Ensure keywords are stored as strings for CSV
        for provider in providers:
            kw_key = f"{provider}_keywords"
            if isinstance(item.get(kw_key), list):
                 item[kw_key] = json.dumps(item[kw_key]) # Store list as JSON string
            elif item.get(kw_key) is None and item.get(f"{provider}_error") is None:
                 item[kw_key] = json.dumps([]) # Empty list if generation was successful but no keywords found


    # Create DataFrame
    df = pd.DataFrame(dataset)

    # Reorder columns for better readability
    cols_order = ["id", "topic_mix_str"] 
    cols_order.extend([f"{p}_{field}" for p in providers for field in ["title", "abstract", "keywords", 
                                                                     "contains_forbidden_words", 
                                                                     "found_forbidden_words", 
                                                                     "contains_subtopics", 
                                                                     "any_subtopic_present",
                                                                     "error", "raw_response"]])
    cols_order.extend(["topic_mix", "diversity_params", "forbidden_words", "selected_subtopics", "prompt"])    
    cols_order = [col for col in cols_order if col in df.columns]
    df = df[cols_order]


    # Save dataset
    output_file = config["output_file"]
    try:
        df.to_csv(output_file, index=False, encoding='utf-8')
        print(f"\nDataset saved successfully to {output_file}")
    except Exception as e:
        print(f"\nError saving dataset to CSV: {e}")


    # Create topics DataFrame (as before)
    topic_rows = []
    for idx, row in df.iterrows():
        # Ensure topic_mix is a dictionary before iterating
         if isinstance(row.get("topic_mix"), dict):
             for topic_id, weight in row["topic_mix"].items():
                 # Safely get topic name, handle missing IDs
                 topic_data = TOPICS.get(topic_id, {"name": f"Unknown Topic ({topic_id})"})
                 topic_name = topic_data.get("name", f"Unnamed Topic ({topic_id})")
                 topic_rows.append({
                     "id": row["id"],
                     "topic_id": topic_id,
                     "topic_name": topic_name,
                     "weight": weight
                 })

    topics_df = pd.DataFrame(topic_rows) if topic_rows else pd.DataFrame(columns=["id", "topic_id", "topic_name", "weight"])

    print(f"\nGenerated data for {len(df)} prompts.")
    # Count successes/failures per provider
    for provider in providers:
        success_count = df[f'{provider}_error'].isna().sum()
        print(f"  {provider.upper()}: {success_count} successful generations, {len(df) - success_count} failures/errors.")


    return df, topics_df

In [None]:
CONFIG = {
    "num_documents": 100,  # Reduced for faster testing, increase as needed
    "output_file": "synthetic_abstracts_dual_llm_censored.csv",
    "topic_network_file": "topic_network_censored.json",

    "ollama": {
        "api_url": "http://localhost:11434/api/chat",
        "model": "llama3.1:latest", #"qwen2.5",  # Or your preferred local model
        "max_tokens": 8000, # Adjust token limit for Ollama model if needed
        "temperature": 0.7,
        "stream": False,
    },

    "groq": {
        "api_url": "https://api.groq.com/openai/v1/chat/completions",
        "api_key": groq_api_key,
        # Choose a Groq model: 'llama3-8b-8192', 'llama3-70b-8192', 'mixtral-8x7b-32768', 'gemma-7b-it'
        "model": 'llama3-70b-8192',
        "max_tokens": 8000, # Groq often uses token limits differently, adjust as needed
        "temperature": 0.7,
    }
}

# CONFIG['num_documents'] = 10

In [None]:
# --- Execution ---
if __name__ == "__main__":
    # Ensure you have replaced 'gsk_...' in CONFIG with your actual Groq key

    # Check if Groq key is still the placeholder
    if CONFIG['groq']['api_key'] == 'gsk_...':
        print("="*60)
        print("ERROR: Please replace 'gsk_...' in the CONFIG dictionary")
        print("       with your actual Groq API key before running.")
        print("="*60)
    else:
        # Generate the dataset
        generated_df, topics_info_df = generate_synthetic_dataset(CONFIG)

        # Display first few rows of the generated data
        print("\n--- Sample Generated Data ---")
        print(generated_df[['id', 'topic_mix_str', 'ollama_title', 'groq_title']].head())



In [None]:
generated_df.to_feather('generated_df_100_censored_longer.feather')
topics_info_df.to_feather('topics_info_df_100_censored_longer.feather')