Skip to content

encorelab/ai-theme-analyser

Repository files navigation

AI-Theme-Analyzer

This Python tool uses Google's Gemini language model on Vertex AI to perform thematic analysis on textual data. It supports processing data from DOCX directories or XLSX spreadsheets.

Theoretical Framework: HHitL-TA

This tool implements the Hybrid Human-in-the-Loop Thematic Analysis (HHitL-TA) framework (Wiebe et al., 2025). Unlike fully automated approaches, this 6-stage framework prioritizes human interpretation and transparency.

It utilizes a hybrid approach:

  1. Deductive: You provide high-level theoretical Constructs (e.g., Social Cognitive Career Theory factors).
  2. Inductive: The AI generates granular Codes derived from the data, nested within those constructs.

Features

  • Multi-Format Support: Parses DOCX files in a directory or reads rows from an XLSX spreadsheet.
  • Hybrid Coding: Generates inductive codes grounded in a priori theoretical constructs.
  • Human-in-the-Loop: The process is staged; the user reviews and refines JSON/Excel outputs at every step before proceeding.
  • Within-Case & Cross-Case Analysis: Analyzes relationships within specific documents/rows and summarizes patterns across the full dataset.
  • Network Visualization: Generates network graphs for code-theme relationships.
  • Magnitude Coding: Assigns intensity scores (e.g., 1-7) to coded extracts based on expression strength.

Prerequisites

  • Google Cloud Project: A project with Vertex AI enabled.
  • Model: Access to gemini-1.5-pro-002 (or similar).
  • Python: Version 3.9 or higher.
  • API Key: A valid Google Cloud API key or configured gcloud auth.

Installation

  1. Clone the Repository:

    git clone [https://github.com/encorelab/ai-theme-analyser.git](https://github.com/encorelab/ai-theme-analyser.git)
    cd ai-theme-analyzer
  2. Install Dependencies:

    pip install -r requirements.txt
  3. Environment Setup: Create a .env file in the project root:

    PROJECT_ID=your-google-project-id
    LOCATION=us-central1
    GEMINI_MODEL=gemini-1.5-pro-002
  4. Authentication: Ensure your local environment is authenticated with Google Cloud:

    gcloud auth application-default login

Usage: The 6-Stage Workflow

This guide assumes you are analyzing an Excel file named cscl_data.xlsx with data in a sheet named Abstracts. If using DOCX files, omit the --input_xlsx and --sheet_name flags (it defaults to the INPUT_DIR in config.py).

Stage 0: Data & Construct Preparation

Goal: Prepare your data and define the theoretical lens.

  1. Data: Ensure cscl_data.xlsx has a header row. Each row is treated as a distinct "source."
  2. Constructs: Create a file named constructs.json in the root directory. These are the "buckets" the AI will look for.
    • Example Structure:
      [
        {
          "construct": "Self-Efficacy",
          "definition": "Belief in one's capability to organize and execute courses of action...",
          "examples": "I am confident I can solve this problem.",
          "exclude": "General confidence unrelated to the task."
        }
      ]

Stage 1 & 2: Familiarization & Initial Inductive Coding

Goal: Generate a verified "Codebook" of inductive codes based on a sample of your data.

  1. Generate Initial Codes: The tool will sample your data and generate codes fitting your constructs.

    python main.py --client generate_initial_codes \
      --input_xlsx "cscl_data.xlsx" \
      --sheet_name "Abstracts"
  2. Human Review (Crucial):

    • Open the resulting Excel file in output_files/.
    • Review the code_justifications sheet.
    • Delete irrelevant codes, rename codes, or refine definitions.
    • Save this refined list as codes.json.

Stage 2.5: Verify Codes (Optional)

Goal: Test your refined codes.json on a new sample to ensure saturation.

python main.py --client verify_initial_codes \
  --input_xlsx "cscl_data.xlsx" \
  --sheet_name "Abstracts"

Stage 3: Full Dataset Coding

Goal: Apply your verified codes to the entire dataset.

  1. Run Full Coding:

    python main.py --client generate_full_dataset_codes \
      --input_xlsx "cscl_data.xlsx" \
      --sheet_name "Abstracts"
  2. Merge Synonyms: If the AI created similar codes (e.g., "student motivation" and "pupil motivation"), use the merger client to consolidate them.

    python main.py --client merge_codes

    Follow the prompts to select your Stage 3 output file.

  3. Apply Merges: Update your dataset with the merged codes.

    python main.py --client replace_merged_codes

    Result: You now have a merged_codings.xlsx file.

Stage 4: Theme Generation

Goal: Group your verified codes into meaningful Themes and Sub-themes within your Constructs.

  1. Generate Hierarchy: The AI analyzes code co-occurrences to suggest a thematic structure.

    python main.py --client generate_themes
  2. Visualize: Create network graphs of your themes.

    python main.py --client visualize_themes

    Human Action: A JSON hierarchy and PNG images are generated. Review the JSON. If a theme is miscategorized, move it manually in the JSON file before the next step.

Stage 5: Within-Case Analysis & Triangulation

Goal: Deep dive into specific documents or calculate "Intensity."

  1. Magnitude/Intensity Coding: Ask the AI to rate the intensity (e.g., 1-7) of every applied code (e.g., how strongly was "Self-Efficacy" expressed in Abstract #42?).

    python main.py --client generate_intensity_codes \
      --input_xlsx "cscl_data.xlsx" \
      --sheet_name "Abstracts"
  2. Qualitative Summaries: Generate text summaries for specific themes across the dataset or for specific subgroups.

    python main.py --client generate_theme_summaries \
      --input_xlsx "cscl_data.xlsx" \
      --sheet_name "Abstracts"

Stage 6: Final Reporting

Goal: Synthesize outputs for the manuscript.

Run the cross-document analyzer to detect high-level patterns, or compile the generated Excel statistics, Network Graphs, and Qualitative Summaries manually.

python main.py --client cross_document_analyzer

Troubleshooting

  • Rate Limits: If the script pauses frequently or crashes with 429 errors, check log.txt. You may need to increase time_between_calls in src/utils.py or request a quota increase for Vertex AI.
  • Empty Outputs: If an Excel output file is empty, check that your constructs.json definitions are broad enough to capture data in your text.
  • XLSX Formatting: Ensure your Excel file does not have complex formatting (merged cells, images) in the header row. The tool reads every column in a row and concatenates them into a single text block for analysis.

Citation

If you use this tool, please cite:

Wiebe, J. P., Khan, R., Burns, S., & Slotta, J. D. (2025). Qualitative Research in the Age of LLMs: A Human-in-the-Loop Approach to Hybrid Thematic Analysis. University of Toronto.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages