This Python tool uses Google's Gemini language model on Vertex AI to perform thematic analysis on textual data. It supports processing data from DOCX directories or XLSX spreadsheets.
This tool implements the Hybrid Human-in-the-Loop Thematic Analysis (HHitL-TA) framework (Wiebe et al., 2025). Unlike fully automated approaches, this 6-stage framework prioritizes human interpretation and transparency.
It utilizes a hybrid approach:
- Deductive: You provide high-level theoretical Constructs (e.g., Social Cognitive Career Theory factors).
- Inductive: The AI generates granular Codes derived from the data, nested within those constructs.
- Multi-Format Support: Parses DOCX files in a directory or reads rows from an XLSX spreadsheet.
- Hybrid Coding: Generates inductive codes grounded in a priori theoretical constructs.
- Human-in-the-Loop: The process is staged; the user reviews and refines JSON/Excel outputs at every step before proceeding.
- Within-Case & Cross-Case Analysis: Analyzes relationships within specific documents/rows and summarizes patterns across the full dataset.
- Network Visualization: Generates network graphs for code-theme relationships.
- Magnitude Coding: Assigns intensity scores (e.g., 1-7) to coded extracts based on expression strength.
- Google Cloud Project: A project with Vertex AI enabled.
- Model: Access to
gemini-1.5-pro-002(or similar). - Python: Version 3.9 or higher.
- API Key: A valid Google Cloud API key or configured
gcloudauth.
-
Clone the Repository:
git clone [https://github.com/encorelab/ai-theme-analyser.git](https://github.com/encorelab/ai-theme-analyser.git) cd ai-theme-analyzer -
Install Dependencies:
pip install -r requirements.txt
-
Environment Setup: Create a
.envfile in the project root:PROJECT_ID=your-google-project-id LOCATION=us-central1 GEMINI_MODEL=gemini-1.5-pro-002
-
Authentication: Ensure your local environment is authenticated with Google Cloud:
gcloud auth application-default login
This guide assumes you are analyzing an Excel file named cscl_data.xlsx with data in a sheet named Abstracts. If using DOCX files, omit the --input_xlsx and --sheet_name flags (it defaults to the INPUT_DIR in config.py).
Goal: Prepare your data and define the theoretical lens.
- Data: Ensure
cscl_data.xlsxhas a header row. Each row is treated as a distinct "source." - Constructs: Create a file named
constructs.jsonin the root directory. These are the "buckets" the AI will look for.- Example Structure:
[ { "construct": "Self-Efficacy", "definition": "Belief in one's capability to organize and execute courses of action...", "examples": "I am confident I can solve this problem.", "exclude": "General confidence unrelated to the task." } ]
- Example Structure:
Goal: Generate a verified "Codebook" of inductive codes based on a sample of your data.
-
Generate Initial Codes: The tool will sample your data and generate codes fitting your constructs.
python main.py --client generate_initial_codes \ --input_xlsx "cscl_data.xlsx" \ --sheet_name "Abstracts"
-
Human Review (Crucial):
- Open the resulting Excel file in
output_files/. - Review the
code_justificationssheet. - Delete irrelevant codes, rename codes, or refine definitions.
- Save this refined list as
codes.json.
- Open the resulting Excel file in
Goal: Test your refined codes.json on a new sample to ensure saturation.
python main.py --client verify_initial_codes \
--input_xlsx "cscl_data.xlsx" \
--sheet_name "Abstracts"Goal: Apply your verified codes to the entire dataset.
-
Run Full Coding:
python main.py --client generate_full_dataset_codes \ --input_xlsx "cscl_data.xlsx" \ --sheet_name "Abstracts"
-
Merge Synonyms: If the AI created similar codes (e.g., "student motivation" and "pupil motivation"), use the merger client to consolidate them.
python main.py --client merge_codes
Follow the prompts to select your Stage 3 output file.
-
Apply Merges: Update your dataset with the merged codes.
python main.py --client replace_merged_codes
Result: You now have a
merged_codings.xlsxfile.
Goal: Group your verified codes into meaningful Themes and Sub-themes within your Constructs.
-
Generate Hierarchy: The AI analyzes code co-occurrences to suggest a thematic structure.
python main.py --client generate_themes
-
Visualize: Create network graphs of your themes.
python main.py --client visualize_themes
Human Action: A JSON hierarchy and PNG images are generated. Review the JSON. If a theme is miscategorized, move it manually in the JSON file before the next step.
Goal: Deep dive into specific documents or calculate "Intensity."
-
Magnitude/Intensity Coding: Ask the AI to rate the intensity (e.g., 1-7) of every applied code (e.g., how strongly was "Self-Efficacy" expressed in Abstract #42?).
python main.py --client generate_intensity_codes \ --input_xlsx "cscl_data.xlsx" \ --sheet_name "Abstracts"
-
Qualitative Summaries: Generate text summaries for specific themes across the dataset or for specific subgroups.
python main.py --client generate_theme_summaries \ --input_xlsx "cscl_data.xlsx" \ --sheet_name "Abstracts"
Goal: Synthesize outputs for the manuscript.
Run the cross-document analyzer to detect high-level patterns, or compile the generated Excel statistics, Network Graphs, and Qualitative Summaries manually.
python main.py --client cross_document_analyzer- Rate Limits: If the script pauses frequently or crashes with 429 errors, check
log.txt. You may need to increasetime_between_callsinsrc/utils.pyor request a quota increase for Vertex AI. - Empty Outputs: If an Excel output file is empty, check that your
constructs.jsondefinitions are broad enough to capture data in your text. - XLSX Formatting: Ensure your Excel file does not have complex formatting (merged cells, images) in the header row. The tool reads every column in a row and concatenates them into a single text block for analysis.
If you use this tool, please cite:
Wiebe, J. P., Khan, R., Burns, S., & Slotta, J. D. (2025). Qualitative Research in the Age of LLMs: A Human-in-the-Loop Approach to Hybrid Thematic Analysis. University of Toronto.