CLIO Implementation for PubMed Abstract Clustering

Project Overview

Complete implementation of the CLIO clustering system for PubMed abstracts about artificial intelligence, based on the CLIO paper methodology.

Virtual Environment Setup

IMPORTANT: Always use the virtual environment for this project:

# Activate virtual environment
source venv/bin/activate

# Deactivate when done
deactivate

Implementation Status ✅

All components have been successfully implemented and tested:

PubMed Data Collection (src/pubmed_collector.py)
- Fetches abstracts using PubMed E-utilities API
- Handles rate limiting and SSL issues
- Extracts title, abstract, PMID, authors, journal
- Caching: Automatically caches abstracts for 7 days to avoid redundant API calls
- Segment-level caching: Better handling of large queries with recursive date splitting
Embeddings (src/embedder.py)
- Uses Sentence-BERT (all-mpnet-base-v2)
- Generates 768-dimensional embeddings
- Includes similarity computation functions
- Offline Mode: Works with cached model to avoid download timeouts
Clustering (src/clusterer.py)
- K-means clustering with automatic k selection
- Multiple clustering methods: hierarchical, silhouette optimization, sqrt heuristic
- Representative abstract identification
- Normalized embeddings for better clustering
Cluster Naming (src/cluster_namer.py)
- Claude API integration (configurable model via CLAUDE_MODEL)
- Samples in-cluster and out-cluster abstracts
- Generates descriptive names and 2-sentence descriptions
- Robust JSON parsing with regex patterns
- Handles partial responses from Claude
Hierarchizer (src/hierarchizer.py)
- Multi-level hierarchy building with adaptive levels
- Neighborhood-based processing
- Parent cluster generation and deduplication
- Automatic top-level cluster determination (4-10 clusters)
- Query-based root cluster generation
Output Formatter (src/output_formatter.py)
- Text format (hierarchical tree view)
- JSON format (structured data)
- CSV format (flattened for analysis)

Running the Pipeline

Quick Test (20 abstracts)

source venv/bin/activate
export TOKENIZERS_PARALLELISM=false  # Suppress warnings
python src/run_clio_pipeline.py --max-abstracts 20 --min-cluster-size 3 --hierarchy-levels 2 --clustering-method hierarchical

Production Run (1000 abstracts)

source venv/bin/activate
export TOKENIZERS_PARALLELISM=false
python src/run_clio_pipeline.py --max-abstracts 1000 --min-cluster-size 10 --hierarchy-levels 3 --clustering-method silhouette

Command Line Options

--max-abstracts: Number of abstracts to fetch (default: 100)
--min-cluster-size: Minimum abstracts per cluster (default: 5)
--hierarchy-levels: Number of hierarchy levels (default: 3)
--clustering-method: Clustering strategy (default: hierarchical)
- hierarchical: Creates many clusters for better hierarchy building
- silhouette: Optimizes for cluster separation using silhouette score
- sqrt: Uses heuristic based on square root of number of abstracts
--output-dir: Output directory (default: output)
--no-cache: Disable caching of PubMed abstracts (cache enabled by default)

Output Structure

Each run creates a timestamped directory with:

output/run_YYYYMMDD_HHMMSS/
├── abstracts.json              # Raw PubMed data
├── embeddings.pkl              # Generated embeddings
├── clustering.pkl              # Clustering results
├── cluster_names.json          # Claude-generated names
├── complete_hierarchy.json     # Full hierarchy with top cluster and all levels
├── hierarchy.json              # Detailed structure with abstract IDs
├── hierarchy.txt               # Human-readable tree view
└── clusters.csv               # Flattened CSV for analysis

Key Output: complete_hierarchy.json

This is the main output for downstream applications, containing:

Root cluster with query-based name and description
Top-level clusters (automatically determined, typically 4-10)
Intermediate hierarchy levels (if any)
All base clusters sorted by size
Complete PMID lists for each cluster

Cost Estimation

100 abstracts: ~$0.30 (using Sonnet)
1000 abstracts: ~$3.00
10,000 abstracts: ~$30.00

Environment Variables

Set in .env file:

ANTHROPIC_API_KEY: Your Claude API key
PUBMED_EMAIL: Your email for PubMed API
CLAUDE_MODEL: Model choice (default: claude-sonnet-4-20250514)

Troubleshooting

SSL Certificate Errors: Already handled in code
Rate Limiting: Code includes delays between API calls
Memory Issues: Process abstracts in smaller batches
JSON Serialization: Numpy types are converted automatically

Example Output

ROOT CLUSTER:
▪ Artificial Intelligence Research
  A comprehensive collection of research on artificial intelligence applications. The abstracts cover diverse AI methodologies and their applications across various domains.

TOP LEVEL (5 clusters):

▪ Artificial Intelligence in Medical Imaging and Diagnostics
  This cluster focuses on AI applications in radiology and pathology...
  ▪ Deep Learning for Medical Images
    Applications of CNNs in X-ray and MRI analysis...
  ▪ AI in Cancer Detection
    Machine learning for early cancer diagnosis...

▪ AI Applications in Drug Discovery
  This cluster covers the use of AI in pharmaceutical research...
  ▪ Molecular Design with ML
    AI-driven drug molecule optimization...
  ▪ Clinical Trial Prediction
    Using AI to predict drug efficacy...

Behavioral Guidelines

If any part of the instructions is not clear, ask for clarifications
It is acceptable and recommended to ask questions to ensure accurate task completion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIO Implementation for PubMed Abstract Clustering

Project Overview

Virtual Environment Setup

Implementation Status ✅

Running the Pipeline

Quick Test (20 abstracts)

Production Run (1000 abstracts)

Command Line Options

Output Structure

Key Output: complete_hierarchy.json

Cost Estimation

Environment Variables

Troubleshooting

Example Output

Behavioral Guidelines

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLIO Implementation for PubMed Abstract Clustering

Project Overview

Virtual Environment Setup

Implementation Status ✅

Running the Pipeline

Quick Test (20 abstracts)

Production Run (1000 abstracts)

Command Line Options

Output Structure

Key Output: complete_hierarchy.json

Cost Estimation

Environment Variables

Troubleshooting

Example Output

Behavioral Guidelines