Complete implementation of the CLIO clustering system for PubMed abstracts about artificial intelligence, based on the CLIO paper methodology.
IMPORTANT: Always use the virtual environment for this project:
# Activate virtual environment
source venv/bin/activate
# Deactivate when done
deactivateAll components have been successfully implemented and tested:
-
PubMed Data Collection (
src/pubmed_collector.py)- Fetches abstracts using PubMed E-utilities API
- Handles rate limiting and SSL issues
- Extracts title, abstract, PMID, authors, journal
- Caching: Automatically caches abstracts for 7 days to avoid redundant API calls
- Segment-level caching: Better handling of large queries with recursive date splitting
-
Embeddings (
src/embedder.py)- Uses Sentence-BERT (all-mpnet-base-v2)
- Generates 768-dimensional embeddings
- Includes similarity computation functions
- Offline Mode: Works with cached model to avoid download timeouts
-
Clustering (
src/clusterer.py)- K-means clustering with automatic k selection
- Multiple clustering methods: hierarchical, silhouette optimization, sqrt heuristic
- Representative abstract identification
- Normalized embeddings for better clustering
-
Cluster Naming (
src/cluster_namer.py)- Claude API integration (configurable model via CLAUDE_MODEL)
- Samples in-cluster and out-cluster abstracts
- Generates descriptive names and 2-sentence descriptions
- Robust JSON parsing with regex patterns
- Handles partial responses from Claude
-
Hierarchizer (
src/hierarchizer.py)- Multi-level hierarchy building with adaptive levels
- Neighborhood-based processing
- Parent cluster generation and deduplication
- Automatic top-level cluster determination (4-10 clusters)
- Query-based root cluster generation
-
Output Formatter (
src/output_formatter.py)- Text format (hierarchical tree view)
- JSON format (structured data)
- CSV format (flattened for analysis)
source venv/bin/activate
export TOKENIZERS_PARALLELISM=false # Suppress warnings
python src/run_clio_pipeline.py --max-abstracts 20 --min-cluster-size 3 --hierarchy-levels 2 --clustering-method hierarchicalsource venv/bin/activate
export TOKENIZERS_PARALLELISM=false
python src/run_clio_pipeline.py --max-abstracts 1000 --min-cluster-size 10 --hierarchy-levels 3 --clustering-method silhouette--max-abstracts: Number of abstracts to fetch (default: 100)--min-cluster-size: Minimum abstracts per cluster (default: 5)--hierarchy-levels: Number of hierarchy levels (default: 3)--clustering-method: Clustering strategy (default: hierarchical)hierarchical: Creates many clusters for better hierarchy buildingsilhouette: Optimizes for cluster separation using silhouette scoresqrt: Uses heuristic based on square root of number of abstracts
--output-dir: Output directory (default: output)--no-cache: Disable caching of PubMed abstracts (cache enabled by default)
Each run creates a timestamped directory with:
output/run_YYYYMMDD_HHMMSS/
├── abstracts.json # Raw PubMed data
├── embeddings.pkl # Generated embeddings
├── clustering.pkl # Clustering results
├── cluster_names.json # Claude-generated names
├── complete_hierarchy.json # Full hierarchy with top cluster and all levels
├── hierarchy.json # Detailed structure with abstract IDs
├── hierarchy.txt # Human-readable tree view
└── clusters.csv # Flattened CSV for analysis
This is the main output for downstream applications, containing:
- Root cluster with query-based name and description
- Top-level clusters (automatically determined, typically 4-10)
- Intermediate hierarchy levels (if any)
- All base clusters sorted by size
- Complete PMID lists for each cluster
- 100 abstracts: ~$0.30 (using Sonnet)
- 1000 abstracts: ~$3.00
- 10,000 abstracts: ~$30.00
Set in .env file:
ANTHROPIC_API_KEY: Your Claude API keyPUBMED_EMAIL: Your email for PubMed APICLAUDE_MODEL: Model choice (default: claude-sonnet-4-20250514)
- SSL Certificate Errors: Already handled in code
- Rate Limiting: Code includes delays between API calls
- Memory Issues: Process abstracts in smaller batches
- JSON Serialization: Numpy types are converted automatically
ROOT CLUSTER:
▪ Artificial Intelligence Research
A comprehensive collection of research on artificial intelligence applications. The abstracts cover diverse AI methodologies and their applications across various domains.
TOP LEVEL (5 clusters):
▪ Artificial Intelligence in Medical Imaging and Diagnostics
This cluster focuses on AI applications in radiology and pathology...
▪ Deep Learning for Medical Images
Applications of CNNs in X-ray and MRI analysis...
▪ AI in Cancer Detection
Machine learning for early cancer diagnosis...
▪ AI Applications in Drug Discovery
This cluster covers the use of AI in pharmaceutical research...
▪ Molecular Design with ML
AI-driven drug molecule optimization...
▪ Clinical Trial Prediction
Using AI to predict drug efficacy...
- If any part of the instructions is not clear, ask for clarifications
- It is acceptable and recommended to ask questions to ensure accurate task completion