palefire/
├── palefire-cli.py # Main CLI application
├── modules/ # Core modules
│ ├── __init__.py # Module exports
│ ├── PaleFireCore.py # Entity enrichment & question detection
│ ├── KeywordBase.py # Keyword extraction (Gensim)
│ └── api_models.py # Pydantic models for API
├── agents/ # AI Agent daemon and parsers
│ ├── __init__.py # Agent module exports
│ ├── AIAgent.py # ModelManager, AIAgentDaemon
│ ├── palefire-agent-service.py # Service script
│ ├── parsers/ # File parsers
│ │ ├── __init__.py # Parser registry
│ │ ├── base_parser.py # Base parser class
│ │ ├── txt_parser.py # Text file parser
│ │ ├── csv_parser.py # CSV parser
│ │ ├── pdf_parser.py # PDF parser
│ │ ├── spreadsheet_parser.py # Excel/ODS parser
│ │ └── url_parser.py # URL/HTML parser
│ ├── docker-compose.agent.yml # Docker compose for agent
│ ├── Dockerfile.agent # Dockerfile for agent
│ └── DOCKER.md # Docker documentation
├── requirements-ner.txt # NER dependencies
├── PALEFIRE_SETUP.md # Setup guide
├── QUICK_REFERENCE.md # Quick reference
├── RANKING_SYSTEM.md # Ranking documentation
├── NER_ENRICHMENT.md # NER documentation
├── QUESTION_TYPE_DETECTION.md # Question-type guide
└── revisions/ # Backup files
└── maintest.py # Original version
Contains the core classes that power Pale Fire's intelligent search:
- Extracts named entities from text (PER, LOC, ORG, DATE, etc.)
- Supports both spaCy (recommended) and pattern-based extraction
- Enriches episodes with entity metadata before ingestion
Key Methods:
extract_entities(text)- Extract entities from textenrich_episode(episode)- Add entity metadata to episodecreate_enriched_content(episode)- Create annotated content
- Detects question type (WHO/WHERE/WHEN/WHAT/WHY/HOW)
- Maps question types to entity type weights
- Provides confidence scores for detection
Key Methods:
detect_question_type(query)- Detect question type from queryapply_entity_type_weights(node, enriched_episode, entity_weights)- Calculate weighted score
The main CLI application that orchestrates everything:
Components:
- Configuration - Neo4j, LLM, episode data
- Helper Functions - Graph operations, temporal analysis, query matching
- Search Functions - 5 different ranking approaches
- Main Function - Ingestion and search workflows
- Agent Management - Start/stop/status daemon commands
- File Parsing - Parse various file formats (TXT, CSV, PDF, Spreadsheet, URL/HTML)
- Keyword Extraction - Extract keywords with optional daemon integration
Contains the AI Agent daemon system for keeping models loaded in memory:
- Thread-safe manager for loaded models
- Keeps KeywordExtractor (Gensim) and EntityEnricher (spaCy) in memory
- Provides singleton access to models
- Handles model initialization and reloading
Key Methods:
initialize(use_spacy=True)- Load all models into memorykeyword_extractor- Get KeywordExtractor instanceentity_enricher- Get EntityEnricher instanceis_initialized()- Check if models are loadedreload()- Reload all models
- Long-running daemon service
- Keeps models loaded to avoid startup delays
- Provides fast keyword and entity extraction
- Handles graceful shutdown with signal handlers
Key Methods:
start(daemon=False)- Start daemon (foreground or background)stop()- Stop daemon gracefullyextract_keywords(text, **kwargs)- Extract keywords using loaded modelsextract_entities(text)- Extract entities using loaded modelsparse_file(file_path, **kwargs)- Parse file and extract textget_status()- Get daemon status and capabilities
- Singleton function to get or create daemon instance
- Ensures only one daemon instance exists
- Thread-safe access
File parsing system for extracting text from various formats:
- Defines interface for all parsers
- Provides file validation utilities
- Standardizes ParseResult format
Key Methods:
parse(file_path, **kwargs)- Parse file and extract textget_supported_extensions()- List supported file typesvalidate_file(file_path)- Check if file is valid
- Parses plain text files (.txt)
- Supports custom encoding
- Splits text into pages/chunks
- Extracts metadata (line count, file size)
Supported: .txt, .text
- Parses CSV files (.csv)
- Auto-detects delimiter
- Extracts table structure
- Supports header row handling
Supported: .csv
- Parses PDF files (.pdf)
- Supports PyPDF2 and pdfplumber
- Extracts text page-by-page
- Extracts tables (with pdfplumber)
- Preserves document metadata
Supported: .pdf
- Parses spreadsheet files (.xlsx, .xls, .ods)
- Supports multiple sheets
- Extracts table structure
- Handles headers and data rows
Supported: .xlsx, .xls, .xlsm, .ods
- Parses HTML pages from URLs
- Uses BeautifulSoup for HTML parsing
- Removes script/style tags by default
- Extracts metadata (title, description, keywords, links)
- Splits content into pages/sections based on headings
- Configurable timeout and headers
Supported: URLs (http://, https://)
Dependencies: requests>=2.31.0, beautifulsoup4>=4.12.0
- Automatic parser selection based on file extension or URL detection
- Factory function
get_parser(file_path)- detects URLs automatically - Helper function
is_url(path)- checks if path is a URL - Extensible for new file types
# Core modules
from modules import EntityEnricher, QuestionTypeDetector, KeywordExtractor
# AI Agent
from agents import ModelManager, AIAgentDaemon, get_daemon
# File parsers
from agents.parsers import TXTParser, CSVParser, PDFParser, SpreadsheetParser, URLParser, get_parser, is_url
# Usage
enricher = EntityEnricher(use_spacy=True)
detector = QuestionTypeDetector()
# Agent usage
daemon = get_daemon(use_spacy=True)
daemon.model_manager.initialize(use_spacy=True)
keywords = daemon.extract_keywords("text")
entities = daemon.extract_entities("text")
# Parser usage
parser = get_parser('document.pdf')
result = parser.parse('document.pdf')1. Load episodes from data
↓
2. EntityEnricher.enrich_episode()
├─ Extract entities (PER, LOC, ORG, etc.)
├─ Group by type
└─ Create enriched content
↓
3. Add to Graphiti with annotations
↓
4. Build knowledge graph
1. User query input
↓
2. QuestionTypeDetector.detect_question_type()
├─ Identify question type (WHO/WHERE/etc.)
├─ Get entity type weights
└─ Calculate confidence
↓
3. Execute hybrid search (RRF)
↓
4. For each result:
├─ Get connection count
├─ Extract temporal info
├─ Calculate query match score
├─ Extract entities from node
└─ Calculate entity type score
↓
5. Combine scores with weights
↓
6. Rank and return top results
1. Start daemon
├─ Create AIAgentDaemon instance
├─ Initialize ModelManager
│ ├─ Load KeywordExtractor (Gensim)
│ └─ Load EntityEnricher (spaCy)
└─ Fork to background (if daemon=True)
↓
2. Daemon running
├─ Models stay loaded in memory
├─ Health check loop (every 60s)
└─ Ready for requests
↓
3. Process requests
├─ extract_keywords() - Fast (models loaded)
├─ extract_entities() - Fast (models loaded)
└─ parse_file() - Uses appropriate parser
↓
4. Stop daemon
├─ Receive SIGTERM/SIGINT
├─ Cleanup models
└─ Remove PID file
1. User provides file path or URL
↓
2. Parser registry selects parser
├─ Check if input is URL (is_url())
├─ If URL: Use URLParser
├─ If file: Check file extension
└─ Instantiate appropriate parser
↓
3. Parser validates input
├─ For files: Check file exists, readable, size > 0
├─ For URLs: Validate URL format (scheme, netloc)
└─ Return error if invalid
↓
4. Parse file/URL
├─ For files: Read from filesystem
├─ For URLs: Fetch with requests, parse HTML with BeautifulSoup
├─ Extract text content
├─ Extract metadata
├─ Extract tables (if applicable)
└─ Split into pages (if multi-page)
↓
5. Return ParseResult
├─ text: Full extracted text
├─ metadata: File/URL information
├─ pages: Page-by-page text (optional)
├─ tables: Extracted tables (optional)
└─ success: Boolean status
↓
6. Optional: Extract keywords
├─ Use daemon (if running)
└─ Extract keywords from parsed text
1. Check if daemon is running
├─ Read PID file
├─ Check process exists
└─ Start daemon if not running
↓
2. Get daemon instance
├─ Singleton pattern
└─ Ensure models initialized
↓
3. Extract keywords
├─ Use loaded KeywordExtractor
├─ Apply TF-IDF, TextRank, Word Frequency
├─ Extract n-grams (2-4 words)
└─ Combine and rank results
↓
4. Return keywords
└─ List of {keyword, score, type}
- File:
search_episodes() - Factors: RRF only
- Use: Simple queries
- File:
search_episodes_with_custom_ranking() - Factors: RRF + Connections
- Use: Find central entities
- File:
search_episodes_with_temporal_ranking() - Factors: RRF + Connections + Temporal
- Use: Date-specific queries
- File:
search_episodes_with_multi_factor_ranking() - Factors: RRF + Connections + Temporal + Query Match
- Use: Complex queries
- File:
search_episodes_with_question_aware_ranking() - Factors: All 4 + Entity Type Intelligence
- Use: Natural language questions
NEO4J_URI=bolt://10.147.18.253:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password
OPENAI_API_KEY=your_keyllm_config = LLMConfig(
api_key="ollama",
model="deepseek-r1:7b",
base_url="http://10.147.18.253:11434/v1"
)graphiti-core- Knowledge graph frameworkpython-dotenv- Environment configuration
spacy- Industrial NLP libraryen_core_web_sm- English language model
gensim>=4.3.0- TF-IDF, TextRank algorithmsnltk(optional) - Better stemming support
psutil>=5.9.0- System monitoring and process management
PyPDF2>=3.0.0- PDF parsing (orpdfplumber>=0.9.0for better table extraction)openpyxl>=3.1.0- Excel .xlsx filesxlrd>=2.0.0- Excel .xls filesodfpy>=1.4.0- OpenDocument Spreadsheet (.ods) filesrequests>=2.31.0- URL fetchingbeautifulsoup4>=4.12.0- HTML parsing
Edit modules/PaleFireCore.py:
class EntityEnricher:
ENTITY_TYPES = {
# Add your custom types
'CUSTOM_TYPE': 'CUSTOM',
...
}Edit modules/PaleFireCore.py:
class QuestionTypeDetector:
QUESTION_PATTERNS = {
'CUSTOM_QUESTION': {
'patterns': [r'\byour pattern\b'],
'entity_weights': {'PER': 1.5, 'LOC': 1.2},
'description': 'Your custom question type'
},
...
}Add to palefire-cli.py:
async def search_episodes_with_custom_method(graphiti, query, ...):
# Your custom ranking logic
pass- Create parser class in
agents/parsers/:
from .base_parser import BaseParser, ParseResult
class CustomParser(BaseParser):
def parse(self, file_path: str, **kwargs) -> ParseResult:
# Your parsing logic
text = extract_text(file_path)
return ParseResult(text=text, metadata={...})
def get_supported_extensions(self) -> List[str]:
return ['.custom']- Register in
agents/parsers/__init__.py:
from .custom_parser import CustomParser
PARSERS = {
...
'.custom': CustomParser,
}Add methods to AIAgentDaemon class:
class AIAgentDaemon:
def custom_operation(self, data):
"""Custom operation using loaded models."""
extractor = self.model_manager.keyword_extractor
# Use extractor for custom logic
return result- spaCy: ~200-500 MB
- Pattern-based: ~10-20 MB
- Neo4j driver: ~50-100 MB
- Gensim (KeywordExtractor): ~100-300 MB
- AI Agent Daemon: ~300-800 MB total (with all models loaded)
- Model loading: 5-10 seconds (one-time)
- Keyword extraction: 0.5-1 second per request
- Entity extraction (spaCy): 50-500ms per node
- Entity extraction (pattern): 10-50ms per node
- Model loading: 5-10 seconds (one-time on startup)
- Keyword extraction: 0.01-0.1 second per request (10-100x faster!)
- Entity extraction: Same as above (models already loaded)
- File parsing: Varies by file type and size
- Question detection: 1-5ms
- Standard search: 100-300ms
- Question-aware search: 500-2000ms
- Use AI Agent Daemon for production - eliminates model loading delays
- Use spaCy for better accuracy
- Reduce
node_search_config.limitfor faster searches - Cache enriched episodes
- Batch process large datasets
- Use connection pooling for Neo4j
- Keep daemon running - models stay loaded, requests are instant
- Parse files once - reuse parsed text for multiple operations
- Use appropriate parsers - PDF parsers vary in speed (pdfplumber slower but better)
# Test entity extraction
enricher = EntityEnricher(use_spacy=True)
episode = {'content': 'Test text', 'type': 'text'}
result = enricher.enrich_episode(episode)
assert 'entities' in result
# Test question detection
detector = QuestionTypeDetector()
info = detector.detect_question_type("Who is John?")
assert info['type'] == 'WHO'# Test full pipeline
python palefire-cli.py # with ADD=True for ingestion
python palefire-cli.py # with ADD=False for searchModuleNotFoundError: No module named 'modules'
Solution: Ensure you're running from the palefire directory
OSError: Can't find model 'en_core_web_sm'
Solution: python -m spacy download en_core_web_sm
ServiceUnavailable: Failed to establish connection
Solution: Check Neo4j is running and credentials are correct
- Modular Design: Keep classes in modules/, functions in CLI
- Type Hints: Use type annotations for better IDE support
- Logging: Use logger for debugging, not print()
- Error Handling: Wrap external calls in try/except
- Documentation: Update docs when adding features
- Testing: Test new features before deployment
- Version Control: Keep revisions/ for rollback capability
Purpose: Thread-safe manager for keeping ML models loaded in memory.
Architecture:
ModelManager
├── Thread-safe lock (RLock)
├── KeywordExtractor (Gensim)
│ ├── TF-IDF models
│ ├── TextRank models
│ └── Word frequency counters
├── EntityEnricher (spaCy/Pattern)
│ ├── spaCy model (if available)
│ └── Pattern matchers
└── Initialization state
Thread Safety: Uses threading.RLock() to ensure safe concurrent access.
Lifecycle:
- Create instance:
manager = ModelManager() - Initialize:
manager.initialize(use_spacy=True) - Access models:
manager.keyword_extractor,manager.entity_enricher - Reload (optional):
manager.reload()
Purpose: Long-running service that keeps models loaded for fast access.
Architecture:
AIAgentDaemon
├── ModelManager (manages loaded models)
├── Process management
│ ├── PID file handling
│ ├── Signal handlers (SIGTERM, SIGINT)
│ └── Health check loop
├── Operations
│ ├── extract_keywords()
│ ├── extract_entities()
│ └── parse_file()
└── Status tracking
├── Running state
└── Model initialization state
Deployment Options:
- CLI:
python palefire-cli.py agent start --daemon - Docker:
docker-compose -f agents/docker-compose.agent.yml up -d - Systemd: Service file for Linux
- Launchd: Plist file for macOS
Communication:
- Currently: Direct Python import (singleton pattern)
- Future: Socket/HTTP API for remote access
Design Pattern: Strategy pattern with factory
Architecture:
Parser System
├── BaseParser (Abstract)
│ ├── parse() - Abstract method
│ ├── get_supported_extensions() - Abstract method
│ └── validate_file() - Concrete utility
├── Concrete Parsers
│ ├── TXTParser
│ ├── CSVParser
│ ├── PDFParser
│ ├── SpreadsheetParser
│ └── URLParser
└── Parser Registry
├── get_parser() - Factory function (auto-detects URLs)
└── is_url() - URL detection helper
ParseResult Structure:
{
'text': str, # Full extracted text
'metadata': dict, # File metadata
'pages': List[str], # Page-by-page text (optional)
'tables': List[dict], # Extracted tables (optional)
'success': bool, # Parsing status
'error': str | None # Error message if failed
}Parser Selection:
- Check if input is URL using
is_url() - If URL: Return
URLParserinstance - If file: Extract file extension
- Lookup in
PARSERSregistry - Instantiate appropriate parser
- Return parser instance
Error Handling:
- Invalid file: Returns
ParseResultwithsuccess=Falseand error message - Missing dependencies: Parser logs warning, returns error
- Encoding issues: TXT parser handles with
errors='replace'
Agent Commands:
agent start- Start daemonagent stop- Stop daemonagent restart- Restart daemonagent status- Check daemon status
Parse Commands:
parse <file>- Auto-detect and parse fileparse-txt <file>- Parse text fileparse-csv <file>- Parse CSV fileparse-pdf <file>- Parse PDF fileparse-spreadsheet <file>- Parse spreadsheet
Keywords Command:
- Automatically checks for daemon
- Starts daemon if not running
- Uses daemon for faster extraction
# Future API design
from agents import get_daemon
@app.on_event("startup")
async def startup():
daemon = get_daemon(use_spacy=True)
daemon.model_manager.initialize(use_spacy=True)
@app.post("/keywords")
async def extract_keywords(request: KeywordRequest):
daemon = get_daemon()
return daemon.extract_keywords(request.text)
@app.post("/parse")
async def parse_file(file: UploadFile):
daemon = get_daemon()
return daemon.parse_file(file.filename)- AI Agent daemon for model persistence
- File parsers (TXT, CSV, PDF, Spreadsheet, URL/HTML)
- Keyword extraction with n-grams
- REST API wrapper
- Web UI
- Batch processing API
- Result caching
- Multi-language support
- Custom entity types per domain
- ML-based question detection
- Entity linking to knowledge bases
- Socket/HTTP communication for daemon
- Parser plugins system
- Additional file formats (DOCX, RTF, etc.)
# Future API structure
from palefire import PaleFire
pf = PaleFire(neo4j_uri, neo4j_user, neo4j_password)
await pf.ingest_episodes(episodes)
results = await pf.search("Who was the AG?")When adding features:
- Add classes to
modules/PaleFireCore.py - Add functions to
palefire-cli.py - Update documentation
- Test thoroughly
- Update ARCHITECTURE.md
Inherits license from parent Open WebUI project.