An advanced AI-powered customer support system that automatically classifies tickets and provides intelligent responses using state-of-the-art Retrieval-Augmented Generation (RAG) with hybrid search, query enhancement, and optimized chunking strategies.
- Bulk Ticket Classification: Automatically classify 30+ sample tickets with AI-powered categorization
- Interactive AI Agent: Real-time chat interface for new ticket submission and response
- Conversational Memory: Context-aware conversations using LangChain ChatMessageHistory with in-memory storage
- Smart Classification: Topic tags, sentiment analysis, and priority assignment
- Advanced RAG Responses: Intelligent answers powered by hybrid search and enhanced retrieval
- Source Citations: All responses include links to relevant documentation
- Search Transparency: Real-time indicators showing search methods used (vector, keyword, or hybrid)
- Dynamic Settings Management: Comprehensive settings page for real-time pipeline configuration
- Hybrid Search: Combines vector similarity and BM25 keyword search for optimal relevance
- Query Enhancement: GPT-4o powered query expansion for technical terms (configurable)
- Enhanced Chunking: Code block preservation with intelligent markdown structure awareness
- Smart Reranking: Configurable weighted merging of vector and keyword search results
- Quality Metrics: Chunk quality indicators including code detection and header analysis
- Real-time Configuration: Dynamic settings updates without application restart
- Settings Import/Export: JSON-based configuration backup and sharing
- Topic Tags: How-to, Product, Connector, Lineage, API/SDK, SSO, Glossary, Best practices, Sensitive data
- Sentiment: Frustrated, Curious, Angry, Neutral
- Priority: P0 (High), P1 (Medium), P2 (Low)
Decision: Separate data pipeline (scraping β storage β vectorization) from deployment application.
Why:
- Data Persistence: Web scraping is expensive and rate-limited. MongoDB storage allows reprocessing embeddings without re-scraping.
- Deployment Flexibility: App folder contains only deployment dependencies, enabling clean Streamlit Cloud deployment.
- Development Efficiency: Can iterate on AI logic without re-running expensive data collection.
- A/B Testing: Separate collections enable comparison between basic and enhanced RAG implementations.
Trade-off: Increased complexity vs. reliability, cost efficiency, and experimentation capability.
Decision: Implement hybrid search combining vector similarity and BM25 keyword search.
Why:
- Technical Term Precision: BM25 excels at exact matches for technical terms, APIs, and product names.
- Semantic Understanding: Vector search captures conceptual relationships and context.
- Complementary Strengths: Vector search for "how to authenticate" + BM25 for "SAML SSO" = comprehensive coverage.
- Fallback Strategy: Graceful degradation to vector-only if BM25 fails.
Trade-off: System complexity and processing overhead vs. significantly improved retrieval quality for technical documentation.
Decision: Optional GPT-4o query enhancement with configurable toggle.
Why:
- Technical Term Expansion: "SSO" β "SAML single sign-on authentication setup"
- Context Enrichment: "API rate limits" β "REST API rate limiting configuration and best practices"
- Acronym Resolution: Critical for technical documentation where acronyms are prevalent.
- Cost Control: Configurable feature allows optimization for different use cases.
Trade-off: Additional API costs and latency vs. dramatically improved retrieval for technical queries.
Decision: Advanced recursive splitting with code block preservation and quality metrics.
Why:
- Code Integrity: Preserves
code blocksas single units to maintain functional examples. - Structure Awareness: Respects markdown headers, lists, and procedures.
- Quality Tracking: Metadata enables optimization and debugging of retrieval quality.
- Context Preservation: Smart boundaries prevent splitting related instructions.
Trade-off: Processing complexity and storage overhead vs. significantly better content quality and retrieval accuracy.
Decision: Configurable enhancement toggles rather than fixed implementation.
Why:
- Deployment Flexibility: Different environments can optimize for cost vs. quality.
- Performance Tuning: Disable expensive features for high-volume scenarios.
- Gradual Rollout: Test advanced features incrementally in production.
- User Choice: Let users balance speed vs. comprehensive results.
Trade-off: Configuration complexity vs. deployment flexibility and performance optimization.
Decision: Configurable weighted fusion of vector and BM25 results with intelligent deduplication.
Why:
- Flexible Relevance: Configurable weights allow optimization for different use cases.
- Exact Match Boost: BM25 results receive configurable weight for technical precision.
- Deduplication: Documents found by both methods receive relevance boost.
- Empirical Optimization: Default weights can be tuned based on specific documentation types.
Trade-off: Algorithm complexity vs. superior result ranking and relevance.
Decision: Separate "enhanced" and "standard" Qdrant collections for A/B testing.
Why:
- Performance Comparison: Direct measurement of advanced features' impact.
- Risk Mitigation: Fallback to standard collection if enhanced features fail.
- Feature Validation: Quantitative assessment of enhancement value.
- Gradual Migration: Safe transition from basic to advanced implementations.
Trade-off: Storage overhead and maintenance complexity vs. risk reduction and optimization capability.
Decision: Dual storage with enhanced Qdrant collections for hybrid search.
Why:
- Data Integrity: MongoDB preserves original content for reprocessing and debugging.
- Hybrid Performance: Qdrant's vector capabilities + in-memory BM25 for keyword search.
- Collection Management: Separate enhanced collections for advanced features.
- Backup Strategy: Multiple data preservation layers prevent data loss.
Trade-off: Infrastructure complexity vs. performance, flexibility, and data safety.
Decision: OpenAI GPT-4o for classification, response generation, and query enhancement.
Why:
- Quality: Superior reasoning for complex ticket classification and technical query expansion.
- JSON Reliability: Consistent structured output for automated processing.
- Context Window: Large context enables conversation memory and comprehensive responses.
- Development Speed: No model training, fine-tuning, or hosting infrastructure needed.
Trade-off: Ongoing API costs vs. response quality, development speed, and advanced capabilities.
Decision: Hybrid embedding strategy with local FastEmbed and in-memory BM25.
Why:
- Cost Efficiency: Free local embeddings vs. OpenAI embedding API costs.
- Privacy: Document content never leaves local environment.
- Performance: 384-dim embeddings balance quality with speed.
- Hybrid Capability: BM25 enables exact term matching for technical precision.
Trade-off: Implementation complexity vs. cost savings, privacy, and enhanced search capabilities.
π USER INTERFACE LAYER
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π€ User Browser Session β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β π Dashboard π¬ Chat Agent βοΈ Settings π Analytics Page β β
β β β’ Bulk Class. β’ Real-time Chat β’ Dynamic Config β’ Performance β β
β β β’ 30+ Tickets β’ Memory Context β’ Import/Export β’ Search Stats β β
β β β’ Statistics β’ Source Cites β’ Validation β’ Usage Metrics β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTP Requests
βΌ
π₯οΈ STREAMLIT APPLICATION LAYER
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β main.py (Port 8501) β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
β β UI Controls β β Session State β β Event Handlers β β
β β β’ Input Forms β β β’ User Session β β β’ Button Clicks β β
β β β’ Display Logic β β β’ Memory Store β β β’ Text Input β β
β β β’ File Uploads β β β’ Chat History β β β’ Page Navigation β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Function Calls
βΌ
π§ AI PROCESSING LAYER (rag_pipeline.py)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Advanced RAG Pipeline Engine β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
β βClassification β β Query Pipeline β β Response Generator β β
β β β’ Topic Tags β β β’ Enhancement β β β’ Template Rendering β β
β β β’ Sentiment β β β’ Hybrid Search β β β’ Citation Assembly β β
β β β’ Priority β β β’ Smart Rerank β β β’ Context Integration β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
ββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββββ¬βββββββββββββββββββββ
β β β
βΌ βΌ βΌ
π€ EXTERNAL AI APIs ποΈ DATA STORAGE LAYER
βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenAI GPT-4o β β Database Services β
β βββββββββββββββ β β βββββββββββββββββββ βββββββββββββββββββββββββββ β
β βClassificationβ ββββΆβ β MongoDB Atlas β β Qdrant Cloud β β
β ββ’ JSON Output β β β β βββββββββββββββ β β βββββββββββββββββββββββ β β
β ββ’ Structured β β β β βRaw Documentsβ β β βVector Collections β β β
β βββββββββββββββ β β β ββ’ Markdown Textβ β ββ’ atlan_docs_enhancedβ β β
β βββββββββββββββ β β β ββ’ Metadata β β β ββ’ Embeddings (384d) β β β
β βQuery Enhanceβ β β β ββ’ Timestamps β β β ββ’ Payloads β β β
β ββ’ Term Expandβ β β β βββββββββββββββ β β βββββββββββββββββββββββ β β
β ββ’ Tech Terms β β β β βββββββββββββββ β β βββββββββββββββββββββββ β β
β βββββββββββββββ β β β βBackup Files β β β βIn-Memory BM25 Index β β β
β βββββββββββββββ β β β ββ’ JSON Dumps β β β ββ’ Keyword Search β β β
β βRAG Response β β β β ββ’ Recovery β β β ββ’ TF-IDF Scoring β β β
β ββ’ Contextual β β β β βββββββββββββββ β β ββ’ rank-bm25 Library β β β
β ββ’ Cited β β β βββββββββββββββββββ β βββββββββββββββββββββββ β β
β βββββββββββββββ β ββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββ ββ β
βββββββββββββββββββ β β
β² β β
β HTTPS/REST API β β
β βΌ βΌ
π PIPELINE SCRIPTS LAYER
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Processing Pipeline β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β scrape.py β β qdrant_ingestion.py β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β β βFirecrawl APIβ β β βText Chunkingβ βFastEmbed BGEβ β β
β β ββ’ Rate Limitsβ β β ββ’ 1200 tokensβ ββ’ Local Gen β β β
β β ββ’ Content β β β ββ’ 200 overlapβ ββ’ 384 dims β β β
β β β Extraction β β β ββ’ Code Aware β ββ’ Privacy β β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β β βMongoDB Save β β β βQuality β βQdrant Uploadβ β β
β β ββ’ Documents β β β βMetrics β ββ’ Collectionsβ β β
β β ββ’ Metadata β β β ββ’ Code Detectβ ββ’ Vectors β β β
β β ββ’ Backup β β β ββ’ Headers β ββ’ Payloads β β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β²
β Manual Execution β Manual Execution
β β
π DATA SOURCES
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β External Documentation β
β βββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
β β docs.atlan.com β β developer.atlan.com ββ
β β β’ Product Documentation (~1078 pages)β β β’ API Documentation (~611) ββ
β β β’ User Guides β β β’ SDK References ββ
β β β’ Feature Explanations β β β’ Code Examples ββ
β β β’ Best Practices β β β’ Technical Specifications ββ
β βββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π KEY INTERACTION FLOWS:
π₯ DATA PIPELINE FLOW:
docs.atlan.com β Firecrawl API β scrape.py β MongoDB β qdrant_ingestion.py β Qdrant
π REAL-TIME SEARCH FLOW:
User Query β Query Enhancement (GPT-4o) β Hybrid Search (Vector+BM25) β
Smart Reranking β Context Assembly β Response Generation (GPT-4o) β User
π¬ CHAT INTERACTION FLOW:
User Input β Streamlit UI β rag_pipeline.py β Classification (GPT-4o) β
RAG/Routing Decision β Search & Generate β Display with Citations
βοΈ CONFIGURATION FLOW:
.env Variables β Feature Toggles β Pipeline Behavior β Performance Optimization
π ERROR HANDLING & RECOVERY:
β’ MongoDB Backup Files for Data Recovery
β’ Graceful Degradation: Hybrid β Vector-only β Routing
β’ Rate Limiting with Exponential Backoff
β’ Session State Management for UI Persistence
- Data Pipeline (Root scripts): Web scraping β Storage β Vector preparation
- Deployment (App folder): Streamlit application with AI capabilities
- AI Services: OpenAI for classification and response generation
- Storage: MongoDB for documents, Qdrant for vector search
- OpenAI GPT-4o: LLM for classification, response generation, and query enhancement
- FastEmbed BAAI/bge-small-en-v1.5: Vector embeddings for semantic search (384 dimensions)
- Qdrant Cloud: Vector database with hybrid search capabilities
- rank-bm25: BM25 algorithm for keyword search and hybrid retrieval
- LangChain: Enhanced text processing with advanced chunking strategies
- Streamlit: Interactive web application framework
- Python: Core application logic and AI pipeline
- MongoDB: Document storage for scraped content
- Streamlit Components: Dashboard and chat interface
- Custom CSS: Styled components and responsive design
- Interactive Elements: Real-time classification and response generation
- Firecrawl API: Automated web scraping service for documentation
- docs.atlan.com: Product documentation and user guides
- developer.atlan.com: API and SDK documentation
- MongoDB: Persistent storage for all scraped content with metadata
- Qdrant: Vector database for semantic search and RAG retrieval
- Web Scraping: Firecrawl crawls documentation sites and extracts content
- Document Storage: Raw content stored in MongoDB with full metadata
- Vector Processing: Content chunked and embedded using FastEmbed BGE-small
- RAG Deployment: Streamlit app queries Qdrant for relevant context
- Python 3.8+ and pip
- OpenAI API key
- Qdrant Cloud instance (vector database)
- MongoDB Atlas instance (document storage)
- Firecrawl API key (if running custom scraping)
- Root directory: Data pipeline scripts (scrape.py, qdrant_ingestion.py)
- app/ directory: Streamlit deployment application with own requirements and .env
git clone https://github.com/kanugurajesh/Assistly
cd AssistlyProject Structure Overview:
crawling/
βββ app/ # Streamlit deployment
β βββ main.py # Main Streamlit application
β βββ rag_pipeline.py # AI pipeline implementation
β βββ requirements.txt # App dependencies
β βββ .env.example # Environment template
β βββ sample_tickets.json
βββ memory_manager.py # Conversational memory management
βββ scrape.py # Firecrawl web scraping
βββ qdrant_ingestion.py # Vector database ingestion
βββ requirements.txt # Data pipeline dependencies
βββ README.md
Create .env file in the app/ directory (copy from app/.env.example):
OPENAI_API_KEY=your_openai_api_key
QDRANT_URI=your_qdrant_cloud_endpoint
QDRANT_API_KEY=your_qdrant_api_key
MONGODB_URI=your_mongodb_atlas_connection_string
FIRECRAWL_API_KEY=your_firecrawl_api_keyπ Deployment-Ready Structure: The
.envfile is located in theapp/directory to enable standalone deployment. Root directory scripts (scrape.py, qdrant_ingestion.py) automatically load environment variables fromapp/.env, ensuring consistent configuration across the entire project while maintaining deployment flexibility for platforms like Streamlit Cloud.
Required for Core Functionality:
-
OPENAI_API_KEY: OpenAI API key for GPT-4o classification, response generation, and query enhancement- Obtain from: https://platform.openai.com/api-keys
- Required permissions: GPT-4o model access and sufficient credits
- Usage: Classification, RAG responses, and optional query enhancement
-
QDRANT_URI: Qdrant Cloud vector database endpoint URL- Format:
https://your-cluster-id.europe-west3-0.gcp.cloud.qdrant.io:6333 - Obtain from: Qdrant Cloud dashboard after cluster creation
- Usage: Vector similarity search and hybrid search operations
- Format:
-
QDRANT_API_KEY: Authentication key for Qdrant Cloud instance- Obtain from: Qdrant Cloud cluster settings β API Keys
- Usage: Secure access to vector database operations
-
MONGODB_URI: MongoDB Atlas connection string- Format:
mongodb+srv://username:[email protected]/database - Obtain from: MongoDB Atlas dashboard β Connect β Application
- Usage: Document storage, metadata persistence, and backup operations
- Format:
Optional (Data Pipeline Only):
FIRECRAWL_API_KEY: Firecrawl API key for web scraping (only needed for custom data ingestion)- Obtain from: https://www.firecrawl.dev/
- Usage: Automated web scraping of documentation sites
- Note: Pre-processed data is included, so this is optional for basic deployment
For deployment (Streamlit app):
pip install -r app/requirements.txtFor data pipeline (if running scraping/ingestion):
pip install -r requirements.txtStep 1: Web Scraping with Firecrawl
# Basic scraping (pre-completed for Atlan docs)
python scrape.py https://docs.atlan.com --limit 3000
python scrape.py https://developer.atlan.com --limit 1000
# Custom scraping examples
python scrape.py https://your-docs.com --limit 500 --collection custom_docsNote: The limits above (3000 for docs.atlan.com, 1000 for developers.atlan.com) are set higher than the actual page counts (~1078 and ~611 respectively) to ensure complete data scraping during development. You can use lower limits based on your needs - the crawler will stop when all available pages are scraped regardless of the limit setting.
All scraped content automatically stored in MongoDB with metadata and backup files.
Step 2: Enhanced Vector Database Ingestion
# Create enhanced collection with advanced chunking
python qdrant_ingestion.py --qdrant-collection "atlan_docs_enhanced" --recreate
# Advanced ingestion with source filtering
python qdrant_ingestion.py --source-url "https://docs.atlan.com" --qdrant-collection "atlan_docs_enhanced"
# Incremental updates (recommended for production)
python qdrant_ingestion.py --qdrant-collection "atlan_docs_enhanced"Enhanced chunking preserves code blocks, creates quality metrics, and generates embeddings with FastEmbed BGE-small for hybrid search.
Note: The application comes with pre-processed data, so this step is only needed for custom datasets or updates. For advanced configuration options, see the "Advanced Pipeline Options" section below.
Basic Command Structure:
python scrape.py <URL> [OPTIONS]Available Options:
--limit <number>: Maximum pages to crawl (default: 3000)--collection <name>: MongoDB collection name (default: atlan_developer_docs)
Common Scraping Scenarios:
# Scrape with custom page limit
python scrape.py https://docs.atlan.com --limit 500
# Scrape to custom MongoDB collection
python scrape.py https://docs.atlan.com --collection custom_docs
# Scrape developer docs with different limits
python scrape.py https://developer.atlan.com --limit 200 --collection dev_docsAdvanced Command Structure:
python qdrant_ingestion.py [OPTIONS]Available Options:
--source-url <url>: Filter documents by specific source URL--collection <name>: MongoDB collection name (default: atlan_developer_docs)--qdrant-collection <name>: Qdrant collection name (default: atlan_docs)--recreate: Delete and recreate Qdrant collection (removes existing data)--no-incremental: Process all documents (skip duplicate checking)
Advanced Ingestion Examples:
# Process only developer documentation
python qdrant_ingestion.py --source-url "https://developer.atlan.com"
# Process only general documentation
python qdrant_ingestion.py --source-url "https://docs.atlan.com"
# Recreate collection (fresh start)
python qdrant_ingestion.py --recreate
# Process all documents without incremental checking
python qdrant_ingestion.py --no-incremental
# Process custom collection with filtering
python qdrant_ingestion.py --collection custom_docs --source-url "https://example.com"
# Create custom Qdrant collection
python qdrant_ingestion.py --qdrant-collection "developer_vectors"
# Process custom MongoDB collection to custom Qdrant collection
python qdrant_ingestion.py --collection dev_docs --qdrant-collection "dev_vectors"
# Full rebuild with specific source and custom collection
python qdrant_ingestion.py --recreate --source-url "https://developer.atlan.com" --qdrant-collection "dev_only"Selective Processing:
- Update only specific documentation domains
- Test pipeline with subset of data
- Separate processing schedules for different sites
Document Type Classification:
- Automatic categorization:
developer.atlan.comβ "developer" type - All other sources β "docs" type
- Enables filtered search and analytics
Performance Optimization:
- Process only changed documentation
- Reduce vector database update time
- Minimize embedding generation costs
Custom Qdrant Collections:
- Separate vector collections for different projects
- Independent collection lifecycle management
- Isolated testing and production environments
- Multiple documentation versions in parallel
Development & Testing:
# Create test collection with limited data
python scrape.py https://docs.atlan.com --limit 50 --collection test_docs
python qdrant_ingestion.py --collection test_docs --qdrant-collection test_vectors --recreateProduction Updates:
# Re-scrape updated content (overwrites existing URLs)
python scrape.py https://docs.atlan.com --limit 3000
# Incremental ingestion (only new/changed documents)
python qdrant_ingestion.pyMulti-Source Management:
# Separate ingestion for different documentation types
python qdrant_ingestion.py --source-url "https://developer.atlan.com"
python qdrant_ingestion.py --source-url "https://docs.atlan.com"How It Works:
- Checks MongoDB document IDs already in Qdrant
- Skips processing of existing documents
- Only processes new or updated content
When to Use --no-incremental:
- After modifying chunking parameters
- When reprocessing is needed due to embedding model changes
- For debugging or validation purposes
Run the Streamlit app:
cd app
streamlit run main.pyThe application will open automatically in your browser at http://localhost:8501
Deploy to Streamlit Community Cloud:
- Push your repository to GitHub
- Visit share.streamlit.io
- Connect your GitHub repository
- Set main file path:
app/main.py - Add environment variables in Streamlit Cloud settings
- Deploy your application
- Navigate to "π Dashboard" in the sidebar
- Click "Load & Classify All Tickets"
- View AI-generated classifications for all 30+ sample tickets
- Analyze summary statistics and topic distributions
- Search and examine individual ticket classifications
- Navigate to "π¬ Chat Agent" in the sidebar
- Enter your question in the chat interface
- Toggle "Show internal analysis" to view classification details
- Get intelligent responses with source citations
- Experience context-aware conversations with memory
- Use the "Conversation Management" sidebar to view memory stats or clear history
- Try sample questions or submit your own tickets
- Navigate to "π Analytics" in the sidebar
- Performance Metrics: View real-time search performance statistics
- Response times for different search methods (vector, hybrid, keyword)
- Query enhancement usage and effectiveness metrics
- Average retrieval quality scores and user satisfaction
- Usage Analytics: Monitor system utilization patterns
- Daily/weekly query volume trends
- Most common topic classifications and routing decisions
- Memory usage statistics across active sessions
- Search Method Distribution: Analyze search strategy effectiveness
- Breakdown of vector vs. hybrid vs. keyword search usage
- Success rates and fallback patterns for different methods
- Quality metrics per search type with comparative analysis
- System Health Monitoring: Track infrastructure performance
- Qdrant collection performance and vector database health
- MongoDB connection status and query response times
- OpenAI API usage, rate limits, and cost optimization insights
The system maintains conversation history to provide context-aware responses:
- Follow-up Questions: Ask related questions without repeating context
- Reference Previous Answers: The AI remembers what it told you earlier
- Natural Flow: Conversations feel more natural and coherent
Conversation Management Sidebar:
- Memory Statistics: View active sessions and total message count
- Current Session Info: See number of exchanges in current conversation
- Clear History: Manually reset conversation memory when needed
Automatic Features:
- Session Isolation: Each browser session has its own conversation memory
- Message Limits: Automatically trims to last 20 messages to prevent token overflow
- Auto Expiry: Sessions expire after 60 minutes of inactivity
- Smart Trimming: Removes oldest messages while preserving conversation pairs
User: "How do I connect Snowflake to Atlan?"
AI: "To connect Snowflake to Atlan, you need to configure..." [provides detailed steps]
User: "What permissions do I need for this?"
AI: "For the Snowflake connection we discussed, you'll need..." [remembers previous context]
User: "Are there any security considerations?"
AI: "Yes, for your Snowflake-Atlan integration, consider..." [builds on conversation]
- Backend: LangChain's
InMemoryChatMessageHistoryfor pure RAM storage - No Database: Conversations stored in Python dictionaries (no external dependencies)
- Session Management: UUID-based session identification with Streamlit session state
- Context Integration: Previous conversation included in RAG prompts for better responses
The ConversationMemoryManager class provides advanced conversation memory features:
Core Features:
- Session Isolation: Each browser session maintains separate conversation history
- Automatic Cleanup: Configurable auto-cleanup removes expired sessions (default: every 100 operations)
- Message Trimming: Automatically limits conversations to last 20 messages per session
- Session Timeout: Sessions expire after 60 minutes of inactivity
- Smart Trimming: Preserves conversation pairs (human + AI messages) when trimming
Configuration Options:
max_messages_per_session: Maximum messages per conversation (default: 20)session_timeout_minutes: Session expiration time (default: 60 minutes)auto_cleanup_interval: Operations between automatic cleanup (default: 100)
Memory Statistics:
- Active session count and total message tracking
- Per-session message counts and last activity timestamps
- Memory usage optimization with automatic garbage collection
- Real-time memory health monitoring via the Analytics page
- Navigate to "βοΈ Settings" in the sidebar
- Collection Management: Select from available Qdrant collections with real-time discovery
- Collection Information: View collection points, vector size, and distance metrics
- Configure search parameters (TOP_K, score thresholds, configurable hybrid weights)
- Adjust model settings (temperature, max tokens, model selection)
- Toggle features (hybrid search, query enhancement)
- Customize UI preferences (show analysis default)
- Apply settings in real-time without restarting the application
- Export/import settings configurations as JSON files
- View configuration warnings for potentially problematic settings
- Troubleshooting: Built-in connection diagnostics and collection validation
The system analyzes tickets using structured prompts to generate:
- Topic Tags: Multiple relevant categories with high accuracy
- Sentiment: Emotional tone analysis for prioritization
- Priority: Business impact assessment with context awareness
- RAG Topics: How-to, Product, Best practices, API/SDK, SSO β Generate answers using hybrid search
- Routing Topics: Connector, Lineage, Glossary, Sensitive data β Route to specialized teams
- Query Processing: Optional GPT-4o enhancement expands technical terms
- Search Strategy: Hybrid vector + keyword search with smart reranking
- Response Generation: Context-aware answers with source attribution
- Input: Raw user query (e.g., "How to setup SSO?")
- Processing: GPT-4o expands technical terms and acronyms
- Output: Enhanced query (e.g., "How to configure SAML single sign-on authentication in Atlan?")
- Benefits: Better retrieval for technical documentation
- Toggle: Configurable via
ENABLE_QUERY_ENHANCEMENT
- Vector Search: Semantic similarity using FastEmbed BGE-small (384 dim)
- Keyword Search: BM25 algorithm for exact term matching
- Fusion Strategy: Configurable weighted combination with smart deduplication
- Reranking: Boosts documents found by both methods
- Fallback: Graceful degradation to vector-only if BM25 fails
- Structure Preservation: Special handling for code blocks and headers
- Smart Separators: 15+ separator types for optimal boundaries
- Quality Metrics: Tracks code presence, headers, and word count
- Metadata Enhancement: Chunk-level quality indicators
- Context Maintenance: Preserves related content together
- Chunk Size: 1200 tokens with 200 token overlap
- Method: Advanced recursive character splitting with enhanced separators
- Code Preservation: Special handling for ``` code blocks and indented code
- Structure Awareness: Preserves headers, lists, procedures, and markdown formatting
- Quality Metrics: Tracks code blocks, headers, word count, and chunk quality scores
- Smart Boundaries: 15+ separator types for optimal semantic chunking
- Vector Search: BAAI/bge-small-en-v1.5 (384 dimensions) with cosine similarity
- Keyword Search: BM25 algorithm for exact term matching
- Search Fusion: Configurable weighted combination of vector and keyword results
- Smart Reranking: Deduplication and relevance scoring with boost for multi-method matches
- Score Threshold: 0.3 minimum similarity for vector results
- Top-K Retrieval: 5 most relevant chunks from hybrid results
- Memory Backend: LangChain's
InMemoryChatMessageHistoryfor pure RAM storage - Session Management: Unique session IDs for each browser session with automatic timeout
- Context Window: Last 5 message exchanges included in RAG prompts for continuity
- Memory Features:
- Automatic message trimming (max 20 messages per session)
- Session cleanup and expiration (60-minute timeout)
- Manual conversation clearing via UI
- Memory usage statistics and monitoring
- No External Dependencies: Pure in-memory storage without databases
OPENAI_API_KEY: Required for GPT-4o classification, response generation, and query enhancementQDRANT_URI: Qdrant Cloud vector database endpoint for hybrid searchQDRANT_API_KEY: Authentication for Qdrant Cloud instanceMONGODB_URI: MongoDB Atlas connection string for document storageFIRECRAWL_API_KEY: Firecrawl API key for web scraping (data pipeline only)
ENABLE_QUERY_ENHANCEMENT: Toggle GPT-4o query expansion (default: False)ENABLE_HYBRID_SEARCH: Toggle vector + BM25 hybrid search (default: True)HYBRID_VECTOR_WEIGHT: Configurable weight for vector search results (default: 1.0)HYBRID_KEYWORD_WEIGHT: Configurable weight for BM25 keyword results (default: 0.0)COLLECTION_NAME: Qdrant collection name (default: "atlan_docs_enhanced")SCORE_THRESHOLD: Minimum similarity threshold (default: 0.3)TOP_K: Number of search results to retrieve (default: 5)MAX_TOKENS: Maximum response length (default: 1000)TEMPERATURE: Response creativity level (default: 0.3)LLM_MODEL: OpenAI model for responses (default: "gpt-4o")
- Real-time Updates: All configuration changes apply immediately without restart
- Collection Management: Dynamic Qdrant collection discovery and switching
- Settings Validation: Built-in warnings for potentially problematic configurations
- Import/Export: JSON-based settings backup and sharing capabilities
- UI Integration: Settings page with tabbed interface for different parameter categories
- Configuration Persistence: Settings stored in session state and applied to pipeline
- Connection Diagnostics: Real-time collection validation and troubleshooting
- Fallback Handling: Graceful degradation when settings cause issues
- Scraping Parameters: Use
--limitand--collectionoptions in scrape.py for custom URLs and crawl limits - Source Filtering: Use
--source-urlin qdrant_ingestion.py for selective document processing - Collection Management: Use
--qdrant-collection,--recreateand--no-incrementaloptions for collection lifecycle - Custom Collections: Use
--qdrant-collectionto create separate vector collections for different projects - Chunk Configuration: Adjust size and overlap in qdrant_ingestion.py (default: 1200 tokens, 200 overlap)
- Vector Search: Modify threshold and top-K in app/rag_pipeline.py (default: 0.3 threshold, 5 chunks)
This project implements two different Qdrant collections with distinct chunking strategies to optimize for different use cases and document types.
| Collection | Chunking Strategy | Branch Availability | Best For |
|---|---|---|---|
atlan_docs |
Basic Chunking | Development branch | Plain text, fast processing |
atlan_docs_enhanced |
Enhanced Chunking | Main & advanced-rag-enhancements branches | Technical docs with code |
Implementation Location: Available in the development branch of this repository
Technical Details:
- Uses simple
RecursiveCharacterTextSplitterwith basic separators:["\n\n", "\n", " ", ""] - Chunk size: ~1200 characters with 200 character overlap
- Keeps separators in chunks (
keep_separator=True) - Fast, straightforward processing
Metadata Structure:
{
"text": "chunk content...",
"source_url": "https://docs.atlan.com/...",
"title": "Document Title",
"doc_type": "docs" | "developer",
"chunk_index": 0,
"total_chunks": 5
}Characteristics:
- β Pros: Simple, fast, works well for plain text documents
- β Cons:
- Doesn't preserve code blocks (may split ```` blocks)
- Markdown structure (headers, lists) may be broken
- Chunks may cut through semantic boundaries β lower retrieval quality
Implementation Location: Current implementation in main and advanced-rag-enhancements branches
Technical Details:
- Code Block Preservation: Uses
preserve_code_blocks()function to surround code blocks with newlines - Rich Separators: 15+ separator types for optimal semantic boundaries:
separators=[ "\n\n\n", # Major section breaks "\n\n", # Paragraph breaks "\n```\n", "```\n", # Code block boundaries "\n# ", "\n## ", "\n### ", "\n#### ", # Headers "\n- ", "\n* ", "\n1. ", "\n2. ", # Lists "\n", ". ", "? ", "! ", "; ", ", ", # Sentences & punctuation " ", "" # Words & characters ]
- Quality Metrics: Analyzes chunk content for optimization
Enhanced Metadata Structure:
{
"text": "chunk content...",
"source_url": "https://docs.atlan.com/...",
"title": "Document Title",
"doc_type": "docs" | "developer",
"chunk_index": 0,
"total_chunks": 5,
"word_count": 150,
"has_code": true,
"has_headers": true,
"chunk_quality": "high" | "medium"
}Characteristics:
- β
Pros:
- Preserves semantic meaning better (respects headers, lists, sentences)
- Code blocks are chunked as whole units (maintains functional examples)
- Quality metadata enables downstream filtering and optimization
- Better context preservation for technical documentation
- β Cons:
- More complex processing (slower)
- Slightly larger metadata footprint
| Aspect | Basic Chunking | Enhanced Chunking |
|---|---|---|
| Speed | Fast β‘ | Moderate β±οΈ |
| Code Preservation | β May split code blocks | β Preserves complete code blocks |
| Markdown Awareness | β Basic line/paragraph splitting | β Respects headers, lists, structure |
| Quality Tracking | β No quality metrics | β Chunk quality indicators |
| Use Case | Raw text ingestion | Developer documentation |
| Retrieval Quality | Good for simple text | Superior for technical content |
Choose atlan_docs (Basic) when:
- Processing large volumes of plain text
- Speed is critical over quality
- Documents don't contain code examples
- Simple question-answering scenarios
Choose atlan_docs_enhanced (Enhanced) when:
- Processing technical documentation
- Documents contain code examples and structured content
- Quality of retrieval is more important than speed
- Need chunk-level quality metrics for optimization
Switching Collections:
- Navigate to "βοΈ Settings" in the sidebar
- Use the "Collection Management" section
- Select from available Qdrant collections
- Apply changes in real-time without restart
- Basic Chunking: ~40% faster processing, smaller storage footprint
- Enhanced Chunking: Higher retrieval accuracy for technical queries, better context preservation
Choose based on your specific use case: speed vs. quality trade-off.
Complete Fresh Setup:
# Scrape new documentation
python scrape.py https://new-docs.com --limit 500 --collection new_docs
# Create fresh vector database
python qdrant_ingestion.py --collection new_docs --recreateIncremental Updates (Recommended):
# Re-scrape updated content (overwrites existing URLs)
python scrape.py https://docs.atlan.com --limit 700
# Incremental ingestion (only new/changed documents)
python qdrant_ingestion.pyDomain-Specific Processing:
# Update only developer documentation vectors
python qdrant_ingestion.py --source-url "https://developer.atlan.com"
# Update only general documentation vectors
python qdrant_ingestion.py --source-url "https://docs.atlan.com"Testing and Development:
# Create test dataset
python scrape.py https://docs.atlan.com --limit 20 --collection test_data
# Test ingestion pipeline with custom Qdrant collection
python qdrant_ingestion.py --collection test_data --qdrant-collection test_vectors --recreateMultiple Project Management:
# Project A: Customer documentation
python scrape.py https://customer-docs.com --collection customer_docs
python qdrant_ingestion.py --collection customer_docs --qdrant-collection customer_vectors
# Project B: Internal documentation
python scrape.py https://internal-docs.com --collection internal_docs
python qdrant_ingestion.py --collection internal_docs --qdrant-collection internal_vectors- Classification Prompts: Edit prompts in app/rag_pipeline.py for custom categorization
- Response Templates: Modify RAG and routing responses in app/main.py
- UI Styling: Update custom CSS in app/main.py for branding
- Smart Scraping: Firecrawl with automated content extraction and metadata preservation
- Persistent Storage: MongoDB with backup capabilities and incremental processing
- Advanced Vector Ingestion: Batch processing with enhanced chunking and quality metrics
- Hybrid Search Performance: Combined vector + keyword search with intelligent reranking
- Multi-Method Retrieval: Hybrid search combines semantic and keyword matching
- Query Enhancement: GPT-4o expands technical terms for better retrieval (configurable)
- Smart Reranking: Configurable weighted fusion of vector and BM25 results
- Source Attribution: All RAG responses include original documentation URLs
- Relevance Scoring: Vector similarity + BM25 scoring with threshold 0.3
- Context Quality: Top-5 chunks from hybrid results for comprehensive answers
- Search Transparency: Real-time indicators showing search methods used
- Feature Toggles: Configurable query enhancement and hybrid search
- Collection Management: Separate enhanced and standard collections
- Incremental Processing: Skip already processed documents for efficiency
- Quality Metrics: Chunk-level quality indicators (code detection, headers, word count)
- Error Resilience: Graceful fallbacks for all advanced features
- Performance Monitoring: Search method tracking and optimization insights
1. Firecrawl Scraping Problems
- Verify Firecrawl API key in environment variables
- Check rate limits and adjust scraping delays in scrape.py
- Monitor MongoDB connection for storage issues
2. MongoDB Storage Issues
- Validate MongoDB Atlas connection string
- Check database and collection permissions
- Verify network access to MongoDB cluster
3. Vector Database Problems
- Verify Qdrant Cloud instance is accessible
- Check embedding dimensions match (384 for BGE-small)
- Validate collection exists and has correct configuration
The system includes automatic backup mechanisms for data protection:
Automatic Backup Features:
- Scraping Backup: All scraped content is automatically saved as backup files during data collection
- Document Persistence: MongoDB Atlas provides built-in automated backups (snapshots every 24 hours)
- Metadata Preservation: Full document metadata, URLs, and timestamps stored for recovery
Manual Backup Procedures:
# Export specific collection to JSON backup
# Use MongoDB Compass or Atlas export functionality
# Or use mongodump for command-line backup:
mongodump --uri "your_mongodb_uri" --collection scraped_pages --db Cluster0 --out ./backup/
# Export custom collection with date stamp
mongodump --uri "your_mongodb_uri" --collection custom_docs --db Cluster0 --out ./backup/$(date +%Y%m%d)/Recovery Procedures:
# Restore from MongoDB Atlas snapshot (via Atlas UI)
# 1. Go to Atlas Dashboard β Clusters β Backup
# 2. Select snapshot date and restore to new cluster
# 3. Update MONGODB_URI in environment variables
# Restore from local backup
mongorestore --uri "your_mongodb_uri" --db Cluster0 ./backup/dump/Cluster0/Qdrant Collection Backup:
- Vector collections can be recreated using
qdrant_ingestion.py --recreate - All embedding data is regenerated from MongoDB source documents
- Collection metadata and configuration preserved in code
Recovery Process:
# Full vector database recreation from MongoDB
python qdrant_ingestion.py --recreate --qdrant-collection atlan_docs_enhanced
# Partial recovery for specific sources
python qdrant_ingestion.py --source-url "https://docs.atlan.com" --recreate
# Verify collection health after recovery
python -c "
from app.rag_pipeline import RAGPipeline
pipeline = RAGPipeline()
print(f'Collection status: {pipeline.qdrant_client.get_collection(\"atlan_docs_enhanced\")}')"- Environment Variables: Ensure
.envfiles are backed up securely - MongoDB: Verify Atlas automated backups are enabled
- Source Code: Regular git commits with configuration files
- API Keys: Secure storage of all service credentials
- Documentation: Keep setup instructions updated for recovery scenarios
4. Streamlit Deployment
- Ensure all environment variables are set in app/.env
- Check that app/requirements.txt includes all dependencies
- Verify OpenAI API key has sufficient credits
5. Classification Errors
- Review prompt templates in app/rag_pipeline.py
- Check JSON parsing logic for malformed responses
- Monitor OpenAI API rate limits
- Check Streamlit logs for detailed error messages
- Validate environment variable loading in app directory
- Test individual pipeline components (MongoDB, Qdrant, OpenAI)
- Monitor API usage and rate limits across all services
- Separation of Concerns: Clean separation between data pipeline (root) and deployment (app/)
- Environment Isolation: Each tier has independent requirements and configuration
- Feature Modularity: Advanced RAG features can be toggled independently
- Data Persistence: MongoDB enables reprocessing and experimentation
- Deployment Ready: Enhanced app/ folder with advanced search capabilities
- Advanced Data Pipeline: Firecrawl β MongoDB β Enhanced Qdrant β Advanced Streamlit
- Hybrid Search System: Vector + BM25 keyword search with intelligent fusion
- Query Enhancement: Optional GPT-4o query expansion for technical terms
- Enhanced Chunking: Code-aware splitting with quality metrics
- Smart Configuration: Feature toggles for different deployment scenarios
- Dual Collection Strategy: Standard vs enhanced collections for comparison
- Performance Optimization: Configurable search weights and thresholds
- Query Enhancement: Optional GPT-4o expansion vs direct search (configurable)
- Hybrid Search: Vector + keyword complexity vs pure vector simplicity
- Enhanced Chunking: Structure preservation vs simple character splitting
- Feature Toggles: Flexibility vs configuration complexity
- Dual Collections: Comparison capability vs storage overhead
- Search Transparency: User insight vs UI complexity
- Performance vs Features: Configurable enhancement levels for different use cases
- Collection Management: Enhanced vs standard collections for A/B testing
- Feature Flags: Runtime configuration of advanced features
- Quality Metrics: Chunk-level quality indicators for optimization
- Search Analytics: Real-time method tracking and performance insights
- Graceful Degradation: Fallbacks ensure system reliability
| Feature | Implementation |
|---|---|
| Search Method | Hybrid vector + BM25 keyword search |
| Query Processing | Optional GPT-4o query enhancement |
| Chunking | Code-aware splitting with quality metrics |
| Results | Smart reranking with configurable fusion weights |
| UI Feedback | Search method indicators + transparency |
| Collection | atlan_docs_enhanced with enhanced metadata |
| Configurability | Feature toggles for all enhancements |
| Performance | Graceful degradation and fallbacks |
| Settings Management | Dynamic configuration with real-time updates |
| Configuration | Import/export, validation, and persistence |
- β Better Technical Term Handling: Hybrid search excels at exact matches
- β Enhanced Code Examples: Preserved code blocks in chunking
- β Query Expansion: GPT-4o expands acronyms and technical terms
- β Search Transparency: Users see which methods found their answers
- β Quality Metrics: Chunk-level indicators for optimization
- β Configurable Features: Toggle enhancements based on needs
- β Dynamic Settings Management: Real-time configuration without restart
- β Settings Import/Export: JSON-based configuration sharing and backup
- β Collection Management: Real-time Qdrant collection discovery and switching
- β Connection Diagnostics: Built-in troubleshooting for collection issues
The project includes utility functions for MongoDB operations in the data pipeline:
Core Functions:
-
get_mongodb_client(): Creates authenticated MongoDB client using environment variables- Returns:
MongoClientinstance configured withMONGODB_URI - Handles connection string validation and error handling
- Usage: For direct database operations and debugging
- Returns:
-
get_mongodb_collection(database_name, collection_name): Gets MongoDB client, database, and collection- Args:
database_name: Target database (default: "Cluster0")collection_name: Target collection (default: "scraped_pages")
- Returns: Tuple of
(client, database, collection) - Usage: For pipeline scripts that need database access
- Args:
-
close_mongodb_client(client): Safely closes MongoDB connections- Args:
client- MongoDB client instance to close - Includes error handling for cleanup operations
- Usage: Ensures proper resource cleanup in scripts
- Args:
Constants:
DEFAULT_DATABASE = "Cluster0": Default MongoDB database nameDEFAULT_COLLECTION = "scraped_pages": Default collection for scraped content
Example Usage:
from utils import get_mongodb_collection, close_mongodb_client
# Get database components
client, db, collection = get_mongodb_collection("Cluster0", "custom_docs")
# Perform operations
documents = collection.find({"source_url": {"$regex": "docs.atlan.com"}})
# Cleanup
close_mongodb_client(client)- Fork the repository
- Create a feature branch
- Implement changes with tests
- Update documentation
- Submit pull request
For issues and questions:
- Check the troubleshooting section
- Review API documentation
- Create an issue with detailed reproduction steps