A sophisticated, hierarchically-aware Retrieval-Augmented Generation (RAG) system for Vaadin documentation that understands document structure, provides framework-specific filtering, and enables intelligent parent-child navigation through documentation sections.
This project provides an advanced RAG system with enhanced hybrid search that:
- Understands Hierarchical Structure: Navigates parent-child relationships within and across documentation files
- Enhanced Hybrid Search: Combines semantic and intelligent keyword search with native Pinecone reranking for superior relevance
- Framework Filtering: Intelligently filters content for Vaadin Flow (Java) vs Hilla (React) frameworks
- Agent-Friendly: Provides MCP (Model Context Protocol) server for seamless IDE assistant integration
- Production Ready: Clean architecture with dependency injection, comprehensive testing, and error handling
vaadin-documentation-services/
βββ packages/
β βββ core-types/ # Shared TypeScript interfaces
β βββ 1-asciidoc-converter/ # AsciiDoc β Markdown + metadata extraction
β βββ 2-embedding-generator/ # Markdown β Vector database with hierarchical chunking
β βββ rest-server/ # Enhanced REST API with hybrid search + reranking
β βββ mcp-server/ # MCP server with hierarchical navigation
βββ package.json # Bun workspace configuration
βββ PROJECT_PLAN.md # Complete project documentation
flowchart TD
subgraph "Step 1: Documentation Processing"
VaadinDocs["π Vaadin Docs<br/>(AsciiDoc)"]
Converter["π AsciiDoc Converter<br/>β’ Framework detection<br/>β’ URL generation<br/>β’ Markdown output"]
Processor["β‘ Embedding Generator<br/>β’ Hierarchical chunking<br/>β’ Parent-child relationships<br/>β’ OpenAI embeddings"]
end
subgraph "Step 2: Enhanced Retrieval"
Pinecone["ποΈ Pinecone Vector DB<br/>β’ Rich metadata<br/>β’ Hierarchical relationships<br/>β’ Framework tags"]
RestAPI["π REST API<br/>β’ Enhanced hybrid search<br/>β’ Native Pinecone reranking<br/>β’ Framework filtering"]
end
subgraph "Step 3: Agent Integration"
MCP["π€ MCP Server<br/>β’ search_vaadin_docs<br/>β’ get_full_document<br/>β’ Full document retrieval"]
IDEs["π» IDE Assistants<br/>β’ Context-aware search<br/>β’ Hierarchical exploration<br/>β’ Framework-specific help"]
end
VaadinDocs --> Converter
Converter --> Processor
Processor --> Pinecone
Pinecone <--> RestAPI
RestAPI <--> MCP
MCP <--> IDEs
classDef processing fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef storage fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef api fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef agent fill:#fff3e0,stroke:#e65100,stroke-width:2px
class VaadinDocs,Converter,Processor processing
class Pinecone,RestAPI storage
class MCP api
class IDEs agent
- Enhanced Hybrid Search: Combines semantic similarity with intelligent keyword extraction and scoring
- Native Pinecone Reranking: Uses Pinecone's bge-reranker-v2-m3 for optimal result ranking
- Framework Awareness: Filters Flow vs Hilla content with common content inclusion
- Query Preprocessing: Smart keyword extraction with stopword filtering for better search quality
- Parent-Child Relationships: Navigate from specific details to broader context
- Cross-File Links: Understand relationships between different documentation files
- Context Breadcrumbs: Maintain navigation context for better user experience
- MCP Integration: Standardized protocol for IDE assistant integration
- TypeScript: Full type safety across all packages
- Comprehensive Testing: Unit tests, integration tests, and hierarchical workflow validation
- Clean Architecture: Dependency injection and interface-based design
- Bun runtime
- OpenAI API key (for embeddings)
- Pinecone API key and index
# Clone and install dependencies
git clone https://github.com/vaadin/vaadin-documentation-services
cd vaadin-documentation-services
bun install
# Create .env file with your API keys
echo "OPENAI_API_KEY=your_openai_api_key" > .env
echo "PINECONE_API_KEY=your_pinecone_api_key" >> .env
echo "PINECONE_INDEX=your_pinecone_index" >> .env
# Convert AsciiDoc to Markdown with metadata
cd packages/1-asciidoc-converter
bun run convert
# Generate embeddings and populate vector database
cd ../2-embedding-generator
bun run generate
cd packages/rest-server
bun run start
# Server runs at http://localhost:3001
The MCP server is deployed and available remotely via HTTP transport at:
https://vaadin-mcp.fly.dev/mcp
Configure your IDE assistant to use the Streamable HTTP transport:
import { StreamableHTTPClientTransport } from "@modelcontextprotocol/sdk/client/streamableHttp.js";
const transport = new StreamableHTTPClientTransport(
new URL("https://vaadin-mcp.fly.dev/mcp")
);
Shared TypeScript interfaces used across all packages:
DocumentChunk
: Core documentation chunk structureRetrievalResult
: Search result with relevance scoringFramework
: Type-safe framework definitions
Converts Vaadin AsciiDoc documentation to Markdown with metadata:
- Framework Detection: Automatically detects Flow/Hilla/common content
- URL Generation: Creates proper Vaadin.com documentation links
- Include Processing: Handles AsciiDoc include directives
- Metadata Extraction: Preserves semantic information in frontmatter
cd packages/1-asciidoc-converter
bun run convert # Convert all documentation
bun run test # Run framework detection tests
Creates vector embeddings with hierarchical relationships:
- Hierarchical Chunking: Preserves document structure and relationships
- Parent-Child Links: Creates cross-file and intra-file relationship mapping
- LangChain Integration: Uses MarkdownHeaderTextSplitter for intelligent chunking
- Batch Processing: Efficient embedding generation and Pinecone upsertion
cd packages/2-embedding-generator
bun run generate # Generate embeddings from Markdown
bun run test # Run chunking and relationship tests
Enhanced API server with hybrid search capabilities:
- Hybrid Search: Semantic + keyword search with RRF fusion
- Framework Filtering: Flow/Hilla/common content filtering
- Document Navigation:
/chunk/:chunkId
endpoint for parent-child navigation - Backward Compatibility: Maintains existing API contracts
cd packages/rest-server
bun run start # Start production server
bun run test # Run comprehensive test suite
bun run test:verbose # Detailed test output
API Endpoints:
POST /search
- Hybrid search with framework filteringGET /chunk/:chunkId
- Retrieve specific document chunkPOST /ask
- AI-generated answers (with streaming support)GET /health
- Health checkGET /vaadin-version
- Get latest Vaadin version from GitHub releases
Model Context Protocol server for IDE assistant integration:
- Document Tools:
search_vaadin_docs
andget_full_document
- Full Document Retrieval: Complete documentation pages with context
- Framework Awareness: Intelligent framework detection and filtering
- Error Handling: Graceful degradation for missing content
cd packages/mcp-server
bun run build # Build for distribution
bun run test # Run document-based tests
Available Tools:
search_vaadin_docs
: Search with semantic and keyword matchingget_full_document
: Retrieve complete documentation pagesget_vaadin_version
: Get latest Vaadin version and release timestamp
Each package includes comprehensive test suites:
# Test individual packages
cd packages/1-asciidoc-converter && bun run test
cd packages/2-embedding-generator && bun run test
cd packages/rest-server && bun run test
cd packages/mcp-server && bun run test
# Run REST server against live endpoint
cd packages/rest-server && bun run test:server
- 100% Framework Detection Accuracy: Flow, Hilla, and common content correctly identified
- Enhanced Hybrid Search: Semantic + keyword search with native Pinecone reranking dramatically improves relevance
- Contextual Navigation: Parent-child relationships enable better result exploration
- 4,982 Document Chunks: Complete coverage of 378 Vaadin documentation files with 5-level hierarchy
- Parallel Processing: Semantic and keyword search executed in parallel with intelligent merging
- Native Reranking: Pinecone's bge-reranker-v2-m3 provides superior result ranking
- Query Preprocessing: Smart keyword extraction with stopword filtering improves search quality
- Efficient Chunking: Optimized token limits with intelligent content splitting
- Clean Architecture: Dependency injection enables easy performance optimization
- 100% API Backward Compatibility: All existing integrations continue to work
- Robust Error Handling: Graceful fallbacks ensure system reliability
- Fresh Data: Recently updated with complete Vaadin documentation coverage
The REST server is deployed to fly.io and available at:
- Production:
https://vaadin-docs-search.fly.dev
- Health Check:
https://vaadin-docs-search.fly.dev/health
The MCP server is deployed to fly.io and available at:
- Production:
https://vaadin-mcp.fly.dev/mcp
- Health Check:
https://vaadin-mcp.fly.dev/health
Automated via GitHub Actions:
- Daily Updates: Documentation re-processed automatically
- Manual Triggers: Can be triggered via GitHub Actions UI
- Error Notifications: Automated alerts for processing failures
This project uses Bun workspaces for package management:
bun install # Install all dependencies
bun run build # Build all packages
bun run test # Test all packages
- Core Types: Add interfaces to
packages/core-types/
- Processing: Extend converters in
packages/1-asciidoc-converter/
orpackages/2-embedding-generator/
- API: Enhance search in
packages/rest-server/
- Integration: Update MCP tools in
packages/mcp-server/
- Single Responsibility: Each package has a clear, focused purpose
- Interface-Based Design: Clean contracts between components
- Dependency Injection: Testable and swappable implementations
- Type Safety: Full TypeScript coverage with strict configuration
- Project Plan: Complete project breakdown and progress tracking
- Project Brief: Original requirements and problem definition
- Package READMEs: Detailed documentation for each package
This project successfully delivered:
β
Sophisticated RAG System: Replaced naive implementation with hierarchically-aware search
β
Enhanced User Experience: Agents can now navigate from specific details to broader context
β
Production Quality: Clean architecture, comprehensive testing, and error handling
β
Framework Intelligence: Accurate Flow/Hilla content separation with common content inclusion
β
Developer Integration: Seamless IDE assistant integration via MCP protocol
The system now provides intelligent, context-aware documentation search that understands the hierarchical structure of Vaadin documentation and enables sophisticated agent interactions.
MIT - See license file for details.
Built with β€οΈ for the Vaadin developer community