Author: Mihai Criveti
The ultimate RSS feed parser and search server for the Model Context Protocol (MCP)
Advanced RSS feed parsing, searching, filtering, and statistical analysis server built with FastMCP. Features AI-powered semantic search, hybrid retrieval (BM25 + semantic), multi-schema podcast support, and comprehensive RSS analytics.
Warning: This is an unsupported sample server for demonstration and testing only. Never run untrusted MCP servers directly on your local filesystem — always use a sandbox, container, or microVM (e.g. Docker, gVisor, Firecracker) with restricted capabilities. Perform your own security evaluation before registering any remote MCP server, including servers from public catalogs.
- Quick Start
- Features
- Installation
- Quick Examples
- All Tools (25 Total)
- Advanced Topics
- JSON-RPC Usage
- Configuration
- Development & Testing
- Troubleshooting
- Example RSS Feeds
- Performance
- Contributing
- License
# Basic installation
cd mcp-servers/python/mcp-rss-search
make install
# With AI similarity and hybrid search (recommended!)
make install-similarity# Stdio mode (for Claude Desktop, Cursor, etc.)
make dev
# HTTP mode (REST API on port 9100)
make serve-httpimport asyncio
from mcp_rss_search.server_fastmcp import rss_parser
async def main():
# Fetch NPR News feed
feed = await rss_parser.fetch_feed('https://feeds.npr.org/1001/rss.xml')
print(f"Feed: {feed['metadata']['title']}")
print(f"Entries: {feed['entry_count']}")
asyncio.run(main())Add to claude_desktop_config.json:
{
"mcpServers": {
"rss-search": {
"command": "python",
"args": ["-m", "mcp_rss_search.server_fastmcp"]
}
}
}- Hybrid Search: Combines BM25 (keyword) and semantic similarity for best results
- Document-Wide Search: Search across all fields automatically
- Configurable Models: Change embedding models at runtime (6+ models supported)
- Similarity Search: Find content by meaning, not just keywords using sentence embeddings
- Duplicate Detection: Identify duplicate or near-duplicate content semantically
- Related Content: Discover "read more like this" recommendations
- Topic Clustering: Automatically group entries into topic clusters
- BM25 + Semantic: Best of both worlds - exact matching and meaning-based search
- iTunes Podcasts: Full support for
itunes:*namespace (subtitle, summary, episode, season, etc.) - Google Play: Support for
googleplay:*namespace - Standard RSS/Atom: Universal RSS 2.0 and Atom feed support
- Smart Field Detection: Automatically extracts subtitles, summaries, descriptions across all formats
- Schema Inspector: Discover available fields in any feed
- Semantic Search in Any Field: Search subtitles, summaries, descriptions, or custom field combinations
- Title Search: Find entries by title with regex support
- Content Search: Search descriptions and full content
- Multi-Field Search: Search across titles, descriptions, authors, and categories
- Author/Speaker Search: Find all episodes by specific authors or podcast speakers
- Regex Support: Use powerful regular expressions for complex queries
- Case-Sensitive Options: Control search precision
- Feed Statistics: Comprehensive feed analytics (entry counts, date ranges, etc.)
- Author Analytics: Count and distribution of authors/speakers
- Category Analytics: Topic distribution and tag analysis
- Content Metrics: Average content length, total entries, etc.)
- Media Analytics: Track audio/video content in feeds
- Date Range Filtering: Filter entries by publication date
- Latest Entries: Get N most recent entries
- Author Filtering: Find all content from specific authors
- Category Browsing: Explore content by tags and categories
- Automatic HTML Removal: Strips all HTML tags from content
- XML Noise Filtering: Clean, structured data without XML clutter
- Entity Decoding: Handles HTML entities correctly
- Smart Caching: Configurable caching for performance
cd mcp-servers/python/mcp-rss-search
# Basic installation (keyword search only)
make install
# With AI similarity and hybrid search
make install-similarity
# With development tools
make dev-install# Basic installation
pip install -e .
# With AI similarity and hybrid search features
pip install -e ".[similarity]"
# Full installation (similarity + dev tools)
pip install -e ".[full]"
# With uv (faster)
uv pip install -e ".[similarity]"Core Dependencies (always installed):
fastmcp>= 3.0.2pydantic>= 2.5.0feedparser>= 6.0.0httpx>= 0.27.0python-dateutil>= 2.8.0
Similarity & Hybrid Search (optional, install with [similarity]):
sentence-transformers>= 2.2.0 (~80MB model, auto-downloaded)numpy>= 1.24.0scikit-learn>= 1.3.0rank-bm25>= 0.2.2 (for hybrid search with BM25 keyword matching)
Environment Variables:
RSS_EMBEDDING_MODEL- Set default embedding model (default:all-MiniLM-L6-v2)
The server automatically detects and extracts fields from:
- iTunes Podcasts: subtitle, summary, episode, season, duration, explicit flag
- Google Play: author, description, image
- Standard RSS/Atom: All standard fields
import asyncio
from mcp_rss_search.server_fastmcp import rss_parser
async def fetch_example():
feed = await rss_parser.fetch_feed('https://feeds.npr.org/1001/rss.xml')
print(f"Feed: {feed['metadata']['title']}")
print(f"Entries: {feed['entry_count']}")
asyncio.run(fetch_example())async def search_example():
feed = await rss_parser.fetch_feed('https://feeds.npr.org/1001/rss.xml')
results = rss_parser.search_entries(feed, 'climate', fields=['title'])
print(f"Found {len(results)} climate-related articles")
for result in results[:3]:
print(f" - {result['title']}")
asyncio.run(search_example())from mcp_rss_search.server_fastmcp import rss_parser, similarity_engine
async def semantic_search_example():
feed = await rss_parser.fetch_feed('https://feeds.npr.org/1001/rss.xml')
# Finds articles about AI, machine learning, automation, etc.
results = similarity_engine.similarity_search(
query="artificial intelligence and automation",
entries=feed['entries'],
top_k=5,
threshold=0.5
)
for result in results:
print(f"{result['entry']['title']} (score: {result['similarity']:.2f})")
asyncio.run(semantic_search_example())from mcp_rss_search.server_fastmcp import rss_parser, hybrid_engine
async def hybrid_search_example():
feed = await rss_parser.fetch_feed('https://podcast-feed.xml')
# Combines keyword matching (BM25) with semantic similarity
results = hybrid_engine.hybrid_search(
query="machine learning ethics",
entries=feed['entries'],
semantic_weight=0.6, # 60% semantic, 40% keyword
bm25_weight=0.4,
top_k=10
)
for result in results:
print(f"{result['entry']['title']}")
print(f" Hybrid: {result['hybrid_score']:.2f} "
f"(Semantic: {result['semantic_score']:.2f}, "
f"BM25: {result['bm25_score']:.2f})")
asyncio.run(hybrid_search_example())async def podcast_example():
# Find episodes with specific guest
results = rss_parser.find_by_author(
feed,
author="Neil deGrasse Tyson",
exact_match=False
)
print(f"Found {len(results)} episodes with Neil deGrasse Tyson")
asyncio.run(podcast_example())async def stats_example():
feed = await rss_parser.fetch_feed('https://blog-feed.xml')
stats = rss_parser.get_statistics(feed)
print(f"Total entries: {stats['total_entries']}")
print(f"Unique authors: {stats['authors']['count']}")
print(f"Date range: {stats['date_range']['earliest']} to {stats['date_range']['latest']}")
asyncio.run(stats_example())The server provides 25 MCP tools organized into 5 categories:
Get information about the current embedding model configuration.
Returns: Model name, status, embedding dimensions, max sequence length
Example: Check which model is currently configured and loaded.
Configure/change the embedding model at runtime.
Parameters:
model_name(string, required): Sentence transformer model to use
Common models:
all-MiniLM-L6-v2(default) - Fast, lightweight (80MB)all-mpnet-base-v2- Higher quality (420MB)multi-qa-mpnet-base-dot-v1- Best for Q&Aparaphrase-multilingual-MiniLM-L12-v2- Multilingual
Example: configure_model("all-mpnet-base-v2") - Switch to higher quality model
Combine BM25 (keyword) and semantic similarity for robust search.
Parameters:
url(string, required): RSS feed URLquery(string, required): Search queryfields(list, optional): Fields to search (default: title, description)top_k(integer, default: 10): Number of resultssemantic_weight(float, default: 0.5): Weight for semantic score (0-1)bm25_weight(float, default: 0.5): Weight for BM25 score (0-1)threshold(float, default: 0.0): Minimum hybrid score
Returns: Results with hybrid_score, semantic_score, and bm25_score
Why Hybrid?: Combines the best of both worlds - exact keyword matching (BM25) and meaning-based matching (semantic).
Document-wide search across ALL fields with automatic hybrid retrieval.
Parameters:
url(string, required): RSS feed URLquery(string, required): Search querytop_k(integer, default: 10): Number of resultsuse_semantic(boolean, default: true): Enable semantic searchuse_bm25(boolean, default: true): Enable BM25 keyword search
Searches: title, subtitle, summary, description, author
Perfect for: When you don't know which field contains your information.
Inspect feed to discover available fields and podcast schema type.
Returns:
- Detected schemas (iTunes, Google Play, etc.)
- Available fields and coverage
- Recommended search fields
- Sample values
Search specifically in podcast episode subtitles (iTunes, Google Play formats).
Perfect for: Finding episodes by topic when subtitles are well-maintained.
Search in episode summaries and descriptions (handles all podcast formats).
Perfect for: Deep content search across iTunes summary, descriptions, etc.
Custom multi-field semantic search - choose exactly which fields to search.
Fields available: title, subtitle, summary, description, author, categories, etc.
Semantic similarity search using AI embeddings. Finds content by meaning, not just keywords.
Example: Query "climate crisis" finds articles about "global warming", "environmental catastrophe", etc.
Parameters:
url(string, required): RSS feed URLquery(string, required): Natural language search querytop_k(integer, default: 10): Number of resultsthreshold(float, default: 0.0): Minimum similarity (0-1)use_cache(boolean, default: true)
Find duplicate or near-duplicate entries using semantic similarity.
Use cases: Cross-posted content, republished articles, feed deduplication
Parameters:
url(string, required): RSS feed URLsimilarity_threshold(float, default: 0.85): Duplicate thresholduse_cache(boolean, default: true)
Find entries related to a specific entry - perfect for "read more like this".
Parameters:
url(string, required): RSS feed URLentry_index(integer, required): Index of source entrytop_k(integer, default: 5): Number of related entriesuse_cache(boolean, default: true)
Automatically cluster entries into topic groups using K-means.
Parameters:
url(string, required): RSS feed URLn_clusters(integer, default: 5): Number of clusters (2-20)use_cache(boolean, default: true)
Fetch and parse an RSS feed from URL.
Parameters:
url(string, required): RSS feed URLuse_cache(boolean, default: true): Use cached feed if available
Returns: Complete feed data with metadata and entries
Search RSS feed entries by title.
Parameters:
url(string, required): RSS feed URLquery(string, required): Search querycase_sensitive(boolean, default: false)regex(boolean, default: false): Use regex pattern matchinguse_cache(boolean, default: true)
Search in entry descriptions and content.
Parameters: Same as search_titles
Search across all fields (title, description, author, categories).
Parameters: Same as search_titles
Find all entries by a specific author or speaker.
Parameters:
url(string, required): RSS feed URLauthor(string, required): Author/speaker nameexact_match(boolean, default: false): Require exact name matchuse_cache(boolean, default: true)
List all unique authors/speakers with their entry counts.
Parameters:
url(string, required): RSS feed URLmin_count(integer, default: 1): Minimum entries per authoruse_cache(boolean, default: true)
List all unique categories/tags with counts.
Parameters:
url(string, required): RSS feed URLmin_count(integer, default: 1): Minimum entries per categoryuse_cache(boolean, default: true)
Get comprehensive feed statistics and analysis.
Returns:
- Total entry count
- Date range (earliest to latest)
- Author statistics (count and distribution)
- Category statistics
- Media statistics (for podcasts)
- Content length statistics
Filter entries by date range.
Parameters:
url(string, required): RSS feed URLstart_date(string, optional): Start date (ISO format or natural language)end_date(string, optional): End dateuse_cache(boolean, default: true)
Example Dates:
- ISO format: "2024-01-01"
- Natural: "last week", "January 2024"
Extract feed-level metadata (title, description, author, etc.).
Get N most recent entries from the feed.
Parameters:
url(string, required): RSS feed URLcount(integer, default: 10, range: 1-100): Number of entriesuse_cache(boolean, default: true)
Comprehensive feed analysis with insights and recommendations.
Returns:
- Feed type detection (Podcast, Blog, News, etc.)
- Update frequency analysis
- Content patterns
- Automated insights
- Recommendations
Clear the internal RSS feed cache.
Semantic similarity search uses AI embeddings to find content by meaning, not just keywords.
- Text is converted to high-dimensional vectors (embeddings)
- Similarity is computed using cosine similarity
- Results are ranked by semantic similarity score (0-1)
| Use Semantic Search When | Use Keyword Search When |
|---|---|
| Looking for concepts/ideas | Need exact phrase/term |
| Exploring related topics | Know specific keyword |
| Finding similar articles | Filtering by author/date |
| Duplicate detection | Precise field matching |
| Content recommendations | Fast, simple queries |
| Score | Meaning | Use Case |
|---|---|---|
| 0.9-1.0 | Nearly identical | Duplicate detection |
| 0.7-0.9 | Very similar | Same topic/event |
| 0.5-0.7 | Related | Recommended reading |
| 0.3-0.5 | Loosely related | Topic exploration |
| < 0.3 | Minimal relation | Filter out |
# High precision search
results = similarity_engine.similarity_search(
query="quantum computing applications in drug discovery",
entries=feed['entries'],
threshold=0.7, # High threshold
top_k=5
)
# Discovery mode (find related concepts)
results = similarity_engine.similarity_search(
query="climate change",
entries=feed['entries'],
threshold=0.3, # Lower threshold
top_k=20 # More results
)
# Find duplicates
duplicates = similarity_engine.find_duplicates(
entries=feed['entries'],
similarity_threshold=0.85
)
# Topic clustering
clusters = similarity_engine.cluster_entries(
entries=feed['entries'],
n_clusters=5
)Hybrid search combines traditional keyword-based ranking (BM25) with semantic similarity for robust search results.
Problem with Pure Semantic Search:
- May miss exact keyword matches
- Can be too "fuzzy" for specific queries
Problem with Pure Keyword Search:
- Misses synonyms and related concepts
- Requires exact word matches
Solution: Hybrid Search:
- Gets both exact matches AND semantic matches
- Configurable weights let you tune precision vs. recall
| Use Case | Semantic Weight | BM25 Weight | Why |
|---|---|---|---|
| General search | 0.5 | 0.5 | Balanced |
| Concept discovery | 0.7 | 0.3 | Find related topics |
| Exact term search | 0.3 | 0.7 | Precise keyword matching |
| Technical docs | 0.4 | 0.6 | Technical terms matter |
| News articles | 0.6 | 0.4 | Concepts > exact words |
# Balanced hybrid search
results = hybrid_engine.hybrid_search(
query="artificial intelligence regulation",
entries=feed['entries'],
semantic_weight=0.5,
bm25_weight=0.5,
top_k=10
)
# Favor semantic (find related concepts)
results = hybrid_engine.hybrid_search(
query="climate change policy",
entries=feed['entries'],
semantic_weight=0.7,
bm25_weight=0.3,
top_k=5
)
# Favor keyword (precise matching)
results = hybrid_engine.hybrid_search(
query="Python 3.12 new features",
entries=feed['entries'],
semantic_weight=0.3,
bm25_weight=0.7,
fields=["title", "description"],
top_k=10
)
# Document-wide search (all fields)
results = hybrid_engine.document_search(
query="quantum computing interview",
entries=feed['entries'],
top_k=10
){
"success": true,
"query": "machine learning ethics",
"semantic_weight": 0.5,
"bm25_weight": 0.5,
"match_count": 5,
"matches": [
{
"hybrid_score": 0.82,
"semantic_score": 0.75,
"bm25_score": 0.89,
"entry": {
"title": "AI Ethics and Responsible Development",
"description": "...",
...
}
}
]
}| Model | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
all-MiniLM-L6-v2 |
80MB | ⚡⚡⚡ | Good | Default, fast |
all-MiniLM-L12-v2 |
120MB | ⚡⚡ | Better | Balanced |
all-mpnet-base-v2 |
420MB | ⚡ | Best | High quality |
multi-qa-mpnet-base-dot-v1 |
420MB | ⚡ | Best | Q&A, search |
all-distilroberta-v1 |
290MB | ⚡⚡ | Better | Fast, good quality |
paraphrase-multilingual-MiniLM-L12-v2 |
470MB | ⚡ | Good | Multilingual |
# Get current model info
info = similarity_engine.get_model_info()
print(f"Current model: {info['configured_model']}")
# Switch to higher quality model
similarity_engine.model_name = "all-mpnet-base-v2"
similarity_engine.model = None # Force reload
# Or via environment variable
import os
os.environ['RSS_EMBEDDING_MODEL'] = "all-mpnet-base-v2"Use Larger Models When:
- Search quality is critical
- You have time for slower processing
- You're doing offline batch processing
- You need best possible results
Use Smaller Models When:
- Speed is important
- You're running on limited hardware
- You need real-time responses
- Default quality is sufficient
Use Specialized Models When:
- Q&A format:
multi-qa-mpnet-base-dot-v1 - Multilingual content:
paraphrase-multilingual-MiniLM-L12-v2 - Code search:
all-mpnet-base-v2
The server automatically detects and extracts fields from multiple podcast formats.
- iTunes Podcast - Full
itunes:*namespace support - Google Play Podcasts -
googleplay:*namespace - Standard RSS 2.0 - All standard fields
- Atom Feeds - Summary, content, author
- Media RSS - Thumbnails, media content
Subtitle:
itunes:subtitle → subtitle → googleplay:description (if < 200 chars)
Summary:
itunes:summary → content → summary → description → googleplay:description
Author:
itunes:author → googleplay:author → author
Image:
itunes:image → googleplay:image → media:thumbnail → image
- title, subtitle, summary, description
- link, author, published, updated
- categories, episode, season
- media_url, media_type, media_duration, media_size
- image, explicit, guid
# 1. Inspect feed schema
schema = rss_parser.inspect_feed_schema(feed_url)
print(f"Schemas: {schema['detected_schemas']}")
print(f"Best fields: {schema['recommended_search_fields']}")
# 2. Search subtitles (topic-focused)
if 'subtitle' in schema['available_fields']:
results = similarity_engine.similarity_search(
query="machine learning applications",
entries=feed['entries'],
fields=["subtitle"]
)
# 3. Search summaries (deep content)
results = similarity_engine.similarity_search(
query="discussion about quantum computing",
entries=feed['entries'],
fields=["summary", "description"]
)
# 4. Custom field search
results = similarity_engine.similarity_search(
query="climate change policy",
entries=feed['entries'],
fields=schema['recommended_search_fields']
)When running in HTTP mode, the server supports JSON-RPC 2.0 protocol.
make serve-http
# Server runs on http://0.0.0.0:9100/mcp/curl -s -X POST -H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
http://0.0.0.0:9100/mcp/ | python3 -m json.toolcurl -s -X POST -H 'Content-Type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "fetch_rss",
"arguments": {
"url": "https://feeds.npr.org/1001/rss.xml",
"use_cache": false
}
}
}' \
http://0.0.0.0:9100/mcp/ | python3 -m json.toolcurl -s -X POST -H 'Content-Type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "search_titles",
"arguments": {
"url": "https://feeds.npr.org/1001/rss.xml",
"query": "climate",
"case_sensitive": false,
"regex": false
}
}
}' \
http://0.0.0.0:9100/mcp/ | python3 -m json.toolcurl -s -X POST -H 'Content-Type: application/json' \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "hybrid_search",
"arguments": {
"url": "https://feeds.npr.org/1001/rss.xml",
"query": "artificial intelligence ethics",
"semantic_weight": 0.6,
"bm25_weight": 0.4,
"top_k": 5
}
}
}' \
http://0.0.0.0:9100/mcp/ | python3 -m json.tool# Set default embedding model
export RSS_EMBEDDING_MODEL="all-mpnet-base-v2"
# Set cache directory (optional)
export RSS_CACHE_DIR="/path/to/cache"Add to ~/.config/claude/claude_desktop_config.json:
{
"mcpServers": {
"rss-search": {
"command": "python",
"args": ["-m", "mcp_rss_search.server_fastmcp"],
"env": {
"RSS_EMBEDDING_MODEL": "all-MiniLM-L6-v2"
}
}
}
}The server supports standard MCP protocol via stdio. Refer to your client's documentation for configuration.
# Run all tests with coverage
make test
# Run specific test
pytest tests/test_server.py::TestRSSParser::test_fetch_feed_success -v
# Check coverage report
pytest --cov=mcp_rss_search --cov-report=html
open htmlcov/index.html# Format code
make format
# Lint code
make lint
# Type checking
mypy src/mcp_rss_search# Create virtual environment
make venv
# Install development dependencies
make dev-install
# Run development server with auto-reload
make dev
# Run all quality checks
make test lint formatSolution: Make sure the virtual environment is activated and the package is installed:
. .venv/bin/activate
cd mcp-servers/python/mcp-rss-search
pip install -e .Error: "sentence-transformers not installed"
Solution: Install similarity dependencies:
pip install -e ".[similarity]"
# Or
make install-similarityReason: Model needs to download (~80MB)
Solution: Wait 30-60 seconds on first use. Model is cached after that.
Solution: Process fewer entries at once
# Process in batches
batch_size = 50
for i in range(0, len(entries), batch_size):
batch = entries[i:i+batch_size]
results = similarity_engine.similarity_search(query, batch)Solution: Check if port 9100 is already in use:
lsof -i :9100
# Kill any process using the port
kill <PID>Solution: Reinstall dev dependencies:
pip install -e ".[dev]"
pytest -v- NPR News: https://feeds.npr.org/1001/rss.xml
- NY Times: https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml
- BBC News: http://feeds.bbci.co.uk/news/rss.xml
- Reuters: http://feeds.reuters.com/reuters/topNews
- The Guardian: https://www.theguardian.com/world/rss
- Hacker News: https://news.ycombinator.com/rss
- TechCrunch: https://techcrunch.com/feed/
- Ars Technica: http://feeds.arstechnica.com/arstechnica/index
- Wired: https://www.wired.com/feed/rss
- The Verge: https://www.theverge.com/rss/index.xml
- The Daily (NYT): https://feeds.simplecast.com/54nAGcIl
- Planet Money: https://feeds.npr.org/510289/podcast.xml
- Radiolab: http://feeds.wnyc.org/radiolab
- This American Life: http://feed.thisamericanlife.org/talpodcast
- Feeds are cached by URL
- Cache persists for server lifetime
- Manual cache clearing with
clear_cachetool - Configurable via
use_cacheparameter
- Async HTTP requests
- Efficient feedparser usage
- Minimal memory footprint
- Fast regex-based text processing
- Use caching: Set
use_cache=Truefor frequently accessed feeds - Filter early: Use specific tools instead of fetching everything
- Regex carefully: Simple string searches are faster than regex
- Clear cache: Use
clear_cache()when feeds update frequently - Batch processing: For large feeds, process entries in batches
Contributions welcome! Areas for enhancement:
- Persistent caching (Redis, SQLite)
- Feed autodiscovery from website URLs
- OPML import/export
- Multi-feed aggregation and deduplication
- Custom feed filters and transformations
- RSS feed creation/generation
- Webhook support for feed updates
- More embedding models support
- GraphQL API support
Apache-2.0
Built with:
- FastMCP - MCP protocol implementation
- feedparser - RSS/Atom parsing
- httpx - HTTP client
- sentence-transformers - Semantic embeddings
- rank-bm25 - BM25 ranking
- ContextForge - ContextForge ecosystem
Happy RSS parsing with AI-powered search! 🎉🧠🔍✨