A Model Context Protocol (MCP) server that provides SQL query access to large-scale geospatial datasets stored in S3. Built with DuckDB for high-performance analytics on H3-indexed environmental, biodiversity, and geospatial data.
Add the hosted MCP endpoint to your LLM client.
create a .vscode/mcp.json like this: (as in this repo)
{
"servers": {
"duckdb-geo": {
"url": "https://duckdb-mcp.nrp-nautilus.io/mcp"
}
}
}Ask your chat client a question and it should answer by querying the database in SQL:
Add to your Claude Desktop configuration file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"duckdb-geo": {
"url": "https://duckdb-mcp.nrp-nautilus.io/mcp"
}
}
}After adding the configuration, restart Claude Desktop.
- Zero-Configuration SQL Access: Query petabytes of geospatial data without database setup
- H3 Geospatial Indexing: Efficient spatial operations using Uber's H3 hexagonal grid system
- Isolated Execution: Each query runs in a fresh DuckDB instance for security
- Stateless HTTP Mode: Fully horizontally scalable for cloud deployment
- Rich Dataset Catalog: Access to 10+ curated environmental and biodiversity datasets
- MCP Resources & Prompts: Browse datasets and get query guidance through MCP protocol
The example configuration provides access to the following datasets via S3:
- GLWD - Global Lakes and Wetlands Database
- Vulnerable Carbon - Conservation International carbon vulnerability data
- NCP - Nature Contributions to People biodiversity scores
- Countries & Regions - Global administrative boundaries (Overture Maps)
- WDPA - World Database on Protected Areas
- Ramsar Sites - Wetlands of International Importance
- HydroBASINS - Global watershed boundaries (levels 3-6)
- iNaturalist - Species occurrence range maps
- Corruption Index 2024 - Transparency International data
See datasets.md for detailed schema information. This file is consumed directly by the LLM, additional datasets can be added by describing them here.
You can also run the server locally
Or install dependencies and run directly:
pip install -r requirements.txt
python server.pyYou can now connect to the server over localhost (note http not https here), e.g. in VSCode:
{
"servers": {
"duckdb-geo": {
"url": "http://localhost:8000/mcp"
},
}
}You can adjust the datasets and instructions to the LLM in the corresponding .md files (e.g. datasets.md). You will need to adjust query-setup.md to run the server locally, as it uses endpoint and thread count that only work from inside our k8s cluster.
Running locally means your local CPU+network resources will be used for the computation, which will likely be much slower than the hosted k8s endpoint.
We have a fully-hosted version
- server.py - Main MCP server with FastMCP framework
- stac.py - STAC catalog integration for dynamic dataset discovery
- datasets.md - Dataset catalog and schema documentation
- query-setup.md - Required DuckDB configuration for all queries
- query-optimization.md - Performance optimization guidelines
- h3-guide.md - H3 geospatial operations reference
- Prompt Engineering: Injects strict SQL rules into tool descriptions to guide LLM behavior
- Isolation Engine: Each query gets a fresh DuckDB connection for security
- Context Injection: Documentation is embedded into MCP resources and tool descriptions
- Partition Pruning: Uses H3 resolution columns (
h0) for efficient S3 reads
Deploy to Kubernetes using the provided manifests:
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yamlThe deployment:
- Runs 2 replicas for high availability
- Allocates 16GB memory per pod for large queries
- Uses
uvfor fast dependency installation - Includes readiness probes for safe rollouts
query(sql_query)- Execute DuckDB SQL with embedded optimization rules
NOTE: Some MCP clients, like in VSCode, do not recognize "resources" and "prompts". Newer clients (Claude code, Continue.dev, Antigravity do)
catalog://list- List all available datasetscatalog://{name}- Get detailed schema for a specific dataset
geospatial-analyst- Load complete context for geospatial analysis persona
- Always include h0 in joins - Enables partition pruning for 5-20x speedup
- Use APPROX_COUNT_DISTINCT(h8) - Fast area calculations with H3 hexagons
- Filter small tables first - Create CTEs to reduce join cardinality
- Set THREADS=100 - Parallel S3 reads are I/O bound, not CPU bound
- Enable object cache - Reduces redundant S3 requests
See query-optimization.md for detailed guidance.
All datasets use Uber's H3 hexagonal grid system for spatial indexing:
- Resolution 8 (h8): ~0.737 km² per hex
- Resolution 0-4 (h0-h4): Coarser resolutions for global analysis
- Use
h3_cell_to_parent()to join datasets at different resolutions - Use
APPROX_COUNT_DISTINCT(h8) * 0.737327598to calculate areas in km²
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_server.py
# Run with coverage
pytest --cov=. tests/THREADS- DuckDB thread count (default: 100 for S3 workloads)PORT- HTTP server port (default: 8000)
Required settings are documented in query-setup.md and automatically injected into query tool descriptions.
- Stateless Design: No persistent database or user data
- Read-Only Access: Server only reads from public S3 buckets
- Query Isolation: Each request gets a fresh DuckDB instance
- DNS Rebinding Protection: Disabled for MCP HTTP mode
MIT License - See repository for details
Contributions welcome! Key areas:
- Additional dataset integrations
- Query optimization patterns
- STAC catalog enhancements
- Documentation improvements
- Model Context Protocol
- DuckDB Documentation
- H3 Geospatial Indexing
- FastMCP Framework
- STAC Specification
For issues and questions:
- GitHub Issues: boettiger-lab/mcp-data-server
- Dataset questions: See datasets.md for data sources
