Releases: acryldata/mcp-server-datahub
v0.5.3
Highlights — New SQL-like Filter Syntax
The search and get_lineage tools now accept a human-readable, SQL-like filter string instead of the previous nested JSON dict. This makes filters dramatically easier for LLM agents (and humans) to write correctly.
Since MCP tools are discovered dynamically by LLM agents at the start of every session, this is not a breaking change — agents will automatically pick up the new filter parameter and its syntax documentation.
Before (0.5.x):
search(query="*", filters={"entity_type": ["DATASET"]})
search(query="*", filters={"and": [{"platform": ["snowflake"]}, {"env": ["PROD"]}]})After (0.5.3):
search(query="*", filter="entity_type = dataset")
search(query="*", filter="platform = snowflake AND env = PROD")The new syntax supports simple equality, IN lists, boolean logic (AND, OR, NOT, parentheses), comparisons (>, >=, <, <=), and existence checks (IS NULL, IS NOT NULL).
Added
search_filter_parser: Full SQL-like filter parser with tokenizer and recursive-descent parser. Compiles human-readable filter strings into DataHub SDKFilterobjects. Includes comprehensiveFILTER_DOCSinjected into tool descriptions so LLM agents always have the syntax reference.- Modular tool architecture: Tools are now organized into dedicated modules under
tools/(search.py,entities.py,lineage.py,dataset_queries.py,assertions.py) instead of being defined inline inmcp_server.py. graphql_helpers: Extracted shared GraphQL execution logic, token budgeting, and response processing into a dedicated module.tool_context: New module for tool-level context management.view_preference: Configurable view preference system (UseDefaultView,NoView,CustomView) for controlling which DataHub view is applied during search.tools/assertions.py: New tool module for data quality assertion checks.
Changed
searchtool: Thefiltersparameter (JSON dict) is replaced byfilter(string). See highlights section above.get_lineagetool: Also uses the new string-basedfilterparameter for filtering lineage results.mcp_server.py: Significantly slimmed down — tool implementations moved to dedicated modules, GraphQL helpers extracted, filter parsing extracted.- Smoke check safety:
smoke_check.pynow refuses to run against non-localhost DataHub instances to prevent accidental mutation of production data.
Removed
test_custom_filter_conversion.py: Removed obsolete test for the old dict-based filter format, replaced bysearch_filter_parser.
Full Changelog: v0.5.2...v0.5.3
v0.5.2
Fixed
- HTTP transport ContextVar propagation: Fixed
LookupErrorfor_mcp_dh_clientContextVar when running with HTTP transport (stateless_http=True). Each HTTP request runs in a separate async context that doesn't inherit ContextVars from the main thread, causingDocumentToolsMiddlewareandVersionFilterMiddlewareto fail. Added_DataHubClientMiddlewarethat sets the ContextVar at the start of every MCP message. create_app()initialization safety: The_app_initializedflag is now set only after all middleware is successfully added, so a failed setup can be retried.--debugmiddleware ordering:LoggingMiddlewareis now added before other middlewares so it wraps the full request/response lifecycle for maximum visibility.
Added
create_app()factory function: Extracted server setup into a factory function so thatfastmcp dev/fastmcp runwork correctly (they import the module but never callmain()).- Multi-mode smoke testing:
smoke_check.pynow supports--urland--stdio-cmdoptions to test against running HTTP/SSE servers or stdio subprocesses, in addition to the default in-process mode. test_all_modes.shorchestrator: Runs smoke checks across all 5 transport modes (in-process, HTTP, SSE, stdio,fastmcp run), with per-mode log capture toscripts/logs/.SMOKE_CHECK.md: Documentation with step-by-step reproduction instructions for all transport modes.- Core tool validation: Smoke check now verifies that all 8 core read-only tools are present, catching silent regressions in tool registration or middleware filtering.
Full Changelog: v0.5.1...v0.5.2
v0.5.1
Fixed
list_schema_fields— Fixed crash when a dataset has no schema metadata. Now gracefully returns an empty fields list.save_document— Errors (e.g., authorization failures) are now raised as exceptions instead of being silently swallowed. LLM agents now see the actual error message.update_description— Hidden from OSS instances where entity-level description updates are not supported. Available on Cloud only.
Added
scripts/smoke_check.py— Comprehensive smoke check script that exercises all available MCP tools against a live DataHub instance. Discovers URNs dynamically, respects version filtering middleware, and tests mutation tools with add-then-remove pairs. Usage:uv run python scripts/smoke_check.py --all
Changed
- Version-aware tool filtering:
update_descriptionnow requires Cloud (@min_version(cloud="0.3.16")), previously also allowed on OSS >= 1.4.0.
Full Changelog: v0.5.0...v0.5.1
v0.5.0
New Tools
Mutation Tools
New tools for modifying metadata in DataHub. Enabled via TOOLS_IS_MUTATION_ENABLED=true.
add_tags/remove_tags— Add or remove tags from entities or schema fields. Supports bulk operations.add_terms/remove_terms— Add or remove glossary terms from entities or schema fields.add_owners/remove_owners— Add or remove ownership assignments. Supports different ownership types.set_domains/remove_domains— Assign or remove domain membership for entities.update_description— Update, append to, or remove descriptions for entities or schema fields.add_structured_properties/remove_structured_properties— Manage typed metadata fields on entities.
User Tools
get_me— Retrieve information about the currently authenticated user. Enabled viaTOOLS_IS_USER_ENABLED=true.
Document Tools
New tools for working with documents (knowledge articles, runbooks, FAQs) stored in DataHub. Automatically hidden when no documents exist in the catalog.
search_documents— Search for documents using keyword search with filters.grep_documents— Search within document content using regex patterns.save_document— Save standalone documents to DataHub's knowledge base.
Enhancements
- Semantic search support — Enable AI-powered semantic search for documents via
SEMANTIC_SEARCH_ENABLED=true. - Document tools middleware — Automatically hides document tools when no documents exist, with a cached existence check (1-minute TTL).
- Upgraded FastMCP to 2.14.5 (from 2.10.5) with MCP SDK 1.26.0 compatibility.
- Relaxed pydantic pin to
>=2.0,<3(was<2.12).
Breaking Changes
- Python 3.11+ is now required (previously 3.10+).
acryl-datahub >= 1.3.1.7is now required.- MCP Inspector: Use
fastmcp devinstead ofmcp devfor development.
Security
- Added
SECURITY.mdwith vulnerability reporting guidelines. - Bumped
authlib(1.6.0 → 1.6.6),urllib3(2.4.0 → 2.6.3),aiohttp(3.12.7 → 3.13.3),python-multipart(0.0.20 → 0.0.22).
New Environment Variables
| Variable | Default | Description |
|---|---|---|
TOOLS_IS_MUTATION_ENABLED |
false |
Enable mutation tools |
TOOLS_IS_USER_ENABLED |
false |
Enable user tools |
DATAHUB_MCP_DOCUMENT_TOOLS_DISABLED |
false |
Completely disable document tools |
SAVE_DOCUMENT_TOOL_ENABLED |
true |
Enable/disable save_document |
SAVE_DOCUMENT_PARENT_TITLE |
Shared |
Parent folder title for saved documents |
SAVE_DOCUMENT_ORGANIZE_BY_USER |
false |
Organize saved documents by user |
SAVE_DOCUMENT_RESTRICT_UPDATES |
true |
Only allow updating documents in shared folder |
SEMANTIC_SEARCH_ENABLED |
false |
Enable semantic (AI-powered) search |
new mcp tools and other improvements
Response Token Budget Management
- New
TokenCountEstimatorclass for fast token counting using character-based heuristics - Automatic result truncation via
_select_results_within_budget()to prevent context window issues - Configurable token limits:
TOOL_RESPONSE_TOKEN_LIMITenvironment variable (default: 80,000 tokens)ENTITY_SCHEMA_TOKEN_BUDGETenvironment variable (default: 16,000 tokens per entity)
- 90% safety buffer to account for token estimation inaccuracies
- Ensures at least one result is always returned
Enhanced Search Capabilities
- Enhanced Keyword Search:
- Supports pagination with
startparameter - Added
viewUrnfor view-based filtering - Added
sortInputfor custom sorting
- Supports pagination with
Query Entity Support
- Native QueryEntity type support (SQL queries as first-class entities)
- New
query_entity.gqlGraphQL query - Optimized entity retrieval with specialized query for QueryEntity types
- Includes query statement, subjects (datasets/fields), and platform information
GraphQL Compatibility
- Adaptive field detection for newer GMS versions
- Caching mechanism for GMS version detection
- Graceful fallback when newer fields aren't available
- Support for
#[CLOUD]and#[NEWER_GMS]conditional field markers DISABLE_NEWER_GMS_FIELD_DETECTIONenvironment variable override
Schema Field Optimization
- Smart field prioritization to stay within token budgets:
- Primary key fields (
isPartOfKey=true) - Partitioning key fields (
isPartitioningKey=true) - Fields with descriptions
- Fields with tags or glossary terms
- Alphabetically by field path
- Primary key fields (
- Generator-based approach for memory efficiency
Error Handling & Security
- Enhanced error logging with full stack traces in
async_backgroundwrapper - Logs function name, args, and kwargs on failures
- ReDoS protection in HTML sanitization with bounded regex patterns
- Query truncation function (configurable via
QUERY_LENGTH_HARD_LIMIT, default: 5,000 chars)
Default Views Support
- Automatic default view application for all search operations
- Fetches organization's default global view from DataHub
- 5-minute caching (configurable via
VIEW_CACHE_TTL_SECONDS) - Can be disabled via
DATAHUB_MCP_DISABLE_DEFAULT_VIEWenvironment variable - Ensures search results respect organization's data governance policies
Dependencies
- Added
cachetools>=5.0.0: For GMS field detection caching - Added
types-cachetools(dev): Type stubs for mypy
Performance
- Memory efficiency: Generator-based result selection avoids loading all results into memory
- Caching: GMS version detection cached per graph instance
- Fast token estimation: Character-based heuristic (no tokenizer overhead)
- Smart truncation: Truncates less important schema fields first
v0.3.10
fix: workaround for https://github.com/jlowin/fastmcp/issues/1377 (#50)
v0.3.9
What's Changed
- feat: support get_dataset_queries, get_lineage for specific column by @mayurinehate in #37
- feat: add view definition in get_entity by @mayurinehate in #38
New Contributors
- @mayurinehate made their first contribution in #37
Full Changelog: v0.3.8...v0.3.9