Conversation
cbff0dc to
bc9711f
Compare
src/mcp_server_datahub/mcp_server.py
Outdated
| if schema_history := _get_schema_history(client, urn): | ||
| result["schemaHistory"] = { | ||
| "latestVersion": schema_history.latest_version.semantic_version, | ||
| "versions": sorted([v.semantic_version for v in schema_history.versions]), | ||
| } |
There was a problem hiding this comment.
Example
"schemaHistory": {
"latestVersion": "0.1.0",
"versions": [ "0.0.0", "0.1.0"]
}
src/mcp_server_datahub/mcp_server.py
Outdated
| variables = {"urn": dataset_urn, "versionStamp": target_version_stamp} | ||
| resp = _execute_graphql( | ||
| client._graph, | ||
| query=entity_details_fragment_gql, | ||
| variables=variables, | ||
| operation_name="getVersionedDataset", | ||
| ) | ||
| return resp.get("versionedDataset", {}) |
There was a problem hiding this comment.
Example
{
"schema": {
"fields": [
{
"fieldPath": "[version=2.0].[type=long].id",
"jsonPath": null,
"nullable": true,
"description": null,
"type": "NUMBER",
"nativeDataType": "BIGINT",
"recursive": false,
"isPartOfKey": false,
"isPartitioningKey": false,
"__typename": "SchemaField"
},
...(repeated every fields)
],
"lastObserved": 1749608884835,
"__typename": "Schema"
},
"editableSchemaMetadata": null,
"__typename": "VersionedDataset"
}
src/mcp_server_datahub/mcp_server.py
Outdated
| semantic_version: str | ||
| version_stamp: str |
There was a problem hiding this comment.
Example
{
"semanticVersion": "0.0.0",
"versionStamp": "browsePaths:0;dataPlatformInstance:0;datasetKey:0;schemaMetadata:1",
}
src/mcp_server_datahub/mcp_server.py
Outdated
|
|
||
|
|
||
| @mcp.tool(description="Get schema from a dataset by its URN and version.") | ||
| @lru_cache |
There was a problem hiding this comment.
It's seems better to use cache since versioned_dataset is immutable.
There was a problem hiding this comment.
I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.
There was a problem hiding this comment.
I’m not sure how effective it will be either. But it would be good to have it since it can reduce at least some traffic when querying the same version of the schema.
I’ll set max_size to around 20.
There was a problem hiding this comment.
I’ll set max_size to around 20.
Changed
src/mcp_server_datahub/mcp_server.py
Outdated
| versions: list[SemanticVersionStruct] | ||
|
|
||
|
|
||
| def _get_schema_version_list( |
There was a problem hiding this comment.
note: getVersionedDataset retrieves each version’s schema correctly whereas getSchemaBlame is not which I tried to use at first.
bc9711f to
00347a9
Compare
Additional Test: Fixing outdated query with LLMPurposeCheck if LLM can fix outdated(column name changed) query with datahub MCP. note: Pls let me know if you need to test on different MCP hosts or models. PromptClaude Desktop, Sonet 4 |
|
@hsheth2 Could you please review again? 🙏 FYI:
|
src/mcp_server_datahub/mcp_server.py
Outdated
| return cls( | ||
| semantic_version=data["semanticVersion"], | ||
| version_stamp=data["versionStamp"], | ||
| ) |
There was a problem hiding this comment.
You may as well set alias for fields and use SemanticVersionStruct.model_validate with dict form instead of separate method.
src/mcp_server_datahub/mcp_server.py
Outdated
|
|
||
| _inject_urls_for_urns(client._graph, result, [""]) | ||
|
|
||
| if schema_version_list := _get_schema_version_list(client, urn): |
There was a problem hiding this comment.
I'm somewhat concerned about performance hit due to additional call to get schema versions every time for a dataset entity. I wonder if this needs its own separate tool, for performance reasons. cc: @hsheth2
Also we should skip this call for non-dataset entities.
There was a problem hiding this comment.
In the previous PR, @hsheth2 already mentioned about this.
my main worry here is that every new tool consumes additional tokens on every request. The more tools we have, the more likely it is that the LLM gets confused / doesn't call our other tools when it should. So I'd like to think about what we can do to reduce the number of tools while keeping our responses simple.
There was a problem hiding this comment.
@eagle-25 I would like to run some tests about how addition of this get_schema_version_list affects overall tool timings of get_entity for dataset entity. I might get to this next week. In the meantime, if you can get some numbers or have any observations, feel free to share.
There was a problem hiding this comment.
Also we should skip this call for non-dataset entities.
Changed
src/mcp_server_datahub/mcp_server.py
Outdated
|
|
||
|
|
||
| @mcp.tool(description="Get schema from a dataset by its URN and version.") | ||
| @lru_cache |
There was a problem hiding this comment.
I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.
|
@mayurinehate Added comments to your feedback. Could you check please? I will modify the code after resolve this conversation |
70239fc to
728948b
Compare
- Add a tool to retrieve the schema of a dataset - Modify get_entity so that when querying a dataset, it also returns the schema version
728948b to
0b61fa2
Compare
|
@mayurinehate Could you review these changes? Applied the following improvement feedbacks.
I also refactored the code into the DatasetSchemaAPI class to improve cohesion. |

Changes
get_entity().get_versioned_datasettool for retrieving schema by version.Motivation
Tests
I defined a test scenario and tried the available MCP hosts and models. Since I didn’t select them based on specific criteria, please suggest any additional tests.
Settings
DataHub: Self hosted (v1.1.0)
Test Dataset in DataHub
emailis renamed toemail_addressat v0.1.0)Test Scenario
0.0.0and0.1.0onsample.userstable.emailcolumn is renamed toemail_addresson0.1.0Prompt(common)
note: To also test that the version list is retrieved correctly, the prompt uses
latestandpreviousinstead of specifying concrete versions.Test Result Summary
Claude Desktop
Claude Sonnet4
Cursor
Claude Sonnet 4
gemini-2.5-pro
GPT-4.1
CLINE (VS Code)
GPT-4.1
note: This PR is recreated from #10