feat: add schema tools by eagle-25 · Pull Request #14 · acryldata/mcp-server-datahub

eagle-25 · 2025-06-12T02:52:22Z

Changes

Add listing schema_versions(latest_version, versions) functionality to response of get_entity().
Add get_versioned_dataset tool for retrieving schema by version.

Motivation

Getting schema by the version let LLM detect changed columns and help fix outdated queries.
This will boost productivity compared to manual methods.

Tests

I defined a test scenario and tried the available MCP hosts and models. Since I didn’t select them based on specific criteria, please suggest any additional tests.

Settings

DataHub: Self hosted (v1.1.0)
Test Dataset in DataHub
- Type: Athena Table
- Name: sample.users
- Schema Histories: (email is renamed to email_address at v0.1.0)
  
  version col1 col2 col3 col4
  
  0.0.0 id name email created_at
  
  0.1.0 id name email_address created_at
Test Scenario
- There are two schema versions: 0.0.0 and 0.1.0 on sample.users table.
- The email column is renamed to email_address on 0.1.0
- Confirm the LLM can detect differences between two schemas.

Prompt(common)

=== Instructions ===
You are a DataHub AI agent. Your job is to answer user questions about DataHub metadata by calling the datahub MCP (Model Context Protocol) methods.

=== Input ===
Can you tell me the schema differences between the latest version and the previous version of the Athena named sample.users?

note: To also test that the version list is retrieved correctly, the prompt uses latest and previous instead of specifying concrete versions.

Test Result Summary

MCP Host	Model	Worked Expectedly	Notes
Claude Desktop	Claude Sonet 4	Y	-
Cursor	Claude Sonet 4	Y	'search' tool errored few times, recovered itself and succeeded.
Cursor	gemini-2.5-pro	N	Failed: argument not supported
Cursor	GPT-4.1	Y	'search' tool errored few times, recovered itself and succeeded.
Cline	GPT-4.1	Y	-

Claude Desktop

Claude Sonnet4

Cursor

Claude Sonnet 4

gemini-2.5-pro

Failed

GPT-4.1

CLINE (VS Code)

GPT-4.1

note: This PR is recreated from #10

eagle-25 · 2025-06-12T02:55:50Z

src/mcp_server_datahub/mcp_server.py

+    if schema_history := _get_schema_history(client, urn):
+        result["schemaHistory"] = {
+            "latestVersion": schema_history.latest_version.semantic_version,
+            "versions": sorted([v.semantic_version for v in schema_history.versions]),
+        }


Example

"schemaHistory": { "latestVersion": "0.1.0", "versions": [ "0.0.0", "0.1.0"] }

eagle-25 · 2025-06-12T02:57:04Z

src/mcp_server_datahub/mcp_server.py

+    variables = {"urn": dataset_urn, "versionStamp": target_version_stamp}
+    resp = _execute_graphql(
+        client._graph,
+        query=entity_details_fragment_gql,
+        variables=variables,
+        operation_name="getVersionedDataset",
+    )
+    return resp.get("versionedDataset", {})


Example

{ "schema": { "fields": [ { "fieldPath": "[version=2.0].[type=long].id", "jsonPath": null, "nullable": true, "description": null, "type": "NUMBER", "nativeDataType": "BIGINT", "recursive": false, "isPartOfKey": false, "isPartitioningKey": false, "__typename": "SchemaField" }, ...(repeated every fields) ], "lastObserved": 1749608884835, "__typename": "Schema" }, "editableSchemaMetadata": null, "__typename": "VersionedDataset" }

eagle-25 · 2025-06-12T03:00:01Z

src/mcp_server_datahub/mcp_server.py

+    semantic_version: str
+    version_stamp: str


Example

{ "semanticVersion": "0.0.0", "versionStamp": "browsePaths:0;dataPlatformInstance:0;datasetKey:0;schemaMetadata:1", }

eagle-25 · 2025-06-12T03:03:42Z

src/mcp_server_datahub/mcp_server.py



+@mcp.tool(description="Get schema from a dataset by its URN and version.")
+@lru_cache


It's seems better to use cache since versioned_dataset is immutable.

I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.

I’m not sure how effective it will be either. But it would be good to have it since it can reduce at least some traffic when querying the same version of the schema.

I’ll set max_size to around 20.

I’ll set max_size to around 20.

Changed

eagle-25 · 2025-06-12T09:47:17Z

src/mcp_server_datahub/mcp_server.py

+    versions: list[SemanticVersionStruct]
+
+
+def _get_schema_version_list(


note: getVersionedDataset retrieves each version’s schema correctly whereas getSchemaBlame is not which I tried to use at first.

eagle-25 · 2025-06-12T10:03:01Z

Additional Test: Fixing outdated query with LLM

Purpose

Check if LLM can fix outdated(column name changed) query with datahub MCP.

note: Pls let me know if you need to test on different MCP hosts or models.

Prompt

=== Instructions ===
You are a DataHub AI agent. Your job is to answer user questions about DataHub metadata by calling the datahub MCP (Model Context Protocol) methods.

=== Input ===
The following query was written for schema version 0.0.0 of the sample.users table:

SELECT 
    id, 
    name, 
    email, 
    created_at 
FROM 
    sample.users WHERE id = 123;
After the table schema was updated, an error occurs when executing the query.

Could you modify the query to match the latest schema?

Claude Desktop, Sonet 4

eagle-25 · 2025-06-12T10:12:34Z

@hsheth2 Could you please review again? 🙏
I’ve added tests using various MCP hosts and models with some code cahnges.

FYI:

I’ve recreated this PR since it differs noticeably from the last PR.
The notable change is that I replaced the getSchemaBlame API with getVersionedDataset to fetch accurate versioned schemas.

src/mcp_server_datahub/gql/entity_details.gql

mayurinehate · 2025-07-11T15:49:18Z

src/mcp_server_datahub/mcp_server.py

+        return cls(
+            semantic_version=data["semanticVersion"],
+            version_stamp=data["versionStamp"],
+        )


You may as well set alias for fields and use SemanticVersionStruct.model_validate with dict form instead of separate method.

mayurinehate · 2025-07-11T15:53:00Z

src/mcp_server_datahub/mcp_server.py


    _inject_urls_for_urns(client._graph, result, [""])

+    if schema_version_list := _get_schema_version_list(client, urn):


I'm somewhat concerned about performance hit due to additional call to get schema versions every time for a dataset entity. I wonder if this needs its own separate tool, for performance reasons. cc: @hsheth2

Also we should skip this call for non-dataset entities.

In the previous PR, @hsheth2 already mentioned about this.

my main worry here is that every new tool consumes additional tokens on every request. The more tools we have, the more likely it is that the LLM gets confused / doesn't call our other tools when it should. So I'd like to think about what we can do to reduce the number of tools while keeping our responses simple.

@eagle-25 I would like to run some tests about how addition of this get_schema_version_list affects overall tool timings of get_entity for dataset entity. I might get to this next week. In the meantime, if you can get some numbers or have any observations, feel free to share.

Also we should skip this call for non-dataset entities.

Changed

mayurinehate · 2025-07-11T15:55:44Z

src/mcp_server_datahub/mcp_server.py



+@mcp.tool(description="Get schema from a dataset by its URN and version.")
+@lru_cache


I'm not sure how frequently we would hit this cache. I don't however mind keeping this around, if we set a max size of few tens.

src/mcp_server_datahub/mcp_server.py

eagle-25 · 2025-07-16T06:03:16Z

@mayurinehate Added comments to your feedback. Could you check please?

I will modify the code after resolve this conversation

- Add a tool to retrieve the schema of a dataset - Modify get_entity so that when querying a dataset, it also returns the schema version

eagle-25 · 2025-08-11T15:32:38Z

@mayurinehate Could you review these changes?

Applied the following improvement feedbacks.

set alias for fields and use SemanticVersionStruct.model_validate
skip retrieving schema versions call for non-dataset entities
set max_size 20 to lru_cache

I also refactored the code into the DatasetSchemaAPI class to improve cohesion.

eagle-25 force-pushed the feat/ass-schema-history-tools branch 2 times, most recently from cbff0dc to bc9711f Compare June 12, 2025 04:40

eagle-25 commented Jun 12, 2025

View reviewed changes

eagle-25 force-pushed the feat/ass-schema-history-tools branch from bc9711f to 00347a9 Compare June 12, 2025 09:55

eagle-25 marked this pull request as ready for review June 12, 2025 10:13

hsheth2 requested a review from mayurinehate July 4, 2025 03:35

mayurinehate reviewed Jul 11, 2025

View reviewed changes

src/mcp_server_datahub/mcp_server.py Outdated Show resolved Hide resolved

eagle-25 requested a review from mayurinehate July 17, 2025 04:17

eagle-25 force-pushed the feat/ass-schema-history-tools branch 3 times, most recently from 70239fc to 728948b Compare August 10, 2025 14:07

feat: add dataset schema support

0b61fa2

- Add a tool to retrieve the schema of a dataset - Modify get_entity so that when querying a dataset, it also returns the schema version

eagle-25 force-pushed the feat/ass-schema-history-tools branch from 728948b to 0b61fa2 Compare August 10, 2025 14:09

version	col1	col2	col3	col4
0.0.0	id	name	email	created_at
0.1.0	id	name	email_address	created_at



		@mcp.tool(description="Get schema from a dataset by its URN and version.")
		@lru_cache

		versions: list[SemanticVersionStruct]


		def _get_schema_version_list(


		_inject_urls_for_urns(client._graph, result, [""])

		if schema_version_list := _get_schema_version_list(client, urn):

Conversation

eagle-25 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Motivation

Tests

Settings

Test Result Summary

Claude Desktop

Claude Sonnet4

Cursor

Claude Sonnet 4

gemini-2.5-pro

GPT-4.1

CLINE (VS Code)

GPT-4.1

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eagle-25 Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eagle-25 Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eagle-25 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional Test: Fixing outdated query with LLM

Purpose

Prompt

Claude Desktop, Sonet 4

Uh oh!

eagle-25 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eagle-25 Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eagle-25 Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eagle-25 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eagle-25 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

eagle-25 commented Jun 12, 2025 •

edited

Loading

eagle-25 Jun 12, 2025 •

edited

Loading

eagle-25 Jun 12, 2025 •

edited

Loading

eagle-25 commented Jun 12, 2025 •

edited

Loading

eagle-25 commented Jun 12, 2025 •

edited

Loading

eagle-25 Aug 10, 2025 •

edited

Loading

eagle-25 Aug 10, 2025 •

edited

Loading

eagle-25 commented Jul 16, 2025 •

edited

Loading

eagle-25 commented Aug 11, 2025 •

edited

Loading