-
Notifications
You must be signed in to change notification settings - Fork 40
feat(docker): add support for a docker build image of datahub-mcp-server #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
cfbe4bc
6910393
5b19d28
0c74fc4
aca8f82
d2483a1
f1e206a
9baad66
ce63cc2
1b17321
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| name: Docker | ||
|
|
||
| on: | ||
| release: | ||
| types: [published] | ||
|
|
||
| jobs: | ||
| build_and_publish: | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| packages: write | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Extract version from release tag | ||
| id: version | ||
| run: | | ||
| echo "version=${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT" | ||
|
|
||
| - name: Log in to Docker Hub | ||
| uses: docker/login-action@v3 | ||
| with: | ||
| username: ${{ secrets.DOCKER_USERNAME }} | ||
| password: ${{ secrets.DOCKER_PASSWORD }} | ||
|
|
||
| - name: Log in to GitHub Container Registry | ||
| uses: docker/login-action@v3 | ||
| with: | ||
| registry: ghcr.io | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 | ||
|
|
||
| - name: Build and push | ||
| uses: docker/build-push-action@v6 | ||
| with: | ||
| context: . | ||
| push: true | ||
| build-args: | | ||
| VERSION=${{ steps.version.outputs.version }} | ||
| tags: | | ||
| acryldata/mcp-server-datahub:${{ steps.version.outputs.version }} | ||
| acryldata/mcp-server-datahub:latest | ||
| ghcr.io/${{ github.repository }}:${{ steps.version.outputs.version }} | ||
| ghcr.io/${{ github.repository }}:latest | ||
| cache-from: type=gha | ||
| cache-to: type=gha,mode=max |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| FROM python:3.11-slim | ||
|
|
||
| WORKDIR /app | ||
|
|
||
| # Install uv | ||
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv | ||
|
|
||
| # Copy dependency files | ||
| COPY pyproject.toml uv.lock ./ | ||
|
|
||
| # Install dependencies (no dev deps, no editable install yet) | ||
| RUN uv sync --frozen --no-dev --no-install-project | ||
|
|
||
| # Copy source | ||
| COPY src/ ./src/ | ||
|
|
||
| # Inject version at build time so setuptools-scm fallback (0.0.0) is not used. | ||
| # The .git directory is not available during Docker builds, so we write | ||
| # _version.py directly from the VERSION build arg. | ||
| ARG VERSION=0.0.0 | ||
| RUN printf '__version__ = version = "%s"\n__version_tuple__ = version_tuple = tuple(int(x) if x.isdigit() else x for x in "%s".lstrip("v").split("."))\n__commit_id__ = commit_id = None\n' \ | ||
| "$VERSION" "$VERSION" \ | ||
| > src/mcp_server_datahub/_version.py | ||
|
|
||
| # Install the project itself | ||
| RUN uv sync --frozen --no-dev | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Docker image always reports version 0.0.0Medium Severity The project uses
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed this |
||
|
|
||
| ENV PATH="/app/.venv/bin:$PATH" | ||
|
|
||
| EXPOSE 8000 | ||
|
|
||
| CMD ["mcp-server-datahub", "--transport", "http"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| services: | ||
| mcp-server-datahub: | ||
| build: . | ||
| ports: | ||
| - "${MCP_SERVER_PORT:-8000}:8000" | ||
| environment: | ||
| DATAHUB_GMS_URL: ${DATAHUB_GMS_URL} | ||
| TOOLS_IS_MUTATION_ENABLED: ${TOOLS_IS_MUTATION_ENABLED:-false} | ||
| TOOLS_IS_USER_ENABLED: ${TOOLS_IS_USER_ENABLED:-false} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,11 +1,15 @@ | ||
| import logging | ||
| from typing import Any | ||
| import os | ||
| from typing import Any, Optional | ||
|
|
||
| import click | ||
| from datahub.ingestion.graph.config import ClientMode | ||
| from datahub.ingestion.graph.config import ClientMode, DatahubClientConfig | ||
| from datahub.sdk.main_client import DataHubClient | ||
| from datahub.telemetry import telemetry | ||
| from fastmcp import FastMCP | ||
| from fastmcp.server.auth import TokenVerifier | ||
| from fastmcp.server.auth.auth import AccessToken | ||
| from fastmcp.server.dependencies import get_http_request | ||
| from fastmcp.server.middleware import Middleware | ||
| from fastmcp.server.middleware.logging import LoggingMiddleware | ||
| from starlette.requests import Request | ||
|
|
@@ -24,6 +28,64 @@ | |
| register_all_tools(is_oss=True) | ||
|
|
||
|
|
||
| _GET_ME_QUERY = "query getMe { me { corpUser { urn username } } }" | ||
|
|
||
|
|
||
| def _build_client(server_url: str, token: str) -> DataHubClient: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we want to have a cache for this?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont think we need to have a cache for this. There's no http calls here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it results in the new connection for every call, no? I'm pretty certain we need to cache |
||
| return DataHubClient( | ||
| config=DatahubClientConfig( | ||
| server=server_url, | ||
| token=token, | ||
| client_mode=ClientMode.SDK, | ||
| datahub_component=f"mcp-server-datahub/{__version__}", | ||
| ) | ||
| ) | ||
|
|
||
|
|
||
| def _verify_client(client: DataHubClient) -> None: | ||
| """Verify the client can authenticate by calling the me query.""" | ||
| client._graph.execute_graphql(_GET_ME_QUERY) | ||
|
|
||
|
|
||
| def _token_from_request() -> Optional[str]: | ||
| """Extract a DataHub token from the current HTTP request. | ||
|
|
||
| Reads the ``Authorization: Bearer <token>`` header. | ||
| """ | ||
| try: | ||
| request = get_http_request() | ||
| except RuntimeError: | ||
| return None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when runtime error would be thrown? should we propagate the exception up instead of returning None here?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is called outside of an http context. It should never happen in the current implementation.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should propagate the exception then, why changing it to None? |
||
| auth = request.headers.get("authorization", "") | ||
| if auth.startswith("Bearer "): | ||
| return auth[len("Bearer ") :] | ||
| return None | ||
|
|
||
|
|
||
| class _DataHubTokenVerifier(TokenVerifier): | ||
| """FastMCP TokenVerifier that validates DataHub bearer tokens. | ||
|
|
||
| Called by FastMCP's BearerAuthBackend for every HTTP request that carries | ||
| an Authorization: Bearer header. If the token is valid a synthetic | ||
| AccessToken is returned; otherwise None causes FastMCP to reply with | ||
| 401 WWW-Authenticate: Bearer automatically. | ||
| """ | ||
|
|
||
| def __init__(self, server_url: str) -> None: | ||
| super().__init__() | ||
| self._server_url = server_url | ||
|
|
||
| async def verify_token(self, token: str) -> Optional[AccessToken]: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we want to have this cacheable? |
||
| try: | ||
| client = _build_client(self._server_url, token) | ||
| _verify_client(client) | ||
| return AccessToken( | ||
| client_id=f"mcp-server-datahub/{__version__}", scopes=[], token=token | ||
| ) | ||
| except Exception: | ||
|
||
| return None | ||
|
|
||
|
|
||
| class _DataHubClientMiddleware(Middleware): | ||
| """Middleware that propagates the DataHub client ContextVar into each request. | ||
|
|
||
|
|
@@ -32,18 +94,34 @@ class _DataHubClientMiddleware(Middleware): | |
| thread. This middleware ensures the DataHub client is available in every request | ||
| context by setting the ContextVar at the start of each MCP message. | ||
|
|
||
| Token validation is handled upstream by ``_DataHubTokenVerifier`` for Bearer | ||
| tokens. This middleware only needs to build the client for the current request | ||
| (or fall back to the default token when a global token is configured). | ||
|
|
||
| Must be added as the first middleware so it wraps all other middlewares. | ||
| """ | ||
|
|
||
| def __init__(self, client: DataHubClient) -> None: | ||
| self._client = client | ||
| def __init__(self, server_url: str, default_token: Optional[str] = None) -> None: | ||
| self._server_url = server_url | ||
| self._default_token = default_token | ||
|
|
||
| def _client_for_request(self) -> DataHubClient: | ||
| token = _token_from_request() | ||
| if token is not None: | ||
| # Token already validated by _DataHubTokenVerifier. | ||
| return _build_client(self._server_url, token) | ||
| if self._default_token is not None: | ||
| return _build_client(self._server_url, self._default_token) | ||
| raise ValueError( | ||
| "No DataHub token provided. Supply a token via the Authorization header." | ||
| ) | ||
|
|
||
| async def on_message( | ||
| self, | ||
| context: Any, | ||
| call_next: Any, | ||
| ) -> Any: | ||
| with with_datahub_client(self._client): | ||
| with with_datahub_client(self._client_for_request()): | ||
| return await call_next(context) | ||
|
|
||
|
|
||
|
|
@@ -72,16 +150,19 @@ def create_app() -> FastMCP: | |
| if _app_initialized: | ||
| return mcp | ||
|
|
||
| client = DataHubClient.from_env( | ||
| client_mode=ClientMode.SDK, | ||
| datahub_component=f"mcp-server-datahub/{__version__}", | ||
| ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Claude is right - this is a breaking change
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will fix. |
||
| server_url = os.environ.get("DATAHUB_GMS_URL") | ||
| if not server_url: | ||
| raise RuntimeError("DATAHUB_GMS_URL environment variable is required.") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we often configure gms and token vs datahubEnv, this will break this flow |
||
|
|
||
| global_token = os.environ.get("DATAHUB_GMS_TOKEN") | ||
| if global_token: | ||
| _verify_client(_build_client(server_url, global_token)) | ||
|
||
|
|
||
| # _DataHubClientMiddleware must be first so the client ContextVar is | ||
| # available to all subsequent middlewares and tool handlers. This is | ||
| # especially important for HTTP transport where each request runs in a | ||
| # separate async context. | ||
| mcp.add_middleware(_DataHubClientMiddleware(client)) | ||
| mcp.add_middleware(_DataHubClientMiddleware(server_url, global_token)) | ||
| mcp.add_middleware(TelemetryMiddleware()) | ||
| mcp.add_middleware(VersionFilterMiddleware()) | ||
| mcp.add_middleware(DocumentToolsMiddleware()) | ||
|
|
@@ -115,7 +196,15 @@ def main(transport: Literal["stdio", "sse", "http"], debug: bool) -> None: | |
| create_app() | ||
|
|
||
| if transport == "http": | ||
| mcp.run(transport=transport, show_banner=False, stateless_http=True) | ||
| server_url = os.environ.get("DATAHUB_GMS_URL", "") | ||
| if not os.environ.get("DATAHUB_GMS_TOKEN"): | ||
| mcp.auth = _DataHubTokenVerifier(server_url) | ||
| mcp.run( | ||
| transport=transport, | ||
| show_banner=False, | ||
| stateless_http=True, | ||
| host="0.0.0.0", | ||
| ) | ||
| else: | ||
| mcp.run(transport=transport, show_banner=False) | ||
|
|
||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unpinned
uv:latesttag risks breaking Docker buildsMedium Severity
The
COPY --from=ghcr.io/astral-sh/uv:latestuses an unpinned:latesttag, making the Docker build non-reproducible. Ifuvreleases a breaking change (e.g., moving the binary path from/uv, or changing CLI behavior), builds will silently break. The existingwheels.ymlworkflow pinsastral-sh/setup-uv@v6, but this Dockerfile has no version constraint at all. Pinning to a specific version or major version tag (e.g.,uv:0.6) would prevent unexpected build failures.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoring this as other use cases use uv:latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not pin? where are the other use cases that use uv:latest?