-
Notifications
You must be signed in to change notification settings - Fork 41
DX-118395: Make SearchTableAndViews resilient to broken catalog entries #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+240
−11
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
cae689b
DX-118395: Make SearchTableAndViews resilient to broken catalog entries
ssaumitra f4926eb
DX-118395: Address PR review feedback
ssaumitra c29e6a7
Incorporating review comments
ssaumitra 04992d0
Fixing test failures
ssaumitra f1d3610
Adding not found column in the dataframe
ssaumitra 38de766
Incorporating review feedbacks
ssaumitra 1335b7b
Fixing the broken test
ssaumitra 2f4f58e
Returning the common search schema from EnterpriseSearchResultsWrappe…
ssaumitra 6b8e6bc
Incorporating review feedbacks
ssaumitra File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| # Dremio MCP Server | ||
|
|
||
| ## Project Overview | ||
|
|
||
| An MCP (Model Context Protocol) server that enables LLM integration with Dremio. It allows LLMs like Claude to query and interact with Dremio data sources via the MCP protocol. Supports local (stdio) and remote (streaming HTTP) deployment modes. | ||
|
|
||
| ## Tech Stack | ||
|
|
||
| - **Language**: Python 3.11+ | ||
| - **Package Manager**: `uv` (not pip) | ||
| - **Build System**: Hatchling | ||
| - **Framework**: FastMCP / FastAPI / Starlette | ||
| - **Key Libraries**: mcp, pydantic, structlog, typer, PyJWT, LaunchDarkly SDK | ||
| - **Testing**: pytest with pytest-asyncio (strict mode) | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ``` | ||
| src/dremioai/ | ||
| ├── api/ # API clients (Dremio REST, Prometheus, CLI) | ||
| │ ├── dremio/ # Dremio API client | ||
| │ ├── prometheus/ # Prometheus API client | ||
| │ └── cli/ # CLI helpers | ||
| ├── config/ # Configuration management (YAML-based) | ||
| ├── servers/ # MCP server implementation | ||
| │ ├── mcp.py # Main MCP server entry point (CLI via typer) | ||
| │ ├── jwks_verifier.py # JWT/JWKS auth verification | ||
| │ └── frameworks/ # Framework integrations (langchain, beeai) | ||
| ├── tools/ # MCP tool definitions | ||
| │ └── tools.py # Base Tools class | ||
| ├── metrics/ # Prometheus metrics | ||
| └── resources/ # MCP resources | ||
| ``` | ||
|
|
||
| ## Common Commands | ||
|
|
||
| ```bash | ||
| # Install dependencies | ||
| uv sync | ||
|
|
||
| # Run the MCP server | ||
| uv run dremio-mcp-server run | ||
|
|
||
| # Run with custom config | ||
| uv run dremio-mcp-server run --config-file <path> | ||
|
|
||
| # Run all tests | ||
| uv run pytest tests | ||
|
|
||
| # Run a specific test file | ||
| uv run pytest tests/test_chart.py | ||
|
|
||
| # Manage config | ||
| uv run dremio-mcp-server config create dremioai --uri <uri> --pat <pat> | ||
| uv run dremio-mcp-server config list --type dremioai | ||
|
|
||
| # Build Docker image | ||
| docker build -t dremio-mcp:0.1.0 . | ||
| ``` | ||
|
|
||
| ## Development Guidelines | ||
|
|
||
| - Follow PEP 8 style guidelines | ||
| - Use type hints for function arguments and return values | ||
| - Async-first: tools and server handlers are async (`asyncio_mode = strict`) | ||
| - New tools must inherit from the `Tools` base class in `dremioai.tools.tools` | ||
| - Tools are categorized by `ToolType`: `FOR_DATA_PATTERNS`, `FOR_SELF`, `FOR_PROMETHEUS` | ||
| - Config is YAML-based, located at `~/.config/dremioai/config.yaml` by default | ||
| - Commit messages start with a JIRA ticket ID (e.g., `DX-XXXXX: description`) | ||
| - Branch from `main` for all changes | ||
|
|
||
| ## Testing | ||
|
|
||
| - Test files live in `tests/` mirroring the `src/` structure | ||
| - pytest config is in `pytest.ini` with `-v --showlocals -x` defaults | ||
| - Tests use strict asyncio mode — use `@pytest.mark.asyncio` for async tests | ||
| - E2E tests are in `tests/e2e/` | ||
|
|
||
| ## Deployment | ||
|
|
||
| - **Local**: stdio mode via `uv run dremio-mcp-server run` | ||
| - **Remote/K8s**: Helm chart in `helm/dremio-mcp/` with streaming HTTP mode | ||
| - Auth: PAT (dev/local) or OAuth + External Token Provider (production) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| # | ||
| # Copyright (C) 2017-2025 Dremio Corporation | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
|
|
||
| from unittest.mock import patch | ||
| from types import SimpleNamespace | ||
|
|
||
| import pandas as pd | ||
| import pytest | ||
| from aiohttp import ClientResponseError | ||
|
|
||
| from dremioai.api.dremio import catalog, search | ||
| from dremioai.tools import tools as tools_mod | ||
|
|
||
|
|
||
| def _client_response_error(status: int, message: str) -> ClientResponseError: | ||
| request_info = SimpleNamespace( | ||
| real_url="http://test/catalog/by-path/x", method="GET", headers={}, url="http://test" | ||
| ) | ||
| return ClientResponseError( | ||
| request_info=request_info, history=(), status=status, message=message | ||
| ) | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_get_schemas_all_success(): | ||
| async def fake_get_schema(p, *_a, **_kw): | ||
| return {"schema": {"col": "VARCHAR"}, "path": p} | ||
|
|
||
| with patch.object(catalog, "get_schema", side_effect=fake_get_schema): | ||
| result = await catalog.get_schemas([["a", "b"], ["c"]]) | ||
|
|
||
| assert len(result) == 2 | ||
| assert result[0]["path"] == ["a", "b"] | ||
| assert result[1]["path"] == ["c"] | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_get_schemas_propagates_http_exception(): | ||
| """get_schemas does not swallow errors — exceptions bubble up to the caller (DX-118395).""" | ||
|
|
||
| async def fake_get_schema(*_a, **_kw): | ||
| raise _client_response_error(400, "Bad Request") | ||
|
|
||
| with patch.object(catalog, "get_schema", side_effect=fake_get_schema): | ||
| with pytest.raises(ClientResponseError): | ||
| await catalog.get_schemas([["ok", "one"], ["bad", "view"]]) | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_get_schemas_propagates_non_http_exception(): | ||
| async def fake_get_schema(*_a, **_kw): | ||
| raise ValueError("kapow") | ||
|
|
||
| with patch.object(catalog, "get_schema", side_effect=fake_get_schema): | ||
| with pytest.raises(ValueError, match="kapow"): | ||
| await catalog.get_schemas([["boom"]]) | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_populate_schemas_marks_not_found_on_failure(): | ||
| """One broken catalog entry must not fail the whole search (DX-118395). | ||
| Error is embedded per-row via schema_not_found on EnterpriseSearchCatalogObject.""" | ||
|
|
||
| async def fake_get_schema(dataset_path_or_id, *_a, **_kw): | ||
| if dataset_path_or_id == ["bad", "view"]: | ||
| raise _client_response_error(400, "Bad Request") | ||
| return {"schema": {"col": "VARCHAR"}} | ||
|
|
||
| ok = search.EnterpriseSearchCatalogObject(path=["ok", "one"], labels=[]) | ||
| bad = search.EnterpriseSearchCatalogObject(path=["bad", "view"], labels=[]) | ||
|
|
||
| with patch.object(search, "get_schema", side_effect=fake_get_schema): | ||
| await ok.populate_schemas() | ||
| await bad.populate_schemas() | ||
|
|
||
| assert ok.schema == {"col": "VARCHAR"} | ||
| assert bad.schema is None | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_search_table_and_views_drops_broken_entries_and_returns_healthy_ones(): | ||
| """DX-118395: one broken catalog entry must not fail the whole tool call. | ||
|
|
||
| The tool silently drops entries whose schema could not be fetched and | ||
| returns the healthy ones. | ||
| """ | ||
|
|
||
| ok_df = pd.DataFrame( | ||
| [{"path": ["ok", "tbl"], "name": "ok.tbl", "schema": {"a": "INT"}}] | ||
| ) | ||
| bad_df = pd.DataFrame( | ||
| [{"path": ["bad", "view"], "name": "bad.view", "schema": None}] | ||
| ) | ||
|
|
||
| async def fake_search(search_obj, use_df=False): | ||
| if search_obj.filter == 'category in ["TABLE"]': | ||
| return ok_df | ||
| return bad_df | ||
|
|
||
| with patch.object(tools_mod.search, "get_search_results", side_effect=fake_search): | ||
| result = await tools_mod.SearchTableAndViews().invoke("NYC bike trips") | ||
|
|
||
| assert set(result.keys()) == {"results"} | ||
| names = {row["name"] for row in result["results"]} | ||
| assert "ok.tbl" in names | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_get_descriptions_raises_on_schema_fetch_error(): | ||
| """get_descriptions must remain fail-fast so GetDescriptionOfTableOrSchema | ||
| surfaces an error instead of silently returning partial data.""" | ||
|
|
||
| async def fake_get_schema(p, *_a, **_kw): | ||
| raise _client_response_error(400, "Bad Request") | ||
|
|
||
| with patch.object(catalog, "get_schema", side_effect=fake_get_schema): | ||
| with pytest.raises(ClientResponseError): | ||
| await catalog.get_descriptions([["a", "b"]]) | ||
|
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.