Skip to content

Commit 29ee673

Browse files
authored
feat: replace verbose tdml docstrings with compact curated summaries (#309)
* feat: replace verbose tdml docstrings with compact curated summaries (#265) Convert TD_ANALYTIC_FUNCS from list to dict[str, str] mapping each teradataml function to a one-line curated summary. Replace convert_tdml_docstring_to_mcp_docstring() with build_tdml_tool_docstring() that generates compact parameter descriptions directly from live function metadata, significantly reducing MCP tool description size. Adds scripts/seed_tdml_summaries.py utility for regenerating summaries from a live DB. * style: ruff format utils/__init__.py
1 parent 2afe062 commit 29ee673

8 files changed

Lines changed: 507 additions & 183 deletions

File tree

CLAUDE.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,15 @@ Tools are organized into domain modules under `src/teradata_mcp_server/tools/`:
6666

6767
Profiles (defined in `config/profiles.yml`) control which modules load. The `module_loader.py` uses regex pattern matching against tool name prefixes to determine which modules to import. Available profiles: `all`, `dba`, `dataScientist`, `eda`, `bar`, `llmUser`, `tester`.
6868

69+
### teradataml Analytic Function Tools (`tdml_*`)
70+
71+
The ~89 `tdml_*` tools (e.g., `tdml_KMeans`, `tdml_XGBoost`) are registered dynamically in `app.py`, separate from the `handle_*` module pattern. Key files:
72+
73+
- **`tools/constants.py`**`TD_ANALYTIC_FUNCS`: a `dict[str, str]` mapping teradataml function name → curated one-line summary. This is the authoritative list of which functions to register. To add a new function, add one entry here.
74+
- **`tools/utils/__init__.py`**`build_tdml_tool_docstring(summary, func_metadata, partition_order_cols)`: builds the compact MCP tool description at registration time by reading parameter names, descriptions, and types from the live teradataml JSON store.
75+
76+
Tools are only registered when `enable_analytic_functions` is true, teradataml is installed, and a database connection is available. Functions missing from the connected system are skipped with a warning.
77+
6978
### Configuration System
7079

7180
Layered config loading (`config_loader.py`):

docs/developer_guide/DEVELOPER_GUIDE.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,19 @@ This section explains how the pieces fit together at runtime.
373373
- The wrapper delegates execution to `execute_db_tool` which:
374374
- Injects a DB connection (SQLAlchemy `Connection` preferred)
375375
- Sets QueryBand based on request context (`tools/utils/queryband.py`)
376+
- Dynamically registers `tdml_*` analytic function tools when teradataml is installed and a database connection is available (see below).
377+
378+
### Dynamic teradataml Analytic Function Registration
379+
380+
The ~89 `tdml_*` tools (e.g., `tdml_KMeans`, `tdml_XGBoost`) are not defined as `handle_*` functions. Instead, `app.py` generates and registers them at startup when `enable_analytic_functions` is true:
381+
382+
1. **`tools/constants.py`** — `TD_ANALYTIC_FUNCS` is a `dict[str, str]` mapping each teradataml function name to a curated one-line summary (e.g., `"KMeans": "Groups observations into k clusters..."`). This dict is the authoritative list of which functions to register.
383+
384+
2. **`tools/utils/__init__.py`** — `build_tdml_tool_docstring(summary, func_metadata, partition_order_cols)` builds the compact MCP tool description at registration time. It reads parameter names, descriptions, Required/Optional, and types directly from `func_metadata.arguments` (teradataml's live JSON store, populated from the database), combining them with the curated summary.
385+
386+
3. **`app.py`** — Iterates `TD_ANALYTIC_FUNCS.items()`, queries the live JSON store for each function's metadata, generates a Python function string via `exec()`, and registers it with `mcp.tool()`. If a function from the dict is not present in the connected database's function list, it is skipped with a warning.
387+
388+
**To add a new analytic function:** add one entry to `TD_ANALYTIC_FUNCS` in `tools/constants.py` with a concise one-line description. No other code changes are needed.
376389

377390
## Project Layout
378391

@@ -394,11 +407,12 @@ teradata-mcp-server/
394407
│ └─ profiles.yml
395408
└─ tools/
396409
├─ __init__.py # Lazy module loader + explicit exports (e.g., TDConn)
410+
├─ constants.py # TD_ANALYTIC_FUNCS dict: teradataml function name → one-line summary
397411
├─ module_loader.py # Profiles → load only needed tool modules (+ YAMLs)
398412
├─ td_connect.py # SQLAlchemy connection + auth validation helpers
399413
├─ utils/
400-
│ ├─ __init__.py # JSON helpers, auth header parsing, exports queryband
401-
+ │ └─ queryband.py # Build Teradata QueryBand from request context
414+
│ ├─ __init__.py # JSON helpers, auth header parsing, tdml docstring builder
415+
│ └─ queryband.py # Build Teradata QueryBand from request context
402416
├─ base/ ... # Tool groups (base, dba, sec, qlty, rag, fs, tdvs, ...)
403417
└─ fs/... # Optional extras; imported only if profile enables them
404418
```

docs/developer_guide/HOW_TO_ADD_YOUR_FUNCTION.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,4 +126,33 @@ Use MCP Inspector or your client (Claude Desktop) to call the tool once it’s e
126126
| MCP wrapper (auto) | Auto-generated MCP wrapper around your handler (built at startup). |
127127
| `execute_db_tool` (internal) | Central adapter: sets QueryBand, handles errors/formatting, reconnects. |
128128

129-
Let me know if you'd like this as a template or reusable decorator for many functions.
129+
---
130+
131+
## ➕ Adding a teradataml Analytic Function (`tdml_*`)
132+
133+
The `tdml_*` tools (e.g., `tdml_KMeans`, `tdml_XGBoost`) are registered dynamically from the teradataml library. They do **not** follow the `handle_*` pattern — instead, they are driven by a curated dictionary.
134+
135+
### How it works
136+
137+
1. `tools/constants.py` contains `TD_ANALYTIC_FUNCS`, a `dict[str, str]` mapping each teradataml function name to a curated one-line summary.
138+
2. At startup, `app.py` iterates this dict, queries the live teradataml JSON store for each function's parameter metadata, and generates + registers a `tdml_<FunctionName>` MCP tool automatically.
139+
3. `build_tdml_tool_docstring()` in `tools/utils/__init__.py` assembles the compact description (summary + one line per parameter with Required/Optional and types).
140+
141+
### Steps to add a new function
142+
143+
1. Open `src/teradata_mcp_server/tools/constants.py`.
144+
2. Add one entry to `TD_ANALYTIC_FUNCS`:
145+
146+
```python
147+
TD_ANALYTIC_FUNCS = {
148+
...
149+
"MyNewFunction": "One-sentence description of what this function does.",
150+
}
151+
```
152+
153+
That's it — no other code changes are needed. The server will register `tdml_MyNewFunction` automatically on next startup, provided the function exists in the connected database's teradataml version.
154+
155+
### Notes
156+
- The summary should be one sentence, no longer than ~120 characters.
157+
- If the function is not present in the connected database, it is skipped with a warning — no error.
158+
- The `fs` extra (`uv sync --extra fs`) must be installed for any `tdml_*` tools to register.

scripts/seed_tdml_summaries.py

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
"""
2+
Seed script: connects to Teradata, extracts one-line summaries from teradataml
3+
__init__.__doc__ for all TD_ANALYTIC_FUNCS, then prints the new dict[str, str]
4+
block ready to paste into constants.py.
5+
6+
Requires DATABASE_URI env var:
7+
export DATABASE_URI="teradata://user:pass@host:1025/db"
8+
uv run python scripts/seed_tdml_summaries.py
9+
"""
10+
11+
import os
12+
import re
13+
import textwrap
14+
import warnings
15+
16+
warnings.filterwarnings("ignore")
17+
18+
import teradataml as tdml # noqa: E402
19+
20+
# Connect so that teradataml populates __init__.__doc__ on each class
21+
_uri = os.environ.get("DATABASE_URI", "")
22+
if _uri:
23+
_m = re.match(r"teradata://([^:]+):([^@]+)@([^:]+):(\d+)/(.+)", _uri)
24+
if _m:
25+
tdml.create_context(
26+
host=_m.group(3),
27+
username=_m.group(1),
28+
password=_m.group(2),
29+
database=_m.group(5),
30+
)
31+
else:
32+
raise ValueError(f"Cannot parse DATABASE_URI: {_uri}")
33+
else:
34+
raise EnvironmentError("DATABASE_URI not set — docstrings require a live connection")
35+
36+
FUNCS = [
37+
"ANOVA",
38+
"Attribution",
39+
"Antiselect",
40+
"Apriori",
41+
"BincodeFit",
42+
"BincodeTransform",
43+
"CFilter",
44+
"CategoricalSummary",
45+
"ChiSq",
46+
"ClassificationEvaluator",
47+
"ColumnSummary",
48+
"ColumnTransformer",
49+
"ConvertTo",
50+
"DecisionForest",
51+
"FTest",
52+
"FillRowId",
53+
"Fit",
54+
"GetFutileColumns",
55+
"GetRowsWithMissingValues",
56+
"GetRowsWithoutMissingValues",
57+
"GLM",
58+
"GLMPerSegment",
59+
"Histogram",
60+
"KMeans",
61+
"KMeansPredict",
62+
"KNN",
63+
"MovingAverage",
64+
"NERExtractor",
65+
"NGramSplitter",
66+
"NaiveBayesTextClassifierPredict",
67+
"NaiveBayesTextClassifierTrainer",
68+
"NonLinearCombineFit",
69+
"NonLinearCombineTransform",
70+
"NumApply",
71+
"NPath",
72+
"OneClassSVM",
73+
"OneClassSVMPredict",
74+
"OneHotEncodingFit",
75+
"OneHotEncodingTransform",
76+
"OrdinalEncodingFit",
77+
"OrdinalEncodingTransform",
78+
"OutlierFilterFit",
79+
"OutlierFilterTransform",
80+
"Pack",
81+
"PolynomialFeaturesFit",
82+
"PolynomialFeaturesTransform",
83+
"Pivoting",
84+
"QQNorm",
85+
"ROC",
86+
"RandomProjectionFit",
87+
"RandomProjectionMinComponents",
88+
"RandomProjectionTransform",
89+
"RegressionEvaluator",
90+
"RoundColumns",
91+
"RowNormalizeFit",
92+
"RowNormalizeTransform",
93+
"SMOTE",
94+
"SVM",
95+
"SVMPredict",
96+
"ScaleFit",
97+
"ScaleTransform",
98+
"Sessionize",
99+
"SentimentExtractor",
100+
"Shap",
101+
"Silhouette",
102+
"SimpleImputeFit",
103+
"SimpleImputeTransform",
104+
"StrApply",
105+
"StringSimilarity",
106+
"TDDecisionForestPredict",
107+
"TDGLMPredict",
108+
"TDNaiveBayesPredict",
109+
"TFIDF",
110+
"TargetEncodingFit",
111+
"TargetEncodingTransform",
112+
"TextMorph",
113+
"TextParser",
114+
"TrainTestSplit",
115+
"Transform",
116+
"UnivariateStatistics",
117+
"Unpack",
118+
"Unpivoting",
119+
"VectorDistance",
120+
"WhichMax",
121+
"WhichMin",
122+
"WordEmbeddings",
123+
"XGBoost",
124+
"XGBoostPredict",
125+
"ZTest",
126+
]
127+
128+
129+
def extract_summary(func_name: str) -> str:
130+
"""Pull the first meaningful sentence from the teradataml __init__ docstring."""
131+
func_obj = getattr(tdml, func_name, None)
132+
if func_obj is None:
133+
return f"Teradata ML analytic function {func_name}."
134+
135+
raw = getattr(func_obj.__init__, "__doc__", None) or ""
136+
# Dedent and strip leading blank lines
137+
raw = textwrap.dedent(raw).strip()
138+
139+
# The teradataml pattern is:
140+
# DESCRIPTION:
141+
# <summary text, may span multiple lines>
142+
#
143+
# PARAMETERS:
144+
# Try to grab the DESCRIPTION block first.
145+
desc_match = re.search(r"DESCRIPTION\s*:\s*\n(.*?)(?:\n\s*\n|\n\s*PARAMETERS\s*:)", raw, re.DOTALL)
146+
if desc_match:
147+
block = desc_match.group(1)
148+
else:
149+
# Fallback: take the first non-empty paragraph
150+
block = raw.split("\n\n")[0]
151+
152+
# Collapse internal whitespace / newlines into a single line
153+
block = re.sub(r"\s+", " ", block).strip()
154+
155+
# Replace teradataml-specific terminology
156+
block = block.replace("teradataml DataFrame", "table name")
157+
block = block.replace("DataFrame", "table name")
158+
159+
# Truncate at the first sentence boundary (period followed by space or end)
160+
# Keep the trailing period.
161+
sent_match = re.search(r"^(.*?\.)\s", block)
162+
if sent_match:
163+
summary = sent_match.group(1)
164+
else:
165+
# No sentence boundary — use the whole block but cap length
166+
summary = block[:200].rstrip()
167+
if not summary.endswith("."):
168+
summary += "."
169+
170+
return summary
171+
172+
173+
def main():
174+
results: list[tuple[str, str]] = []
175+
missing: list[str] = []
176+
177+
for name in FUNCS:
178+
summary = extract_summary(name)
179+
results.append((name, summary))
180+
if "analytic function" in summary and name in summary:
181+
missing.append(name)
182+
183+
# Print the dict literal ready to paste into constants.py
184+
print("TD_ANALYTIC_FUNCS = {")
185+
for name, summary in results:
186+
# Escape any quotes inside the summary
187+
safe = summary.replace('"', '\\"')
188+
print(f' "{name}": "{safe}",')
189+
print("}")
190+
191+
if missing:
192+
print(f"\n# WARNING: {len(missing)} functions had no extractable docstring — fallback used:")
193+
for m in missing:
194+
print(f"# {m}")
195+
196+
197+
if __name__ == "__main__":
198+
main()

src/teradata_mcp_server/app.py

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
from teradata_mcp_server.tools import ContextCatalog
3939
from teradata_mcp_server.tools.graph.graph_edge_contract import GRAPH_EDGE_CONTRACT
4040
from teradata_mcp_server.tools.utils import (
41-
convert_tdml_docstring_to_mcp_docstring,
41+
build_tdml_tool_docstring,
4242
execute_analytic_function,
4343
get_anlytic_function_signature,
4444
get_dynamic_function_definition,
@@ -587,7 +587,7 @@ def execute_tool(
587587
if enable_analytic_functions:
588588
tdml_processed_funcs = set(tdml.analytics.json_parser.json_store._JsonStore._get_function_list()[0].keys())
589589

590-
for func_name in funcs:
590+
for func_name, summary in funcs.items():
591591
# Before adding the function, check if function is existed or not.
592592
# Connection is not mandatory for MCP server. If connection is not there, then
593593
# functions can not be added.
@@ -596,30 +596,28 @@ def execute_tool(
596596
continue
597597

598598
func_metadata = tdml.analytics.json_parser.json_store._JsonStore.get_function_metadata(func_name)
599-
func_obj = getattr(tdml, func_name, None)
600599
func_params = func_metadata.function_params
601600

602601
inp_data = [t.get_lang_name() for t in func_metadata.input_tables]
603602
# Add partition_by parameters for func parameters.
604-
additional_args_docs = []
603+
partition_order_cols = []
605604
for table in inp_data:
606605
func_params[f"{table}_partition_column"] = None
607606
func_params[f"{table}_order_column"] = None
608-
additional_args_docs.append(get_partition_col_order_col_doc_string(table))
607+
partition_order_cols.append(get_partition_col_order_col_doc_string(table))
609608

610609
# Generate function argument string.
611610
func_args_str = get_anlytic_function_signature(func_params)
612611

613612
full_func_name = "tdml_" + func_name
614-
init_doc = func_obj.__init__.__doc__ # type: ignore[misc]
615613
func_str = get_dynamic_function_definition().format(
616614
analytic_function=full_func_name,
617-
doc_string=init_doc,
615+
doc_string=summary,
618616
func_args_str=func_args_str,
619617
tables_to_df=json.dumps(inp_data),
620618
)
621619

622-
doc_string = convert_tdml_docstring_to_mcp_docstring(init_doc, additional_args_docs)
620+
doc_string = build_tdml_tool_docstring(summary, func_metadata, partition_order_cols)
623621

624622
# Execute the generated function definition in the global scope.
625623
# Global scope will have all other functions. So reference to other functions will work.

0 commit comments

Comments
 (0)