Da 1766 ironing the testing pipeline for spider2.0 testing of cb mcp server#1
Open
Arjunnr-cb wants to merge 9 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an end-to-end, Makefile-driven pipeline to generate SQL++ via the Couchbase MCP server, postprocess the generated queries, and evaluate/analyze results against a Couchbase cluster, with configuration flowing from config.json → .env.
Changes:
- Introduces
make setup/run/quicktestworkflow and ascripts/setup.pygenerator to produce.envfromconfig.json. - Updates
run_mcp.shto load.env, start iQ-FastAPI automatically, run MCP generation, postprocess, and evaluate/analyze into timestampedruns/<IST_timestamp>[_tag]/. - Adds a Snowflake question set file and supporting docs/ignore rules.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
test/run.py |
Makes MCP server path/venv/log path configurable via environment variables and remaps MCP_CB_* to CB_*. |
test/questions_snowflake.jsonl |
Adds Snowflake question set (currently JSONL format). |
scripts/setup.py |
Adds generator to create .env from config.json values. |
run_mcp.sh |
Adds .env loading, iQ-FastAPI lifecycle management, run naming, stale output cleanup, postprocess step, and updated evaluation invocation. |
Makefile |
Adds setup, run, and quicktest targets driven by .env variables. |
docs/testing-pipeline-for-semantic-catalog.md |
Documents setup/configuration and pipeline flow. |
config_example.json |
Adds structured config template for required paths/credentials and pipeline defaults. |
.gitignore |
Ignores generated artifacts (runs/, test/output/) and config.json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| exit 1 | ||
| fi | ||
|
|
||
| # Python for evaluation scripts (needs pandas, couchbase, tqdm — use iQ-FastAPI venv) |
Comment on lines
200
to
204
| echo " (limit: $LIMIT questions)" | ||
| fi | ||
| "$PYTHON_BIN" "${TEST_DIR}/run.py" --questions_file "$QUESTIONS_FILE" $LIMIT_ARG | ||
| MCP_SERVER_LOG_FILE="${LOG_DIR}/mcp_server.log" \ | ||
| "${MCP_SERVER_VENV_PATH}/bin/python" "${TEST_DIR}/run.py" --questions_file "$QUESTIONS_FILE" $LIMIT_ARG | ||
| echo "" |
| if [[ "$LIMIT" -gt 0 ]]; then | ||
| LIMIT_ARG="--limit $LIMIT" | ||
| echo " (limit: $LIMIT questions)" | ||
| fi |
Comment on lines
+228
to
233
| "$EVAL_PYTHON" "${EVAL_DIR}/evaluate_sqlpp_catalog.py" \ | ||
| --result_dir "$SUBMISSION_DIR" \ | ||
| --gold_dir "$GOLD_DIR" \ | ||
| --max_workers "$MAX_WORKERS" \ | ||
| --timeout "$TIMEOUT" \ | ||
| 2>&1 | tee "${LOG_DIR}/evaluate.log" | ||
| --timeout "$TIMEOUT" | ||
|
|
Comment on lines
+38
to
+43
| for var in group["variables"]: | ||
| key = var["key"] | ||
| value = var.get("value", "") | ||
| lines.append(f"{key}={value}") | ||
| if var.get("required") and not value: | ||
| missing_required.append(key) |
| def _resolve_server_dir() -> Path: | ||
| env_path = os.environ.get("MCP_SERVER_PATH", "").strip() | ||
| if env_path: | ||
| return Path(env_path) |
Comment on lines
+90
to
+95
| | `PIPELINE_DATASET` | File | | ||
| |--------------------|------| | ||
| | `sqlite` | `test/questions_sqlite.json` | | ||
| | `snowflake` | `test/questions_snowflake.json` | | ||
| | `bird` | `test/questions_bird.json` | | ||
|
|
| │ | ||
| ├── 1. Start iQ-FastAPI (query generation backend) | ||
| ├── 2. Run MCP client → generate SQL++ for each question | ||
| ├── 3. Copy .sqlpp files to submission/ |
Comment on lines
+1
to
+16
| {"instance_id": "sf_bq029", "db": "PATENTS", "question": "Get the average number of inventors per patent and the total count of patent publications in Canada (CA) for each 5-year period from 1960 to 2020, based on publication dates. Only include patents that have at least one inventor listed, and group results by 5-year intervals (1960-1964, 1965-1969, etc.).", "external_knowledge": null} | ||
| {"instance_id": "sf_bq026", "db": "PATENTS", "question": "For the assignee who has been the most active in the patent category 'A61', I'd like to know the five patent jurisdictions code where they filed the most patents during their busiest year, separated by commas.", "external_knowledge": null} | ||
| {"instance_id": "sf_bq091", "db": "PATENTS", "question": "In which year did the assignee with the most applications in the patent category 'A61' file the most?", "external_knowledge": null} | ||
| {"instance_id": "sf_bq099", "db": "PATENTS", "question": "For patent class A01B3, I want to analyze the information of the top 3 assignees based on the total number of applications. Please provide the following five pieces of information: the name of this assignee, total number of applications, the year with the most applications, the number of applications in that year, and the country code with the most applications during that year.", "external_knowledge": null} | ||
| {"instance_id": "sf_bq033", "db": "PATENTS", "question": "How many U.S. publications related to IoT (where the abstract includes the phrase 'internet of things') were filed each month from 2008 to 2022, including months with no filings?", "external_knowledge": null} | ||
| {"instance_id": "sf_bq209", "db": "PATENTS", "question": "Can you calculate the number of utility patents that were granted in 2010 and have exactly one forward citation within a 10-year window following their application/filing date? For this analysis, forward citations should be counted as distinct citing application numbers that cited the patent within 10 years after the patent's own filing date.", "external_knowledge": null} | ||
| {"instance_id": "sf_bq027", "db": "PATENTS", "question": "For patents granted between 2010 and 2018, provide the publication number of each patent and the number of backward citations it has received in the SEA category.", "external_knowledge": null} | ||
| {"instance_id": "sf_bq210", "db": "PATENTS", "question": "How many US B2 patents granted between 2008 and 2018 contain claims that do not include the word 'claim'?", "external_knowledge": null} | ||
| {"instance_id": "sf_bq211", "db": "PATENTS", "question": "Among patents granted between 2010 and 2023 in CN, how many of them belong to families that have a total of over one distinct applications?", "external_knowledge": null} | ||
| {"instance_id": "sf_bq213", "db": "PATENTS", "question": "What is the most common 4-digit IPC code among US B2 utility patents granted from June to August in 2022?", "external_knowledge": "patents_info.md"} | ||
| {"instance_id": "sf_bq212", "db": "PATENTS", "question": "For United States utility patents under the B2 classification granted between June and September of 2022, identify the most frequent 4-digit IPC code for each patent. Then, list the publication numbers and IPC4 codes of patents where this code appears 10 or more times.", "external_knowledge": "patents_info.md"} | ||
| {"instance_id": "sf_bq214", "db": "PATENTS_GOOGLE", "question": "For United States utility patents under the B2 classification granted between 2010 and 2014, find the one with the most forward citations within a month of its filing date, and identify the most similar patent from the same filing year, regardless of its type.", "external_knowledge": "patents_info.md"} | ||
| {"instance_id": "sf_bq216", "db": "PATENTS_GOOGLE", "question": "Identify the top five patents filed in the same year as `US-9741766-B2` that are most similar to it based on technological similarities. Please provide the publication numbers.", "external_knowledge": "patents_info.md"} | ||
| {"instance_id": "sf_bq247", "db": "PATENTS_GOOGLE", "question": "From the publications dataset, first identify the top six families with the most publications whose family_id is not '-1'. Then, using the abs_and_emb table (joined on publication_number), provide each of those families\u2019 IDs alongside every non-empty abstract associated with their publications.", "external_knowledge": null} | ||
| {"instance_id": "sf_bq127", "db": "PATENTS_GOOGLE", "question": "For each publication family whose earliest publication was first published in January 2015, please provide the earliest publication date, the distinct publication numbers, their country codes, the distinct CPC and IPC codes, distinct families (namely, the ids) that cite and are cited by this publication family. Please present all lists as comma-separated values, sorted alphabetically", "external_knowledge": null} | ||
| {"instance_id": "sf_bq215", "db": "PATENTS", "question": "Which US patent (with a B2 kind code and a grant date between 2015 and 2018) has the highest originality score calculated as 1 - (the sum of squared occurrences of distinct 4-digit IPC codes in its backward citations divided by the square of the total occurrences of these 4-digit IPC codes)?", "external_knowledge": "patents_info.md"} |
| }, | ||
| { | ||
| "key": "PIPELINE_TAG", | ||
| "description": "Run name suffix appended to the runs/ folder name (e.g. v2 \u2192 runs/mcp_v2/)", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a pipeline for end-to-end testing and evaluation of the Couchbase MCP server with SQL++ query generation and analysis. It adds a Makefile-based workflow, a structured configuration system, and robust automation for setup, execution, and logging. The changes focus on making the pipeline easy to configure, run, and debug, while supporting multiple datasets and environments.