Skip to content
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
1ce8654
Add mongodb-query skill with testing infrastructure
betsybutton Mar 4, 2026
be8fdbe
Reorganize mongodb-query skill to pass skill-validator
betsybutton Mar 4, 2026
bd1de61
Trigger CI
betsybutton Mar 4, 2026
0d90692
Rename skill to mongodb-query-generator
betsybutton Mar 4, 2026
c6e55ad
Update evals.json with new skill name
betsybutton Mar 4, 2026
255fb82
Update README.md directory structure path
betsybutton Mar 4, 2026
9340e23
Improve mongodb-query-generator skill with best practices
betsybutton Mar 11, 2026
4abbd10
Merge branch 'main' into query-writing-skill
betsybutton Mar 11, 2026
a053d94
Update testing/mongodb-query-generator/mongodb-query-workspace/SUMMAR…
betsybutton Mar 11, 2026
e1c1fe7
Update testing/mongodb-query-generator/evals/evals.json
betsybutton Mar 11, 2026
8af67b3
Update testing/mongodb-query-generator/mongodb-query-workspace/iterat…
betsybutton Mar 11, 2026
417ef7b
Update testing/mongodb-query-generator/mongodb-query-workspace/load-f…
betsybutton Mar 11, 2026
be5afdc
Initial plan
Copilot Mar 11, 2026
a775974
Remove unsafe load-fixtures.js that used eval() and update README
Copilot Mar 11, 2026
0176753
Update skill description to clarify boundary with query optimizer
betsybutton Mar 11, 2026
d9b8c0e
Rename skill from mongodb-query-generator to mongodb-natural-language…
betsybutton Mar 11, 2026
29250c9
Address copilot feedback on skill documentation
betsybutton Mar 18, 2026
bb7bf83
added boundaries evals
betsybutton Mar 18, 2026
7f9ecd2
Remove .DS_Store file from mongodb-natural-language-querying
betsybutton Mar 18, 2026
f245513
Delete testing/mongodb-natural-language-querying/mongodb-query-worksp…
betsybutton Mar 18, 2026
1f25a14
Delete skills/.DS_Store
betsybutton Mar 18, 2026
fa50265
requested changes made
betsybutton Mar 18, 2026
cdda517
Merge branch 'query-writing-skill' of https://github.com/mongodb/agen…
betsybutton Mar 18, 2026
c99896e
deleted testing results
betsybutton Mar 18, 2026
34b3155
new boundary testing
betsybutton Mar 19, 2026
ba3fd32
comment addressed
betsybutton Mar 19, 2026
6c47532
slack feedback
betsybutton Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 208 additions & 0 deletions skills/mongodb-natural-language-querying/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
---
name: mongodb-natural-language-querying
description: Generate read-only MongoDB queries (find) or aggregation pipelines using natural language, with collection schema context and sample documents. Use this skill whenever the user asks to write, create, or generate MongoDB queries, wants to filter/query/aggregate data in MongoDB, asks "how do I query...", needs help with query syntax, or discusses finding/filtering/grouping MongoDB documents. Also use for translating SQL-like requests to MongoDB syntax. Does NOT handle Atlas Search ($search operator), vector/semantic search ($vectorSearch operator), fuzzy matching, autocomplete indexes, or relevance scoring - use search-and-ai for those. Does NOT analyze or optimize existing queries - use mongodb-query-optimizer for that. Does NOT handle aggregation pipelines that involve write operations. Requires MongoDB MCP server.
allowed-tools: mcp__mongodb__*
---
Comment thread
betsybutton marked this conversation as resolved.

# MongoDB Natural Language Querying

You are an expert MongoDB read-only query generator. When a user requests a MongoDB query or aggregation pipeline, follow these guidelines based on the Compass query generation patterns.

## Query Generation Process

### 1. Gather Context Using MCP Tools

**Required Information:**
- Database name and collection name (use `mcp__mongodb__list-databases` and `mcp__mongodb__list-collections` if not provided)
- User's natural language description of the query
- Current date context: ${currentDate} (for date-relative queries)

**Fetch in this order:**

1. **Indexes** (for query optimization):
```
mcp__mongodb__collection-indexes({ database, collection })
```

2. **Schema** (for field validation):
```
mcp__mongodb__collection-schema({ database, collection, sampleSize: 50 })
```
- Returns flattened schema with field names and types
- Includes nested document structures and array fields

3. **Sample documents** (for understanding data patterns):
```
mcp__mongodb__find({ database, collection, limit: 4 })
```
- Shows actual data values and formats
- Reveals common patterns (enums, ranges, etc.)

### 2. Analyze Context and Validate Fields

Before generating a query, always validate field names against the schema you fetched. MongoDB won't error on nonexistent field names - it will simply return no results or behave unexpectedly, making bugs hard to diagnose. By checking the schema first, you catch these issues before the user tries to run the query.

Also review the available indexes to understand which query patterns will perform best.

### 3. Choose Query Type: Find vs Aggregation

Prefer find queries over aggregation pipelines because find queries are simpler and easier for other developers to understand.

**For Find Queries**, generate responses with these fields:
- `filter` - The query filter (required)
- `project` - Field projection (optional)
- `sort` - Sort specification (optional)
- `skip` - Number of documents to skip (optional)
- `limit` - Number of documents to return (optional)
- `collation` - Collation specification (optional)

**Use Find Query when:**
- Simple filtering on one or more fields
- Basic sorting and limiting
- Field projection only
Comment thread
dacharyc marked this conversation as resolved.
Outdated

**For Aggregation Pipelines**, generate an array of stage objects.

**Use Aggregation Pipeline when the request requires:**
Comment thread
dacharyc marked this conversation as resolved.
- Grouping or aggregation functions (sum, count, average, etc.)
- Multiple transformation stages
- Computed fields or data reshaping
- Joins with other collections ($lookup)
- Array unwinding or complex array operations

Comment on lines +65 to +70
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skill spec says to use aggregation for “Text search with scoring”, which strongly implies Atlas Search $search, but the PR description lists Atlas Search as unsupported. To avoid misleading consumers, clarify the limitation in SKILL.md (and/or adjust this bullet) so it’s clear whether $search is expected or if $regex/$text are the supported approaches.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to reconcile this with the search skill

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also flagging we might want to clarify the scope here, related to this - i.e.: we're saying here "Text search with scoring" as a reason to use an aggregation pipeline, which puts it explicitly in scope for this skill. But the only text search guidance in the skill is line 114: "Do not use $text without a text index." There's no mention of $search or $vectorSearch at all. Presumably that's intentional because we don't want to step on the search skill, but not being clear about the scope gate here is a problem because:

  • $search has completely different syntax, index requirements (Atlas Search indexes, not regular text indexes), and scoring mechanics than $text
  • $search must be the first stage in a pipeline, which conflicts with the "place $match at the beginning" advice (line 139)
  • An agent triggered by "search my documents for..." could produce a $text-based query when the user's deployment uses Atlas Search, or try to apply the skill's aggregation ordering rules to a $search pipeline where they don't hold
  • $vectorSearch is even further from anything the skill covers

The fix would be a scope-gate noting that Atlas Search ($search, $vectorSearch) queries are not covered by this skill.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opting to make the scopes cleaner cut; will remove this line & update other parts accordingly!

### 4. Format Your Response

Always output queries in a JSON response structure with stringified MongoDB query syntax. The outer response must be valid JSON, while the query strings inside use MongoDB shell/Extended JSON syntax (with unquoted keys and single quotes) for readability and compatibility with MongoDB tools.
Comment thread
betsybutton marked this conversation as resolved.
Comment thread
betsybutton marked this conversation as resolved.

**Find Query Response:**
```json
{
"query": {
"filter": "{ age: { $gte: 25 } }",
Comment thread
betsybutton marked this conversation as resolved.
"project": "{ name: 1, age: 1, _id: 0 }",
"sort": "{ age: -1 }",
"limit": "10"
}
}
```

**Aggregation Pipeline Response:**
```json
{
"aggregation": {
"pipeline": "[{ $match: { status: 'active' } }, { $group: { _id: '$category', total: { $sum: '$amount' } } }]"
}
}
```

Note the stringified format:
- ✅ `"{ age: { $gte: 25 } }"` (string)
- ❌ `{ age: { $gte: 25 } }` (object)

For aggregation pipelines:
- ✅ `"[{ $match: { status: 'active' } }]"` (string)
- ❌ `[{ $match: { status: 'active' } }]` (array)

## Best Practices

### Query Quality
1. **Generate correct queries** - Build queries that match user requirements, then check index coverage:
Comment thread
dacharyc marked this conversation as resolved.
- Generate the query to correctly satisfy all user requirements
- After generating the query, check if existing indexes can support it
- If no appropriate index exists, mention this in your response (user may want to create one)
- Never use `$where` because it prevents index usage
- Do not use `$text` without a text index
- `$expr` should only be used when necessary (use sparingly)
2. **Avoid redundant operators** - Never add operators that are already implied by other conditions:
- Don't add `$exists` when you already have an equality or inequality check (e.g., `status: "active"` or `age: { $gt: 25 }` already implies the field exists)
- Don't add overlapping range conditions (e.g., don't use both `$gte: 0` and `$gt: -1`)
- Each condition should add meaningful filtering that isn't already covered
3. **Project only needed fields** - Reduce data transfer with projections
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should mention to explicitly add _id:0 when _id is not needed

Comment thread
dacharyc marked this conversation as resolved.
- Add `_id: 0` to the projection when `_id` field is not needed
4. **Validate field names** against the schema before using them
5. **Use appropriate operators** - Choose the right MongoDB operator for the task:
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Best Practices list the numbering is duplicated (5. appears twice), which makes references ambiguous. Please renumber the list items so each point has a unique number.

Suggested change
5. **Use appropriate operators** - Choose the right MongoDB operator for the task:
6. **Use appropriate operators** - Choose the right MongoDB operator for the task:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging on to Copilot here, I was also about to flag this 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch!

- `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte` for comparisons
- `$in`, `$nin` for matching against a list of possible values (equivalent to multiple $eq/$ne conditions OR'ed together)
- `$and`, `$or`, `$not`, `$nor` for logical operations
- `$regex` for case sensitive text pattern matching (prefer left-anchored patterns like `/^prefix/` when possible, as they can use indexes efficiently)
- `$exists` for field existence checks (prefer `a: {$ne: null}` to `a: {$exists: false}` )
Comment thread
dacharyc marked this conversation as resolved.
Outdated
- `$type` for type matching
6. **Optimize array field checks** - Use efficient patterns for array operations:
- To check if array is non-empty: use `"arrayField.0": {$exists: true}` instead of `arrayField: {$exists: true, $type: "array", $ne: []}`
- Checking for the first element's existence is simpler, more readable, and more efficient than combining existence, type, and inequality checks
- For matching array elements with multiple conditions, use `$elemMatch`
- For array length checks, use `$size` when you need an exact count

### Aggregation Pipeline Quality
1. **Filter early** - Use `$match` as early as possible to reduce documents
2. **Project at the end** - Use `$project` at the end to correctly shape returned documents to the client
3. **Limit when possible** - Add `$limit` after `$sort` when appropriate
4. **Use indexes** - Ensure `$match` and `$sort` stages can use indexes:
- Place `$match` stages at the beginning of the pipeline
- Initial `$match` and `$sort` stages can use indexes if they precede any stage that modifies documents
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not strictly true... but there's not a simple description that's possible here, probably better leave it open ended.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I won't make an edit for now - lmk if you have a better idea of how to handle this case

- After generating `$match` filters, check if indexes can support them
- Minimize stages that transform documents before first `$match`
5. **Optimize `$lookup`** - Consider denormalization for frequently joined data
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure indexes on fields in foreign collection being joined


### Error Prevention
1. **Validate all field references** against the schema
2. **Quote field names correctly** - Use dot notation for nested fields
3. **Escape special characters** in regex patterns
4. **Check data types** - Ensure field values match field types from schema
5. **Geospatial coordinates** - MongoDB's GeoJSON format requires longitude first, then latitude (e.g., `[longitude, latitude]` or `{type: "Point", coordinates: [lng, lat]}`). This is opposite to how coordinates are often written in plain English, so double-check this when generating geo queries.

## Schema Analysis

When provided with sample documents, analyze:
1. **Field types** - String, Number, Boolean, Date, ObjectId, Array, Object
2. **Field patterns** - Required vs optional fields (check multiple samples)
3. **Nested structures** - Objects within objects, arrays of objects
4. **Array elements** - Homogeneous vs heterogeneous arrays
5. **Special types** - Dates, ObjectIds, Binary data, GeoJSON

## Sample Document Usage

Use sample documents to:
- Understand actual data values and ranges
- Identify field naming conventions (camelCase, snake_case, etc.)
- Detect common patterns (e.g., status enums, category values)
- Estimate cardinality for grouping operations
- Validate that your query will work with real data

## Error Handling

If you cannot generate a query:
1. **Explain why** - Missing schema, ambiguous request, impossible query
2. **Ask for clarification** - Request more details about requirements
3. **Suggest alternatives** - Propose different approaches if available
4. **Provide examples** - Show similar queries that could work

## Example Workflow

**User Input:** "Find all active users over 25 years old, sorted by registration date"

**Your Process:**
1. Check schema for fields: `status`, `age`, `registrationDate` or similar
2. Verify field types match the query requirements
3. Generate query based on user requirements
4. Check if available indexes can support the query
5. Suggest creating an index if no appropriate index exists for the query filters

**Generated Query:**
```json
{
"query": {
"filter": "{ status: 'active', age: { $gt: 25 } }",
"sort": "{ registrationDate: -1 }"
}
}
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be two more steps here - suggest index if there isn't one and suggest adding limit if a collection is relatively large?

## Size Limits

Keep requests under 5MB:
- If sample documents are too large, use fewer samples (minimum 1)
- Limit to 4 sample documents by default
- For very large documents, project only essential fields when sampling

---
Loading
Loading