Skip to content

Commit 7a1d69c

Browse files
authored
Merge pull request #143 from tgonzalezc5/feat/exa-search-skill
feat: add Exa AI-powered search tool
2 parents 37a148b + 8e6540e commit 7a1d69c

7 files changed

Lines changed: 781 additions & 0 deletions

File tree

docs/scientific-skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,4 +207,5 @@
207207
- **What-If Oracle** - Run structured What-If scenario analysis with multi-branch possibility exploration. Use for speculative questions, scenario planning, risk analysis, contingency planning, strategic options evaluation, stress-testing ideas, and thinking through consequences before committing
208208

209209
### Web Search & Information Retrieval
210+
- **Exa Search** - Web search and URL content extraction via the Exa API. Use for high-quality web search tuned for scientific and technical content, scholarly filtering via `category="research paper"` plus academic domain allowlists, and batch URL extraction
210211
- **Parallel Web** - Search the web, extract URL content, and run deep research using the Parallel Chat API and Extract API. Use for web searches, research queries, and general information gathering with synthesized summaries and citations
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
name: exa-search
3+
description: "Web toolkit powered by Exa, tuned for scientific and technical content. Use this skill when the user needs to search the web or fetch/extract URL content. Covers: web search (semantic lookups, research, current info — with optional research-paper category and academic domain filtering) and URL extraction (fetching pages, articles, academic PDFs in batch). Use this skill for web-related tasks when the user wants high-quality search or scholarly filtering via category=research paper. Triggers on requests to search, look up, fetch a page, or extract an article."
4+
compatibility: Requires exa-py Python SDK, an EXA_API_KEY, and internet access.
5+
license: MIT
6+
metadata:
7+
skill-author: Exa
8+
website: https://exa.ai
9+
docs: https://exa.ai/docs
10+
---
11+
12+
# Exa Web Toolkit
13+
14+
A skill for web-powered research tasks backed by [Exa](https://exa.ai): web search and URL extraction. Exa's index combines high-quality keyword and semantic retrieval, which makes it well-suited to scientific, technical, and conceptual queries.
15+
16+
## Routing — pick the right capability
17+
18+
Read the user's request and match it to one of the capabilities below. Read the corresponding reference file for detailed instructions before running commands.
19+
20+
| User wants to... | Capability | Where |
21+
|---|---|---|
22+
| Look something up, research a topic, find current info | **Web Search** | `references/web-search.md` |
23+
| Fetch content from a specific URL (webpage, article, PDF) | **Web Extract** | `references/web-extract.md` |
24+
| Install or authenticate | **Setup** | Below |
25+
26+
### Decision guide
27+
28+
- **Default to Web Search** for topic lookups, research questions, or "what is X?" queries. When the topic is scientific or technical, pass `--category "research paper"` to bias toward scholarly sources, and/or an academic `--include-domains` allowlist. See `references/web-search.md` for the two-pass academic strategy.
29+
- **Use Web Extract** when the user provides a URL or asks you to read/fetch a specific page. Prefer this over the built-in WebFetch for batch extraction (multiple URLs in one call) and for academic PDFs.
30+
31+
### Academic source priority
32+
33+
For technical or scientific queries, prefer academic and scientific sources:
34+
- Peer-reviewed journal articles and conference proceedings over blog posts or news
35+
- Preprints (arXiv, bioRxiv, medRxiv) when peer-reviewed versions aren't available
36+
- Institutional and government sources (NIH, WHO, NASA, NIST) over commercial sites
37+
- Primary research over secondary summaries
38+
39+
Two levers to steer Exa toward scholarly content:
40+
1. `--category "research paper"` biases retrieval toward scholarly sources.
41+
2. `--include-domains` with a scholarly allowlist (arxiv.org, nature.com, pubmed.ncbi.nlm.nih.gov, etc.) restricts the domain pool.
42+
43+
Combine both for strictly academic results. See `references/web-search.md` for the full pattern.
44+
45+
When citing academic sources, include author names and publication year where available (e.g., [Smith et al., 2025](url)) in addition to the standard citation format. If a DOI is present, prefer the DOI link.
46+
47+
---
48+
49+
## Setup
50+
51+
This skill uses the [`exa-py`](https://github.com/exa-labs/exa-py) Python SDK. The scripts in `scripts/` declare their dependencies via PEP 723 inline metadata, so you can run them directly with `uv run` without a separate install step:
52+
53+
```bash
54+
uv run --with exa-py python "$SKILL_PATH/scripts/exa_search.py" --help
55+
```
56+
57+
If you prefer a persistent install:
58+
59+
```bash
60+
uv pip install "exa-py>=1.14.0"
61+
```
62+
63+
### Authentication
64+
65+
All commands read the API key from the `EXA_API_KEY` environment variable. Get your Exa API key at [dashboard.exa.ai/api-keys](https://dashboard.exa.ai/api-keys).
66+
67+
First, check if a `.env` file exists in the project root and contains `EXA_API_KEY`. If so, load it:
68+
69+
```bash
70+
dotenv -f .env run -- uv run --with exa-py python "$SKILL_PATH/scripts/exa_search.py" "your query"
71+
```
72+
73+
If `dotenv` isn't available, install it: `pip install python-dotenv[cli]` or `uv pip install python-dotenv[cli]`.
74+
75+
If there's no `.env`, export the key for the session:
76+
77+
```bash
78+
export EXA_API_KEY="your-key"
79+
```
80+
81+
Verify by running any script with `--help` — it will exit cleanly if the key is set and auth-check runs only when a real query is made.
82+
83+
### Tracking header
84+
85+
Every script in this skill sets the `x-exa-integration` request header to `k-dense-ai--scientific-agent-skills` so Exa can attribute usage from the K-Dense AI scientific-agent-skills repo to this integration. Do not remove or rename this header when adapting the scripts.
86+
87+
---
88+
89+
## Files in this skill
90+
91+
- `SKILL.md` — this file (routing and setup)
92+
- `references/web-search.md` — detailed web search reference with academic strategy
93+
- `references/web-extract.md` — URL content extraction reference
94+
- `scripts/exa_search.py` — CLI wrapper around `client.search_and_contents`
95+
- `scripts/exa_extract.py` — CLI wrapper around `client.get_contents`
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# URL Extraction
2+
3+
Extract content from: $ARGUMENTS
4+
5+
## Command
6+
7+
Choose a short, descriptive filename based on the URL or content (e.g., `alphafold-paper`, `nature-editorial`). Use lowercase with hyphens, no spaces.
8+
9+
```bash
10+
uv run --with exa-py python "$SKILL_PATH/scripts/exa_extract.py" "$ARGUMENTS" \
11+
--text \
12+
-o "$FILENAME.json"
13+
```
14+
15+
You can pass multiple URLs as positional arguments — the script batches them in a single `/contents` call, which is faster and cheaper than looping.
16+
17+
Content modes:
18+
19+
- `--text` (default if nothing else is passed) returns full-text content
20+
- `--highlights` returns extracted passages instead of full text
21+
22+
## Academic content handling
23+
24+
When extracting from academic sources (arXiv, PubMed, journal sites, conference proceedings), use `--text` to get the full paper text:
25+
26+
```bash
27+
uv run --with exa-py python "$SKILL_PATH/scripts/exa_extract.py" "$URL" \
28+
--text \
29+
-o "$FILENAME.json"
30+
```
31+
32+
For arXiv, either the `/abs/` page URL or the raw PDF URL works. Prefer `/abs/` when available — it has cleaner metadata (title, authors, published date) attached to the result.
33+
34+
## Response format
35+
36+
Return content as:
37+
38+
**[Page Title](URL)**
39+
40+
For academic papers, include structured metadata when available:
41+
- **Authors:** list of authors (from the `author` field)
42+
- **Published:** from `published_date`
43+
44+
Then the extracted content, with these rules:
45+
- Keep content verbatim — do not paraphrase or summarize
46+
- Parse lists exhaustively — extract EVERY numbered/bulleted item
47+
- Strip only obvious noise: nav menus, footers, ads
48+
- Preserve all facts, names, numbers, dates, quotes
49+
- For academic papers, preserve figure/table captions and references
50+
51+
**Partial-result handling** — when batching multiple URLs, one or more may fail (paywall, robots.txt, timeout). Report which URLs extracted successfully and which failed, rather than silently dropping failures.
52+
53+
After the response, mention the output file path (`$FILENAME.json`) so the user knows it's available for follow-up questions.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Web Search
2+
3+
Search the web for: $ARGUMENTS
4+
5+
## Command
6+
7+
Choose a short, descriptive filename based on the query (e.g., `ai-chip-news`, `crispr-off-target`). Use lowercase with hyphens, no spaces.
8+
9+
```bash
10+
uv run --with exa-py python "$SKILL_PATH/scripts/exa_search.py" "$ARGUMENTS" \
11+
--text --highlights \
12+
-o "$FILENAME.json"
13+
```
14+
15+
`$SKILL_PATH` is the path to this skill directory. The `-o` flag saves the full results to a JSON file so follow-up questions can reuse them without re-querying.
16+
17+
**Search type selection**`--type` controls retrieval mode:
18+
19+
| Mode | When to use |
20+
|---|---|
21+
| `auto` (default) | Exa's general-purpose search. Use this unless you have a reason not to. |
22+
| `fast` | Lowest latency. Use for simple lookups where speed matters more than nuance. |
23+
| `deep` | Slowest but highest quality. Use for hard, conceptual, or exhaustive research queries where recall matters more than latency. |
24+
25+
**Content modes** — add any combination:
26+
27+
- `--text` returns full-text content per result
28+
- `--highlights` returns the most relevant passages (good signal-to-noise, lower token cost than full text)
29+
30+
Default to `--highlights` for broad searches (cheaper, more skimmable). Add `--text` only when you need to quote or extract in detail.
31+
32+
**Filtering options** — Exa supports rich filtering via the SDK:
33+
34+
- `--start-published-date YYYY-MM-DD` / `--end-published-date YYYY-MM-DD` for time-sensitive queries
35+
- `--include-domains domain1.com,domain2.com` to restrict to an allowlist
36+
- `--exclude-domains spam.com,low-quality.com` to drop a blocklist
37+
- `--category "research paper"` to bias toward scholarly content (also: `company`, `news`, `github`, `personal site`, `financial report`, `people`)
38+
- `--user-location US` for locale-specific results
39+
40+
## Academic source strategy
41+
42+
For scientific or technical queries, Exa has two strong levers:
43+
44+
### 1. Use `--category "research paper"`
45+
46+
```bash
47+
uv run --with exa-py python "$SKILL_PATH/scripts/exa_search.py" "$ARGUMENTS" \
48+
--category "research paper" \
49+
--text --highlights \
50+
-o "$FILENAME-academic.json"
51+
```
52+
53+
This biases retrieval toward papers indexed as scholarly content (journals, preprint servers, conference proceedings) rather than blogs or news coverage.
54+
55+
### 2. Restrict to scholarly domains
56+
57+
For stricter academic filtering, combine the category with an explicit domain allowlist:
58+
59+
```bash
60+
uv run --with exa-py python "$SKILL_PATH/scripts/exa_search.py" "$ARGUMENTS" \
61+
--category "research paper" \
62+
--include-domains "arxiv.org,biorxiv.org,medrxiv.org,pubmed.ncbi.nlm.nih.gov,nature.com,science.org" \
63+
--text --highlights \
64+
-o "$FILENAME-academic.json"
65+
```
66+
67+
### Two-pass pattern for comprehensive coverage
68+
69+
Run **both** an academic-focused search and an unrestricted one, then merge with academic sources first:
70+
71+
1. Academic pass: `--category "research paper"` with the scholarly domain allowlist above.
72+
2. General pass: the standard command without `--category` or `--include-domains`, to catch relevant non-academic sources (news coverage, lab blogs, institutional pages).
73+
74+
Merge results, leading with academic sources. If the query is clearly non-scientific, skip the academic pass.
75+
76+
**When to use the two-search pattern:** Any query involving scientific claims, medical information, research findings, technical mechanisms, statistical data, or anything where primary literature would be more reliable than secondary reporting.
77+
78+
## Parsing results
79+
80+
Parse the JSON output. Each result includes:
81+
82+
- `title`, `url`, `published_date`, `author`
83+
- `score` — Exa's relevance score for the query
84+
- `text` (if `--text`), `highlights` + `highlight_scores` (if `--highlights`)
85+
86+
**Snippet fallback** — any combination of content fields may be present. Cascade through them: prefer `highlights` (tight, pre-selected passages), fall back to a truncated slice of `text`. Never assume exactly one is present.
87+
88+
## Response format
89+
90+
**CRITICAL: Every claim must have an inline citation.** Use markdown links pulling only from the JSON output. Never invent or guess URLs.
91+
92+
For academic sources, use author-year citation style where metadata is available:
93+
- Academic: [Smith et al., 2025](url) or [Smith & Jones, 2024](url)
94+
- Non-academic: [Source Title](url)
95+
96+
Synthesize a response that:
97+
- Leads with findings from peer-reviewed or preprint sources when available
98+
- Clearly distinguishes between claims backed by primary research vs. secondary reporting
99+
- Includes specific facts, names, numbers, dates
100+
- Cites every fact inline — do not leave any claim uncited
101+
- Organizes by theme if multiple topics
102+
- Notes the evidence quality (e.g., "a randomized controlled trial found..." vs. "a blog post reports...")
103+
104+
**End with a Sources section** listing every URL referenced, grouped by type:
105+
106+
```
107+
Sources:
108+
109+
Academic / Peer-reviewed:
110+
- [Smith et al., 2025 — Title of Paper](https://doi.org/...) (Nature, 2025)
111+
- [Jones & Lee, 2024 — Title of Paper](https://arxiv.org/...) (arXiv preprint)
112+
113+
Other:
114+
- [Source Title](https://example.com/article) (Feb 2026)
115+
```
116+
117+
This Sources section is mandatory. Do not omit it. If no academic sources were found, note that and explain why (e.g., the topic is too recent, not yet studied, or inherently non-academic).
118+
119+
After the Sources section, mention the output file path (`$FILENAME.json`) so the user knows it's available for follow-up questions.
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
#!/usr/bin/env python3
2+
# /// script
3+
# requires-python = ">=3.11"
4+
# dependencies = ["exa-py>=1.14.0"]
5+
# ///
6+
"""Fetch and extract content from URLs using Exa's /contents endpoint.
7+
8+
Example:
9+
uv run exa_extract.py \\
10+
https://arxiv.org/abs/2401.04088 \\
11+
https://www.nature.com/articles/s41586-024-07566-y \\
12+
--text \\
13+
-o extracted.json
14+
"""
15+
from __future__ import annotations
16+
17+
import argparse
18+
import json
19+
import os
20+
import sys
21+
from dataclasses import asdict, dataclass, field
22+
from typing import Any
23+
24+
try:
25+
from exa_py import Exa
26+
except ImportError:
27+
print(
28+
"exa_py not installed. Run: uv pip install exa-py (or invoke with: uv run --with exa-py)",
29+
file=sys.stderr,
30+
)
31+
sys.exit(2)
32+
33+
34+
EXA_INTEGRATION_HEADER = "k-dense-ai--scientific-agent-skills"
35+
36+
37+
@dataclass
38+
class ExtractedDocument:
39+
"""Typed view of a single extracted document for JSON export."""
40+
41+
url: str
42+
id: str | None
43+
title: str | None
44+
author: str | None
45+
published_date: str | None
46+
text: str | None = None
47+
highlights: list[str] = field(default_factory=list)
48+
49+
50+
def _build_contents(text: bool, highlights: bool) -> dict[str, Any]:
51+
contents: dict[str, Any] = {}
52+
if text:
53+
contents["text"] = True
54+
if highlights:
55+
contents["highlights"] = True
56+
if not contents:
57+
# Default to full text when the caller doesn't pick anything.
58+
contents["text"] = True
59+
return contents
60+
61+
62+
def _to_typed(item: Any) -> ExtractedDocument:
63+
return ExtractedDocument(
64+
url=getattr(item, "url", ""),
65+
id=getattr(item, "id", None),
66+
title=getattr(item, "title", None),
67+
author=getattr(item, "author", None),
68+
published_date=getattr(item, "published_date", None),
69+
text=getattr(item, "text", None),
70+
highlights=list(getattr(item, "highlights", None) or []),
71+
)
72+
73+
74+
def run(args: argparse.Namespace) -> dict[str, Any]:
75+
api_key = os.environ.get("EXA_API_KEY")
76+
if not api_key:
77+
print("EXA_API_KEY environment variable is not set.", file=sys.stderr)
78+
sys.exit(2)
79+
80+
client = Exa(api_key=api_key)
81+
client.headers["x-exa-integration"] = EXA_INTEGRATION_HEADER
82+
83+
contents = _build_contents(args.text, args.highlights)
84+
response = client.get_contents(urls=args.urls, **contents)
85+
86+
typed = [_to_typed(item) for item in getattr(response, "results", []) or []]
87+
return {
88+
"urls": list(args.urls),
89+
"num_results": len(typed),
90+
"results": [asdict(doc) for doc in typed],
91+
}
92+
93+
94+
def build_parser() -> argparse.ArgumentParser:
95+
parser = argparse.ArgumentParser(description="Extract content from URLs with Exa.")
96+
parser.add_argument("urls", nargs="+", help="One or more URLs to extract.")
97+
parser.add_argument("--text", action="store_true", help="Return full-text content.")
98+
parser.add_argument("--highlights", action="store_true", help="Return extracted highlight snippets.")
99+
parser.add_argument("-o", "--output", default=None, help="Write JSON to this file (default: stdout).")
100+
return parser
101+
102+
103+
def main(argv: list[str] | None = None) -> int:
104+
args = build_parser().parse_args(argv)
105+
payload = run(args)
106+
text = json.dumps(payload, indent=2, ensure_ascii=False)
107+
if args.output:
108+
with open(args.output, "w", encoding="utf-8") as fh:
109+
fh.write(text)
110+
print(f"Wrote {len(payload['results'])} documents to {args.output}")
111+
else:
112+
print(text)
113+
return 0
114+
115+
116+
if __name__ == "__main__":
117+
sys.exit(main())

0 commit comments

Comments
 (0)