Last updated: 2026-05-04 (Tier 2: automatic version discovery)
All phases complete. GHA workflow runs 13 sources (ES 8.3/8.4/8.5, Enterprise 10.0/10.2, Cloud 10.2.2510/10.3.2512, admin-manual 10.0/10.2, SOAR on-prem 8.4/8.5, SOAR Cloud, Lantern). splunk-discover-versions auto-detects current versions before each crawl and commits updated versions.json back to the repo.
| File | Status | Notes |
|---|---|---|
pyproject.toml |
✅ Done | Deps + entry points: splunk-mcp, splunk-crawl, splunk-setup, splunk-merge, splunk-discover-versions |
.gitignore |
✅ Done | Includes merge temp patterns (*.tmp, *.tmp-wal, *.tmp-shm) |
.python-version |
✅ Done | 3.12 |
src/splunk_docs_mcp/__init__.py |
✅ Done | |
src/splunk_docs_mcp/config.py |
✅ Done | 13 active sources; _enterprise_source() / _cloud_source() factories; version_discovery_url field; null-version filtering |
src/splunk_docs_mcp/discover.py |
✅ Done | splunk-discover-versions CLI; parses <select id="version-select"> on help.splunk.com; updates versions.json |
src/splunk_docs_mcp/db.py |
✅ Done | Schema + all helpers; content_md_hash; version_tags; run_version_merge_pass(); run_dedup_pass() uses content_md_hash; version filter matches json_each(version_tags) |
src/splunk_docs_mcp/extractor.py |
✅ Done | |
src/splunk_docs_mcp/server.py |
✅ Done | 6 tools; version= filter on search_docs + search_docs_semantic; source instructions |
src/splunk_docs_mcp/cli.py |
✅ Done | --delay-jitter; _dedup_pass(); exit 1 only if failure rate >5% |
src/splunk_docs_mcp/crawler.py |
✅ Done | Retry pass after BFS; failed URLs excluded from visited set; auth-redirect detection (4xx after off-domain redirect → skipped, not failed) |
src/splunk_docs_mcp/merge.py |
✅ Done | merge_dbs(), export_sources(), splunk-merge CLI |
src/splunk_docs_mcp/setup.py |
✅ Done | Grouped hierarchical menu (product → versions); n-1 auto-adds parent; total MB shown per entry; WAL cleanup after merge |
tests/test_extractor.py |
✅ Done | 18 tests for parse_url_metadata() |
tests/test_crawler.py |
✅ Done | 18 tests for _normalise_url, _is_target_url, _section_from_url |
.github/workflows/crawl-and-release.yml |
✅ Done | 10-source matrix (crawl + crawl-derived + merge-and-release); resilient merge (skips missing DBs) |
README.md |
✅ Done | Hallucination motivation at top; uv install instructions; simplified sources table; n−1 coverage model |
- MCP server: all 6 tools;
version=filter on both search tools - Multi-version search:
search_docs(query, version="8.4")matches rows byversioncolumn ANDversion_tagsJSON array - Cross-version dedup (Option B):
run_version_merge_passcollapses same-content n-1 rows into parent rows tagged with both versions; DB size stays bounded as more n-1 sources are added - Cross-source dedup:
is_duplicate=1suppresses duplicate content (now usingcontent_md_hash— fixes Enterprise/Cloud Markdown-identical pages); bypassed whenversion=is set - BM25 keyword search: FTS5, BM25 ranked, title weighted 10×, snippets
- Semantic search: all-MiniLM-L6-v2 embeddings, matrix cached at startup
- Crawler retry pass: after main BFS, failed URLs are re-attempted once
- Auth-redirect detection: pages that redirect to external SSO (403) are skipped cleanly, not counted as failures
- Incremental re-crawl: failed URLs excluded from visited set so they're retried on next run
splunk-merge: merges per-source DBs + exports per-source files +manifest.jsonsplunk-setup: interactive menu; single-source skips merge; multi-source merges; WAL cleanup- 36 passing tests:
parse_url_metadata,_normalise_url,_is_target_url,_section_from_url - GHA workflow: 7-job matrix, per-source DB caching,
continue-on-error, resilient merge
No blocking issues. The previously noted Enterprise/Cloud dedup gap is resolved by content_md_hash in run_dedup_pass.
- Item 10 ✅ —
crawled_atin search results - Item 4 ✅ — Exponential backoff retry (3 attempts, 2/4/8 s)
- Item 3 ✅ — Embedding matrix cache at startup
- Item 8 ✅ — Smart chunking (heading → paragraph → character fallback) +
--rechunk - Item 2 ✅ — Lantern sitemap seeding +
<lastmod>skip + BFS fallback
- Item 6 ✅ — Embedding reuse via
content_hash - Item 1 ✅ — GHA matrix (7 parallel jobs) +
merge_dbs()+splunk-mergeCLI - Item 7 ✅ — Multi-version crawling (ES 8.3/8.4) +
version=filter on search tools
- Item 5 ✅ — Cross-source deduplication (
is_duplicatecolumn; version-bypass logic) - Item 5b ✅ — Extend dedup to use
content_md_hashfor Enterprise/Cloud overlap (done in Option B) - Item 9 ✅ —
splunk-setupversion selection UI