Skip to content

Commit 1ef9cba

Browse files
authored
Feature/prompt templates and lmstudio sdk (#171)
* Add prompt template support and LM Studio SDK integration Features: - Prompt template support for embedding models (via --embedding-prompt-template) - LM Studio SDK integration for automatic context length detection - Hybrid token limit discovery (Ollama → LM Studio → Registry → Default) - Client-side token truncation to prevent silent failures - Automatic persistence of embedding_options to .meta.json Implementation: - Added _query_lmstudio_context_limit() with Node.js subprocess bridge - Modified compute_embeddings_openai() to apply prompt templates before truncation - Extended CLI with --embedding-prompt-template flag for build and search - URL detection for LM Studio (port 1234 or lmstudio/lm.studio keywords) - HTTP→WebSocket URL conversion for SDK compatibility Tests: - 60 passing tests across 5 test files - Comprehensive coverage of prompt templates, LM Studio integration, and token handling - Parametrized tests for maintainability and clarity * Add integration tests and fix LM Studio SDK bridge Features: - End-to-end integration tests for prompt template with EmbeddingGemma - Integration tests for hybrid token limit discovery mechanism - Tests verify real-world functionality with live services (LM Studio, Ollama) Fixes: - LM Studio SDK bridge now uses client.embedding.load() for embedding models - Fixed NODE_PATH resolution to include npm global modules - Fixed integration test to use WebSocket URL (ws://) for SDK bridge Tests: - test_prompt_template_e2e.py: 8 integration tests covering: - Prompt template prepending with LM Studio (EmbeddingGemma) - LM Studio SDK bridge for context length detection - Ollama dynamic token limit detection - Hybrid discovery fallback mechanism (registry, default) - All tests marked with @pytest.mark.integration for selective execution - Tests gracefully skip when services unavailable Documentation: - Updated tests/README.md with integration test section - Added prerequisites and running instructions - Documented that prompt templates are ONLY for EmbeddingGemma - Added integration marker to pyproject.toml Test Results: - All 8 integration tests passing with live services - Confirmed prompt templates work correctly with EmbeddingGemma - Verified LM Studio SDK bridge auto-detects context length (2048) - Validated hybrid token limit discovery across all backends * Add prompt template support to Ollama mode Extends prompt template functionality from OpenAI mode to Ollama for backend consistency. Changes: - Add provider_options parameter to compute_embeddings_ollama() - Apply prompt template before token truncation (lines 1005-1011) - Pass provider_options through compute_embeddings() call chain Tests: - test_ollama_embedding_with_prompt_template: Verifies templates work with Ollama - test_ollama_prompt_template_affects_embeddings: Confirms embeddings differ with/without template - Both tests pass with live Ollama service (2/2 passing) Usage: leann build --embedding-mode ollama --embedding-prompt-template "query: " ... * Fix LM Studio SDK bridge to respect JIT auto-evict settings Problem: SDK bridge called client.embedding.load() which loaded models into LM Studio memory and bypassed JIT auto-evict settings, causing duplicate model instances to accumulate. Root cause analysis (from Perplexity research): - Explicit SDK load() commands are treated as "pinned" models - JIT auto-evict only applies to models loaded reactively via API requests - SDK-loaded models remain in memory until explicitly unloaded Solutions implemented: 1. Add model.unload() after metadata query (line 243) - Load model temporarily to get context length - Unload immediately to hand control back to JIT system - Subsequent API requests trigger JIT load with auto-evict 2. Add token limit caching to prevent repeated SDK calls - Cache discovered limits in _token_limit_cache dict (line 48) - Key: (model_name, base_url), Value: token_limit - Prevents duplicate load/unload cycles within same process - Cache shared across all discovery methods (Ollama, SDK, registry) Tests: - TestTokenLimitCaching: 5 tests for cache behavior (integrated into test_token_truncation.py) - Manual testing confirmed no duplicate models in LM Studio after fix - All existing tests pass Impact: - Respects user's LM Studio JIT and auto-evict settings - Reduces model memory footprint - Faster subsequent builds (cached limits) * Document prompt template and LM Studio SDK features Added comprehensive documentation for new optional embedding features: Configuration Guide (docs/configuration-guide.md): - New section: "Optional Embedding Features" - Task-Specific Prompt Templates subsection: - Explains EmbeddingGemma use case with document/query prompts - CLI and Python API examples - Clear warnings about compatible vs incompatible models - References to GitHub issue #155 and HuggingFace blog - LM Studio Auto-Detection subsection: - Prerequisites (Node.js + @lmstudio/sdk) - How auto-detection works (4-step process) - Benefits and optional nature clearly stated FAQ (docs/faq.md): - FAQ #2: When should I use prompt templates? - DO/DON'T guidance with examples - Links to detailed configuration guide - FAQ #3: Why is LM Studio loading multiple copies? - Explains the JIT auto-evict fix - Troubleshooting steps if still seeing issues - FAQ #4: Do I need Node.js and @lmstudio/sdk? - Clarifies it's completely optional - Lists benefits if installed - Installation instructions Cross-references between documents for easy navigation between quick reference and detailed guides. * Add separate build/query template support for task-specific models Task-specific models like EmbeddingGemma require different templates for indexing vs searching. Store both templates at build time and auto-apply query template during search with backward compatibility. * Consolidate prompt template tests from 44 to 37 tests Merged redundant no-op tests, removed low-value implementation tests, consolidated parameterized CLI tests, and removed hanging over-mocked test. All tests pass with improved focus on behavioral testing. * Fix query template application in compute_query_embedding Query templates were only applied in the fallback code path, not when using the embedding server (default path). This meant stored query templates in index metadata were ignored during MCP and CLI searches. Changes: - Move template application to before any computation path (searcher_base.py:109-110) - Add comprehensive tests for both server and fallback paths - Consolidate tests into test_prompt_template_persistence.py Tests verify: - Template applied when using embedding server - Template applied in fallback path - Consistent behavior between both paths * Apply ruff formatting and fix linting issues - Remove unused imports - Fix import ordering - Remove unused variables - Apply code formatting * Fix CI test failures: mock OPENAI_API_KEY in tests Tests were failing in CI because compute_embeddings_openai() checks for OPENAI_API_KEY before using the mocked client. Added monkeypatch to set fake API key in test fixture.
1 parent a635509 commit 1ef9cba

15 files changed

+3095
-15
lines changed

docs/configuration-guide.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,95 @@ builder.build_index("./indexes/my-notes", chunks)
158158

159159
`embedding_options` is persisted to the index `meta.json`, so subsequent `LeannSearcher` or `LeannChat` sessions automatically reuse the same provider settings (the embedding server manager forwards them to the provider for you).
160160

161+
## Optional Embedding Features
162+
163+
### Task-Specific Prompt Templates
164+
165+
Some embedding models are trained with task-specific prompts to differentiate between documents and queries. The most notable example is **Google's EmbeddingGemma**, which requires different prompts depending on the use case:
166+
167+
- **Indexing documents**: `"title: none | text: "`
168+
- **Search queries**: `"task: search result | query: "`
169+
170+
LEANN supports automatic prompt prepending via the `--embedding-prompt-template` flag:
171+
172+
```bash
173+
# Build index with EmbeddingGemma (via LM Studio or Ollama)
174+
leann build my-docs \
175+
--docs ./documents \
176+
--embedding-mode openai \
177+
--embedding-model text-embedding-embeddinggemma-300m-qat \
178+
--embedding-api-base http://localhost:1234/v1 \
179+
--embedding-prompt-template "title: none | text: " \
180+
--force
181+
182+
# Search with query-specific prompt
183+
leann search my-docs \
184+
--query "What is quantum computing?" \
185+
--embedding-prompt-template "task: search result | query: "
186+
```
187+
188+
**Important Notes:**
189+
- **Only use with compatible models**: EmbeddingGemma and similar task-specific models
190+
- **NOT for regular models**: Adding prompts to models like `nomic-embed-text`, `text-embedding-3-small`, or `bge-base-en-v1.5` will corrupt embeddings
191+
- **Template is saved**: Build-time templates are saved to `.meta.json` for reference
192+
- **Flexible prompts**: You can use any prompt string, or leave it empty (`""`)
193+
194+
**Python API:**
195+
```python
196+
from leann.api import LeannBuilder
197+
198+
builder = LeannBuilder(
199+
embedding_mode="openai",
200+
embedding_model="text-embedding-embeddinggemma-300m-qat",
201+
embedding_options={
202+
"base_url": "http://localhost:1234/v1",
203+
"api_key": "lm-studio",
204+
"prompt_template": "title: none | text: ",
205+
},
206+
)
207+
builder.build_index("./indexes/my-docs", chunks)
208+
```
209+
210+
**References:**
211+
- [HuggingFace Blog: EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) - Technical details
212+
213+
### LM Studio Auto-Detection (Optional)
214+
215+
When using LM Studio with the OpenAI-compatible API, LEANN can optionally auto-detect model context lengths via the LM Studio SDK. This eliminates manual configuration for token limits.
216+
217+
**Prerequisites:**
218+
```bash
219+
# Install Node.js (if not already installed)
220+
# Then install the LM Studio SDK globally
221+
npm install -g @lmstudio/sdk
222+
```
223+
224+
**How it works:**
225+
1. LEANN detects LM Studio URLs (`:1234`, `lmstudio` in URL)
226+
2. Queries model metadata via Node.js subprocess
227+
3. Automatically unloads model after query (respects your JIT auto-evict settings)
228+
4. Falls back to static registry if SDK unavailable
229+
230+
**No configuration needed** - it works automatically when SDK is installed:
231+
232+
```bash
233+
leann build my-docs \
234+
--docs ./documents \
235+
--embedding-mode openai \
236+
--embedding-model text-embedding-nomic-embed-text-v1.5 \
237+
--embedding-api-base http://localhost:1234/v1
238+
# Context length auto-detected if SDK available
239+
# Falls back to registry (2048) if not
240+
```
241+
242+
**Benefits:**
243+
- ✅ Automatic token limit detection
244+
- ✅ Respects LM Studio JIT auto-evict settings
245+
- ✅ No manual registry maintenance
246+
- ✅ Graceful fallback if SDK unavailable
247+
248+
**Note:** This is completely optional. LEANN works perfectly fine without the SDK using the built-in token limit registry.
249+
161250
## Index Selection: Matching Your Scale
162251

163252
### HNSW (Hierarchical Navigable Small World)

docs/faq.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,51 @@ You can speed up the process by using a lightweight embedding model. Add this to
88
--embedding-model sentence-transformers/all-MiniLM-L6-v2
99
```
1010
**Model sizes:** `all-MiniLM-L6-v2` (30M parameters), `facebook/contriever` (~100M parameters), `Qwen3-0.6B` (600M parameters)
11+
12+
## 2. When should I use prompt templates?
13+
14+
**Use prompt templates ONLY with task-specific embedding models** like Google's EmbeddingGemma. These models are specially trained to use different prompts for documents vs queries.
15+
16+
**DO NOT use with regular models** like `nomic-embed-text`, `text-embedding-3-small`, or `bge-base-en-v1.5` - adding prompts to these models will corrupt the embeddings.
17+
18+
**Example usage with EmbeddingGemma:**
19+
```bash
20+
# Build with document prompt
21+
leann build my-docs --embedding-prompt-template "title: none | text: "
22+
23+
# Search with query prompt
24+
leann search my-docs --query "your question" --embedding-prompt-template "task: search result | query: "
25+
```
26+
27+
See the [Configuration Guide: Task-Specific Prompt Templates](configuration-guide.md#task-specific-prompt-templates) for detailed usage.
28+
29+
## 3. Why is LM Studio loading multiple copies of my model?
30+
31+
This was fixed in recent versions. LEANN now properly unloads models after querying metadata, respecting your LM Studio JIT auto-evict settings.
32+
33+
**If you still see duplicates:**
34+
- Update to the latest LEANN version
35+
- Restart LM Studio to clear loaded models
36+
- Check that you have JIT auto-evict enabled in LM Studio settings
37+
38+
**How it works now:**
39+
1. LEANN loads model temporarily to get context length
40+
2. Immediately unloads after query
41+
3. LM Studio JIT loads model on-demand for actual embeddings
42+
4. Auto-evicts per your settings
43+
44+
## 4. Do I need Node.js and @lmstudio/sdk?
45+
46+
**No, it's completely optional.** LEANN works perfectly fine without them using a built-in token limit registry.
47+
48+
**Benefits if you install it:**
49+
- Automatic context length detection for LM Studio models
50+
- No manual registry maintenance
51+
- Always gets accurate token limits from the model itself
52+
53+
**To install (optional):**
54+
```bash
55+
npm install -g @lmstudio/sdk
56+
```
57+
58+
See [Configuration Guide: LM Studio Auto-Detection](configuration-guide.md#lm-studio-auto-detection-optional) for details.

packages/leann-core/src/leann/api.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -916,6 +916,7 @@ def search(
916916
metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None,
917917
batch_size: int = 0,
918918
use_grep: bool = False,
919+
provider_options: Optional[dict[str, Any]] = None,
919920
**kwargs,
920921
) -> list[SearchResult]:
921922
"""
@@ -979,10 +980,24 @@ def search(
979980

980981
start_time = time.time()
981982

983+
# Extract query template from stored embedding_options with fallback chain:
984+
# 1. Check provider_options override (highest priority)
985+
# 2. Check query_prompt_template (new format)
986+
# 3. Check prompt_template (old format for backward compat)
987+
# 4. None (no template)
988+
query_template = None
989+
if provider_options and "prompt_template" in provider_options:
990+
query_template = provider_options["prompt_template"]
991+
elif "query_prompt_template" in self.embedding_options:
992+
query_template = self.embedding_options["query_prompt_template"]
993+
elif "prompt_template" in self.embedding_options:
994+
query_template = self.embedding_options["prompt_template"]
995+
982996
query_embedding = self.backend_impl.compute_query_embedding(
983997
query,
984998
use_server_if_available=recompute_embeddings,
985999
zmq_port=zmq_port,
1000+
query_template=query_template,
9861001
)
9871002
logger.info(f" Generated embedding shape: {query_embedding.shape}")
9881003
embedding_time = time.time() - start_time

packages/leann-core/src/leann/cli.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,18 @@ def create_parser(self) -> argparse.ArgumentParser:
144144
default=None,
145145
help="API key for embedding service (defaults to OPENAI_API_KEY)",
146146
)
147+
build_parser.add_argument(
148+
"--embedding-prompt-template",
149+
type=str,
150+
default=None,
151+
help="Prompt template to prepend to all texts for embedding (e.g., 'query: ' for search)",
152+
)
153+
build_parser.add_argument(
154+
"--query-prompt-template",
155+
type=str,
156+
default=None,
157+
help="Prompt template for queries (different from build template for task-specific models)",
158+
)
147159
build_parser.add_argument(
148160
"--force", "-f", action="store_true", help="Force rebuild existing index"
149161
)
@@ -260,6 +272,12 @@ def create_parser(self) -> argparse.ArgumentParser:
260272
action="store_true",
261273
help="Display file paths and metadata in search results",
262274
)
275+
search_parser.add_argument(
276+
"--embedding-prompt-template",
277+
type=str,
278+
default=None,
279+
help="Prompt template to prepend to query for embedding (e.g., 'query: ' for search)",
280+
)
263281

264282
# Ask command
265283
ask_parser = subparsers.add_parser("ask", help="Ask questions")
@@ -1398,6 +1416,14 @@ async def build_index(self, args):
13981416
resolved_embedding_key = resolve_openai_api_key(args.embedding_api_key)
13991417
if resolved_embedding_key:
14001418
embedding_options["api_key"] = resolved_embedding_key
1419+
if args.query_prompt_template:
1420+
# New format: separate templates
1421+
if args.embedding_prompt_template:
1422+
embedding_options["build_prompt_template"] = args.embedding_prompt_template
1423+
embedding_options["query_prompt_template"] = args.query_prompt_template
1424+
elif args.embedding_prompt_template:
1425+
# Old format: single template (backward compat)
1426+
embedding_options["prompt_template"] = args.embedding_prompt_template
14011427

14021428
builder = LeannBuilder(
14031429
backend_name=args.backend_name,

0 commit comments

Comments
 (0)