Skip to content

Commit 832eead

Browse files
author
embed-rerank-bot
committed
feat: Add dynamic embedding dimension detection and configuration
Major improvements: - Automatic dimension detection from model config (hidden_size, d_model, embedding_size, model_dim, dim) - Optional fixed output dimension controls via OUTPUT_EMBEDDING_DIMENSION and DIMENSION_STRATEGY env vars - OpenAI-compatible 'dimensions' request field for per-request dimension control - Enhanced MLX backend to properly read config with multiple dimension key fallbacks - Fixed placeholder/fallback path to respect config dict values - Updated router docstrings to reflect dynamic dimensions - Comprehensive README documentation with dimension configuration best practices - Added LightRAG integration guidance and Qwen similarity scaling notes Breaking changes: None - fully backward compatible Version: 1.3.0
1 parent 5602678 commit 832eead

File tree

6 files changed

+116
-72
lines changed

6 files changed

+116
-72
lines changed

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2222
- These updates maintain full compatibility with existing OpenAI SDK usage; default remains `encoding_format="float"`.
2323

2424

25+
## [1.3.0] - 2025-11-04
26+
27+
### Added
28+
- 📏 Dynamic embedding dimension documentation in README, including supported config keys and best practices for vector DBs
29+
- 🧩 Optional fixed output-dimension controls (disabled by default)
30+
- Env: `OUTPUT_EMBEDDING_DIMENSION`, `DIMENSION_STRATEGY` (pad|trim)
31+
- OpenAI-compatible `dimensions` request field mapped to per-request trim when no global override is set
32+
33+
### Fixed
34+
- 🛠 MLX backend embedding dimension alignment
35+
- Properly reads model config dimension using multiple keys (`hidden_size`, `d_model`, `embedding_size`, `model_dim`, `dim`)
36+
- Placeholder/fallback path now respects config dict values (no unintended 4096 defaulting)
37+
38+
### Changed
39+
- 🧹 Removed hardcoded dimension assumptions from docs; router docstrings updated to reflect dynamic dimensions
40+
- 📚 README updated with “Dynamic Embedding Dimensions” section and optional compatibility knobs
41+
2542
### Added
2643
- 🆕 **Cohere API v1/v2 Compatibility**: Full support for Cohere reranking API
2744
- `/v1/rerank` endpoint (legacy format support)

README.md

Lines changed: 52 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,7 @@
1313
Lightning-fast local embeddings & reranking for Apple Silicon (MLX-first). OpenAI, TEI, and Cohere compatible.
1414

1515
## 🔧 Troubleshooting
16-
1716
### Common Issues
18-
1917
**"Embedding service not initialized" Error**: Fixed in v1.2.0. If you encounter this error:
2018
1. Update to the latest version: `pip install --upgrade embed-rerank`
2119
2. For source installations, ensure proper service initialization in `main.py`
@@ -40,35 +38,15 @@ For comprehensive troubleshooting, see [docs/TROUBLESHOOTING.md](docs/TROUBLESHO
4038

4139
Recent MLX versions removed `mx.array` in favor of `mx.asarray` (and `mx.numpy.array`). This repository includes a compatibility helper that automatically forwards to the appropriate API, so Apple Silicon embeddings continue to work across MLX versions.
4240

43-
What changed:
41+
**What changed:**
4442
- Internal `mx.array(...)` calls now use a helper that tries, in order: `mx.array``mx.asarray``mx.numpy.array`.
45-
- Placeholder embedding fallback now respects the model configuration using `config['hidden_size']` (previously some error paths defaulted to 4096).
43+
- Placeholder embedding fallback now respects the model configuration using multiple dimension keys.
4644

47-
Why this matters:
45+
**Why this matters:**
4846
- Prevents runtime error: `module 'mlx.core' has no attribute 'array'` on newer MLX.
49-
- Ensures embedding dimension matches the loaded model, avoiding vector size mismatches (e.g., when updating existing ChromaDB collections).
50-
51-
Quick validation (Apple Silicon + MLX installed):
52-
```python
53-
import asyncio
54-
from app.backends.factory import BackendFactory
55-
56-
async def main():
57-
backend = BackendFactory.create_backend("mlx", "mlx-community/Qwen3-Embedding-4B-4bit-DWQ")
58-
await backend.load_model()
59-
res = await backend.embed_texts(["hello", "world"])
60-
print("shape:", res.vectors.shape) # (2, <model_hidden_size>)
61-
62-
asyncio.run(main())
63-
```
64-
65-
Notes:
66-
- Optional dependency for MLX (macOS only): `pip install "embed-rerank[mlx]"` or see `pyproject.toml` (`mlx>=0.4.0`, `mlx-lm>=0.2.0`).
67-
- If you maintain an existing ChromaDB collection, verify that new embeddings match the existing dimension before upsert.
68-
69-
---
70-
47+
- Ensures embedding dimension matches the loaded model, avoiding vector size mismatches.
7148

49+
**Optional dependency for MLX (macOS only):** `pip install "embed-rerank[mlx]"` or see `pyproject.toml` (`mlx>=0.4.0`, `mlx-lm>=0.2.0`).
7250

7351
---
7452

@@ -213,13 +191,35 @@ The service automatically handles long texts with intelligent processing:
213191
- **Auto-Truncation**: Texts exceeding token limits are automatically reduced by ~75%
214192
- **Smart Summarization**: Key sentences are preserved while removing redundancy
215193
- **Dynamic Token Limits**: Automatically detected from model metadata (e.g., 512 tokens for Qwen3)
216-
- **Dimension Detection**: Vector dimensions auto-configured from model (e.g., 1024D for Qwen3)
194+
- **Dynamic Dimension Detection**: Vector dimensions auto-configured from model metadata
217195
- **Processing Transparency**: Optional processing info in API responses
218196

219197
**Example: 8000+ character text → 2037 tokens automatically**
220198

221199
---
222200

201+
### 📏 Dynamic Embedding Dimensions
202+
203+
- The service derives embedding dimension directly from the loaded model’s config.
204+
- Supported config keys (priority): `hidden_size``d_model``embedding_size``model_dim``dim`.
205+
- Backend and health endpoints report the actual vector size; clients should not assume a fixed dimension.
206+
- Tip for vector DBs (e.g., Qdrant): create the collection with the reported dimension.
207+
208+
#### Optional: Fixed Output Dimension (Compatibility)
209+
210+
If you already have an index built at a specific dimension (e.g., 4096), you can ask the service to pad/trim output vectors to that size:
211+
212+
```env
213+
# Optional – force output vectors to a fixed size
214+
OUTPUT_EMBEDDING_DIMENSION=4096
215+
# Strategy: pad with zeros or trim leading dimensions (then re-normalize)
216+
DIMENSION_STRATEGY=pad # or trim
217+
```
218+
219+
- Service-level setting takes precedence over per-request settings.
220+
- OpenAI-compatible `dimensions` request field is supported and maps to trim behavior when no global override is set.
221+
- For cosine similarity, zero-padding + re-normalization is safe; for other metrics, prefer retraining/reindexing.
222+
223223
### 📂 Model Cache Management
224224

225225
The service automatically manages model downloads and caching:
@@ -272,11 +272,7 @@ response = client.embeddings.create(
272272
model="text-embedding-ada-002"
273273
)
274274
# 🚀 10x faster than OpenAI, same code!
275-
276275
```
277-
278-
#### Base64 encoding (OpenAI-compatible)
279-
280276
You can request base64-encoded embeddings by setting `encoding_format="base64"`. This is useful when transporting vectors through systems that expect strings only.
281277

282278
```python
@@ -330,25 +326,6 @@ response = requests.post("http://localhost:9000/v1/rerank", json={
330326
})
331327
```
332328

333-
---
334-
335-
## 🧩 LightRAG Integration
336-
337-
We validated an end-to-end workflow using LightRAG with this service:
338-
- Embeddings via the OpenAI-compatible endpoint (`/v1/embeddings`)
339-
- Reranking via the Cohere-compatible endpoint (`/v1/rerank` or `/v2/rerank`)
340-
341-
Results: the integration tests succeeded using OpenAI embeddings and Cohere reranking.
342-
343-
Qwen Embedding similarity scaling note: when using the Qwen Embedding model, we observed cosine similarity values that appear very small (e.g., `0.02`, `0.03`). This is expected due to vector scaling differences and does not indicate poor retrieval by itself. As a starting point, we recommend disabling the retrieval threshold in LightRAG to avoid filtering out good matches prematurely:
344-
345-
```
346-
# === Retrieval threshold ===
347-
COSINE_THRESHOLD=0.0
348-
```
349-
350-
Adjust upward later based on your dataset and evaluation results.
351-
352329
### Native API
353330

354331
```bash
@@ -416,6 +393,9 @@ embed-rerank --test full --test-url http://localhost:9000
416393

417394
### 🔧 Advanced Testing (Source Code)
418395

396+
```bash
397+
### 🔧 Advanced Testing (Source Code)
398+
419399
For development and comprehensive testing with the source code:
420400

421401
```bash
@@ -474,6 +454,7 @@ embed-rerank --port 9000 &
474454
```
475455

476456
> **Windows Support**: Coming soon! Currently optimized for macOS/Linux.
457+
```
477458

478459
---
479460

@@ -510,7 +491,7 @@ embed-rerank --port 9000 &
510491

511492
---
512493

513-
## 📝 Quick Reference
494+
## Quick Reference
514495

515496
### Installation & Startup
516497
```bash
@@ -554,6 +535,25 @@ flake8 app/ tests/ --max-line-length=120 --extend-ignore=E203,W503 # Linting
554535

555536
---
556537

538+
## 🧩 LightRAG Integration
539+
540+
We validated an end-to-end workflow using LightRAG with this service:
541+
- Embeddings via the OpenAI-compatible endpoint (`/v1/embeddings`)
542+
- Reranking via the Cohere-compatible endpoint (`/v1/rerank` or `/v2/rerank`)
543+
544+
Results: the integration tests succeeded using OpenAI embeddings and Cohere reranking.
545+
546+
Qwen Embedding similarity scaling note: when using the Qwen Embedding model, we observed cosine similarity values that appear very small (e.g., `0.02`, `0.03`). This is expected due to vector scaling differences and does not indicate poor retrieval by itself. As a starting point, we recommend disabling the retrieval threshold in LightRAG to avoid filtering out good matches prematurely:
547+
548+
```
549+
# === Retrieval threshold ===
550+
COSINE_THRESHOLD=0.0
551+
```
552+
553+
Adjust upward later based on your dataset and evaluation results.
554+
555+
---
556+
557557
## 📄 License
558558

559559
MIT License - build amazing things with this code!

app/__init__.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,17 @@
77
- Apple Silicon MLX optimization with PyTorch fallback
88
- Multi-API compatibility: Native, OpenAI, TEI, and Cohere formats
99
10-
🚀 NEW in v1.2.3: OpenAI base64 encoding support + docs update
11-
- Fixed Cohere API tests with proper environment variable handling
12-
- Resolved pytest environment variable propagation issues
13-
- Eliminated false warnings while maintaining 100% API compatibility
14-
- Enhanced test suite reliability and consistency
15-
- All API formats (Native, OpenAI, TEI, Cohere) now show clean success status
10+
🚀 NEW in v1.3.0: Dynamic embedding dimensions and enhanced model configuration!
11+
- Automatic embedding dimension detection from model config (hidden_size, d_model, etc.)
12+
- Optional fixed output dimension controls (OUTPUT_EMBEDDING_DIMENSION, DIMENSION_STRATEGY)
13+
- OpenAI-compatible 'dimensions' request field support for per-request dimension control
14+
- OpenAI base64 encoding support (encoding_format="base64")
15+
- MLX backend now properly reads config with multiple dimension key fallbacks
16+
- Enhanced README documentation with dimension configuration best practices
17+
- LightRAG integration guidance and Qwen similarity scaling notes
1618
1719
Author: joonsoo-me
1820
"""
1921

20-
__version__ = "1.2.3"
22+
__version__ = "1.3.0"
2123
__author__ = "joonsoo-me"

app/backends/mlx_backend.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,16 @@ def _load_model_sync(self):
264264
logger.info("config.json missing - using default config for Qwen3 model")
265265
config = {"hidden_size": 4096, "max_position_embeddings": 32768}
266266

267+
# Normalize config keys: some models (e.g., Qwen3) use 'd_model'
268+
# Ensure 'hidden_size' is present for downstream logic
269+
if isinstance(config, dict):
270+
if 'hidden_size' not in config and 'd_model' in config:
271+
try:
272+
config['hidden_size'] = int(config['d_model'])
273+
except Exception:
274+
# Fallback silently if casting fails
275+
config['hidden_size'] = config['d_model']
276+
267277
# Attempt to locate and load MLX weights
268278
weights_path = self._find_weights_file(model_dir)
269279
if weights_path:
@@ -279,7 +289,8 @@ def _load_model_sync(self):
279289
logger.info("No MLX weights found - creating compatible embedding model")
280290

281291
# Create a compatible MLX embedding model
282-
hidden_size = config.get('hidden_size', 4096)
292+
# Prefer explicit hidden_size, fall back to d_model
293+
hidden_size = config.get('hidden_size') or config.get('d_model', 4096)
283294
model = self._create_placeholder_model(hidden_size)
284295
config['hidden_size'] = hidden_size
285296
logger.info("🧪 Created MLX-compatible embedding model", hidden_size=hidden_size)
@@ -393,7 +404,8 @@ class MLXEmbeddingModel:
393404
def __init__(self, config, weights):
394405
self.config = config
395406
self.weights = weights
396-
self.hidden_size = config.get('hidden_size', 4096)
407+
# Some configs expose 'd_model' rather than 'hidden_size'
408+
self.hidden_size = config.get('hidden_size') or config.get('d_model', 4096)
397409
self.max_position_embeddings = config.get('max_position_embeddings', 32768)
398410

399411
def embed(self, input_ids):
@@ -524,8 +536,11 @@ def _embed_sync(self, texts: List[str], batch_size: int) -> np.ndarray:
524536

525537
def _generate_placeholder_embeddings(self, texts: List[str]) -> np.ndarray:
526538
"""Generate placeholder embeddings for fallback."""
527-
# self.config is a dict; use dict.get to reflect actual model settings
528-
embedding_dim = self.config.get('hidden_size', 4096) if self.config else 4096
539+
# self.config is a dict; prefer hidden_size then d_model for dynamic dimension support
540+
if isinstance(self.config, dict):
541+
embedding_dim = self.config.get('hidden_size') or self.config.get('d_model') or 4096
542+
else:
543+
embedding_dim = 4096
529544

530545
# Use text hash for deterministic embeddings
531546
embeddings = []

app/routers/embedding_router.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
88
⚡ Performance Highlights:
99
- Sub-millisecond text embedding generation
10-
- 320-dimensional vectors optimized for semantic search
10+
- Dynamic vector dimension (auto-detected from model)
1111
- Batch processing with MLX acceleration
1212
- Zero-copy operations on Apple Silicon
1313
@@ -89,7 +89,7 @@ async def generate_embeddings(request: EmbedRequest, service: EmbeddingService =
8989
✨ What happens here:
9090
- Text tokenization optimized for Apple Silicon
9191
- MLX-accelerated model inference through unified memory
92-
- 320-dimensional vector generation in <1ms
92+
- Dynamic vector dimension auto-detected from model
9393
- Automatic normalization for cosine similarity
9494
9595
🎯 Perfect for:

app/utils/model_metadata.py

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,18 @@ async def extract_metadata(model_path_or_name: str) -> Dict[str, Any]:
9999
}
100100
)
101101

102-
# 임베딩 차원 결정 (보통 hidden_size와 동일)
103-
metadata["embedding_dimension"] = metadata["hidden_size"]
102+
# 임베딩 차원 결정 (우선순위: hidden_size > d_model > embedding_size > model_dim > dim)
103+
candidate_keys = [
104+
"hidden_size",
105+
"d_model",
106+
"embedding_size",
107+
"model_dim",
108+
"dim",
109+
]
110+
for k in candidate_keys:
111+
if k in config and isinstance(config[k], int):
112+
metadata["embedding_dimension"] = int(config[k])
113+
break
104114

105115
# 권장 최대 토큰 계산 (성능 최적화)
106116
max_pos = metadata["max_position_embeddings"]
@@ -180,11 +190,11 @@ def extract_metadata_from_path(self, model_path: str) -> Dict[str, Any]:
180190
with open(config_file, 'r', encoding='utf-8') as f:
181191
config = json.load(f)
182192

183-
# 임베딩 차원 추출
184-
if "hidden_size" in config:
185-
metadata["embedding_dimension"] = config["hidden_size"]
186-
elif "d_model" in config:
187-
metadata["embedding_dimension"] = config["d_model"]
193+
# 임베딩 차원 추출 (여러 키 지원)
194+
for k in ["hidden_size", "d_model", "embedding_size", "model_dim", "dim"]:
195+
if k in config and isinstance(config[k], int):
196+
metadata["embedding_dimension"] = int(config[k])
197+
break
188198

189199
# 최대 포지션 임베딩 추출
190200
if "max_position_embeddings" in config:

0 commit comments

Comments
 (0)