feat: Add dynamic embedding dimension detection and configuration

embed-rerank-bot · embed-rerank-bot · commit 832eead3f51a · 2025-11-05T00:18:10.000+09:00
Major improvements:
- Automatic dimension detection from model config (hidden_size, d_model, embedding_size, model_dim, dim)
- Optional fixed output dimension controls via OUTPUT_EMBEDDING_DIMENSION and DIMENSION_STRATEGY env vars
- OpenAI-compatible 'dimensions' request field for per-request dimension control
- Enhanced MLX backend to properly read config with multiple dimension key fallbacks
- Fixed placeholder/fallback path to respect config dict values
- Updated router docstrings to reflect dynamic dimensions
- Comprehensive README documentation with dimension configuration best practices
- Added LightRAG integration guidance and Qwen similarity scaling notes

Breaking changes: None - fully backward compatible
Version: 1.3.0
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -22,6 +22,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - These updates maintain full compatibility with existing OpenAI SDK usage; default remains `encoding_format="float"`.
 
 
+## [1.3.0] - 2025-11-04
+
+### Added
+- 📏 Dynamic embedding dimension documentation in README, including supported config keys and best practices for vector DBs
+- 🧩 Optional fixed output-dimension controls (disabled by default)
+  - Env: `OUTPUT_EMBEDDING_DIMENSION`, `DIMENSION_STRATEGY` (pad|trim)
+  - OpenAI-compatible `dimensions` request field mapped to per-request trim when no global override is set
+
+### Fixed
+- 🛠 MLX backend embedding dimension alignment
+  - Properly reads model config dimension using multiple keys (`hidden_size`, `d_model`, `embedding_size`, `model_dim`, `dim`)
+  - Placeholder/fallback path now respects config dict values (no unintended 4096 defaulting)
+
+### Changed
+- 🧹 Removed hardcoded dimension assumptions from docs; router docstrings updated to reflect dynamic dimensions
+- 📚 README updated with “Dynamic Embedding Dimensions” section and optional compatibility knobs
+
 ### Added
 - 🆕 **Cohere API v1/v2 Compatibility**: Full support for Cohere reranking API
   - `/v1/rerank` endpoint (legacy format support)
diff --git a/README.md b/README.md
@@ -13,9 +13,7 @@
 Lightning-fast local embeddings & reranking for Apple Silicon (MLX-first). OpenAI, TEI, and Cohere compatible.
 
 ## 🔧 Troubleshooting
-
 ### Common Issues
-
 **"Embedding service not initialized" Error**: Fixed in v1.2.0. If you encounter this error:
 1. Update to the latest version: `pip install --upgrade embed-rerank`
 2. For source installations, ensure proper service initialization in `main.py`
@@ -40,35 +38,15 @@ For comprehensive troubleshooting, see [docs/TROUBLESHOOTING.md](docs/TROUBLESHO
 
 Recent MLX versions removed `mx.array` in favor of `mx.asarray` (and `mx.numpy.array`). This repository includes a compatibility helper that automatically forwards to the appropriate API, so Apple Silicon embeddings continue to work across MLX versions.
 
-What changed:
+**What changed:**
 - Internal `mx.array(...)` calls now use a helper that tries, in order: `mx.array` → `mx.asarray` → `mx.numpy.array`.
-- Placeholder embedding fallback now respects the model configuration using `config['hidden_size']` (previously some error paths defaulted to 4096).
+- Placeholder embedding fallback now respects the model configuration using multiple dimension keys.
 
-Why this matters:
+**Why this matters:**
 - Prevents runtime error: `module 'mlx.core' has no attribute 'array'` on newer MLX.
-- Ensures embedding dimension matches the loaded model, avoiding vector size mismatches (e.g., when updating existing ChromaDB collections).
-
-Quick validation (Apple Silicon + MLX installed):
-```python
-import asyncio
-from app.backends.factory import BackendFactory
-
-async def main():
-    backend = BackendFactory.create_backend("mlx", "mlx-community/Qwen3-Embedding-4B-4bit-DWQ")
-    await backend.load_model()
-    res = await backend.embed_texts(["hello", "world"])
-    print("shape:", res.vectors.shape)  # (2, <model_hidden_size>)
-
-asyncio.run(main())
-```
-
-Notes:
-- Optional dependency for MLX (macOS only): `pip install "embed-rerank[mlx]"` or see `pyproject.toml` (`mlx>=0.4.0`, `mlx-lm>=0.2.0`).
-- If you maintain an existing ChromaDB collection, verify that new embeddings match the existing dimension before upsert.
-
----
-
+- Ensures embedding dimension matches the loaded model, avoiding vector size mismatches.
 
+**Optional dependency for MLX (macOS only):** `pip install "embed-rerank[mlx]"` or see `pyproject.toml` (`mlx>=0.4.0`, `mlx-lm>=0.2.0`).
 
 ---
 
@@ -213,13 +191,35 @@ The service automatically handles long texts with intelligent processing:
 - **Auto-Truncation**: Texts exceeding token limits are automatically reduced by ~75%
 - **Smart Summarization**: Key sentences are preserved while removing redundancy
 - **Dynamic Token Limits**: Automatically detected from model metadata (e.g., 512 tokens for Qwen3)
-- **Dimension Detection**: Vector dimensions auto-configured from model (e.g., 1024D for Qwen3)
+- **Dynamic Dimension Detection**: Vector dimensions auto-configured from model metadata
 - **Processing Transparency**: Optional processing info in API responses
 
 **Example: 8000+ character text → 2037 tokens automatically**
 
 ---
 
+### 📏 Dynamic Embedding Dimensions
+
+- The service derives embedding dimension directly from the loaded model’s config.
+- Supported config keys (priority): `hidden_size` → `d_model` → `embedding_size` → `model_dim` → `dim`.
+- Backend and health endpoints report the actual vector size; clients should not assume a fixed dimension.
+- Tip for vector DBs (e.g., Qdrant): create the collection with the reported dimension.
+
+#### Optional: Fixed Output Dimension (Compatibility)
+
+If you already have an index built at a specific dimension (e.g., 4096), you can ask the service to pad/trim output vectors to that size:
+
+```env
+# Optional – force output vectors to a fixed size
+OUTPUT_EMBEDDING_DIMENSION=4096
+# Strategy: pad with zeros or trim leading dimensions (then re-normalize)
+DIMENSION_STRATEGY=pad   # or trim
+```
+
+- Service-level setting takes precedence over per-request settings.
+- OpenAI-compatible `dimensions` request field is supported and maps to trim behavior when no global override is set.
+- For cosine similarity, zero-padding + re-normalization is safe; for other metrics, prefer retraining/reindexing.
+
 ### 📂 Model Cache Management
 
 The service automatically manages model downloads and caching:
@@ -272,11 +272,7 @@ response = client.embeddings.create(
     model="text-embedding-ada-002"
 )
 # 🚀 10x faster than OpenAI, same code!
-
 ```
-
-#### Base64 encoding (OpenAI-compatible)
-
 You can request base64-encoded embeddings by setting `encoding_format="base64"`. This is useful when transporting vectors through systems that expect strings only.
 
 ```python
@@ -330,25 +326,6 @@ response = requests.post("http://localhost:9000/v1/rerank", json={
 })
 ```
 
----
-
-## 🧩 LightRAG Integration
-
-We validated an end-to-end workflow using LightRAG with this service:
-- Embeddings via the OpenAI-compatible endpoint (`/v1/embeddings`)
-- Reranking via the Cohere-compatible endpoint (`/v1/rerank` or `/v2/rerank`)
-
-Results: the integration tests succeeded using OpenAI embeddings and Cohere reranking.
-
-Qwen Embedding similarity scaling note: when using the Qwen Embedding model, we observed cosine similarity values that appear very small (e.g., `0.02`, `0.03`). This is expected due to vector scaling differences and does not indicate poor retrieval by itself. As a starting point, we recommend disabling the retrieval threshold in LightRAG to avoid filtering out good matches prematurely:
-
-```
-# === Retrieval threshold ===
-COSINE_THRESHOLD=0.0
-```
-
-Adjust upward later based on your dataset and evaluation results.
-
 ### Native API
 
 ```bash
@@ -416,6 +393,9 @@ embed-rerank --test full --test-url http://localhost:9000
 
 ### 🔧 Advanced Testing (Source Code)
 
+```bash
+### 🔧 Advanced Testing (Source Code)
+
 For development and comprehensive testing with the source code:
 
 ```bash
@@ -474,6 +454,7 @@ embed-rerank --port 9000 &
 ```
 
 > **Windows Support**: Coming soon! Currently optimized for macOS/Linux.
+```
 
 ---
 
@@ -510,7 +491,7 @@ embed-rerank --port 9000 &
 
 ---
 
-## 📝 Quick Reference
+## Quick Reference
 
 ### Installation & Startup
 ```bash
@@ -554,6 +535,25 @@ flake8 app/ tests/ --max-line-length=120 --extend-ignore=E203,W503  # Linting
 
 ---
 
+## 🧩 LightRAG Integration
+
+We validated an end-to-end workflow using LightRAG with this service:
+- Embeddings via the OpenAI-compatible endpoint (`/v1/embeddings`)
+- Reranking via the Cohere-compatible endpoint (`/v1/rerank` or `/v2/rerank`)
+
+Results: the integration tests succeeded using OpenAI embeddings and Cohere reranking.
+
+Qwen Embedding similarity scaling note: when using the Qwen Embedding model, we observed cosine similarity values that appear very small (e.g., `0.02`, `0.03`). This is expected due to vector scaling differences and does not indicate poor retrieval by itself. As a starting point, we recommend disabling the retrieval threshold in LightRAG to avoid filtering out good matches prematurely:
+
+```
+# === Retrieval threshold ===
+COSINE_THRESHOLD=0.0
+```
+
+Adjust upward later based on your dataset and evaluation results.
+
+---
+
 ## 📄 License
 
 MIT License - build amazing things with this code!
diff --git a/app/__init__.py b/app/__init__.py
@@ -7,15 +7,17 @@
 - Apple Silicon MLX optimization with PyTorch fallback
 - Multi-API compatibility: Native, OpenAI, TEI, and Cohere formats
 
-🚀 NEW in v1.2.3: OpenAI base64 encoding support + docs update
-- Fixed Cohere API tests with proper environment variable handling
-- Resolved pytest environment variable propagation issues
-- Eliminated false warnings while maintaining 100% API compatibility
-- Enhanced test suite reliability and consistency
-- All API formats (Native, OpenAI, TEI, Cohere) now show clean success status
+🚀 NEW in v1.3.0: Dynamic embedding dimensions and enhanced model configuration!
+- Automatic embedding dimension detection from model config (hidden_size, d_model, etc.)
+- Optional fixed output dimension controls (OUTPUT_EMBEDDING_DIMENSION, DIMENSION_STRATEGY)
+- OpenAI-compatible 'dimensions' request field support for per-request dimension control
+- OpenAI base64 encoding support (encoding_format="base64")
+- MLX backend now properly reads config with multiple dimension key fallbacks
+- Enhanced README documentation with dimension configuration best practices
+- LightRAG integration guidance and Qwen similarity scaling notes
 
 Author: joonsoo-me
 """
 
-__version__ = "1.2.3"
+__version__ = "1.3.0"
 __author__ = "joonsoo-me"
diff --git a/app/backends/mlx_backend.py b/app/backends/mlx_backend.py
@@ -264,6 +264,16 @@ def _load_model_sync(self):
                 logger.info("config.json missing - using default config for Qwen3 model")
                 config = {"hidden_size": 4096, "max_position_embeddings": 32768}
 
+            # Normalize config keys: some models (e.g., Qwen3) use 'd_model'
+            # Ensure 'hidden_size' is present for downstream logic
+            if isinstance(config, dict):
+                if 'hidden_size' not in config and 'd_model' in config:
+                    try:
+                        config['hidden_size'] = int(config['d_model'])
+                    except Exception:
+                        # Fallback silently if casting fails
+                        config['hidden_size'] = config['d_model']
+
             # Attempt to locate and load MLX weights
             weights_path = self._find_weights_file(model_dir)
             if weights_path:
@@ -279,7 +289,8 @@ def _load_model_sync(self):
                 logger.info("No MLX weights found - creating compatible embedding model")
 
             # Create a compatible MLX embedding model
-            hidden_size = config.get('hidden_size', 4096)
+            # Prefer explicit hidden_size, fall back to d_model
+            hidden_size = config.get('hidden_size') or config.get('d_model', 4096)
             model = self._create_placeholder_model(hidden_size)
             config['hidden_size'] = hidden_size
             logger.info("🧪 Created MLX-compatible embedding model", hidden_size=hidden_size)
@@ -393,7 +404,8 @@ class MLXEmbeddingModel:
                 def __init__(self, config, weights):
                     self.config = config
                     self.weights = weights
-                    self.hidden_size = config.get('hidden_size', 4096)
+                    # Some configs expose 'd_model' rather than 'hidden_size'
+                    self.hidden_size = config.get('hidden_size') or config.get('d_model', 4096)
                     self.max_position_embeddings = config.get('max_position_embeddings', 32768)
 
                 def embed(self, input_ids):
@@ -524,8 +536,11 @@ def _embed_sync(self, texts: List[str], batch_size: int) -> np.ndarray:
 
     def _generate_placeholder_embeddings(self, texts: List[str]) -> np.ndarray:
         """Generate placeholder embeddings for fallback."""
-        # self.config is a dict; use dict.get to reflect actual model settings
-        embedding_dim = self.config.get('hidden_size', 4096) if self.config else 4096
+        # self.config is a dict; prefer hidden_size then d_model for dynamic dimension support
+        if isinstance(self.config, dict):
+            embedding_dim = self.config.get('hidden_size') or self.config.get('d_model') or 4096
+        else:
+            embedding_dim = 4096
 
         # Use text hash for deterministic embeddings
         embeddings = []
diff --git a/app/routers/embedding_router.py b/app/routers/embedding_router.py
@@ -7,7 +7,7 @@
 
 ⚡ Performance Highlights:
 - Sub-millisecond text embedding generation
-- 320-dimensional vectors optimized for semantic search
+- Dynamic vector dimension (auto-detected from model)
 - Batch processing with MLX acceleration
 - Zero-copy operations on Apple Silicon
 
@@ -89,7 +89,7 @@ async def generate_embeddings(request: EmbedRequest, service: EmbeddingService =
     ✨ What happens here:
     - Text tokenization optimized for Apple Silicon
     - MLX-accelerated model inference through unified memory
-    - 320-dimensional vector generation in <1ms
+    - Dynamic vector dimension auto-detected from model
     - Automatic normalization for cosine similarity
 
     🎯 Perfect for:
diff --git a/app/utils/model_metadata.py b/app/utils/model_metadata.py
@@ -99,8 +99,18 @@ async def extract_metadata(model_path_or_name: str) -> Dict[str, Any]:
                     }
                 )
 
-                # 임베딩 차원 결정 (보통 hidden_size와 동일)
-                metadata["embedding_dimension"] = metadata["hidden_size"]
+                # 임베딩 차원 결정 (우선순위: hidden_size > d_model > embedding_size > model_dim > dim)
+                candidate_keys = [
+                    "hidden_size",
+                    "d_model",
+                    "embedding_size",
+                    "model_dim",
+                    "dim",
+                ]
+                for k in candidate_keys:
+                    if k in config and isinstance(config[k], int):
+                        metadata["embedding_dimension"] = int(config[k])
+                        break
 
                 # 권장 최대 토큰 계산 (성능 최적화)
                 max_pos = metadata["max_position_embeddings"]
@@ -180,11 +190,11 @@ def extract_metadata_from_path(self, model_path: str) -> Dict[str, Any]:
                 with open(config_file, 'r', encoding='utf-8') as f:
                     config = json.load(f)
 
-                # 임베딩 차원 추출
-                if "hidden_size" in config:
-                    metadata["embedding_dimension"] = config["hidden_size"]
-                elif "d_model" in config:
-                    metadata["embedding_dimension"] = config["d_model"]
+                # 임베딩 차원 추출 (여러 키 지원)
+                for k in ["hidden_size", "d_model", "embedding_size", "model_dim", "dim"]:
+                    if k in config and isinstance(config[k], int):
+                        metadata["embedding_dimension"] = int(config[k])
+                        break
 
                 # 최대 포지션 임베딩 추출
                 if "max_position_embeddings" in config: