-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Environment
- LM Studio Version: 0.3.34
- Operating System: Windows 11
- VRAM: 8GB
- Models Tested:
nomic-embed-text-v1.5-GGUF(Q4_K_M and Q5_0 quantizations)gpt-oss:20bgemma3:27b
Issue Summary
After updating to LM Studio v0.3.34, embedding generation crashes consistently during batch processing, and LLM inference performance has degraded significantly (3x slower response times).
Problem 1: Embedding Model Crashes
Expected Behavior
- In previous LM Studio versions (< 0.3.34),
nomic-embed-text-v1.5processed embeddings reliably with 4 concurrent threads - Model handled standard text inputs (~1500 characters) without crashes
- No memory leaks or model unloading during batch operations
Current Behavior (v0.3.34)
- Model crashes after processing 2-5 embeddings with error:
Error code: 400 - {'error': 'The model has crashed without additional information. (Exit code: 18446744072635812000).'} - Model reports as "unloaded or crashed" in subsequent requests
- Requires manual model reload after each crash
Reproduction Steps
- Load
nomic-embed-text-v1.5-GGUF(Q4_K_M or Q5_0) in LM Studio v0.3.34 - Send multiple embedding requests via OpenAI-compatible API endpoint:
- POST http://localhost:1234/v1/embeddings
{
"model": "text-embedding-nomic-embed-text-v1.5",
"input": "",
"encoding_format": "base64"
} - Process 4+ embeddings concurrently (ThreadPoolExecutor with 4 workers)
- Observe crash after 2-5 successful embeddings
Sample Request Logs
2025-12-11 16:54:21 - HTTP 200 OK (embedding 1) ✓
2025-12-11 16:54:21 - HTTP 200 OK (embedding 2) ✓
2025-12-11 16:54:29 - HTTP 400 Bad Request (embedding 3) ✗
Error: "The model has crashed without additional information. (Exit code: 18446744072635812000)"
2025-12-11 16:54:29 - HTTP 400 Bad Request (embedding 4) ✗
Error: "Model has unloaded or crashed.."
Workaround Attempted
- Re-downloading fresh model weights: Temporarily resolves issue but crashes return after 2-10 embeddings
- Reducing concurrent threads to 2: Still crashes (slower but same result)
- Restarting LM Studio: Only temporary fix
Problem 2: LLM Inference Performance Degradation
Expected Behavior (Previous Versions)
gpt-oss:20bandgemma3:27bwith small context windows (~2000 tokens)- Average response time: ~1 minute
- 8GB VRAM sufficient for inference
Current Behavior (v0.3.34)
- Same models with identical prompts/settings
- Average response time: ~3 minutes (3x slower)
- Some requests timeout without response
- VRAM usage appears normal (~6-7GB)
Reproduction Steps
- Load
gpt-oss:20borgemma3:27bin LM Studio v0.3.34 - Send inference request with ~2000 token context via API
- Measure time-to-first-token and total response time
- Compare with previous LM Studio versions
System Impact
- Workflow Blocked: Batch embedding generation for forensic document analysis (thousands of documents) is infeasible due to constant crashes
- Production Impact: Applications relying on LM Studio embeddings are now unstable
- Workaround Cost: Manual restarts every 5 embeddings is not scalable
Suspected Root Cause
Based on logs and testing:
- Memory Management Regression: Model appears to leak memory or fail to release VRAM between requests
- Concurrency Bug: Multi-threaded requests may trigger race condition in v0.3.34
- Quantization Handling: Both Q4_K_M and Q5_0 exhibit same behavior (not quantization-specific)
Request
- Urgent: Please investigate memory handling changes in v0.3.34 for embedding models
- Alternative: Provide rollback instructions or direct download link for v0.3.33 (last stable version)
- Logs: Happy to provide additional debug logs if needed
Additional Context
- API Format: Using OpenAI-compatible endpoint (
/v1/embeddings) - Client Library:
openaiPython SDK (v1.x) - Network: Local inference (no remote API calls)
- Previous Versions Tested: v0.3.30-v0.3.33 worked reliably with identical code
References
- Model: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF
- Error Code:
18446744072635812000(appears to be unsigned overflow/OOM indicator)
Workaround Confirmed
Migrating to Ollama resolves the issue completely.
Test Results
- LM Studio v0.3.34: Crashes after 2-5 embeddings with exit code
18446744072635812000 - Ollama (same model): Stable, no crashes after 100+ embeddings
Migration Details
Installed Ollama v0.5.4
ollama pull nomic-embed-text:v1.5
Updated API endpoint
FROM: http://localhost:1234/v1/embeddings (LM Studio)
TO: http://localhost:11434/api/embeddings (Ollama)
text
Performance Comparison
| Metric | LM Studio v0.3.34 | Ollama |
|---|---|---|
| Stability | ❌ Crashes every 2-5 requests | ✅ Stable (100+ requests) |
| Speed | ~8s per embedding (when working) | ~0.13s per embedding |
| VRAM | Memory leak suspected | Efficient management |
| Concurrent requests | Fails with 4 threads | Works with 4 threads |
Logs (Ollama - Working)
2025-12-11 17:22:39 - Client connected successfully: 192.168.1.101:11434
2025-12-11 17:22:39 - Embedding generated: dimension 768 (0.128s) ✓
2025-12-11 17:22:39 - Embedding generated: dimension 768 (0.010s) ✓
2025-12-11 17:22:39 - Item added to collection successfully
text
Conclusion: This confirms the issue is specific to LM Studio v0.3.34 and not related to:
- Model quantization (Q4_K_M, Q5_0)
- Input text content
- API client library (OpenAI SDK)
- Concurrent request handling (works fine in Ollama)
The regression appears to be in LM Studio's embedding server implementation or memory management.
Request for Resolution
Since downgrading to v0.3.33 is not officially supported, please:
- Investigate memory management changes in v0.3.34 embedding server
- Provide rollback option or direct link to v0.3.33 installer
- Fix the crash in upcoming release
Thank you for looking into this critical issue affecting production workflows.