Bug Report: Embedding Model Crashes and Performance Degradation in LM Studio v0.3.34

### Environment
- **LM Studio Version**: 0.3.34
- **Operating System**: Windows 11
- **VRAM**: 8GB
- **Models Tested**:
  - `nomic-embed-text-v1.5-GGUF` (Q4_K_M and Q5_0 quantizations)
  - `gpt-oss:20b`
  - `gemma3:27b`

### Issue Summary
After updating to LM Studio v0.3.34, embedding generation crashes consistently during batch processing, and LLM inference performance has degraded significantly (3x slower response times).

---

### Problem 1: Embedding Model Crashes

#### Expected Behavior
- In previous LM Studio versions (< 0.3.34), `nomic-embed-text-v1.5` processed embeddings reliably with 4 concurrent threads
- Model handled standard text inputs (~1500 characters) without crashes
- No memory leaks or model unloading during batch operations

#### Current Behavior (v0.3.34)
- Model crashes after processing 2-5 embeddings with error:
Error code: 400 - {'error': 'The model has crashed without additional information. (Exit code: 18446744072635812000).'}
- Model reports as "unloaded or crashed" in subsequent requests
- Requires manual model reload after each crash

#### Reproduction Steps
1. Load `nomic-embed-text-v1.5-GGUF` (Q4_K_M or Q5_0) in LM Studio v0.3.34
2. Send multiple embedding requests via OpenAI-compatible API endpoint:
3. POST http://localhost:1234/v1/embeddings
{
"model": "text-embedding-nomic-embed-text-v1.5",
"input": "<text content>",
"encoding_format": "base64"
}
3. Process 4+ embeddings concurrently (ThreadPoolExecutor with 4 workers)
4. Observe crash after 2-5 successful embeddings

#### Sample Request Logs
2025-12-11 16:54:21 - HTTP 200 OK (embedding 1) ✓
2025-12-11 16:54:21 - HTTP 200 OK (embedding 2) ✓
2025-12-11 16:54:29 - HTTP 400 Bad Request (embedding 3) ✗
Error: "The model has crashed without additional information. (Exit code: 18446744072635812000)"
2025-12-11 16:54:29 - HTTP 400 Bad Request (embedding 4) ✗
Error: "Model has unloaded or crashed.."

#### Workaround Attempted
- Re-downloading fresh model weights: **Temporarily resolves** issue but crashes return after 2-10 embeddings
- Reducing concurrent threads to 2: **Still crashes** (slower but same result)
- Restarting LM Studio: **Only temporary fix**

---

### Problem 2: LLM Inference Performance Degradation

#### Expected Behavior (Previous Versions)
- `gpt-oss:20b` and `gemma3:27b` with small context windows (~2000 tokens)
- Average response time: **~1 minute**
- 8GB VRAM sufficient for inference

#### Current Behavior (v0.3.34)
- Same models with identical prompts/settings
- Average response time: **~3 minutes** (3x slower)
- Some requests timeout without response
- VRAM usage appears normal (~6-7GB)

#### Reproduction Steps
1. Load `gpt-oss:20b` or `gemma3:27b` in LM Studio v0.3.34
2. Send inference request with ~2000 token context via API
3. Measure time-to-first-token and total response time
4. Compare with previous LM Studio versions

---

### System Impact
- **Workflow Blocked**: Batch embedding generation for forensic document analysis (thousands of documents) is infeasible due to constant crashes
- **Production Impact**: Applications relying on LM Studio embeddings are now unstable
- **Workaround Cost**: Manual restarts every 5 embeddings is not scalable

---

### Suspected Root Cause
Based on logs and testing:
1. **Memory Management Regression**: Model appears to leak memory or fail to release VRAM between requests
2. **Concurrency Bug**: Multi-threaded requests may trigger race condition in v0.3.34
3. **Quantization Handling**: Both Q4_K_M and Q5_0 exhibit same behavior (not quantization-specific)

---

### Request
- **Urgent**: Please investigate memory handling changes in v0.3.34 for embedding models
- **Alternative**: Provide rollback instructions or direct download link for v0.3.33 (last stable version)
- **Logs**: Happy to provide additional debug logs if needed

---

### Additional Context
- **API Format**: Using OpenAI-compatible endpoint (`/v1/embeddings`)
- **Client Library**: `openai` Python SDK (v1.x)
- **Network**: Local inference (no remote API calls)
- **Previous Versions Tested**: v0.3.30-v0.3.33 worked reliably with identical code

---

### References
- Model: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF
- Error Code: `18446744072635812000` (appears to be unsigned overflow/OOM indicator)

### Workaround Confirmed

**Migrating to Ollama resolves the issue completely.**

#### Test Results
- **LM Studio v0.3.34**: Crashes after 2-5 embeddings with exit code `18446744072635812000`
- **Ollama (same model)**: Stable, no crashes after 100+ embeddings

#### Migration Details
Installed Ollama v0.5.4
ollama pull nomic-embed-text:v1.5

Updated API endpoint
FROM: http://localhost:1234/v1/embeddings (LM Studio)
TO: http://localhost:11434/api/embeddings (Ollama)
text

#### Performance Comparison
| Metric | LM Studio v0.3.34 | Ollama |
|--------|-------------------|--------|
| Stability | ❌ Crashes every 2-5 requests | ✅ Stable (100+ requests) |
| Speed | ~8s per embedding (when working) | ~0.13s per embedding |
| VRAM | Memory leak suspected | Efficient management |
| Concurrent requests | Fails with 4 threads | Works with 4 threads |

#### Logs (Ollama - Working)
2025-12-11 17:22:39 - Client connected successfully: 192.168.1.101:11434
2025-12-11 17:22:39 - Embedding generated: dimension 768 (0.128s) ✓
2025-12-11 17:22:39 - Embedding generated: dimension 768 (0.010s) ✓
2025-12-11 17:22:39 - Item added to collection successfully

text

**Conclusion**: This confirms the issue is specific to LM Studio v0.3.34 and not related to:
- Model quantization (Q4_K_M, Q5_0)
- Input text content
- API client library (OpenAI SDK)
- Concurrent request handling (works fine in Ollama)

The regression appears to be in LM Studio's embedding server implementation or memory management.

---

### Request for Resolution

Since downgrading to v0.3.33 is not officially supported, please:

1. **Investigate memory management changes** in v0.3.34 embedding server
2. **Provide rollback option** or direct link to v0.3.33 installer
3. **Fix the crash** in upcoming release

Thank you for looking into this critical issue affecting production workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: Embedding Model Crashes and Performance Degradation in LM Studio v0.3.34 #171

Environment

Issue Summary

Problem 1: Embedding Model Crashes

Expected Behavior

Current Behavior (v0.3.34)

Reproduction Steps

Sample Request Logs

Workaround Attempted

Problem 2: LLM Inference Performance Degradation

Expected Behavior (Previous Versions)

Current Behavior (v0.3.34)

Reproduction Steps

System Impact

Suspected Root Cause

Request

Additional Context

References

Workaround Confirmed

Test Results

Migration Details

Performance Comparison

Logs (Ollama - Working)

Request for Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	LM Studio v0.3.34	Ollama
Stability	❌ Crashes every 2-5 requests	✅ Stable (100+ requests)
Speed	~8s per embedding (when working)	~0.13s per embedding
VRAM	Memory leak suspected	Efficient management
Concurrent requests	Fails with 4 threads	Works with 4 threads

Bug Report: Embedding Model Crashes and Performance Degradation in LM Studio v0.3.34 #171

Description

Environment

Issue Summary

Problem 1: Embedding Model Crashes

Expected Behavior

Current Behavior (v0.3.34)

Reproduction Steps

Sample Request Logs

Workaround Attempted

Problem 2: LLM Inference Performance Degradation

Expected Behavior (Previous Versions)

Current Behavior (v0.3.34)

Reproduction Steps

System Impact

Suspected Root Cause

Request

Additional Context

References

Workaround Confirmed

Test Results

Migration Details

Performance Comparison

Logs (Ollama - Working)

Request for Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions