Skip to content

Bug Report: Embedding Model Crashes and Performance Degradation in LM Studio v0.3.34 #171

@urlan

Description

@urlan

Environment

  • LM Studio Version: 0.3.34
  • Operating System: Windows 11
  • VRAM: 8GB
  • Models Tested:
    • nomic-embed-text-v1.5-GGUF (Q4_K_M and Q5_0 quantizations)
    • gpt-oss:20b
    • gemma3:27b

Issue Summary

After updating to LM Studio v0.3.34, embedding generation crashes consistently during batch processing, and LLM inference performance has degraded significantly (3x slower response times).


Problem 1: Embedding Model Crashes

Expected Behavior

  • In previous LM Studio versions (< 0.3.34), nomic-embed-text-v1.5 processed embeddings reliably with 4 concurrent threads
  • Model handled standard text inputs (~1500 characters) without crashes
  • No memory leaks or model unloading during batch operations

Current Behavior (v0.3.34)

  • Model crashes after processing 2-5 embeddings with error:
    Error code: 400 - {'error': 'The model has crashed without additional information. (Exit code: 18446744072635812000).'}
  • Model reports as "unloaded or crashed" in subsequent requests
  • Requires manual model reload after each crash

Reproduction Steps

  1. Load nomic-embed-text-v1.5-GGUF (Q4_K_M or Q5_0) in LM Studio v0.3.34
  2. Send multiple embedding requests via OpenAI-compatible API endpoint:
  3. POST http://localhost:1234/v1/embeddings
    {
    "model": "text-embedding-nomic-embed-text-v1.5",
    "input": "",
    "encoding_format": "base64"
    }
  4. Process 4+ embeddings concurrently (ThreadPoolExecutor with 4 workers)
  5. Observe crash after 2-5 successful embeddings

Sample Request Logs

2025-12-11 16:54:21 - HTTP 200 OK (embedding 1) ✓
2025-12-11 16:54:21 - HTTP 200 OK (embedding 2) ✓
2025-12-11 16:54:29 - HTTP 400 Bad Request (embedding 3) ✗
Error: "The model has crashed without additional information. (Exit code: 18446744072635812000)"
2025-12-11 16:54:29 - HTTP 400 Bad Request (embedding 4) ✗
Error: "Model has unloaded or crashed.."

Workaround Attempted

  • Re-downloading fresh model weights: Temporarily resolves issue but crashes return after 2-10 embeddings
  • Reducing concurrent threads to 2: Still crashes (slower but same result)
  • Restarting LM Studio: Only temporary fix

Problem 2: LLM Inference Performance Degradation

Expected Behavior (Previous Versions)

  • gpt-oss:20b and gemma3:27b with small context windows (~2000 tokens)
  • Average response time: ~1 minute
  • 8GB VRAM sufficient for inference

Current Behavior (v0.3.34)

  • Same models with identical prompts/settings
  • Average response time: ~3 minutes (3x slower)
  • Some requests timeout without response
  • VRAM usage appears normal (~6-7GB)

Reproduction Steps

  1. Load gpt-oss:20b or gemma3:27b in LM Studio v0.3.34
  2. Send inference request with ~2000 token context via API
  3. Measure time-to-first-token and total response time
  4. Compare with previous LM Studio versions

System Impact

  • Workflow Blocked: Batch embedding generation for forensic document analysis (thousands of documents) is infeasible due to constant crashes
  • Production Impact: Applications relying on LM Studio embeddings are now unstable
  • Workaround Cost: Manual restarts every 5 embeddings is not scalable

Suspected Root Cause

Based on logs and testing:

  1. Memory Management Regression: Model appears to leak memory or fail to release VRAM between requests
  2. Concurrency Bug: Multi-threaded requests may trigger race condition in v0.3.34
  3. Quantization Handling: Both Q4_K_M and Q5_0 exhibit same behavior (not quantization-specific)

Request

  • Urgent: Please investigate memory handling changes in v0.3.34 for embedding models
  • Alternative: Provide rollback instructions or direct download link for v0.3.33 (last stable version)
  • Logs: Happy to provide additional debug logs if needed

Additional Context

  • API Format: Using OpenAI-compatible endpoint (/v1/embeddings)
  • Client Library: openai Python SDK (v1.x)
  • Network: Local inference (no remote API calls)
  • Previous Versions Tested: v0.3.30-v0.3.33 worked reliably with identical code

References

Workaround Confirmed

Migrating to Ollama resolves the issue completely.

Test Results

  • LM Studio v0.3.34: Crashes after 2-5 embeddings with exit code 18446744072635812000
  • Ollama (same model): Stable, no crashes after 100+ embeddings

Migration Details

Installed Ollama v0.5.4
ollama pull nomic-embed-text:v1.5

Updated API endpoint
FROM: http://localhost:1234/v1/embeddings (LM Studio)
TO: http://localhost:11434/api/embeddings (Ollama)
text

Performance Comparison

Metric LM Studio v0.3.34 Ollama
Stability ❌ Crashes every 2-5 requests ✅ Stable (100+ requests)
Speed ~8s per embedding (when working) ~0.13s per embedding
VRAM Memory leak suspected Efficient management
Concurrent requests Fails with 4 threads Works with 4 threads

Logs (Ollama - Working)

2025-12-11 17:22:39 - Client connected successfully: 192.168.1.101:11434
2025-12-11 17:22:39 - Embedding generated: dimension 768 (0.128s) ✓
2025-12-11 17:22:39 - Embedding generated: dimension 768 (0.010s) ✓
2025-12-11 17:22:39 - Item added to collection successfully

text

Conclusion: This confirms the issue is specific to LM Studio v0.3.34 and not related to:

  • Model quantization (Q4_K_M, Q5_0)
  • Input text content
  • API client library (OpenAI SDK)
  • Concurrent request handling (works fine in Ollama)

The regression appears to be in LM Studio's embedding server implementation or memory management.


Request for Resolution

Since downgrading to v0.3.33 is not officially supported, please:

  1. Investigate memory management changes in v0.3.34 embedding server
  2. Provide rollback option or direct link to v0.3.33 installer
  3. Fix the crash in upcoming release

Thank you for looking into this critical issue affecting production workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions