Skip to content

Different Embeddings from sentence-transformers/all-MiniLM-L6-v2 compared to Python #188

@udaychandra

Description

@udaychandra

First, thank you for the amazing work on Jlama! It's great to have native Java libraries for embeddings and LLMs.

Issue

We're getting different embedding values from Jlama compared to Python's sentence-transformers for the all-MiniLM-L6-v2 model, even though we've verified that tokenization is identical.

Java Code (Jlama)

var modelName = "sentence-transformers/all-MiniLM-L6-v2";
var workingDirectory = System.getProperty("user.home") + "/.jlama/models/";
var downloader = new Downloader(workingDirectory, modelName);
var modelPath = downloader.huggingFaceModel();

var model = ModelSupport.loadEmbeddingModel(modelPath, DType.F32, DType.F32);

String text = "This is a test document about machine learning";
float[] embedding = model.embed(text, Generator.PoolingType.AVG);

System.out.println("First 10 values:");
for (int i = 0; i < 10; i++) {
    System.out.println("  [" + i + "] = " + embedding[i]);
}

Java Output:

Magnitude: 1.0000001
[0] = -0.0009431843
[1] = 0.006532612
[2] = 0.070363656
[3] = 0.0154365115

Python Code (sentence-transformers)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
text = "This is a test document about machine learning"
embedding = model.encode(text)

print("First 10 values:")
for i in range(10):
    print(f"  [{i}] = {embedding[i]}")

Python Output:

Magnitude: 1.0
[0] = -0.038466498255729675
[1] = 0.00013165567361284047
[2] = 0.01088548544794321
[3] = 0.040931958705186844

What We've Verified

  1. Tokenization is identical: Both produce the same token IDs: [101, 2023, 2003, 1037, 3231, 6254, 2055, 3698, 4083, 102]
  2. Same pooling strategy: Both use mean/average pooling (PoolingType.AVG in Java, pooling_mode_mean_tokens=True in Python)
  3. Same model source: Both download from HuggingFace sentence-transformers/all-MiniLM-L6-v2

The Problem

The actual embedding values are completely different (not just minor floating-point differences).

Questions

  1. Is the all-MiniLM-L6-v2 model fully supported/tested with Jlama?
  2. Are we missing any configuration or preprocessing steps?

Any guidance would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions