-
Notifications
You must be signed in to change notification settings - Fork 143
Open
Description
First, thank you for the amazing work on Jlama! It's great to have native Java libraries for embeddings and LLMs.
Issue
We're getting different embedding values from Jlama compared to Python's sentence-transformers for the all-MiniLM-L6-v2 model, even though we've verified that tokenization is identical.
Java Code (Jlama)
var modelName = "sentence-transformers/all-MiniLM-L6-v2";
var workingDirectory = System.getProperty("user.home") + "/.jlama/models/";
var downloader = new Downloader(workingDirectory, modelName);
var modelPath = downloader.huggingFaceModel();
var model = ModelSupport.loadEmbeddingModel(modelPath, DType.F32, DType.F32);
String text = "This is a test document about machine learning";
float[] embedding = model.embed(text, Generator.PoolingType.AVG);
System.out.println("First 10 values:");
for (int i = 0; i < 10; i++) {
System.out.println(" [" + i + "] = " + embedding[i]);
}Java Output:
Magnitude: 1.0000001
[0] = -0.0009431843
[1] = 0.006532612
[2] = 0.070363656
[3] = 0.0154365115
Python Code (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
text = "This is a test document about machine learning"
embedding = model.encode(text)
print("First 10 values:")
for i in range(10):
print(f" [{i}] = {embedding[i]}")Python Output:
Magnitude: 1.0
[0] = -0.038466498255729675
[1] = 0.00013165567361284047
[2] = 0.01088548544794321
[3] = 0.040931958705186844
What We've Verified
- Tokenization is identical: Both produce the same token IDs:
[101, 2023, 2003, 1037, 3231, 6254, 2055, 3698, 4083, 102] - Same pooling strategy: Both use mean/average pooling (
PoolingType.AVGin Java,pooling_mode_mean_tokens=Truein Python) - Same model source: Both download from HuggingFace
sentence-transformers/all-MiniLM-L6-v2
The Problem
The actual embedding values are completely different (not just minor floating-point differences).
Questions
- Is the
all-MiniLM-L6-v2model fully supported/tested with Jlama? - Are we missing any configuration or preprocessing steps?
Any guidance would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels