Skip to content

fix: convert Dataset Column to list before SentenceTransformer.encode#116

Open
octo-patch wants to merge 1 commit into
HandsOnLLM:mainfrom
octo-patch:fix/issue-79-sentencetransformer-column-encode
Open

fix: convert Dataset Column to list before SentenceTransformer.encode#116
octo-patch wants to merge 1 commit into
HandsOnLLM:mainfrom
octo-patch:fix/issue-79-sentencetransformer-column-encode

Conversation

@octo-patch

Copy link
Copy Markdown

Fixes #79

Problem

SentenceTransformer.encode() sorts input sentences by length internally using numpy.int64 indices. HuggingFace Dataset column objects (e.g. data["train"]["text"]) do not support indexing with numpy.int64, which raises a TypeError during encode.

Solution

Wrap the column with list() before passing it to model.encode(), converting it to a plain Python list that supports standard numpy indexing:

# Before
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings  = model.encode(data["test"]["text"],  show_progress_bar=True)

# After
train_embeddings = model.encode(list(data["train"]["text"]), show_progress_bar=True)
test_embeddings  = model.encode(list(data["test"]["text"]),  show_progress_bar=True)

Testing

The fix matches the workaround confirmed in the issue thread. The list() conversion has zero impact on encode results — it only changes the container type before sort-by-length occurs.

…fixes HandsOnLLM#79)

SentenceTransformer.encode() sorts sentences by length internally using
numpy.int64 indices, which HuggingFace Dataset Column objects do not support.
This causes a TypeError at encode-time.

Wrapping the column with list() converts it to a plain Python list before
the call, eliminating the incompatibility.
@seanv507

Copy link
Copy Markdown

IMO this problem is fixed by pinning sentence transformers as mentioned in this issue
#107 (comment)

(no issue when running colab, pinning transformers, sentence transformers and peft to versions in requirements.txt)

@octo-patch

Copy link
Copy Markdown
Author

Thanks — pinning works as a workaround, but this 2-line defensive change keeps the notebook running on whatever sentence-transformers / datasets versions a reader happens to install (e.g. fresh Colab runtimes with newer pins). It costs almost nothing and avoids users hitting the same TypeError again the next time the upstream pins drift. Happy to defer if you prefer to handle this purely via requirements.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chapter 4 --- Supervised Classification

2 participants