fix: convert Dataset Column to list before SentenceTransformer.encode by octo-patch · Pull Request #116 · HandsOnLLM/Hands-On-Large-Language-Models

octo-patch · 2026-04-23T04:23:58Z

Fixes #79

Problem

SentenceTransformer.encode() sorts input sentences by length internally using numpy.int64 indices. HuggingFace Dataset column objects (e.g. data["train"]["text"]) do not support indexing with numpy.int64, which raises a TypeError during encode.

Solution

Wrap the column with list() before passing it to model.encode(), converting it to a plain Python list that supports standard numpy indexing:

# Before
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings  = model.encode(data["test"]["text"],  show_progress_bar=True)

# After
train_embeddings = model.encode(list(data["train"]["text"]), show_progress_bar=True)
test_embeddings  = model.encode(list(data["test"]["text"]),  show_progress_bar=True)

Testing

The fix matches the workaround confirmed in the issue thread. The list() conversion has zero impact on encode results — it only changes the container type before sort-by-length occurs.

…fixes HandsOnLLM#79) SentenceTransformer.encode() sorts sentences by length internally using numpy.int64 indices, which HuggingFace Dataset Column objects do not support. This causes a TypeError at encode-time. Wrapping the column with list() converts it to a plain Python list before the call, eliminating the incompatibility.

seanv507 · 2026-04-29T11:07:03Z

IMO this problem is fixed by pinning sentence transformers as mentioned in this issue
#107 (comment)

(no issue when running colab, pinning transformers, sentence transformers and peft to versions in requirements.txt)

octo-patch · 2026-04-29T12:08:19Z

Thanks — pinning works as a workaround, but this 2-line defensive change keeps the notebook running on whatever sentence-transformers / datasets versions a reader happens to install (e.g. fresh Colab runtimes with newer pins). It costs almost nothing and avoids users hitting the same TypeError again the next time the upstream pins drift. Happy to defer if you prefer to handle this purely via requirements.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: convert Dataset Column to list before SentenceTransformer.encode#116

fix: convert Dataset Column to list before SentenceTransformer.encode#116
octo-patch wants to merge 1 commit into
HandsOnLLM:mainfrom
octo-patch:fix/issue-79-sentencetransformer-column-encode

octo-patch commented Apr 23, 2026

Uh oh!

seanv507 commented Apr 29, 2026

Uh oh!

octo-patch commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

octo-patch commented Apr 23, 2026

Problem

Solution

Testing

Uh oh!

seanv507 commented Apr 29, 2026

Uh oh!

octo-patch commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants