Skip to content

Fix SoloSeq embedding truncation behavior#580

Open
samanyuckulkarni123 wants to merge 1 commit into
aqlaboratory:mainfrom
samanyuckulkarni123:main
Open

Fix SoloSeq embedding truncation behavior#580
samanyuckulkarni123 wants to merge 1 commit into
aqlaboratory:mainfrom
samanyuckulkarni123:main

Conversation

@samanyuckulkarni123

Copy link
Copy Markdown

Summary

Fix SoloSeq embedding truncation behavior in scripts/precompute_embeddings.py.

The previous implementation truncated with toks[:1022], which slices the batch axis rather than the token-length axis. Fixes that bug and makes long sequence handling explicit.

Details

  • truncate ESM inputs on the token-length axis instead of the batch axis
  • preserve saved embedding length based on the effective truncated sequence length
  • replace the old truncate flag behavior with a real --truncate / --no-truncate interface
  • emit a warning when sequences longer than 1022 residues are truncated
  • raise a clear error when truncation is disabled for overlength sequences
  • add focused regression tests
  • update SoloSeq docs to match the implemented behavior

Test plan

  • Verified the updated files parse successfully:
    • python3 -m py_compile scripts/precompute_embeddings.py tests/test_precompute_embeddings.py
  • Ran the new targeted regression test module in a local virtual environment with torch installed:
    • .venv/bin/python -m unittest tests.test_precompute_embeddings
  • Confirmed the test module covers:
    • default CLI behavior (--truncate defaults to enabled)
    • explicit --no-truncate behavior
    • correct truncation of overlength sequences to the ESM-1b 1022 residue limit
    • truncation on the token-length axis rather than dropping batch rows
    • clear failure when truncation is disabled for overlength input

- fix truncation in scripts/precompute_embeddings.py to slice token length instead of batch size
- add real --truncate / --no-truncate CLI behavior
- warn when overlength sequences are truncated
- raise a clear error when truncation is disabled for overlength sequences
- add regression tests for parser and long-sequence handling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant