Fix pandas 2.x ValueError in monomer SDF loading#20
Open
anagnorisis2peripeteia wants to merge 1 commit into
Open
Fix pandas 2.x ValueError in monomer SDF loading#20anagnorisis2peripeteia wants to merge 1 commit into
anagnorisis2peripeteia wants to merge 1 commit into
Conversation
Pandas 2.x introduced Arrow-backed string columns that reject list assignment via loc/at, raising: ValueError: Must have equal len keys and value when setting with an iterable Fix: cast the three list-valued columns (m_Rgroups, m_RgroupIdx, m_attachmentPointIdx) to object dtype before writing parsed list values into them, and use df.at[] (single-cell scalar assignment) instead of df.loc[] (which pandas 2.x interprets as a multi-row broadcast when given an iterable). Applied in both get_monomer_info() (sequence.py) and _load_monomer_sdf() (monomerlib.py), which duplicate the same loading pattern. Closes Boehringer-Ingelheim#18
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #18
Pandas 2.x introduced Arrow-backed storage for string columns. When
get_monomer_info()and_load_monomer_sdf()attempt to write a parsedPython list into these columns via
df.loc[idx, col] = [...], pandas 2.xraises:
because it interprets the list as multiple row values rather than a single
list object for one cell.
Fix
Two changes, applied in both
sequence.pyandmonomerlib.py:Cast the three list-valued columns (
m_Rgroups,m_RgroupIdx,m_attachmentPointIdx) toobjectdtype before writing into them.Only these three columns are cast — the rest of the DataFrame retains
its pandas 2.x type optimisations.
Use
df.at[idx, col](single-cell assignment) instead ofdf.loc[idx, col](which pandas 2.x misinterprets as a broadcastwhen given an iterable).
Note on duplication
get_monomer_info()insequence.pyand_load_monomer_sdf()inmonomerlib.pyimplement the same SDF loading and list-parsing logicindependently. This fix is applied to both. A follow-up refactor could
consolidate them into a single shared loader to avoid this kind of
divergence in future — happy to open a separate PR for that if useful.