Skip to content

PERF: Big slowdown of searchsorted on pd.Series#65840

Open
kjmin622 wants to merge 2 commits into
pandas-dev:mainfrom
kjmin622:issue65837
Open

PERF: Big slowdown of searchsorted on pd.Series#65840
kjmin622 wants to merge 2 commits into
pandas-dev:mainfrom
kjmin622:issue65837

Conversation

@kjmin622

@kjmin622 kjmin622 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@kjmin622 kjmin622 marked this pull request as draft June 10, 2026 00:20
@kjmin622 kjmin622 marked this pull request as ready for review June 10, 2026 01:13
@rhshadrach rhshadrach changed the title BUG: Big slowdown of searchsorted on pd.Series PERF: Big slowdown of searchsorted on pd.Series Jun 10, 2026
@rhshadrach rhshadrach added Performance Memory or execution speed performance Strings String extension data type and string data Sorting e.g. sort_index, sort_values labels Jun 10, 2026
@rhshadrach rhshadrach modified the milestones: 3.1, 3.0.4 Jun 10, 2026
Comment thread pandas/core/arrays/string_.py Outdated
"""
if self._hasna:
ndarray = self._ndarray
if len(ndarray) and libmissing.checknull(ndarray[-1]):

@rhshadrach rhshadrach Jun 10, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While NA values do generally get sorted to the end, they can also be sorted to the front via na_position. I think we should consider either front or back as being sorted, so need to also check the front of the array.

Can you also add a test for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach Thank you for review! Added code changes and tests that account for na_position.

@rhshadrach rhshadrach left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jbrockmendel - this PR replaces the O(n) check for NA values with an O(1) under the assumption that the input is sorted; this assumption is already being made about the non-NA values. Are you okay with that?

@jbrockmendel

Copy link
Copy Markdown
Member

Yes

ndarray = self._ndarray
if len(ndarray) and (
libmissing.checknull(ndarray[0]) or libmissing.checknull(ndarray[-1])
):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment about why this isn't using self._hasna?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Sorting e.g. sort_index, sort_values Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: Big slowdown of searchsorted on pd.Series

3 participants