Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Save original index and remap after function completes #61116

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Jeffrharr
Copy link

@Jeffrharr Jeffrharr commented Mar 13, 2025

Note: I'm new to this project, so this is my first PR.

Saves the index for SeriesNLargest at algorithm start and resets it before returning. This fixes performance issues when the index has many duplicate values.

Results:

  • The original statistics can be viewed in the original ticket, but slow_df was several ms
In [4]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: N = 1500
   ...: N_HALF = 750
   ...: 
   ...: slow_df = pd.DataFrame({'a':  np.random.rand(N)}, index=np.concatenate([[1] * N_HALF, np.arange(N_HALF)]))
   ...: print("slow_df")
   ...: %timeit slow_df['a'].nlargest()
   ...: 
   ...: fast_df = pd.DataFrame({'a': np.random.rand(N)})
   ...: print("fast_df")
   ...: %timeit fast_df['a'].nlargest()

slow_df
427 μs ± 11.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
fast_df
420 μs ± 5.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Tests

The existing tests should cover this unless we want to add specific tests via the asv_bench.

Addendum

I also modified the call to sort to use sort(kind="stable") to get consistent ordering which is what is currently happening in the equivalent Frame method (it was using kind=mergesort which is equivalent to kind=stable, but kept for portability). I can remove this -- it may be better in another PR.
https://numpy.org/doc/stable/reference/generated/numpy.sort.html#numpy.sort

@Jeffrharr Jeffrharr marked this pull request as ready for review March 13, 2025 22:13
@Jeffrharr Jeffrharr changed the title Save original index and remap after function completes. Bug: Save original index and remap after function completes Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Surprisingly slow nlargest with duplicates in the index
1 participant