fix: deduplicate DataFrame index before asfreq() in _apply_df_freq_horizon#843
Open
hossamnagy wants to merge 1 commit into
Open
Conversation
Prevents ValueError: cannot reindex on an axis with duplicate labels when naive-mpc-optim is called with an InfluxDB backend whose time window (due to UTC offsets) spans two calendar days, producing ~2x the expected data points. The resulting concat of PV + load forecast Series can carry duplicate timestamps that cause asfreq() to fail. Uses the same ~df.index.duplicated() pattern already present in set_df_index_freq() and prepare_data() elsewhere in the codebase.
Contributor
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR hardens the _apply_df_freq_horizon pipeline by deduplicating the DataFrame index before resampling with asfreq(), preventing failures when concatenated forecast data contains duplicate timestamps due to timezone-spanning query ranges. Flow diagram for DataFrame index deduplication before asfreq in _apply_df_freq_horizonflowchart LR
A[_apply_df_freq_horizon called with df] --> B[Check retrieve_hass_conf optimization_time_step]
B --> C[Convert step to Timedelta if needed]
C --> D[Filter df with ~df.index.duplicated keep=last]
D --> E[Call df.asfreq step]
E --> F[Return resampled df]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Since this duplicate-index filtering pattern now appears in several places (here,
set_df_index_freq(), andprepare_data()), consider extracting a small helper (e.g.,utils.drop_duplicate_index(df, keep='last')) to keep behavior consistent and easier to maintain. - If the upstream concatenation does not guarantee the index is sorted, consider calling
df = df.sort_index()before dropping duplicates so thatkeep='last'is deterministic and not dependent on concatenation order.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Since this duplicate-index filtering pattern now appears in several places (here, `set_df_index_freq()`, and `prepare_data()`), consider extracting a small helper (e.g., `utils.drop_duplicate_index(df, keep='last')`) to keep behavior consistent and easier to maintain.
- If the upstream concatenation does not guarantee the index is sorted, consider calling `df = df.sort_index()` before dropping duplicates so that `keep='last'` is deterministic and not dependent on concatenation order.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
naive-mpc-optimraisesValueError: cannot reindex on an axis with duplicate labelsinside_apply_df_freq_horizonwhendf.asfreq(step)is called on the concatenated PV + load forecast DataFrame.Root cause:
get_days_list(n)spans two UTC calendar dates (due to timezone offsets), so InfluxDB returns roughly 2x the expected data points for a nominal "1 day" window. After the PV and load forecast Series are concatenated onaxis=1, the resulting DatetimeIndex can contain duplicate timestamps, which causesasfreq()to fail.Traceback:
Fix
Deduplicate the index immediately before calling
asfreq, using the same~df.index.duplicated()pattern already present inset_df_index_freq()andprepare_data()elsewhere in the codebase:Tested on
EMHASS 0.17.2, InfluxDB backend, 30-min optimization time step,
naiveload forecast method,Australia/Sydneytimezone (UTC+10, no DST at time of testing).Summary by Sourcery
Bug Fixes: