Skip to content

Integrate Sylvan Energy heat rates analysis#5190

Open
katie-lamb wants to merge 8 commits into
mainfrom
sylvan-heat-rates
Open

Integrate Sylvan Energy heat rates analysis#5190
katie-lamb wants to merge 8 commits into
mainfrom
sylvan-heat-rates

Conversation

@katie-lamb

@katie-lamb katie-lamb commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Overview

Running Sylvan's script was actually quite fast for me, the slow part was reading in / downloading CEMS. But still, parallelizing and making it available to run on all states at once seems good.

Closes #5106 .

What problem does this address?

Dagsterizes Sylvan's analysis of plant characteristics and heat rate calculations.

I was able to vectorize almost all of it, except some of the binning (see comments). It runs pretty fast for me in Dagster, but I haven't tried running it on all of the states. Maybe the binning needs to be vectorized as well, but then it likely won't match Jaxon's script outputs exactly.

I didn't yet create assets for the other CSVs produced by the script (eia_860_plant_unit_summary.csv, eia_860_plant_summary.csv, eia_860_plant_gen_summary.csv), but not sure if these all need to be assets.

What did you change?

Documentation

Make sure to update relevant aspects of the documentation:

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

I tested this by materializing out_epacems__yearly_operational_characteristics on just California with 3 years of data and comparing the asset to the epa_op_char_output_df.csv output from Jaxon's script. They were identical except for a .1 difference in heat_rate_at_min_stable_level_mmbtu_per_mwh. This could be a rounding error?

I didn't touch anything dbt, so that will need to get added in. It's currently failing the unit tests because of this.

To-do list

  • Migrate to polars as much as possible. Verify output still the same.
  • Decide whether to create assets for the other CSVs produced by the script (eia_860_plant_unit_summary.csv, eia_860_plant_summary.csv, eia_860_plant_gen_summary.csv) -- not useful for users, add only if a useful intermediate debugging output
  • Run script and ID any additional performance blocks to handle
  • Run script and compare to outputs from code
  • Create an asset graph to kick off per-year and write to DB
  • Add DBT tests

out of scope

  • Fill in gaps with similarly binned plants
  • Expand this to calculate rolling characteristics for other years

Testing todos

  • If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
  • Run pixi run prek-run to run linters and static code analysis checks.
  • Run pixi run pytest-ci locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.
  • For PRs that change the PUDL outputs significantly, run the full ETL locally and then run the data validations using dbt. If you can't run the ETL locally then run the build-deploy-pudl GitHub Action manually and ensure that it succeeds.


from . import (
allocate_gen_fuel,
derived_plant_characteristics,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i chose this name, but maybe it's not right

HEAT_RATE_ANALYSIS_CONFIG_SCHEMA = {
"final_year": Field(
int,
default_value=2024, # derive from dataset settings instead of hard coding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i said in comment: derive from dataset settings instead of hard coding?

),
"eia_epa_mapping_year": Field(
int,
default_value=2024, # derive from dataset settings instead of hard coding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this could be derived from dataset settings

Comment on lines +110 to +125
if isinstance(core_epacems__hourly_emissions, pl.LazyFrame):
cems_lf = core_epacems__hourly_emissions.select(cems_columns).filter(
pl.col("year").is_between(start_year, final_year, closed="both")
)
if states:
cems_lf = cems_lf.filter(pl.col("state").is_in(states))

cems = cems_lf.collect(engine="streaming").to_pandas()
else:
cems = core_epacems__hourly_emissions.loc[
core_epacems__hourly_emissions["year"].between(start_year, final_year),
cems_columns,
].copy()

if states:
cems = cems[cems["state"].isin(states)].copy()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

materializing the asset was crashing for me when loading cems as a pandas dataframe, so i had to keep it as a polars lazy frame for as long as possible

"""Estimate EPA CEMS unit operational characteristics."""
heat_rate_config = _get_heat_rate_analysis_config(context)
return (
pl.from_pandas(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandera yelled at me when I tried returning this as a pandas dataframe, so i made it a polars lazy frame. not sure what convention is here.

Comment on lines +617 to +623
if valid_for_binning.any():
binned = (
cems.loc[valid_for_binning]
.groupby(unit_cols, group_keys=False)[load_factor_col]
.apply(lambda s: pd.cut(s, bins=10, right=True, include_lowest=False))
.astype("object")
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this groupby apply is not vectorized... but couldn't figure out how to avoid it while still matching the output of jaxon's script

@katie-lamb katie-lamb marked this pull request as ready for review April 23, 2026 23:34
@katie-lamb katie-lamb requested a review from cmgosnell April 23, 2026 23:34
return (~(same_unit & consecutive_hour & same_state)).cumsum()


def _assign_groupwise_load_factor_bins(

@katie-lamb katie-lamb Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diverges slightly from the original script. It no longer loops over the units and does everything within the loop; it now does per-unit pd.cut just for bin assignment, then feeds those bins into (mostly) vectorized operations.

return ramp_input.groupby(unit_cols).apply(summarize_unit).reset_index()


def estimate_operational_characteristics_by_unit(

@katie-lamb katie-lamb Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does several things: per-unit max-load prep, binning, stable-run detection, stable-bin selection, heat-rate summarization, up/down run summarization, and final column shaping. It could probably be broken out into more pieces.

return cems


def _summarize_ramp_rates(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentionally not fully vectorize, because all the vectorized implementations I tried led to meaningful changes in the final output compared to the script's final output.

Comment on lines +750 to +764
stable_runs = (
binned_cems.loc[binned_cems["load_factor_bin_ordinal"] > 1]
.groupby(
unit_cols
+ [
"load_factor_bin_ordinal",
"load_factor_bin_left",
"load_factor_bin",
"bin_run_id",
],
as_index=False,
)
.size()
.rename(columns={"size": "run_length"})
)

@katie-lamb katie-lamb Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this identifies candidate stable runs from binned_cems, then chooses one stable bin per unit to calculate heat_rate_at_min_stable_level_mmbtu_per_mwh and min_up_time_hr.

There are two different types of "stable bin" here:

load_factor_bin_ordinal: a numeric ordering of bins within a unit
min_stable_bin: the actual pd.Interval object from pd.cut

The ordinal is good for sorting and picking the first qualifying stable bin. But to match the outputs of the script (for the heat_rate_at_min_stable_level_mmbtu_per_mwh and min_up_time_hr ), the actual interval object matters, because the original script selects rows based on exact bin membership rather than just bin order.

Comment thread src/pudl/metadata/fields.py Outdated
),
"unit": "MW",
},
"min_down_time_hr": {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be "min_down_time_hours"? and same for "min_up_time_hours".

@cmgosnell cmgosnell added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 output Exporting data from PUDL into other platforms or interchange formats. analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. epacems Integration and analysis of the EPA CEMS dataset. labels Apr 27, 2026
@e-belfer e-belfer self-assigned this May 21, 2026
"usage_warnings": ["estimated_values"],
"additional_details_text": """This table summarizes several inferred
operational characteristics for each EPA CEMS emissions unit using hourly CEMS
gross load and fuel heat content over a configurable multi-year window.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To-do: For basically all users this will not be configurable - should we just note that this is 3 years by default?

if ramp.limit(1).collect().is_empty():
return np.nan, np.nan

# Cast to pandas to qcut bins

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider whether to change binning method to work in polars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 epacems Integration and analysis of the EPA CEMS dataset. output Exporting data from PUDL into other platforms or interchange formats.

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Integrate Sylvan's CEMS and EIA inferred plant characteristics

3 participants