Integrate Sylvan Energy heat rates analysis by katie-lamb · Pull Request #5190 · catalyst-cooperative/pudl

katie-lamb · 2026-04-17T01:19:57Z

Overview

Running Sylvan's script was actually quite fast for me, the slow part was reading in / downloading CEMS. But still, parallelizing and making it available to run on all states at once seems good.

Closes #5106 .

What problem does this address?

Dagsterizes Sylvan's analysis of plant characteristics and heat rate calculations.

I was able to vectorize almost all of it, except some of the binning (see comments). It runs pretty fast for me in Dagster, but I haven't tried running it on all of the states. Maybe the binning needs to be vectorized as well, but then it likely won't match Jaxon's script outputs exactly.

I didn't yet create assets for the other CSVs produced by the script (eia_860_plant_unit_summary.csv, eia_860_plant_summary.csv, eia_860_plant_gen_summary.csv), but not sure if these all need to be assets.

What did you change?

Documentation

Make sure to update relevant aspects of the documentation:

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

I tested this by materializing out_epacems__yearly_operational_characteristics on just California with 3 years of data and comparing the asset to the epa_op_char_output_df.csv output from Jaxon's script. They were identical except for a .1 difference in heat_rate_at_min_stable_level_mmbtu_per_mwh. This could be a rounding error?

I didn't touch anything dbt, so that will need to get added in. It's currently failing the unit tests because of this.

To-do list

Migrate to polars as much as possible. Verify output still the same.
Decide whether to create assets for the other CSVs produced by the script (eia_860_plant_unit_summary.csv, eia_860_plant_summary.csv, eia_860_plant_gen_summary.csv) -- not useful for users, add only if a useful intermediate debugging output
Run script and ID any additional performance blocks to handle
Run script and compare to outputs from code
Create an asset graph to kick off per-year and write to DB
Add DBT tests

out of scope

Fill in gaps with similarly binned plants
Expand this to calculate rolling characteristics for other years

Testing todos

If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
Run pixi run prek-run to run linters and static code analysis checks.
Run pixi run pytest-ci locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For PRs that change the PUDL outputs significantly, run the full ETL locally and then run the data validations using dbt. If you can't run the ETL locally then run the build-deploy-pudl GitHub Action manually and ensure that it succeeds.

katie-lamb · 2026-04-23T23:21:34Z


 from . import (
    allocate_gen_fuel,
+    derived_plant_characteristics,


i chose this name, but maybe it's not right

katie-lamb · 2026-04-23T23:22:02Z

+HEAT_RATE_ANALYSIS_CONFIG_SCHEMA = {
+    "final_year": Field(
+        int,
+        default_value=2024,  # derive from dataset settings instead of hard coding?


As i said in comment: derive from dataset settings instead of hard coding?

katie-lamb · 2026-04-23T23:22:21Z

+    ),
+    "eia_epa_mapping_year": Field(
+        int,
+        default_value=2024,  # derive from dataset settings instead of hard coding?


again this could be derived from dataset settings

katie-lamb · 2026-04-23T23:25:35Z

+    if isinstance(core_epacems__hourly_emissions, pl.LazyFrame):
+        cems_lf = core_epacems__hourly_emissions.select(cems_columns).filter(
+            pl.col("year").is_between(start_year, final_year, closed="both")
+        )
+        if states:
+            cems_lf = cems_lf.filter(pl.col("state").is_in(states))
+
+        cems = cems_lf.collect(engine="streaming").to_pandas()
+    else:
+        cems = core_epacems__hourly_emissions.loc[
+            core_epacems__hourly_emissions["year"].between(start_year, final_year),
+            cems_columns,
+        ].copy()
+
+        if states:
+            cems = cems[cems["state"].isin(states)].copy()


materializing the asset was crashing for me when loading cems as a pandas dataframe, so i had to keep it as a polars lazy frame for as long as possible

katie-lamb · 2026-04-23T23:28:12Z

+    """Estimate EPA CEMS unit operational characteristics."""
+    heat_rate_config = _get_heat_rate_analysis_config(context)
+    return (
+        pl.from_pandas(


Pandera yelled at me when I tried returning this as a pandas dataframe, so i made it a polars lazy frame. not sure what convention is here.

katie-lamb · 2026-04-23T23:32:07Z

+    if valid_for_binning.any():
+        binned = (
+            cems.loc[valid_for_binning]
+            .groupby(unit_cols, group_keys=False)[load_factor_col]
+            .apply(lambda s: pd.cut(s, bins=10, right=True, include_lowest=False))
+            .astype("object")
+        )


this groupby apply is not vectorized... but couldn't figure out how to avoid it while still matching the output of jaxon's script

katie-lamb · 2026-04-25T00:05:40Z

+    return (~(same_unit & consecutive_hour & same_state)).cumsum()
+
+
+def _assign_groupwise_load_factor_bins(


This diverges slightly from the original script. It no longer loops over the units and does everything within the loop; it now does per-unit pd.cut just for bin assignment, then feeds those bins into (mostly) vectorized operations.

katie-lamb · 2026-04-25T00:06:56Z

+    return ramp_input.groupby(unit_cols).apply(summarize_unit).reset_index()
+
+
+def estimate_operational_characteristics_by_unit(


This does several things: per-unit max-load prep, binning, stable-run detection, stable-bin selection, heat-rate summarization, up/down run summarization, and final column shaping. It could probably be broken out into more pieces.

katie-lamb · 2026-04-25T00:09:23Z

+    return cems
+
+
+def _summarize_ramp_rates(


This is intentionally not fully vectorize, because all the vectorized implementations I tried led to meaningful changes in the final output compared to the script's final output.

katie-lamb · 2026-04-25T00:14:34Z

+    stable_runs = (
+        binned_cems.loc[binned_cems["load_factor_bin_ordinal"] > 1]
+        .groupby(
+            unit_cols
+            + [
+                "load_factor_bin_ordinal",
+                "load_factor_bin_left",
+                "load_factor_bin",
+                "bin_run_id",
+            ],
+            as_index=False,
+        )
+        .size()
+        .rename(columns={"size": "run_length"})
+    )


Here this identifies candidate stable runs from binned_cems, then chooses one stable bin per unit to calculate heat_rate_at_min_stable_level_mmbtu_per_mwh and min_up_time_hr.

There are two different types of "stable bin" here:

load_factor_bin_ordinal: a numeric ordering of bins within a unit
min_stable_bin: the actual pd.Interval object from pd.cut

The ordinal is good for sorting and picking the first qualifying stable bin. But to match the outputs of the script (for the heat_rate_at_min_stable_level_mmbtu_per_mwh and min_up_time_hr ), the actual interval object matters, because the original script selects rows based on exact bin membership rather than just bin order.

katie-lamb · 2026-04-25T00:24:43Z

        ),
        "unit": "MW",
    },
+    "min_down_time_hr": {


should probably be "min_down_time_hours"? and same for "min_up_time_hours".

e-belfer · 2026-05-21T14:38:00Z

+            "usage_warnings": ["estimated_values"],
+            "additional_details_text": """This table summarizes several inferred
+operational characteristics for each EPA CEMS emissions unit using hourly CEMS
+gross load and fuel heat content over a configurable multi-year window.


To-do: For basically all users this will not be configurable - should we just note that this is 3 years by default?

e-belfer · 2026-05-25T16:38:44Z

+    if ramp.limit(1).collect().is_empty():
+        return np.nan, np.nan
+
+    # Cast to pandas to qcut bins


Consider whether to change binning method to work in polars.

initial template

490a54e

github-project-automation Bot added this to Catalyst Megaproject Apr 17, 2026

github-project-automation Bot moved this to New in Catalyst Megaproject Apr 17, 2026

cmgosnell moved this from New to In progress in Catalyst Megaproject Apr 20, 2026

katie-lamb added 2 commits April 22, 2026 11:44

Merge branch 'main' into sylvan-heat-rates

a364811

update to vectorize and match output

3ae61ed

katie-lamb commented Apr 23, 2026

View reviewed changes

remove old module name

2ddb7eb

katie-lamb marked this pull request as ready for review April 23, 2026 23:34

katie-lamb requested a review from cmgosnell April 23, 2026 23:34

katie-lamb commented Apr 25, 2026

View reviewed changes

update metadata

a229a39

e-belfer self-assigned this May 21, 2026

Merge branch 'main' into sylvan-heat-rates

2fef45e

e-belfer reviewed May 21, 2026

View reviewed changes

Rename fields

db9b773

WIP migration to polars

41f10d4

e-belfer reviewed May 25, 2026

View reviewed changes

		return (~(same_unit & consecutive_hour & same_state)).cumsum()


		def _assign_groupwise_load_factor_bins(

		return ramp_input.groupby(unit_cols).apply(summarize_unit).reset_index()


		def estimate_operational_characteristics_by_unit(

Uh oh!

Conversation

katie-lamb commented Apr 17, 2026 • edited by e-belfer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What problem does this address?

What did you change?

Documentation

Testing

To-do list

out of scope

Testing todos

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katie-lamb Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katie-lamb Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katie-lamb Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

katie-lamb commented Apr 17, 2026 •

edited by e-belfer

Loading

katie-lamb Apr 25, 2026 •

edited

Loading

katie-lamb Apr 25, 2026 •

edited

Loading

katie-lamb Apr 25, 2026 •

edited

Loading